html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/3386#issuecomment-540477057,https://api.github.com/repos/pydata/xarray/issues/3386,540477057,MDEyOklzc3VlQ29tbWVudDU0MDQ3NzA1Nw==,42270910,2019-10-10T09:11:31Z,2019-10-10T09:11:31Z,NONE,"@dcherian a dump of a single file: ``` ncdump -hs era5_mean_sea_level_pressure_2002.nc netcdf era5_mean_sea_level_pressure_2002 { dimensions: longitude = 1440 ; latitude = 721 ; time = 8760 ; variables: float longitude(longitude) ; longitude:units = ""degrees_east"" ; longitude:long_name = ""longitude"" ; float latitude(latitude) ; latitude:units = ""degrees_north"" ; latitude:long_name = ""latitude"" ; int time(time) ; time:units = ""hours since 1900-01-01 00:00:00.0"" ; time:long_name = ""time"" ; time:calendar = ""gregorian"" ; short msl(time, latitude, longitude) ; msl:scale_factor = 0.23025422306319 ; msl:add_offset = 99003.8223728885 ; msl:_FillValue = -32767s ; msl:missing_value = -32767s ; msl:units = ""Pa"" ; msl:long_name = ""Mean sea level pressure"" ; msl:standard_name = ""air_pressure_at_mean_sea_level"" ; // global attributes: :Conventions = ""CF-1.6"" ; :history = ""2019-10-03 16:05:54 GMT by grib_to_netcdf-2.10.0: /opt/ecmwf/eccodes/bin/grib_to_netcdf -o /cache/data5/adaptor.mars.internal-1570117777.9045198-23871-11-c8564b6f-4db5-48d8-beab-ba9fef91d4e8.nc /cache/tmp/c8564b6f-4db5-48d8-beab-ba9fef91d4e8-adaptor.mars.internal-1570117777.905033-23871-3-tmp.grib"" ; :_Format = ""64-bit offset"" ; } ``` @shoyer : thanks for the tip, I think that it indeed simply adding more data-loading threads is the best solution.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,504497403 https://github.com/pydata/xarray/issues/3386#issuecomment-540474492,https://api.github.com/repos/pydata/xarray/issues/3386,540474492,MDEyOklzc3VlQ29tbWVudDU0MDQ3NDQ5Mg==,6213168,2019-10-10T09:05:21Z,2019-10-10T09:05:21Z,MEMBER,"@sipposip if your dask graph is resolved straight after the load from disk, you can try disabling the dask optimizer to see if you can squeeze some milliseconds out of load(). You can look up the setting syntax on the dask documentation.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,504497403 https://github.com/pydata/xarray/issues/3386#issuecomment-540208420,https://api.github.com/repos/pydata/xarray/issues/3386,540208420,MDEyOklzc3VlQ29tbWVudDU0MDIwODQyMA==,1217238,2019-10-09T21:28:48Z,2019-10-09T21:28:48Z,MEMBER,"netCDF4.MFDataset works on a much more restricted set of netCDF files than `xarray.open_mfdataset`. I'm not surprised it's a little bit faster, but I'm not sure it's worth the maintenance burden of supporting this separate code path. Making a fully featured version of open_mfdataset with dask would be challenging. Can you simply add more threads in TensorFlow/Keras for loading the data? My other suggestion is to pre-shuffle the data on disk, so you don't need random access inside your training loop.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,504497403 https://github.com/pydata/xarray/issues/3386#issuecomment-540033550,https://api.github.com/repos/pydata/xarray/issues/3386,540033550,MDEyOklzc3VlQ29tbWVudDU0MDAzMzU1MA==,2448579,2019-10-09T14:43:29Z,2019-10-09T14:43:29Z,MEMBER,It would be useful to see what a single file looks like and what the combined dataset looks like. `open_mfdataset` can sometimes require some tuning to get good performance.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,504497403 https://github.com/pydata/xarray/issues/3386#issuecomment-539916279,https://api.github.com/repos/pydata/xarray/issues/3386,539916279,MDEyOklzc3VlQ29tbWVudDUzOTkxNjI3OQ==,42270910,2019-10-09T09:20:06Z,2019-10-09T09:20:06Z,NONE,"setting ```dask.config.set(scheduler=""synchronous"")``` globally indeed resolved the threading issues, thanks. However, loading and preprocessing a single timeslice of data is ~40 % slower with dask and open_mfdataset (with chunks={'time':1}) compared to netCDF4.MFDataset . Is this is expected/a known issue? If not, I can try to create a minimal reproducible example.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,504497403 https://github.com/pydata/xarray/issues/3386#issuecomment-539907822,https://api.github.com/repos/pydata/xarray/issues/3386,539907822,MDEyOklzc3VlQ29tbWVudDUzOTkwNzgyMg==,6213168,2019-10-09T08:58:21Z,2019-10-09T08:58:21Z,MEMBER,"@sipposip xarray doesn't use netCDF4.MFDataset, but netCDF4.Dataset which is then wrapped by dask arrays which are then concatenated. > Opening each file separately with open_dataset, and then concatenating them with xr.concat does not work, as this loads the data into memory. This is by design, because of the reason above. The NetCDF/HDF5 lazy loading means that data is loaded up into a numpy.ndarray on the first operation performed upon it. This includes concatenation. I'm aware that threads within threads, threads within processes, and processes within threads cause a world of pain in the form of random deadlocks - I've been there myself. You can completely disable dask threads process-wide with ```python dask.config.set(scheduler=""synchronous"") ... ds.load() ``` or as a context manager ```python with dask.config.set(scheduler=""synchronous""): ds.load() ``` or for the single operation: ```python ds.load(scheduler=""synchronous"") ``` Does this address your issue? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,504497403