html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/1981#issuecomment-373806224,https://api.github.com/repos/pydata/xarray/issues/1981,373806224,MDEyOklzc3VlQ29tbWVudDM3MzgwNjIyNA==,6181563,2018-03-16T18:34:19Z,2018-03-16T18:34:19Z,CONTRIBUTOR,distributed,"{""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304201107 https://github.com/pydata/xarray/issues/1981#issuecomment-373802503,https://api.github.com/repos/pydata/xarray/issues/1981,373802503,MDEyOklzc3VlQ29tbWVudDM3MzgwMjUwMw==,2443309,2018-03-16T18:21:20Z,2018-03-16T18:21:20Z,MEMBER,@jmunroe - this is good to know. Have you been using the default scheduler (multiprocessing for dask.bag) or the distributed scheduler?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304201107 https://github.com/pydata/xarray/issues/1981#issuecomment-373794415,https://api.github.com/repos/pydata/xarray/issues/1981,373794415,MDEyOklzc3VlQ29tbWVudDM3Mzc5NDQxNQ==,6181563,2018-03-16T17:53:44Z,2018-03-16T17:53:44Z,CONTRIBUTOR,"For what's worth, this is exactly the workflow I use (https://github.com/OceansAus/cosima-cookbook) when opening a large number of netCDF files: bag = dask.bag.from_sequence(ncfiles) load_variable = lambda ncfile: xr.open_dataset(ncfile, chunks=chunks, decode_times=False)[variables] bag = bag.map(load_variable) dataarrays = bag.compute() and then dataarray = xr.concat(dataarrays, dim='time', coords='all', ) and it appears to work well. Code snippets from cosima-cookbook/cosima_cookbook/netcdf_index.py ","{""total_count"": 3, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304201107 https://github.com/pydata/xarray/issues/1981#issuecomment-372316094,https://api.github.com/repos/pydata/xarray/issues/1981,372316094,MDEyOklzc3VlQ29tbWVudDM3MjMxNjA5NA==,2443309,2018-03-12T13:51:07Z,2018-03-12T13:51:07Z,MEMBER,@shoyer - we can sidestep the global HDF lock if we use multiprocessing (or the distributed scheduler as you mentioned) and the `autoclose` option. This is the approach I took during my initial tests. It would be great if we could use the threading library too but that does seem less applicable given the current state of the HDF library. ,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304201107 https://github.com/pydata/xarray/issues/1981#issuecomment-372195137,https://api.github.com/repos/pydata/xarray/issues/1981,372195137,MDEyOklzc3VlQ29tbWVudDM3MjE5NTEzNw==,1217238,2018-03-12T05:09:16Z,2018-03-12T05:09:16Z,MEMBER,"I think is definitely worth exploring and could potentially be a large win. One potential challenge is global locking with HDF5. If opening many datasets is slow because much data needs to get read with HDF5, then multiple threads will not help -- you'll need to use multiple processes, e.g., with dask-distributed.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304201107