html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/pull/1983#issuecomment-382487555,https://api.github.com/repos/pydata/xarray/issues/1983,382487555,MDEyOklzc3VlQ29tbWVudDM4MjQ4NzU1NQ==,2443309,2018-04-18T18:38:47Z,2018-04-18T18:38:47Z,MEMBER,"With my last commits here, this feature is completely optional and defaults to the current behavior. I cleaned up the tests a bit further and am now ready to merge this. Baring any objections, I'll merge this on Friday. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-382157273,https://api.github.com/repos/pydata/xarray/issues/1983,382157273,MDEyOklzc3VlQ29tbWVudDM4MjE1NzI3Mw==,2443309,2018-04-17T21:41:03Z,2018-04-17T21:41:03Z,MEMBER,I think that makes sense for now. We need to experiment with this a bit more but I don't see a problem merging the basic workflow we have now (with a minor change to the default behavior). ,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-382154051,https://api.github.com/repos/pydata/xarray/issues/1983,382154051,MDEyOklzc3VlQ29tbWVudDM4MjE1NDA1MQ==,1217238,2018-04-17T21:30:53Z,2018-04-17T21:30:53Z,MEMBER,It sounds like the right resolution for now would be to leave the default as `parallel=False` and leave this as an optional feature.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-382146851,https://api.github.com/repos/pydata/xarray/issues/1983,382146851,MDEyOklzc3VlQ29tbWVudDM4MjE0Njg1MQ==,2443309,2018-04-17T21:08:29Z,2018-04-17T21:08:29Z,MEMBER,"@NicWayand - Thanks for giving this a go. Some thoughts on your problem... I'm have been using this feature for the past few days and have been seeing a speedup on datasets with many files along the lines of what I showed above. I am applying my tests on perhaps the perfect test architecture (parallel shared fs, fast interconnect, etc.). I think there are many reasons/cases where this won't work as well. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-382071801,https://api.github.com/repos/pydata/xarray/issues/1983,382071801,MDEyOklzc3VlQ29tbWVudDM4MjA3MTgwMQ==,1117224,2018-04-17T17:14:33Z,2018-04-17T17:38:42Z,NONE,"Thanks @jhamman for working on this! I did a test on my real world data (1202 ~3mb files) on my local computer and am not getting results I expected: 1) No speed up with parallel=True 2) _Slow down_ when using distributed (processes=16 cores=16). Am I missing something? ```python nc_files = glob.glob(E.obs['NSIDC_0081']['sipn_nc']+'/*.nc') print(len(nc_files)) 1202 # Parallel False %time ds = xr.open_mfdataset(nc_files, concat_dim='time', parallel=False, autoclose=True) CPU times: user 57.8 s, sys: 3.2 s, total: 1min 1s Wall time: 1min # Parallel True with default scheduler %time ds = xr.open_mfdataset(nc_files, concat_dim='time', parallel=True, autoclose=True) CPU times: user 1min 16s, sys: 9.82 s, total: 1min 26s Wall time: 1min 16s # Parallel True with distributed from dask.distributed import Client client = Client() print(client) %time ds = xr.open_mfdataset(nc_files, concat_dim='time', parallel=True, autoclose=True) CPU times: user 2min 17s, sys: 12.3 s, total: 2min 29s Wall time: 3min 48s ``` On feature/parallel_open_netcdf commit 280a46f13426a462fb3e983cfd5ac7a0565d1826","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-381277673,https://api.github.com/repos/pydata/xarray/issues/1983,381277673,MDEyOklzc3VlQ29tbWVudDM4MTI3NzY3Mw==,2443309,2018-04-13T22:42:59Z,2018-04-13T22:42:59Z,MEMBER,"@rabernat - I got the tests passing here again. If you can make the time to try your example/test again, it would be great to figure out what wasn't working before. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-380257320,https://api.github.com/repos/pydata/xarray/issues/1983,380257320,MDEyOklzc3VlQ29tbWVudDM4MDI1NzMyMA==,2443309,2018-04-10T21:44:28Z,2018-04-10T21:45:02Z,MEMBER,"@rabernat - I just pushed a few more commits here. Can I ask two questions: When using the distributed scheduler, what configuration are you using? Can you try: - `autoclose=True` (in open_mfdataset) - `processes=True` (in client) If this turns out to be a corner case with the distributed scheduler, I can add a integration test for that specific use case.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-380150362,https://api.github.com/repos/pydata/xarray/issues/1983,380150362,MDEyOklzc3VlQ29tbWVudDM4MDE1MDM2Mg==,2443309,2018-04-10T15:49:06Z,2018-04-10T15:49:06Z,MEMBER,@rabernat - my last commit(s) seem to have broken the CI so I'll need to revisit this.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-380121937,https://api.github.com/repos/pydata/xarray/issues/1983,380121937,MDEyOklzc3VlQ29tbWVudDM4MDEyMTkzNw==,1197350,2018-04-10T14:32:25Z,2018-04-10T14:32:25Z,MEMBER,"I recently tried this branch with my data server and got an error. I opened a dataset this way ```python # works fine with parallel=False ds = xr.open_mfdataset(os.path.join(ddir, '*V1_1.204*.nc'), decode_cf=False, parallel=True) ``` and got the following error. ``` distributed.utils - ERROR - NetCDF: HDF error Traceback (most recent call last): File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/utils.py"", line 237, in f result[0] = yield make_coro() File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py"", line 1055, in run value = future.result() File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/concurrent.py"", line 238, in result raise_exc_info(self._exc_info) File """", line 4, in raise_exc_info File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py"", line 1063, in run yielded = self.gen.throw(*exc_info) File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/client.py"", line 1356, in _gather traceback) File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/six.py"", line 692, in reraise raise value.with_traceback(tb) File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/dask/compatibility.py"", line 48, in apply return func(*args, **kwargs) File ""/home/rpa/xarray/xarray/backends/api.py"", line 318, in open_dataset return maybe_decode_store(store, lock) File ""/home/rpa/xarray/xarray/backends/api.py"", line 238, in maybe_decode_store drop_variables=drop_variables) File ""/home/rpa/xarray/xarray/conventions.py"", line 594, in decode_cf vars, attrs = obj.load() File ""/home/rpa/xarray/xarray/backends/common.py"", line 217, in load for k, v in self.get_variables().items()) File ""/home/rpa/xarray/xarray/backends/netCDF4_.py"", line 319, in get_variables iteritems(self.ds.variables)) File ""/home/rpa/xarray/xarray/core/utils.py"", line 308, in FrozenOrderedDict return Frozen(OrderedDict(*args, **kwargs)) File ""/home/rpa/xarray/xarray/backends/netCDF4_.py"", line 318, in for k, v in File ""/home/rpa/xarray/xarray/backends/netCDF4_.py"", line 311, in open_store_variable encoding['original_shape'] = var.shape File ""netCDF4/_netCDF4.pyx"", line 3381, in netCDF4._netCDF4.Variable.shape.__get__ (netCDF4/_netCDF4.c:34388) File ""netCDF4/_netCDF4.pyx"", line 2759, in netCDF4._netCDF4.Dimension.__len__ (netCDF4/_netCDF4.c:27006) RuntimeError: NetCDF: HDF error ``` Without the distributed scheduler (but with `parallel=True`), I get no error, but the command never returns, and eventually I have to restart the kernel. Any idea what could be going on? (Sorry for the non-reproducible bug report...I figured some trials ""in the field"" might be useful.)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-379323343,https://api.github.com/repos/pydata/xarray/issues/1983,379323343,MDEyOklzc3VlQ29tbWVudDM3OTMyMzM0Mw==,2443309,2018-04-06T17:33:45Z,2018-04-06T17:33:45Z,MEMBER,All the tests are passing here? Any final objectors?,"{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-379306351,https://api.github.com/repos/pydata/xarray/issues/1983,379306351,MDEyOklzc3VlQ29tbWVudDM3OTMwNjM1MQ==,2443309,2018-04-06T16:29:15Z,2018-04-06T16:29:15Z,MEMBER,I image there will be a small performance cost when the number of files is small. That cost is probably lost in the noise in most i/o operations. ,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-379305062,https://api.github.com/repos/pydata/xarray/issues/1983,379305062,MDEyOklzc3VlQ29tbWVudDM3OTMwNTA2Mg==,1197350,2018-04-06T16:24:22Z,2018-04-06T16:24:22Z,MEMBER,Can we imagine cases where it might actually *degrade* performance?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-379304351,https://api.github.com/repos/pydata/xarray/issues/1983,379304351,MDEyOklzc3VlQ29tbWVudDM3OTMwNDM1MQ==,1217238,2018-04-06T16:21:51Z,2018-04-06T16:21:51Z,MEMBER,My reason for suggesting default `parallel=True` when using distributed is default to turning this feature on when we can expect it will probably improve performance.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-379303753,https://api.github.com/repos/pydata/xarray/issues/1983,379303753,MDEyOklzc3VlQ29tbWVudDM3OTMwMzc1Mw==,2443309,2018-04-06T16:19:35Z,2018-04-06T16:19:35Z,MEMBER,"> I'm curious about the logic of defaulting to parallel when using distributed. I'm not tied to the behavior. It was [suggested](https://github.com/pydata/xarray/pull/1983#discussion_r173990300) by @shoyer a while back. Perhaps we try this and evaluate how it works in the wild? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-376689828,https://api.github.com/repos/pydata/xarray/issues/1983,376689828,MDEyOklzc3VlQ29tbWVudDM3NjY4OTgyOA==,2443309,2018-03-27T21:59:35Z,2018-03-27T21:59:35Z,MEMBER,"> Have you tested this with both a local system and an HPC cluster? I have. See below for a simple example using this feature on [Cheyenne](https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne). ```python In [1]: import xarray as xr ...: ...: import glob ...: In [2]: pattern = '/glade/u/home/jhamman/workdir/LOCA_daily/met_data/CESM1-BGC/16th/rcp45/r1i1p1/*/*nc' In [3]: len(glob.glob(pattern)) Out[3]: 285 In [4]: %time ds = xr.open_mfdataset(pattern) CPU times: user 15.5 s, sys: 2.62 s, total: 18.1 s Wall time: 42.4 s In [5]: ds.close() In [6]: %time ds = xr.open_mfdataset(pattern, parallel=True) CPU times: user 18.4 s, sys: 5.28 s, total: 23.6 s Wall time: 30.7 s In [7]: ds.close() In [8]: from dask.distributed import Client In [9]: client = Client() clien In [10]: client Out[10]: In [11]: %time ds = xr.open_mfdataset(pattern, parallel=True, autoclose=True) CPU times: user 10.8 s, sys: 808 ms, total: 11.6 s Wall time: 12.4 s ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-375799794,https://api.github.com/repos/pydata/xarray/issues/1983,375799794,MDEyOklzc3VlQ29tbWVudDM3NTc5OTc5NA==,2443309,2018-03-23T21:12:33Z,2018-03-23T21:12:33Z,MEMBER,"> I'm tempted to just skip this test there but thought I should ask for help first... I've skipped the offending test on appveyor for now. Objectors speak up please. I don't have a windows machine to test on and iterating via appveyor is not something a sane person does 😉. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-373245814,https://api.github.com/repos/pydata/xarray/issues/1983,373245814,MDEyOklzc3VlQ29tbWVudDM3MzI0NTgxNA==,2443309,2018-03-15T03:05:08Z,2018-03-15T03:05:08Z,MEMBER,"If anyone understands Windows file handling with Python, I'm all ears as to why this is failing on AppVeyor. I'm tempted to just skip this test there but thought I should ask for help first...","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831 https://github.com/pydata/xarray/pull/1983#issuecomment-372807932,https://api.github.com/repos/pydata/xarray/issues/1983,372807932,MDEyOklzc3VlQ29tbWVudDM3MjgwNzkzMg==,2443309,2018-03-13T20:30:49Z,2018-03-13T20:30:49Z,MEMBER,@shoyer - I updated this to use dask.delayed. I actually like it more because I only have to call compute once. Thanks for the suggestion. ,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831