html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/1983#issuecomment-382487555,https://api.github.com/repos/pydata/xarray/issues/1983,382487555,MDEyOklzc3VlQ29tbWVudDM4MjQ4NzU1NQ==,2443309,2018-04-18T18:38:47Z,2018-04-18T18:38:47Z,MEMBER,"With my last commits here, this feature is completely optional and defaults to the current behavior. I cleaned up the tests a bit further and am now ready to merge this. Baring any objections, I'll merge this on Friday. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-382157273,https://api.github.com/repos/pydata/xarray/issues/1983,382157273,MDEyOklzc3VlQ29tbWVudDM4MjE1NzI3Mw==,2443309,2018-04-17T21:41:03Z,2018-04-17T21:41:03Z,MEMBER,I think that makes sense for now. We need to experiment with this a bit more but I don't see a problem merging the basic workflow we have now (with a minor change to the default behavior). ,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-382154051,https://api.github.com/repos/pydata/xarray/issues/1983,382154051,MDEyOklzc3VlQ29tbWVudDM4MjE1NDA1MQ==,1217238,2018-04-17T21:30:53Z,2018-04-17T21:30:53Z,MEMBER,It sounds like the right resolution for now would be to leave the default as `parallel=False` and leave this as an optional feature.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-382146851,https://api.github.com/repos/pydata/xarray/issues/1983,382146851,MDEyOklzc3VlQ29tbWVudDM4MjE0Njg1MQ==,2443309,2018-04-17T21:08:29Z,2018-04-17T21:08:29Z,MEMBER,"@NicWayand - Thanks for giving this a go. Some thoughts on your problem...
I'm have been using this feature for the past few days and have been seeing a speedup on datasets with many files along the lines of what I showed above. I am applying my tests on perhaps the perfect test architecture (parallel shared fs, fast interconnect, etc.). I think there are many reasons/cases where this won't work as well. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-382071801,https://api.github.com/repos/pydata/xarray/issues/1983,382071801,MDEyOklzc3VlQ29tbWVudDM4MjA3MTgwMQ==,1117224,2018-04-17T17:14:33Z,2018-04-17T17:38:42Z,NONE,"Thanks @jhamman for working on this! I did a test on my real world data (1202 ~3mb files) on my local computer and am not getting results I expected:
1) No speed up with parallel=True
2) _Slow down_ when using distributed (processes=16 cores=16).
Am I missing something?
```python
nc_files = glob.glob(E.obs['NSIDC_0081']['sipn_nc']+'/*.nc')
print(len(nc_files))
1202
# Parallel False
%time ds = xr.open_mfdataset(nc_files, concat_dim='time', parallel=False, autoclose=True)
CPU times: user 57.8 s, sys: 3.2 s, total: 1min 1s
Wall time: 1min
# Parallel True with default scheduler
%time ds = xr.open_mfdataset(nc_files, concat_dim='time', parallel=True, autoclose=True)
CPU times: user 1min 16s, sys: 9.82 s, total: 1min 26s
Wall time: 1min 16s
# Parallel True with distributed
from dask.distributed import Client
client = Client()
print(client)
%time ds = xr.open_mfdataset(nc_files, concat_dim='time', parallel=True, autoclose=True)
CPU times: user 2min 17s, sys: 12.3 s, total: 2min 29s
Wall time: 3min 48s
```
On feature/parallel_open_netcdf commit 280a46f13426a462fb3e983cfd5ac7a0565d1826","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-381277673,https://api.github.com/repos/pydata/xarray/issues/1983,381277673,MDEyOklzc3VlQ29tbWVudDM4MTI3NzY3Mw==,2443309,2018-04-13T22:42:59Z,2018-04-13T22:42:59Z,MEMBER,"@rabernat - I got the tests passing here again. If you can make the time to try your example/test again, it would be great to figure out what wasn't working before. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-380257320,https://api.github.com/repos/pydata/xarray/issues/1983,380257320,MDEyOklzc3VlQ29tbWVudDM4MDI1NzMyMA==,2443309,2018-04-10T21:44:28Z,2018-04-10T21:45:02Z,MEMBER,"@rabernat - I just pushed a few more commits here. Can I ask two questions:
When using the distributed scheduler, what configuration are you using? Can you try:
- `autoclose=True` (in open_mfdataset)
- `processes=True` (in client)
If this turns out to be a corner case with the distributed scheduler, I can add a integration test for that specific use case.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-380150362,https://api.github.com/repos/pydata/xarray/issues/1983,380150362,MDEyOklzc3VlQ29tbWVudDM4MDE1MDM2Mg==,2443309,2018-04-10T15:49:06Z,2018-04-10T15:49:06Z,MEMBER,@rabernat - my last commit(s) seem to have broken the CI so I'll need to revisit this.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-380121937,https://api.github.com/repos/pydata/xarray/issues/1983,380121937,MDEyOklzc3VlQ29tbWVudDM4MDEyMTkzNw==,1197350,2018-04-10T14:32:25Z,2018-04-10T14:32:25Z,MEMBER,"I recently tried this branch with my data server and got an error.
I opened a dataset this way
```python
# works fine with parallel=False
ds = xr.open_mfdataset(os.path.join(ddir, '*V1_1.204*.nc'), decode_cf=False, parallel=True)
```
and got the following error.
```
distributed.utils - ERROR - NetCDF: HDF error
Traceback (most recent call last):
File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/utils.py"", line 237, in f
result[0] = yield make_coro()
File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py"", line 1055, in run
value = future.result()
File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/concurrent.py"", line 238, in result
raise_exc_info(self._exc_info)
File """", line 4, in raise_exc_info
File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py"", line 1063, in run
yielded = self.gen.throw(*exc_info)
File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/client.py"", line 1356, in _gather
traceback)
File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/six.py"", line 692, in reraise
raise value.with_traceback(tb)
File ""/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/dask/compatibility.py"", line 48, in apply
return func(*args, **kwargs)
File ""/home/rpa/xarray/xarray/backends/api.py"", line 318, in open_dataset
return maybe_decode_store(store, lock)
File ""/home/rpa/xarray/xarray/backends/api.py"", line 238, in maybe_decode_store
drop_variables=drop_variables)
File ""/home/rpa/xarray/xarray/conventions.py"", line 594, in decode_cf
vars, attrs = obj.load()
File ""/home/rpa/xarray/xarray/backends/common.py"", line 217, in load
for k, v in self.get_variables().items())
File ""/home/rpa/xarray/xarray/backends/netCDF4_.py"", line 319, in get_variables
iteritems(self.ds.variables))
File ""/home/rpa/xarray/xarray/core/utils.py"", line 308, in FrozenOrderedDict
return Frozen(OrderedDict(*args, **kwargs))
File ""/home/rpa/xarray/xarray/backends/netCDF4_.py"", line 318, in
for k, v in
File ""/home/rpa/xarray/xarray/backends/netCDF4_.py"", line 311, in open_store_variable
encoding['original_shape'] = var.shape
File ""netCDF4/_netCDF4.pyx"", line 3381, in netCDF4._netCDF4.Variable.shape.__get__ (netCDF4/_netCDF4.c:34388)
File ""netCDF4/_netCDF4.pyx"", line 2759, in netCDF4._netCDF4.Dimension.__len__ (netCDF4/_netCDF4.c:27006)
RuntimeError: NetCDF: HDF error
```
Without the distributed scheduler (but with `parallel=True`), I get no error, but the command never returns, and eventually I have to restart the kernel.
Any idea what could be going on? (Sorry for the non-reproducible bug report...I figured some trials ""in the field"" might be useful.)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-379323343,https://api.github.com/repos/pydata/xarray/issues/1983,379323343,MDEyOklzc3VlQ29tbWVudDM3OTMyMzM0Mw==,2443309,2018-04-06T17:33:45Z,2018-04-06T17:33:45Z,MEMBER,All the tests are passing here? Any final objectors?,"{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-379306351,https://api.github.com/repos/pydata/xarray/issues/1983,379306351,MDEyOklzc3VlQ29tbWVudDM3OTMwNjM1MQ==,2443309,2018-04-06T16:29:15Z,2018-04-06T16:29:15Z,MEMBER,I image there will be a small performance cost when the number of files is small. That cost is probably lost in the noise in most i/o operations. ,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-379305062,https://api.github.com/repos/pydata/xarray/issues/1983,379305062,MDEyOklzc3VlQ29tbWVudDM3OTMwNTA2Mg==,1197350,2018-04-06T16:24:22Z,2018-04-06T16:24:22Z,MEMBER,Can we imagine cases where it might actually *degrade* performance?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-379304351,https://api.github.com/repos/pydata/xarray/issues/1983,379304351,MDEyOklzc3VlQ29tbWVudDM3OTMwNDM1MQ==,1217238,2018-04-06T16:21:51Z,2018-04-06T16:21:51Z,MEMBER,My reason for suggesting default `parallel=True` when using distributed is default to turning this feature on when we can expect it will probably improve performance.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-379303753,https://api.github.com/repos/pydata/xarray/issues/1983,379303753,MDEyOklzc3VlQ29tbWVudDM3OTMwMzc1Mw==,2443309,2018-04-06T16:19:35Z,2018-04-06T16:19:35Z,MEMBER,"> I'm curious about the logic of defaulting to parallel when using distributed.
I'm not tied to the behavior. It was [suggested](https://github.com/pydata/xarray/pull/1983#discussion_r173990300) by @shoyer a while back. Perhaps we try this and evaluate how it works in the wild? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-376689828,https://api.github.com/repos/pydata/xarray/issues/1983,376689828,MDEyOklzc3VlQ29tbWVudDM3NjY4OTgyOA==,2443309,2018-03-27T21:59:35Z,2018-03-27T21:59:35Z,MEMBER,"> Have you tested this with both a local system and an HPC cluster?
I have. See below for a simple example using this feature on [Cheyenne](https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne).
```python
In [1]: import xarray as xr
...:
...: import glob
...:
In [2]: pattern = '/glade/u/home/jhamman/workdir/LOCA_daily/met_data/CESM1-BGC/16th/rcp45/r1i1p1/*/*nc'
In [3]: len(glob.glob(pattern))
Out[3]: 285
In [4]: %time ds = xr.open_mfdataset(pattern)
CPU times: user 15.5 s, sys: 2.62 s, total: 18.1 s
Wall time: 42.4 s
In [5]: ds.close()
In [6]: %time ds = xr.open_mfdataset(pattern, parallel=True)
CPU times: user 18.4 s, sys: 5.28 s, total: 23.6 s
Wall time: 30.7 s
In [7]: ds.close()
In [8]: from dask.distributed import Client
In [9]: client = Client()
clien
In [10]: client
Out[10]:
In [11]: %time ds = xr.open_mfdataset(pattern, parallel=True, autoclose=True)
CPU times: user 10.8 s, sys: 808 ms, total: 11.6 s
Wall time: 12.4 s
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-375799794,https://api.github.com/repos/pydata/xarray/issues/1983,375799794,MDEyOklzc3VlQ29tbWVudDM3NTc5OTc5NA==,2443309,2018-03-23T21:12:33Z,2018-03-23T21:12:33Z,MEMBER,"> I'm tempted to just skip this test there but thought I should ask for help first...
I've skipped the offending test on appveyor for now. Objectors speak up please. I don't have a windows machine to test on and iterating via appveyor is not something a sane person does 😉. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-373245814,https://api.github.com/repos/pydata/xarray/issues/1983,373245814,MDEyOklzc3VlQ29tbWVudDM3MzI0NTgxNA==,2443309,2018-03-15T03:05:08Z,2018-03-15T03:05:08Z,MEMBER,"If anyone understands Windows file handling with Python, I'm all ears as to why this is failing on AppVeyor. I'm tempted to just skip this test there but thought I should ask for help first...","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831
https://github.com/pydata/xarray/pull/1983#issuecomment-372807932,https://api.github.com/repos/pydata/xarray/issues/1983,372807932,MDEyOklzc3VlQ29tbWVudDM3MjgwNzkzMg==,2443309,2018-03-13T20:30:49Z,2018-03-13T20:30:49Z,MEMBER,@shoyer - I updated this to use dask.delayed. I actually like it more because I only have to call compute once. Thanks for the suggestion. ,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,304589831