html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2501#issuecomment-768470505,https://api.github.com/repos/pydata/xarray/issues/2501,768470505,MDEyOklzc3VlQ29tbWVudDc2ODQ3MDUwNQ==,2448579,2021-01-27T18:06:16Z,2021-01-27T18:06:16Z,MEMBER,I think this is stale now. See https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets for latest guidance on reading such datasets. Please open a new issue if you are still having trouble with `open_mfdataset`,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-512663861,https://api.github.com/repos/pydata/xarray/issues/2501,512663861,MDEyOklzc3VlQ29tbWVudDUxMjY2Mzg2MQ==,7799184,2019-07-18T04:51:06Z,2019-07-18T04:52:17Z,CONTRIBUTOR,"Hi guys, I'm having some issue that looks similar to @rsignell-usgs. Trying to open 413 netcdf files using `open_mfdataset` with `parallel=True`. The dataset (successfully opened with `parallel=False`) has ~300G on disk and looks like:
```ipython
In [1] import xarray as xr
In [2]: dset = xr.open_mfdataset(""./bom-ww3/bom-ww3_*.nc"", chunks={'time': 744, 'latitude': 100, 'longitude': 100}, parallel=False)
In [3]: dset
Out[3]:
Dimensions: (latitude: 190, longitude: 289, time: 302092)
Coordinates:
* longitude (longitude) float32 70.0 70.4 70.8 71.2 ... 184.4 184.8 185.2
* latitude (latitude) float32 -55.6 -55.2 -54.8 -54.4 ... 19.2 19.6 20.0
* time (time) datetime64[ns] 1979-01-01 ... 2013-05-31T23:00:00.000013440
Data variables:
hs (time, latitude, longitude) float32 dask.array
fp (time, latitude, longitude) float32 dask.array
dp (time, latitude, longitude) float32 dask.array
wl (time, latitude, longitude) float32 dask.array
U10 (time, latitude, longitude) float32 dask.array
V10 (time, latitude, longitude) float32 dask.array
hs1 (time, latitude, longitude) float32 dask.array
hs2 (time, latitude, longitude) float32 dask.array
tp1 (time, latitude, longitude) float32 dask.array
tp2 (time, latitude, longitude) float32 dask.array
lp0 (time, latitude, longitude) float32 dask.array
lp1 (time, latitude, longitude) float32 dask.array
lp2 (time, latitude, longitude) float32 dask.array
th0 (time, latitude, longitude) float32 dask.array
th1 (time, latitude, longitude) float32 dask.array
th2 (time, latitude, longitude) float32 dask.array
hs0 (time, latitude, longitude) float32 dask.array
tp0 (time, latitude, longitude) float32 dask.array
```
Trying to read it on a standard python session gives me core dumped:
```ipython
In [1]: import xarray as xr
In [2]: dset = xr.open_mfdataset(""./bom-ww3/bom-ww3_*.nc"", chunks={'time': 744, 'latitude': 100, 'longitude': 100}, parallel=True)
Bus error (core dumped)
```
Trying to read it on a dask cluster I get:
```ipython
In [1]: from dask.distributed import Client
In [2]: import xarray as xr
In [3]: client = Client()
In [4]: dset = xr.open_mfdataset(""./bom-ww3/bom-ww3_*.nc"", chunks={'time': 744, 'latitude': 100, 'longitud
...: e': 100}, parallel=True)
free(): double free detected in tcache 2free(): double free detected in tcache 2
free(): double free detected in tcache 2
distributed.nanny - WARNING - Worker process 18744 was killed by signal 11
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 18740 was killed by signal 6
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 18742 was killed by signal 7
distributed.nanny - WARNING - Worker process 18738 was killed by signal 6
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
free(): double free detected in tcache 2munmap_chunk(): invalid pointer
free(): double free detected in tcache 2
free(): double free detected in tcache 2
distributed.nanny - WARNING - Worker process 19082 was killed by signal 6
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process 19073 was killed by signal 6
distributed.nanny - WARNING - Restarting worker
---------------------------------------------------------------------------
KilledWorker Traceback (most recent call last)
in ()
----> 1 dset = xr.open_mfdataset(""./bom-ww3/bom-ww3_*.nc"", chunks={'time': 744, 'latitude': 100, 'longitude': 100}, parallel=True)
/usr/local/lib/python3.7/dist-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, combine, autoclose, parallel, **kwargs)
772 # calling compute here will return the datasets/file_objs lists,
773 # the underlying datasets will still be stored as dask arrays
--> 774 datasets, file_objs = dask.compute(datasets, file_objs)
775
776 # Combine all datasets, closing them in case of a ValueError
/usr/local/lib/python3.7/dist-packages/dask/base.py in compute(*args, **kwargs)
444 keys = [x.__dask_keys__() for x in collections]
445 postcomputes = [x.__dask_postcompute__() for x in collections]
--> 446 results = schedule(dsk, keys, **kwargs)
447 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
448
/home/oceanum/.local/lib/python3.7/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2525 should_rejoin = False
2526 try:
-> 2527 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
2528 finally:
2529 for f in futures.values():
/home/oceanum/.local/lib/python3.7/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1821 direct=direct,
1822 local_worker=local_worker,
-> 1823 asynchronous=asynchronous,
1824 )
1825
/home/oceanum/.local/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
761 else:
762 return sync(
--> 763 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
764 )
765
/home/oceanum/.local/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
330 e.wait(10)
331 if error[0]:
--> 332 six.reraise(*error[0])
333 else:
334 return result[0]
/usr/lib/python3/dist-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
/home/oceanum/.local/lib/python3.7/site-packages/distributed/utils.py in f()
315 if callback_timeout is not None:
316 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 317 result[0] = yield future
318 except Exception as exc:
319 error[0] = sys.exc_info()
/home/oceanum/.local/lib/python3.7/site-packages/tornado/gen.py in run(self)
733
734 try:
--> 735 value = future.result()
736 except Exception:
737 exc_info = sys.exc_info()
/home/oceanum/.local/lib/python3.7/site-packages/tornado/gen.py in run(self)
740 if exc_info is not None:
741 try:
--> 742 yielded = self.gen.throw(*exc_info) # type: ignore
743 finally:
744 # Break up a reference to itself
/home/oceanum/.local/lib/python3.7/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1678 exc = CancelledError(key)
1679 else:
-> 1680 six.reraise(type(exception), exception, traceback)
1681 raise exc
1682 if errors == ""skip"":
/usr/lib/python3/dist-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
KilledWorker: ('open_dataset-e7916acb-6d9f-4532-ab76-5b9c1b1a39c2', )
```
Is there anything obviously wrong I'm trying here please?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-510144707,https://api.github.com/repos/pydata/xarray/issues/2501,510144707,MDEyOklzc3VlQ29tbWVudDUxMDE0NDcwNw==,1872600,2019-07-10T16:59:12Z,2019-07-11T11:47:02Z,NONE,"@TomAugspurger , I sat down here at Scipy with @rabernat and he instantly realized that we needed to drop the `feature_id` coordinate to prevent `open_mfdataset` from trying to harmonize that coordinate from all the chunks.
So if I use this code, the `open_mfdataset` command finishes:
```python
def drop_coords(ds):
ds = ds.drop(['reference_time','feature_id'])
return ds.reset_coords(drop=True)
```
and I can then add back in the dropped coordinate values at the end:
```python
dsets = [xr.open_dataset(f) for f in files[:3]]
ds.coords['feature_id'] = dsets[0].coords['feature_id']
```
I'm now running into memory issues when I write the zarr data -- but I should raise that as a new issue, right?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-510217080,https://api.github.com/repos/pydata/xarray/issues/2501,510217080,MDEyOklzc3VlQ29tbWVudDUxMDIxNzA4MA==,1312546,2019-07-10T20:30:41Z,2019-07-10T20:30:41Z,MEMBER,"Yep, that’s my suspicion as well. I’m still plugging away at it. Currently the pausing logic isn’t quite working well.
> On Jul 10, 2019, at 12:10, Ryan Abernathey wrote:
>
> I believe that the memory issue is basically the same as dask/distributed#2602.
>
> The graphs look like: read --> rechunk --> write.
>
> Reading and rechunking increase memory consumption. Writing relieves it. In Rich's case, the workers just load too much data before they write it. Eventually they run out of memory.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-510169853,https://api.github.com/repos/pydata/xarray/issues/2501,510169853,MDEyOklzc3VlQ29tbWVudDUxMDE2OTg1Mw==,1197350,2019-07-10T18:10:37Z,2019-07-10T18:10:37Z,MEMBER,"I believe that the memory issue is basically the same as https://github.com/dask/distributed/issues/2602.
The graphs look like: `read --> rechunk --> write`.
Reading and rechunking increase memory consumption. Writing relieves it. In Rich's case, the workers just load too much data before they write it. Eventually they run out of memory.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-510167911,https://api.github.com/repos/pydata/xarray/issues/2501,510167911,MDEyOklzc3VlQ29tbWVudDUxMDE2NzkxMQ==,1312546,2019-07-10T18:05:07Z,2019-07-10T18:05:07Z,MEMBER,"Great, thanks. I’ll look into the memory issue when writing. We may already have an issue for it.
> On Jul 10, 2019, at 10:59, Rich Signell wrote:
>
> @TomAugspurger , I sat down here at Scipy with @rabernat and he instantly realized that we needed to drop the feature_id coordinate to prevent open_mfdataset from trying to harmonize that coordinate from all the chunks.
>
> So if I use this code, the open_mdfdataset command finishes:
>
> def drop_coords(ds):
> ds = ds.drop(['reference_time','feature_id'])
> return ds.reset_coords(drop=True)
> and I can then add back in the dropped coordinate values at the end:
>
> dsets = [xr.open_dataset(f) for f in files[:3]]
> ds.coords['feature_id'] = dsets[0].coords['feature_id']
> I'm now running into memory issues when I write the zarr data -- but I should raise that as a new issue, right?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-509379294,https://api.github.com/repos/pydata/xarray/issues/2501,509379294,MDEyOklzc3VlQ29tbWVudDUwOTM3OTI5NA==,1872600,2019-07-08T20:28:48Z,2019-07-08T20:29:20Z,NONE,"@TomAugspurger , I thought @rabernat's suggestion of implementing
```python
def drop_coords(ds):
return ds.reset_coords(drop=True)
```
would avoid this checking. Did I understand or implement this incorrectly?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-509346055,https://api.github.com/repos/pydata/xarray/issues/2501,509346055,MDEyOklzc3VlQ29tbWVudDUwOTM0NjA1NQ==,1312546,2019-07-08T18:46:58Z,2019-07-08T18:46:58Z,MEMBER,"@rsignell-usgs very helpful, thanks. I'd noticed that there was a pause after the open_dataset tasks finish, indicating that either the scheduler or (more likely) the client was doing work rather than the cluster. Most likely @rabernat's guess
> In open_mfdataset, all of the dimensions and coordinates of the individual files have to be checked and verified to be compatible. That is often the source of slow performance with open_mfdataset.
is correct. Verifying all that now, and looking into if / how that can be done on the workers.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-509341467,https://api.github.com/repos/pydata/xarray/issues/2501,509341467,MDEyOklzc3VlQ29tbWVudDUwOTM0MTQ2Nw==,1872600,2019-07-08T18:34:02Z,2019-07-08T18:34:02Z,NONE,"@rabernat , to answer your question, if I open just two files:
```
ds = xr.open_mfdataset(files[:2], preprocess=drop_coords, autoclose=True, parallel=True)
```
the resulting dataset is:
```
Dimensions: (feature_id: 2729077, reference_time: 1, time: 2)
Coordinates:
* reference_time (reference_time) datetime64[ns] 2009-01-01
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
* time (time) datetime64[ns] 2009-01-01 2009-01-01T01:00:00
Data variables:
streamflow (time, feature_id) float64 dask.array
q_lateral (time, feature_id) float64 dask.array
velocity (time, feature_id) float64 dask.array
qSfcLatRunoff (time, feature_id) float64 dask.array
qBucket (time, feature_id) float64 dask.array
qBtmVertRunoff (time, feature_id) float64 dask.array
Attributes:
featureType: timeSeries
proj4: +proj=longlat +datum=NAD83 +no_defs
model_initialization_time: 2009-01-01_00:00:00
station_dimension: feature_id
model_output_valid_time: 2009-01-01_00:00:00
stream_order_output: 1
cdm_datatype: Station
esri_pe_string: GEOGCS[GCS_North_American_1983,DATUM[D_North_...
Conventions: CF-1.6
model_version: NWM 1.2
dev_OVRTSWCRT: 1
dev_NOAH_TIMESTEP: 3600
dev_channel_only: 0
dev_channelBucket_only: 0
dev: dev_ prefix indicates development/internal me...
```
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-509340139,https://api.github.com/repos/pydata/xarray/issues/2501,509340139,MDEyOklzc3VlQ29tbWVudDUwOTM0MDEzOQ==,1872600,2019-07-08T18:30:18Z,2019-07-08T18:30:18Z,NONE,"@TomAugspurger, okay, I just ran the above code again and here's what happens:
The `open_mfdataset` proceeds nicely on my 8 workers with 40 cores, eventually completing the 8760 `open_dataset` tasks in about 10 minutes. One interesting thing is that the number of tasks keep dropping as time goes on. Not sure why that would be:





The memory usage on the workers seems okay during this process:

Then, despite the tasks showing on the dashboard being completed, the `open_mfdataset` command does not complete, but nothing has died, and I'm not sure what's happening. I check `top` and get this:

then after about 10 more minutes, I get these warnings:

and then the errors:
```python-traceback
distributed.client - WARNING - Couldn't gather 17520 keys, rescheduling {'getattr-fd038834-befa-4a9b-b78f-51f9aa2b28e5': ('tcp://127.0.0.1:45640',), 'drop_coords-39be9e52-59de-4e1f-b6d8-27e7d931b5af': ('tcp://127.0.0.1:55881',), 'drop_coords-8bd07037-9ca4-4f97-83fb-8b02d7ad0333': ('tcp://127.0.0.1:56164',), 'drop_coords-ca3dd72b-e5af-4099-b593-89dc97717718': ('tcp://127.0.0.1:59961',), 'getattr-c0af8992-e928-4d42-9e64-340303143454': ('tcp://127.0.0.1:42989',), 'drop_coords-8cdfe5fb-7a29-4606-8692-efa747be5bc1': ('tcp://127.0.0.1:35445',), 'getattr-03669206-0d26-46a1-988d-690fe830e52f':
...
```
Full error listing here:
https://gist.github.com/rsignell-usgs/3b7101966b8c6d05f48a0e01695f35d6
Does this help? I'd be happy to screenshare if that would be useful.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-509307081,https://api.github.com/repos/pydata/xarray/issues/2501,509307081,MDEyOklzc3VlQ29tbWVudDUwOTMwNzA4MQ==,1312546,2019-07-08T16:57:15Z,2019-07-08T16:57:15Z,MEMBER,"I'm looking into it today. Can you clarify
> The memory use kept growing until the process died.
by ""process"" do you mean a dask worker process, or just the main python process executing the `ds = xr.open_mfdataset(...)` code?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-509282831,https://api.github.com/repos/pydata/xarray/issues/2501,509282831,MDEyOklzc3VlQ29tbWVudDUwOTI4MjgzMQ==,1872600,2019-07-08T15:51:23Z,2019-07-08T15:51:23Z,NONE,"@TomAugspurger, I'm back from vacation now and ready to attack this again. Any updates on your end?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-506497180,https://api.github.com/repos/pydata/xarray/issues/2501,506497180,MDEyOklzc3VlQ29tbWVudDUwNjQ5NzE4MA==,1312546,2019-06-27T20:24:26Z,2019-06-27T20:24:26Z,MEMBER,"> The datasets in our cloud datastore are designed explicitly to avoid this problem!
Good to know!
FYI, https://github.com/pydata/xarray/issues/2501#issuecomment-506478508 was user error (I can access it, but need to specify the us-east-1 region). Taking a look now.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-506482057,https://api.github.com/repos/pydata/xarray/issues/2501,506482057,MDEyOklzc3VlQ29tbWVudDUwNjQ4MjA1Nw==,1197350,2019-06-27T19:36:51Z,2019-06-27T19:36:51Z,MEMBER,"@rsignell-usgs
Can you post the xarray repr of two sample files post pre-processing function?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-506481845,https://api.github.com/repos/pydata/xarray/issues/2501,506481845,MDEyOklzc3VlQ29tbWVudDUwNjQ4MTg0NQ==,1197350,2019-06-27T19:36:11Z,2019-06-27T19:36:11Z,MEMBER,"> Are there any datasets on https://pangeo-data.github.io/pangeo-datastore/ that would exhibit this poor behavior?
The datasets in our cloud datastore are designed explicitly to avoid this problem!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-506478508,https://api.github.com/repos/pydata/xarray/issues/2501,506478508,MDEyOklzc3VlQ29tbWVudDUwNjQ3ODUwOA==,1312546,2019-06-27T19:25:05Z,2019-06-27T19:25:05Z,MEMBER,"Thanks, will take a look this afternoon. Are there any datasets on https://pangeo-data.github.io/pangeo-datastore/ that would exhibit this poor behavior? I may not have access to the bucket (or I'm misusing `rclone`)
```
2019/06/27 14:23:50 NOTICE: Config file ""/Users/taugspurger/.config/rclone/rclone.conf"" not found - using defaults
2019/06/27 14:23:50 Failed to create file system for ""aws-east:nwm-archive/2009"": didn't find section in config file
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-506475819,https://api.github.com/repos/pydata/xarray/issues/2501,506475819,MDEyOklzc3VlQ29tbWVudDUwNjQ3NTgxOQ==,1872600,2019-06-27T19:16:28Z,2019-06-27T19:24:31Z,NONE,"I tried this, and either I didn't apply it right, or it didn't work. The memory use kept growing until the process died. My code to process the 8760 netcdf files with `open_mfdataset` looks like this:
```python
import xarray as xr
from dask.distributed import Client, progress, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
import pandas as pd
dates = pd.date_range(start='2009-01-01 00:00',end='2009-12-31 23:00', freq='1h')
files = ['./nc/{}/{}.CHRTOUT_DOMAIN1.comp'.format(date.strftime('%Y'),date.strftime('%Y%m%d%H%M')) for date in dates]
def drop_coords(ds):
return ds.reset_coords(drop=True)
ds = xr.open_mfdataset(files, preprocess=drop_coords, autoclose=True, parallel=True)
ds1 = ds.chunk(chunks={'time':168, 'feature_id':209929})
import numcodecs
numcodecs.blosc.use_threads = False
ds1.to_zarr('zarr/2009', mode='w', consolidated=True)
```
I transfered the netcdf files from AWS S3 to my local disk to run this, using this command:
```
rclone sync --include '*.CHRTOUT_DOMAIN1.comp' aws-east:nwm-archive/2009 . --checksum --fast-list --transfers 16
```
@TomAugspurger, if you could take a look, that would be great, and if you have any ideas of how to make this example simpler/more easily reproducible, please let me know.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-503641038,https://api.github.com/repos/pydata/xarray/issues/2501,503641038,MDEyOklzc3VlQ29tbWVudDUwMzY0MTAzOA==,1197350,2019-06-19T16:48:29Z,2019-06-19T16:48:29Z,MEMBER,"Try writing a preprocessor function that drops all coordinates
```python
def drop_coords(ds):
return ds.reset_coords(drop=True)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-497381301,https://api.github.com/repos/pydata/xarray/issues/2501,497381301,MDEyOklzc3VlQ29tbWVudDQ5NzM4MTMwMQ==,1872600,2019-05-30T15:55:56Z,2019-05-30T15:58:48Z,NONE,"I'm hitting some memory issues with using `open_mfdataset` with a cluster also.
Specifically, I'm trying to open 8760 NetCDF files with an 8 node, 40 cpu LocalCluster.
When I issue:
```
ds = xr.open_mfdataset(files, parallel=True)
```
all looks good on the Dask dashboard:


and the tasks complete with no errors in about 4 minutes.
Then 4 more minutes go by before I get a bunch of errors like:
```
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process 26054 was killed by unknown signal
distributed.nanny - WARNING - Restarting worker
```
and my cell doesn't complete.
Any suggestions?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-432546977,https://api.github.com/repos/pydata/xarray/issues/2501,432546977,MDEyOklzc3VlQ29tbWVudDQzMjU0Njk3Nw==,1492047,2018-10-24T07:38:31Z,2018-10-24T07:38:31Z,CONTRIBUTOR,"Thank you for looking into this.
I just want to point out that I'm not that much concerned with the ""slow performance"" but much more with the memory consumption and the limitation it implies.
```python
from glob import glob
import xarray as xr
all_files = glob('...*TP110*.nc')
display(xr.open_dataset(all_files[0]))
display(xr.open_dataset(all_files[1]))
```
```
Dimensions: (meas_ind: 40, time: 2871, wvf_ind: 128)
Coordinates:
* time (time) datetime64[ns] 2017-06-19T14:24:20.792036992 ... 2017-06-19T15:14:38.491743104
* meas_ind (meas_ind) int8 0 1 2 3 4 ... 36 37 38 39
* wvf_ind (wvf_ind) int8 0 1 2 3 ... 125 126 127
lat (time) float64 ...
lon (time) float64 ...
lon_40hz (time, meas_ind) float64 ...
lat_40hz (time, meas_ind) float64 ...
Data variables:
time_40hz (time, meas_ind) datetime64[ns] ...
surface_type (time) float32 ...
rad_surf_type (time) float32 ...
qual_alt_1hz_range (time) float32 ...
qual_alt_1hz_swh (time) float32 ...
qual_alt_1hz_sig0 (time) float32 ...
qual_alt_1hz_off_nadir_angle_wf (time) float32 ...
qual_inst_corr_1hz_range (time) float32 ...
qual_inst_corr_1hz_swh (time) float32 ...
qual_inst_corr_1hz_sig0 (time) float32 ...
qual_rad_1hz_tb_k (time) float32 ...
qual_rad_1hz_tb_ka (time) float32 ...
alt_state_flag_acq_mode_40hz (time, meas_ind) float32 ...
alt_state_flag_tracking_mode_40hz (time, meas_ind) float32 ...
orb_state_flag_diode (time) float32 ...
orb_state_flag_rest (time) float32 ...
ecmwf_meteo_map_avail (time) float32 ...
trailing_edge_variation_flag (time) float32 ...
trailing_edge_variation_flag_40hz (time, meas_ind) float32 ...
ice_flag (time) float32 ...
interp_flag_mean_sea_surface (time) float32 ...
interp_flag_mdt (time) float32 ...
interp_flag_ocean_tide_sol1 (time) float32 ...
interp_flag_ocean_tide_sol2 (time) float32 ...
interp_flag_meteo (time) float32 ...
alt (time) float64 ...
alt_40hz (time, meas_ind) float64 ...
orb_alt_rate (time) float32 ...
range (time) float64 ...
range_40hz (time, meas_ind) float64 ...
range_used_40hz (time, meas_ind) float32 ...
range_rms (time) float32 ...
range_numval (time) float32 ...
number_of_iterations (time, meas_ind) float32 ...
net_instr_corr_range (time) float64 ...
model_dry_tropo_corr (time) float32 ...
model_wet_tropo_corr (time) float32 ...
rad_wet_tropo_corr (time) float32 ...
iono_corr_gim (time) float32 ...
sea_state_bias (time) float32 ...
swh (time) float32 ...
swh_40hz (time, meas_ind) float32 ...
swh_used_40hz (time, meas_ind) float32 ...
swh_rms (time) float32 ...
swh_numval (time) float32 ...
net_instr_corr_swh (time) float32 ...
sig0 (time) float32 ...
sig0_40hz (time, meas_ind) float32 ...
sig0_used_40hz (time, meas_ind) float32 ...
sig0_rms (time) float32 ...
sig0_numval (time) float32 ...
agc (time) float32 ...
agc_rms (time) float32 ...
agc_numval (time) float32 ...
net_instr_corr_sig0 (time) float32 ...
atmos_corr_sig0 (time) float32 ...
off_nadir_angle_wf (time) float32 ...
off_nadir_angle_wf_40hz (time, meas_ind) float32 ...
tb_k (time) float32 ...
tb_ka (time) float32 ...
mean_sea_surface (time) float64 ...
mean_topography (time) float64 ...
geoid (time) float64 ...
bathymetry (time) float64 ...
inv_bar_corr (time) float32 ...
hf_fluctuations_corr (time) float32 ...
ocean_tide_sol1 (time) float64 ...
ocean_tide_sol2 (time) float64 ...
ocean_tide_equil (time) float32 ...
ocean_tide_non_equil (time) float32 ...
load_tide_sol1 (time) float32 ...
load_tide_sol2 (time) float32 ...
solid_earth_tide (time) float32 ...
pole_tide (time) float32 ...
wind_speed_model_u (time) float32 ...
wind_speed_model_v (time) float32 ...
wind_speed_alt (time) float32 ...
rad_water_vapor (time) float32 ...
rad_liquid_water (time) float32 ...
ice1_range_40hz (time, meas_ind) float64 ...
ice1_sig0_40hz (time, meas_ind) float32 ...
ice1_qual_flag_40hz (time, meas_ind) float32 ...
seaice_range_40hz (time, meas_ind) float64 ...
seaice_sig0_40hz (time, meas_ind) float32 ...
seaice_qual_flag_40hz (time, meas_ind) float32 ...
ice2_range_40hz (time, meas_ind) float64 ...
ice2_le_sig0_40hz (time, meas_ind) float32 ...
ice2_sig0_40hz (time, meas_ind) float32 ...
ice2_sigmal_40hz (time, meas_ind) float32 ...
ice2_slope1_40hz (time, meas_ind) float64 ...
ice2_slope2_40hz (time, meas_ind) float64 ...
ice2_mqe_40hz (time, meas_ind) float32 ...
ice2_qual_flag_40hz (time, meas_ind) float32 ...
mqe_40hz (time, meas_ind) float32 ...
peakiness_40hz (time, meas_ind) float32 ...
ssha (time) float32 ...
tracker_40hz (time, meas_ind) float64 ...
tracker_used_40hz (time, meas_ind) float32 ...
tracker_diode_40hz (time, meas_ind) float64 ...
pri_counter_40hz (time, meas_ind) float64 ...
qual_alt_1hz_off_nadir_angle_pf (time) float32 ...
off_nadir_angle_pf (time) float32 ...
off_nadir_angle_rain_40hz (time, meas_ind) float32 ...
uso_corr (time) float64 ...
internal_path_delay_corr (time) float64 ...
modeled_instr_corr_range (time) float32 ...
doppler_corr (time) float32 ...
cog_corr (time) float32 ...
modeled_instr_corr_swh (time) float32 ...
internal_corr_sig0 (time) float32 ...
modeled_instr_corr_sig0 (time) float32 ...
agc_40hz (time, meas_ind) float32 ...
agc_corr_40hz (time, meas_ind) float32 ...
scaling_factor_40hz (time, meas_ind) float64 ...
epoch_40hz (time, meas_ind) float64 ...
width_leading_edge_40hz (time, meas_ind) float64 ...
amplitude_40hz (time, meas_ind) float64 ...
thermal_noise_40hz (time, meas_ind) float64 ...
seaice_epoch_40hz (time, meas_ind) float64 ...
seaice_amplitude_40hz (time, meas_ind) float64 ...
ice2_epoch_40hz (time, meas_ind) float64 ...
ice2_amplitude_40hz (time, meas_ind) float64 ...
ice2_mean_amplitude_40hz (time, meas_ind) float64 ...
ice2_thermal_noise_40hz (time, meas_ind) float64 ...
ice2_slope_40hz (time, meas_ind) float64 ...
signal_to_noise_ratio (time) float32 ...
waveforms_40hz (time, meas_ind, wvf_ind) float32 ...
Attributes:
Conventions: CF-1.1
title: GDR - Expertise dataset
institution: CNES
source: radar altimeter
history: 2017-07-21 08:25:07 : Creation
contact: CNES aviso@oceanobs.com, EUMETSAT ops@...
references: L1 library=V4.5p1, L2 library=V5.5p2, ...
processing_center: SALP
reference_document: SARAL/ALTIKA Products Handbook, SALP-M...
mission_name: SARAL
altimeter_sensor_name: ALTIKA
radiometer_sensor_name: ALTIKA_RAD
doris_sensor_name: DGXX
cycle_number: 110
absolute_rev_number: 22545
pass_number: 1
absolute_pass_number: 109219
equator_time: 2017-06-19 14:49:32.128000
equator_longitude: 227.77
first_meas_time: 2017-06-19 14:24:20.792037
last_meas_time: 2017-06-19 15:14:38.491743
xref_altimeter_level1: ALK_ALT_1PaS20170619_154722_20170619_1...
xref_radiometer_level1: ALK_RAD_1PaS20170619_154643_20170619_1...
xref_altimeter_characterisation: ALK_CHA_AXVCNE20131115_120000_20100101...
xref_radiometer_characterisation: ALK_CHR_AXVCNE20110207_180000_20110101...
xref_altimeter_ltm: ALK_CAL_AXXCNE20170720_110014_20130102...
xref_doris_uso: SRL_OS1_AXXCNE20170720_083800_20130226...
xref_orbit_data: SRL_VOR_AXVCNE20170720_111700_20170618...
xref_pf_data: SRL_VPF_AXVCNE20170720_111800_20170618...
xref_pole_location: SMM_POL_AXXCNE20170721_071500_19870101...
xref_gim_data: SRL_ION_AXPCNE20170620_074756_20170619...
xref_mog2d_data: SMM_MOG_AXVCNE20170709_191501_20170619...
xref_orf_data: SRL_ORF_AXXCNE20170720_083800_20160704...
xref_meteorological_files: SMM_APA_AXVCNE20170619_170611_20170619...
ellipsoid_axis: 6378136.3
ellipsoid_flattening: 0.0033528131778969
Dimensions: (meas_ind: 40, time: 2779, wvf_ind: 128)
Coordinates:
* time (time) datetime64[ns] 2017-06-19T15:14:39.356848 ... 2017-06-19T16:04:56.808873920
* meas_ind (meas_ind) int8 0 1 2 3 4 ... 36 37 38 39
* wvf_ind (wvf_ind) int8 0 1 2 3 ... 125 126 127
lat (time) float64 ...
lon (time) float64 ...
lon_40hz (time, meas_ind) float64 ...
lat_40hz (time, meas_ind) float64 ...
Data variables:
time_40hz (time, meas_ind) datetime64[ns] ...
surface_type (time) float32 ...
rad_surf_type (time) float32 ...
qual_alt_1hz_range (time) float32 ...
qual_alt_1hz_swh (time) float32 ...
qual_alt_1hz_sig0 (time) float32 ...
qual_alt_1hz_off_nadir_angle_wf (time) float32 ...
qual_inst_corr_1hz_range (time) float32 ...
qual_inst_corr_1hz_swh (time) float32 ...
qual_inst_corr_1hz_sig0 (time) float32 ...
qual_rad_1hz_tb_k (time) float32 ...
qual_rad_1hz_tb_ka (time) float32 ...
alt_state_flag_acq_mode_40hz (time, meas_ind) float32 ...
alt_state_flag_tracking_mode_40hz (time, meas_ind) float32 ...
orb_state_flag_diode (time) float32 ...
orb_state_flag_rest (time) float32 ...
ecmwf_meteo_map_avail (time) float32 ...
trailing_edge_variation_flag (time) float32 ...
trailing_edge_variation_flag_40hz (time, meas_ind) float32 ...
ice_flag (time) float32 ...
interp_flag_mean_sea_surface (time) float32 ...
interp_flag_mdt (time) float32 ...
interp_flag_ocean_tide_sol1 (time) float32 ...
interp_flag_ocean_tide_sol2 (time) float32 ...
interp_flag_meteo (time) float32 ...
alt (time) float64 ...
alt_40hz (time, meas_ind) float64 ...
orb_alt_rate (time) float32 ...
range (time) float64 ...
range_40hz (time, meas_ind) float64 ...
range_used_40hz (time, meas_ind) float32 ...
range_rms (time) float32 ...
range_numval (time) float32 ...
number_of_iterations (time, meas_ind) float32 ...
net_instr_corr_range (time) float64 ...
model_dry_tropo_corr (time) float32 ...
model_wet_tropo_corr (time) float32 ...
rad_wet_tropo_corr (time) float32 ...
iono_corr_gim (time) float32 ...
sea_state_bias (time) float32 ...
swh (time) float32 ...
swh_40hz (time, meas_ind) float32 ...
swh_used_40hz (time, meas_ind) float32 ...
swh_rms (time) float32 ...
swh_numval (time) float32 ...
net_instr_corr_swh (time) float32 ...
sig0 (time) float32 ...
sig0_40hz (time, meas_ind) float32 ...
sig0_used_40hz (time, meas_ind) float32 ...
sig0_rms (time) float32 ...
sig0_numval (time) float32 ...
agc (time) float32 ...
agc_rms (time) float32 ...
agc_numval (time) float32 ...
net_instr_corr_sig0 (time) float32 ...
atmos_corr_sig0 (time) float32 ...
off_nadir_angle_wf (time) float32 ...
off_nadir_angle_wf_40hz (time, meas_ind) float32 ...
tb_k (time) float32 ...
tb_ka (time) float32 ...
mean_sea_surface (time) float64 ...
mean_topography (time) float64 ...
geoid (time) float64 ...
bathymetry (time) float64 ...
inv_bar_corr (time) float32 ...
hf_fluctuations_corr (time) float32 ...
ocean_tide_sol1 (time) float64 ...
ocean_tide_sol2 (time) float64 ...
ocean_tide_equil (time) float32 ...
ocean_tide_non_equil (time) float32 ...
load_tide_sol1 (time) float32 ...
load_tide_sol2 (time) float32 ...
solid_earth_tide (time) float32 ...
pole_tide (time) float32 ...
wind_speed_model_u (time) float32 ...
wind_speed_model_v (time) float32 ...
wind_speed_alt (time) float32 ...
rad_water_vapor (time) float32 ...
rad_liquid_water (time) float32 ...
ice1_range_40hz (time, meas_ind) float64 ...
ice1_sig0_40hz (time, meas_ind) float32 ...
ice1_qual_flag_40hz (time, meas_ind) float32 ...
seaice_range_40hz (time, meas_ind) float64 ...
seaice_sig0_40hz (time, meas_ind) float32 ...
seaice_qual_flag_40hz (time, meas_ind) float32 ...
ice2_range_40hz (time, meas_ind) float64 ...
ice2_le_sig0_40hz (time, meas_ind) float32 ...
ice2_sig0_40hz (time, meas_ind) float32 ...
ice2_sigmal_40hz (time, meas_ind) float32 ...
ice2_slope1_40hz (time, meas_ind) float64 ...
ice2_slope2_40hz (time, meas_ind) float64 ...
ice2_mqe_40hz (time, meas_ind) float32 ...
ice2_qual_flag_40hz (time, meas_ind) float32 ...
mqe_40hz (time, meas_ind) float32 ...
peakiness_40hz (time, meas_ind) float32 ...
ssha (time) float32 ...
tracker_40hz (time, meas_ind) float64 ...
tracker_used_40hz (time, meas_ind) float32 ...
tracker_diode_40hz (time, meas_ind) float64 ...
pri_counter_40hz (time, meas_ind) float64 ...
qual_alt_1hz_off_nadir_angle_pf (time) float32 ...
off_nadir_angle_pf (time) float32 ...
off_nadir_angle_rain_40hz (time, meas_ind) float32 ...
uso_corr (time) float64 ...
internal_path_delay_corr (time) float64 ...
modeled_instr_corr_range (time) float32 ...
doppler_corr (time) float32 ...
cog_corr (time) float32 ...
modeled_instr_corr_swh (time) float32 ...
internal_corr_sig0 (time) float32 ...
modeled_instr_corr_sig0 (time) float32 ...
agc_40hz (time, meas_ind) float32 ...
agc_corr_40hz (time, meas_ind) float32 ...
scaling_factor_40hz (time, meas_ind) float64 ...
epoch_40hz (time, meas_ind) float64 ...
width_leading_edge_40hz (time, meas_ind) float64 ...
amplitude_40hz (time, meas_ind) float64 ...
thermal_noise_40hz (time, meas_ind) float64 ...
seaice_epoch_40hz (time, meas_ind) float64 ...
seaice_amplitude_40hz (time, meas_ind) float64 ...
ice2_epoch_40hz (time, meas_ind) float64 ...
ice2_amplitude_40hz (time, meas_ind) float64 ...
ice2_mean_amplitude_40hz (time, meas_ind) float64 ...
ice2_thermal_noise_40hz (time, meas_ind) float64 ...
ice2_slope_40hz (time, meas_ind) float64 ...
signal_to_noise_ratio (time) float32 ...
waveforms_40hz (time, meas_ind, wvf_ind) float32 ...
Attributes:
Conventions: CF-1.1
title: GDR - Expertise dataset
institution: CNES
source: radar altimeter
history: 2017-07-21 08:25:19 : Creation
contact: CNES aviso@oceanobs.com, EUMETSAT ops@...
references: L1 library=V4.5p1, L2 library=V5.5p2, ...
processing_center: SALP
reference_document: SARAL/ALTIKA Products Handbook, SALP-M...
mission_name: SARAL
altimeter_sensor_name: ALTIKA
radiometer_sensor_name: ALTIKA_RAD
doris_sensor_name: DGXX
cycle_number: 110
absolute_rev_number: 22546
pass_number: 2
absolute_pass_number: 109220
equator_time: 2017-06-19 15:39:46.492000
equator_longitude: 35.21
first_meas_time: 2017-06-19 15:14:39.356848
last_meas_time: 2017-06-19 16:04:56.808874
xref_altimeter_level1: ALK_ALT_1PaS20170619_154722_20170619_1...
xref_radiometer_level1: ALK_RAD_1PaS20170619_154643_20170619_1...
xref_altimeter_characterisation: ALK_CHA_AXVCNE20131115_120000_20100101...
xref_radiometer_characterisation: ALK_CHR_AXVCNE20110207_180000_20110101...
xref_altimeter_ltm: ALK_CAL_AXXCNE20170720_110014_20130102...
xref_doris_uso: SRL_OS1_AXXCNE20170720_083800_20130226...
xref_orbit_data: SRL_VOR_AXVCNE20170720_111700_20170618...
xref_pf_data: SRL_VPF_AXVCNE20170720_111800_20170618...
xref_pole_location: SMM_POL_AXXCNE20170721_071500_19870101...
xref_gim_data: SRL_ION_AXPCNE20170620_074756_20170619...
xref_mog2d_data: SMM_MOG_AXVCNE20170709_191501_20170619...
xref_orf_data: SRL_ORF_AXXCNE20170720_083800_20160704...
xref_meteorological_files: SMM_APA_AXVCNE20170619_170611_20170619...
ellipsoid_axis: 6378136.3
ellipsoid_flattening: 0.0033528131778969```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-432342306,https://api.github.com/repos/pydata/xarray/issues/2501,432342306,MDEyOklzc3VlQ29tbWVudDQzMjM0MjMwNg==,1197350,2018-10-23T17:27:50Z,2018-10-23T17:27:50Z,MEMBER,"^ I'm assuming you're in a notebook. If not, call `print` instead of `display`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074
https://github.com/pydata/xarray/issues/2501#issuecomment-432342180,https://api.github.com/repos/pydata/xarray/issues/2501,432342180,MDEyOklzc3VlQ29tbWVudDQzMjM0MjE4MA==,1197350,2018-10-23T17:27:30Z,2018-10-23T17:27:30Z,MEMBER,"In `open_mfdataset`, all of the dimensions and coordinates of the individual files have to be checked and verified to be compatible. That is often the source of slow performance with open_mfdataset.
To help us help you debug, please provide more information about the files your are opening. Specifically, please call `open_dataset()` directly on the first two files and copy and paste the output here. Specifically, do something like this
```python
from glob import glob
import xarray as xr
all_files = glob('*1002*.nc')
display(xr.open_dataset(all_files[0]))
display(xr.open_dataset(all_files[1]))
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,372848074