home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

22 rows where issue = 372848074 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 6

  • rsignell-usgs 7
  • rabernat 6
  • TomAugspurger 6
  • Thomas-Z 1
  • dcherian 1
  • rafa-guedes 1

author_association 3

  • MEMBER 13
  • NONE 7
  • CONTRIBUTOR 2

issue 1

  • open_mfdataset usage and limitations. · 22 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
768470505 https://github.com/pydata/xarray/issues/2501#issuecomment-768470505 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDc2ODQ3MDUwNQ== dcherian 2448579 2021-01-27T18:06:16Z 2021-01-27T18:06:16Z MEMBER

I think this is stale now. See https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets for latest guidance on reading such datasets. Please open a new issue if you are still having trouble with open_mfdataset

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
512663861 https://github.com/pydata/xarray/issues/2501#issuecomment-512663861 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUxMjY2Mzg2MQ== rafa-guedes 7799184 2019-07-18T04:51:06Z 2019-07-18T04:52:17Z CONTRIBUTOR

Hi guys, I'm having some issue that looks similar to @rsignell-usgs. Trying to open 413 netcdf files using open_mfdataset with parallel=True. The dataset (successfully opened with parallel=False) has ~300G on disk and looks like:

```ipython In [1] import xarray as xr

In [2]: dset = xr.open_mfdataset("./bom-ww3/bom-ww3_*.nc", chunks={'time': 744, 'latitude': 100, 'longitude': 100}, parallel=False)

In [3]: dset Out[3]: <xarray.Dataset> Dimensions: (latitude: 190, longitude: 289, time: 302092) Coordinates: * longitude (longitude) float32 70.0 70.4 70.8 71.2 ... 184.4 184.8 185.2 * latitude (latitude) float32 -55.6 -55.2 -54.8 -54.4 ... 19.2 19.6 20.0 * time (time) datetime64[ns] 1979-01-01 ... 2013-05-31T23:00:00.000013440 Data variables: hs (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> fp (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> dp (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> wl (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> U10 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> V10 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> hs1 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> hs2 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> tp1 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> tp2 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> lp0 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> lp1 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> lp2 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> th0 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> th1 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> th2 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> hs0 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> tp0 (time, latitude, longitude) float32 dask.array<shape=(302092, 190, 289), chunksize=(745, 100, 100)> ```

Trying to read it on a standard python session gives me core dumped:

```ipython In [1]: import xarray as xr

In [2]: dset = xr.open_mfdataset("./bom-ww3/bom-ww3_*.nc", chunks={'time': 744, 'latitude': 100, 'longitude': 100}, parallel=True) Bus error (core dumped) ```

Trying to read it on a dask cluster I get:

```ipython In [1]: from dask.distributed import Client

In [2]: import xarray as xr

In [3]: client = Client()

In [4]: dset = xr.open_mfdataset("./bom-ww3/bom-ww3_*.nc", chunks={'time': 744, 'latitude': 100, 'longitud ...: e': 100}, parallel=True) free(): double free detected in tcache 2free(): double free detected in tcache 2

free(): double free detected in tcache 2 distributed.nanny - WARNING - Worker process 18744 was killed by signal 11 distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Worker process 18740 was killed by signal 6 distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Worker process 18742 was killed by signal 7 distributed.nanny - WARNING - Worker process 18738 was killed by signal 6 distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Restarting worker free(): double free detected in tcache 2munmap_chunk(): invalid pointer

free(): double free detected in tcache 2 free(): double free detected in tcache 2 distributed.nanny - WARNING - Worker process 19082 was killed by signal 6 distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Worker process 19073 was killed by signal 6 distributed.nanny - WARNING - Restarting worker


KilledWorker Traceback (most recent call last) <ipython-input-4-740561b80fec> in <module>() ----> 1 dset = xr.open_mfdataset("./bom-ww3/bom-ww3_*.nc", chunks={'time': 744, 'latitude': 100, 'longitude': 100}, parallel=True)

/usr/local/lib/python3.7/dist-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, combine, autoclose, parallel, **kwargs) 772 # calling compute here will return the datasets/file_objs lists, 773 # the underlying datasets will still be stored as dask arrays --> 774 datasets, file_objs = dask.compute(datasets, file_objs) 775 776 # Combine all datasets, closing them in case of a ValueError

/usr/local/lib/python3.7/dist-packages/dask/base.py in compute(args, kwargs) 444 keys = [x.dask_keys() for x in collections] 445 postcomputes = [x.dask_postcompute() for x in collections] --> 446 results = schedule(dsk, keys, kwargs) 447 return repack([f(r, a) for r, (f, a) in zip(results, postcomputes)]) 448

/home/oceanum/.local/lib/python3.7/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs) 2525 should_rejoin = False 2526 try: -> 2527 results = self.gather(packed, asynchronous=asynchronous, direct=direct) 2528 finally: 2529 for f in futures.values():

/home/oceanum/.local/lib/python3.7/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous) 1821 direct=direct, 1822 local_worker=local_worker, -> 1823 asynchronous=asynchronous, 1824 ) 1825

/home/oceanum/.local/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, args, kwargs) 761 else: 762 return sync( --> 763 self.loop, func, args, callback_timeout=callback_timeout, **kwargs 764 ) 765

/home/oceanum/.local/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, args, kwargs) 330 e.wait(10) 331 if error[0]: --> 332 six.reraise(error[0]) 333 else: 334 return result[0]

/usr/lib/python3/dist-packages/six.py in reraise(tp, value, tb) 691 if value.traceback is not tb: 692 raise value.with_traceback(tb) --> 693 raise value 694 finally: 695 value = None

/home/oceanum/.local/lib/python3.7/site-packages/distributed/utils.py in f() 315 if callback_timeout is not None: 316 future = gen.with_timeout(timedelta(seconds=callback_timeout), future) --> 317 result[0] = yield future 318 except Exception as exc: 319 error[0] = sys.exc_info()

/home/oceanum/.local/lib/python3.7/site-packages/tornado/gen.py in run(self) 733 734 try: --> 735 value = future.result() 736 except Exception: 737 exc_info = sys.exc_info()

/home/oceanum/.local/lib/python3.7/site-packages/tornado/gen.py in run(self) 740 if exc_info is not None: 741 try: --> 742 yielded = self.gen.throw(*exc_info) # type: ignore 743 finally: 744 # Break up a reference to itself

/home/oceanum/.local/lib/python3.7/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker) 1678 exc = CancelledError(key) 1679 else: -> 1680 six.reraise(type(exception), exception, traceback) 1681 raise exc 1682 if errors == "skip":

/usr/lib/python3/dist-packages/six.py in reraise(tp, value, tb) 691 if value.traceback is not tb: 692 raise value.with_traceback(tb) --> 693 raise value 694 finally: 695 value = None

KilledWorker: ('open_dataset-e7916acb-6d9f-4532-ab76-5b9c1b1a39c2', <Worker 'tcp://10.240.0.5:36019', memory: 0, processing: 63>) ```

Is there anything obviously wrong I'm trying here please?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
510144707 https://github.com/pydata/xarray/issues/2501#issuecomment-510144707 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUxMDE0NDcwNw== rsignell-usgs 1872600 2019-07-10T16:59:12Z 2019-07-11T11:47:02Z NONE

@TomAugspurger , I sat down here at Scipy with @rabernat and he instantly realized that we needed to drop the feature_id coordinate to prevent open_mfdataset from trying to harmonize that coordinate from all the chunks.

So if I use this code, the open_mfdataset command finishes: python def drop_coords(ds): ds = ds.drop(['reference_time','feature_id']) return ds.reset_coords(drop=True) and I can then add back in the dropped coordinate values at the end: python dsets = [xr.open_dataset(f) for f in files[:3]] ds.coords['feature_id'] = dsets[0].coords['feature_id']

I'm now running into memory issues when I write the zarr data -- but I should raise that as a new issue, right?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
510217080 https://github.com/pydata/xarray/issues/2501#issuecomment-510217080 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUxMDIxNzA4MA== TomAugspurger 1312546 2019-07-10T20:30:41Z 2019-07-10T20:30:41Z MEMBER

Yep, that’s my suspicion as well. I’m still plugging away at it. Currently the pausing logic isn’t quite working well.

On Jul 10, 2019, at 12:10, Ryan Abernathey notifications@github.com wrote:

I believe that the memory issue is basically the same as dask/distributed#2602.

The graphs look like: read --> rechunk --> write.

Reading and rechunking increase memory consumption. Writing relieves it. In Rich's case, the workers just load too much data before they write it. Eventually they run out of memory.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
510169853 https://github.com/pydata/xarray/issues/2501#issuecomment-510169853 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUxMDE2OTg1Mw== rabernat 1197350 2019-07-10T18:10:37Z 2019-07-10T18:10:37Z MEMBER

I believe that the memory issue is basically the same as https://github.com/dask/distributed/issues/2602.

The graphs look like: read --> rechunk --> write.

Reading and rechunking increase memory consumption. Writing relieves it. In Rich's case, the workers just load too much data before they write it. Eventually they run out of memory.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
510167911 https://github.com/pydata/xarray/issues/2501#issuecomment-510167911 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUxMDE2NzkxMQ== TomAugspurger 1312546 2019-07-10T18:05:07Z 2019-07-10T18:05:07Z MEMBER

Great, thanks. I’ll look into the memory issue when writing. We may already have an issue for it.

On Jul 10, 2019, at 10:59, Rich Signell notifications@github.com wrote:

@TomAugspurger , I sat down here at Scipy with @rabernat and he instantly realized that we needed to drop the feature_id coordinate to prevent open_mfdataset from trying to harmonize that coordinate from all the chunks.

So if I use this code, the open_mdfdataset command finishes:

def drop_coords(ds): ds = ds.drop(['reference_time','feature_id']) return ds.reset_coords(drop=True) and I can then add back in the dropped coordinate values at the end:

dsets = [xr.open_dataset(f) for f in files[:3]] ds.coords['feature_id'] = dsets[0].coords['feature_id'] I'm now running into memory issues when I write the zarr data -- but I should raise that as a new issue, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509379294 https://github.com/pydata/xarray/issues/2501#issuecomment-509379294 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTM3OTI5NA== rsignell-usgs 1872600 2019-07-08T20:28:48Z 2019-07-08T20:29:20Z NONE

@TomAugspurger , I thought @rabernat's suggestion of implementing python def drop_coords(ds): return ds.reset_coords(drop=True) would avoid this checking. Did I understand or implement this incorrectly?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509346055 https://github.com/pydata/xarray/issues/2501#issuecomment-509346055 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTM0NjA1NQ== TomAugspurger 1312546 2019-07-08T18:46:58Z 2019-07-08T18:46:58Z MEMBER

@rsignell-usgs very helpful, thanks. I'd noticed that there was a pause after the open_dataset tasks finish, indicating that either the scheduler or (more likely) the client was doing work rather than the cluster. Most likely @rabernat's guess

In open_mfdataset, all of the dimensions and coordinates of the individual files have to be checked and verified to be compatible. That is often the source of slow performance with open_mfdataset.

is correct. Verifying all that now, and looking into if / how that can be done on the workers.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509341467 https://github.com/pydata/xarray/issues/2501#issuecomment-509341467 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTM0MTQ2Nw== rsignell-usgs 1872600 2019-07-08T18:34:02Z 2019-07-08T18:34:02Z NONE

@rabernat , to answer your question, if I open just two files: ds = xr.open_mfdataset(files[:2], preprocess=drop_coords, autoclose=True, parallel=True) the resulting dataset is: <xarray.Dataset> Dimensions: (feature_id: 2729077, reference_time: 1, time: 2) Coordinates: * reference_time (reference_time) datetime64[ns] 2009-01-01 * feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804 * time (time) datetime64[ns] 2009-01-01 2009-01-01T01:00:00 Data variables: streamflow (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> q_lateral (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> velocity (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> qSfcLatRunoff (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> qBucket (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> qBtmVertRunoff (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> Attributes: featureType: timeSeries proj4: +proj=longlat +datum=NAD83 +no_defs model_initialization_time: 2009-01-01_00:00:00 station_dimension: feature_id model_output_valid_time: 2009-01-01_00:00:00 stream_order_output: 1 cdm_datatype: Station esri_pe_string: GEOGCS[GCS_North_American_1983,DATUM[D_North_... Conventions: CF-1.6 model_version: NWM 1.2 dev_OVRTSWCRT: 1 dev_NOAH_TIMESTEP: 3600 dev_channel_only: 0 dev_channelBucket_only: 0 dev: dev_ prefix indicates development/internal me...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509340139 https://github.com/pydata/xarray/issues/2501#issuecomment-509340139 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTM0MDEzOQ== rsignell-usgs 1872600 2019-07-08T18:30:18Z 2019-07-08T18:30:18Z NONE

@TomAugspurger, okay, I just ran the above code again and here's what happens:

The open_mfdataset proceeds nicely on my 8 workers with 40 cores, eventually completing the 8760 open_dataset tasks in about 10 minutes. One interesting thing is that the number of tasks keep dropping as time goes on. Not sure why that would be: The memory usage on the workers seems okay during this process:

Then, despite the tasks showing on the dashboard being completed, the open_mfdataset command does not complete, but nothing has died, and I'm not sure what's happening. I check top and get this:

then after about 10 more minutes, I get these warnings:

and then the errors: python-traceback distributed.client - WARNING - Couldn't gather 17520 keys, rescheduling {'getattr-fd038834-befa-4a9b-b78f-51f9aa2b28e5': ('tcp://127.0.0.1:45640',), 'drop_coords-39be9e52-59de-4e1f-b6d8-27e7d931b5af': ('tcp://127.0.0.1:55881',), 'drop_coords-8bd07037-9ca4-4f97-83fb-8b02d7ad0333': ('tcp://127.0.0.1:56164',), 'drop_coords-ca3dd72b-e5af-4099-b593-89dc97717718': ('tcp://127.0.0.1:59961',), 'getattr-c0af8992-e928-4d42-9e64-340303143454': ('tcp://127.0.0.1:42989',), 'drop_coords-8cdfe5fb-7a29-4606-8692-efa747be5bc1': ('tcp://127.0.0.1:35445',), 'getattr-03669206-0d26-46a1-988d-690fe830e52f': ... Full error listing here: https://gist.github.com/rsignell-usgs/3b7101966b8c6d05f48a0e01695f35d6

Does this help? I'd be happy to screenshare if that would be useful.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509307081 https://github.com/pydata/xarray/issues/2501#issuecomment-509307081 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTMwNzA4MQ== TomAugspurger 1312546 2019-07-08T16:57:15Z 2019-07-08T16:57:15Z MEMBER

I'm looking into it today. Can you clarify

The memory use kept growing until the process died.

by "process" do you mean a dask worker process, or just the main python process executing the ds = xr.open_mfdataset(...) code?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509282831 https://github.com/pydata/xarray/issues/2501#issuecomment-509282831 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTI4MjgzMQ== rsignell-usgs 1872600 2019-07-08T15:51:23Z 2019-07-08T15:51:23Z NONE

@TomAugspurger, I'm back from vacation now and ready to attack this again. Any updates on your end?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
506497180 https://github.com/pydata/xarray/issues/2501#issuecomment-506497180 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwNjQ5NzE4MA== TomAugspurger 1312546 2019-06-27T20:24:26Z 2019-06-27T20:24:26Z MEMBER

The datasets in our cloud datastore are designed explicitly to avoid this problem!

Good to know!

FYI, https://github.com/pydata/xarray/issues/2501#issuecomment-506478508 was user error (I can access it, but need to specify the us-east-1 region). Taking a look now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
506482057 https://github.com/pydata/xarray/issues/2501#issuecomment-506482057 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwNjQ4MjA1Nw== rabernat 1197350 2019-06-27T19:36:51Z 2019-06-27T19:36:51Z MEMBER

@rsignell-usgs

Can you post the xarray repr of two sample files post pre-processing function?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
506481845 https://github.com/pydata/xarray/issues/2501#issuecomment-506481845 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwNjQ4MTg0NQ== rabernat 1197350 2019-06-27T19:36:11Z 2019-06-27T19:36:11Z MEMBER

Are there any datasets on https://pangeo-data.github.io/pangeo-datastore/ that would exhibit this poor behavior?

The datasets in our cloud datastore are designed explicitly to avoid this problem!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
506478508 https://github.com/pydata/xarray/issues/2501#issuecomment-506478508 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwNjQ3ODUwOA== TomAugspurger 1312546 2019-06-27T19:25:05Z 2019-06-27T19:25:05Z MEMBER

Thanks, will take a look this afternoon. Are there any datasets on https://pangeo-data.github.io/pangeo-datastore/ that would exhibit this poor behavior? I may not have access to the bucket (or I'm misusing rclone)

2019/06/27 14:23:50 NOTICE: Config file "/Users/taugspurger/.config/rclone/rclone.conf" not found - using defaults 2019/06/27 14:23:50 Failed to create file system for "aws-east:nwm-archive/2009": didn't find section in config file

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
506475819 https://github.com/pydata/xarray/issues/2501#issuecomment-506475819 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwNjQ3NTgxOQ== rsignell-usgs 1872600 2019-06-27T19:16:28Z 2019-06-27T19:24:31Z NONE

I tried this, and either I didn't apply it right, or it didn't work. The memory use kept growing until the process died. My code to process the 8760 netcdf files with open_mfdataset looks like this:

```python import xarray as xr from dask.distributed import Client, progress, LocalCluster

cluster = LocalCluster() client = Client(cluster)

import pandas as pd

dates = pd.date_range(start='2009-01-01 00:00',end='2009-12-31 23:00', freq='1h') files = ['./nc/{}/{}.CHRTOUT_DOMAIN1.comp'.format(date.strftime('%Y'),date.strftime('%Y%m%d%H%M')) for date in dates]

def drop_coords(ds): return ds.reset_coords(drop=True)

ds = xr.open_mfdataset(files, preprocess=drop_coords, autoclose=True, parallel=True) ds1 = ds.chunk(chunks={'time':168, 'feature_id':209929})

import numcodecs numcodecs.blosc.use_threads = False ds1.to_zarr('zarr/2009', mode='w', consolidated=True) ```

I transfered the netcdf files from AWS S3 to my local disk to run this, using this command:

rclone sync --include '*.CHRTOUT_DOMAIN1.comp' aws-east:nwm-archive/2009 . --checksum --fast-list --transfers 16 @TomAugspurger, if you could take a look, that would be great, and if you have any ideas of how to make this example simpler/more easily reproducible, please let me know.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
503641038 https://github.com/pydata/xarray/issues/2501#issuecomment-503641038 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwMzY0MTAzOA== rabernat 1197350 2019-06-19T16:48:29Z 2019-06-19T16:48:29Z MEMBER

Try writing a preprocessor function that drops all coordinates python def drop_coords(ds): return ds.reset_coords(drop=True)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
497381301 https://github.com/pydata/xarray/issues/2501#issuecomment-497381301 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDQ5NzM4MTMwMQ== rsignell-usgs 1872600 2019-05-30T15:55:56Z 2019-05-30T15:58:48Z NONE

I'm hitting some memory issues with using open_mfdataset with a cluster also.

Specifically, I'm trying to open 8760 NetCDF files with an 8 node, 40 cpu LocalCluster.

When I issue: ds = xr.open_mfdataset(files, parallel=True) all looks good on the Dask dashboard: and the tasks complete with no errors in about 4 minutes.

Then 4 more minutes go by before I get a bunch of errors like: distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting distributed.nanny - WARNING - Worker process 26054 was killed by unknown signal distributed.nanny - WARNING - Restarting worker and my cell doesn't complete.

Any suggestions?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
432546977 https://github.com/pydata/xarray/issues/2501#issuecomment-432546977 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDQzMjU0Njk3Nw== Thomas-Z 1492047 2018-10-24T07:38:31Z 2018-10-24T07:38:31Z CONTRIBUTOR

Thank you for looking into this.

I just want to point out that I'm not that much concerned with the "slow performance" but much more with the memory consumption and the limitation it implies.

```python from glob import glob import xarray as xr

all_files = glob('...TP110.nc') display(xr.open_dataset(all_files[0])) display(xr.open_dataset(all_files[1])) ```

<xarray.Dataset> Dimensions: (meas_ind: 40, time: 2871, wvf_ind: 128) Coordinates: * time (time) datetime64[ns] 2017-06-19T14:24:20.792036992 ... 2017-06-19T15:14:38.491743104 * meas_ind (meas_ind) int8 0 1 2 3 4 ... 36 37 38 39 * wvf_ind (wvf_ind) int8 0 1 2 3 ... 125 126 127 lat (time) float64 ... lon (time) float64 ... lon_40hz (time, meas_ind) float64 ... lat_40hz (time, meas_ind) float64 ... Data variables: time_40hz (time, meas_ind) datetime64[ns] ... surface_type (time) float32 ... rad_surf_type (time) float32 ... qual_alt_1hz_range (time) float32 ... qual_alt_1hz_swh (time) float32 ... qual_alt_1hz_sig0 (time) float32 ... qual_alt_1hz_off_nadir_angle_wf (time) float32 ... qual_inst_corr_1hz_range (time) float32 ... qual_inst_corr_1hz_swh (time) float32 ... qual_inst_corr_1hz_sig0 (time) float32 ... qual_rad_1hz_tb_k (time) float32 ... qual_rad_1hz_tb_ka (time) float32 ... alt_state_flag_acq_mode_40hz (time, meas_ind) float32 ... alt_state_flag_tracking_mode_40hz (time, meas_ind) float32 ... orb_state_flag_diode (time) float32 ... orb_state_flag_rest (time) float32 ... ecmwf_meteo_map_avail (time) float32 ... trailing_edge_variation_flag (time) float32 ... trailing_edge_variation_flag_40hz (time, meas_ind) float32 ... ice_flag (time) float32 ... interp_flag_mean_sea_surface (time) float32 ... interp_flag_mdt (time) float32 ... interp_flag_ocean_tide_sol1 (time) float32 ... interp_flag_ocean_tide_sol2 (time) float32 ... interp_flag_meteo (time) float32 ... alt (time) float64 ... alt_40hz (time, meas_ind) float64 ... orb_alt_rate (time) float32 ... range (time) float64 ... range_40hz (time, meas_ind) float64 ... range_used_40hz (time, meas_ind) float32 ... range_rms (time) float32 ... range_numval (time) float32 ... number_of_iterations (time, meas_ind) float32 ... net_instr_corr_range (time) float64 ... model_dry_tropo_corr (time) float32 ... model_wet_tropo_corr (time) float32 ... rad_wet_tropo_corr (time) float32 ... iono_corr_gim (time) float32 ... sea_state_bias (time) float32 ... swh (time) float32 ... swh_40hz (time, meas_ind) float32 ... swh_used_40hz (time, meas_ind) float32 ... swh_rms (time) float32 ... swh_numval (time) float32 ... net_instr_corr_swh (time) float32 ... sig0 (time) float32 ... sig0_40hz (time, meas_ind) float32 ... sig0_used_40hz (time, meas_ind) float32 ... sig0_rms (time) float32 ... sig0_numval (time) float32 ... agc (time) float32 ... agc_rms (time) float32 ... agc_numval (time) float32 ... net_instr_corr_sig0 (time) float32 ... atmos_corr_sig0 (time) float32 ... off_nadir_angle_wf (time) float32 ... off_nadir_angle_wf_40hz (time, meas_ind) float32 ... tb_k (time) float32 ... tb_ka (time) float32 ... mean_sea_surface (time) float64 ... mean_topography (time) float64 ... geoid (time) float64 ... bathymetry (time) float64 ... inv_bar_corr (time) float32 ... hf_fluctuations_corr (time) float32 ... ocean_tide_sol1 (time) float64 ... ocean_tide_sol2 (time) float64 ... ocean_tide_equil (time) float32 ... ocean_tide_non_equil (time) float32 ... load_tide_sol1 (time) float32 ... load_tide_sol2 (time) float32 ... solid_earth_tide (time) float32 ... pole_tide (time) float32 ... wind_speed_model_u (time) float32 ... wind_speed_model_v (time) float32 ... wind_speed_alt (time) float32 ... rad_water_vapor (time) float32 ... rad_liquid_water (time) float32 ... ice1_range_40hz (time, meas_ind) float64 ... ice1_sig0_40hz (time, meas_ind) float32 ... ice1_qual_flag_40hz (time, meas_ind) float32 ... seaice_range_40hz (time, meas_ind) float64 ... seaice_sig0_40hz (time, meas_ind) float32 ... seaice_qual_flag_40hz (time, meas_ind) float32 ... ice2_range_40hz (time, meas_ind) float64 ... ice2_le_sig0_40hz (time, meas_ind) float32 ... ice2_sig0_40hz (time, meas_ind) float32 ... ice2_sigmal_40hz (time, meas_ind) float32 ... ice2_slope1_40hz (time, meas_ind) float64 ... ice2_slope2_40hz (time, meas_ind) float64 ... ice2_mqe_40hz (time, meas_ind) float32 ... ice2_qual_flag_40hz (time, meas_ind) float32 ... mqe_40hz (time, meas_ind) float32 ... peakiness_40hz (time, meas_ind) float32 ... ssha (time) float32 ... tracker_40hz (time, meas_ind) float64 ... tracker_used_40hz (time, meas_ind) float32 ... tracker_diode_40hz (time, meas_ind) float64 ... pri_counter_40hz (time, meas_ind) float64 ... qual_alt_1hz_off_nadir_angle_pf (time) float32 ... off_nadir_angle_pf (time) float32 ... off_nadir_angle_rain_40hz (time, meas_ind) float32 ... uso_corr (time) float64 ... internal_path_delay_corr (time) float64 ... modeled_instr_corr_range (time) float32 ... doppler_corr (time) float32 ... cog_corr (time) float32 ... modeled_instr_corr_swh (time) float32 ... internal_corr_sig0 (time) float32 ... modeled_instr_corr_sig0 (time) float32 ... agc_40hz (time, meas_ind) float32 ... agc_corr_40hz (time, meas_ind) float32 ... scaling_factor_40hz (time, meas_ind) float64 ... epoch_40hz (time, meas_ind) float64 ... width_leading_edge_40hz (time, meas_ind) float64 ... amplitude_40hz (time, meas_ind) float64 ... thermal_noise_40hz (time, meas_ind) float64 ... seaice_epoch_40hz (time, meas_ind) float64 ... seaice_amplitude_40hz (time, meas_ind) float64 ... ice2_epoch_40hz (time, meas_ind) float64 ... ice2_amplitude_40hz (time, meas_ind) float64 ... ice2_mean_amplitude_40hz (time, meas_ind) float64 ... ice2_thermal_noise_40hz (time, meas_ind) float64 ... ice2_slope_40hz (time, meas_ind) float64 ... signal_to_noise_ratio (time) float32 ... waveforms_40hz (time, meas_ind, wvf_ind) float32 ... Attributes: Conventions: CF-1.1 title: GDR - Expertise dataset institution: CNES source: radar altimeter history: 2017-07-21 08:25:07 : Creation contact: CNES aviso@oceanobs.com, EUMETSAT ops@... references: L1 library=V4.5p1, L2 library=V5.5p2, ... processing_center: SALP reference_document: SARAL/ALTIKA Products Handbook, SALP-M... mission_name: SARAL altimeter_sensor_name: ALTIKA radiometer_sensor_name: ALTIKA_RAD doris_sensor_name: DGXX cycle_number: 110 absolute_rev_number: 22545 pass_number: 1 absolute_pass_number: 109219 equator_time: 2017-06-19 14:49:32.128000 equator_longitude: 227.77 first_meas_time: 2017-06-19 14:24:20.792037 last_meas_time: 2017-06-19 15:14:38.491743 xref_altimeter_level1: ALK_ALT_1PaS20170619_154722_20170619_1... xref_radiometer_level1: ALK_RAD_1PaS20170619_154643_20170619_1... xref_altimeter_characterisation: ALK_CHA_AXVCNE20131115_120000_20100101... xref_radiometer_characterisation: ALK_CHR_AXVCNE20110207_180000_20110101... xref_altimeter_ltm: ALK_CAL_AXXCNE20170720_110014_20130102... xref_doris_uso: SRL_OS1_AXXCNE20170720_083800_20130226... xref_orbit_data: SRL_VOR_AXVCNE20170720_111700_20170618... xref_pf_data: SRL_VPF_AXVCNE20170720_111800_20170618... xref_pole_location: SMM_POL_AXXCNE20170721_071500_19870101... xref_gim_data: SRL_ION_AXPCNE20170620_074756_20170619... xref_mog2d_data: SMM_MOG_AXVCNE20170709_191501_20170619... xref_orf_data: SRL_ORF_AXXCNE20170720_083800_20160704... xref_meteorological_files: SMM_APA_AXVCNE20170619_170611_20170619... ellipsoid_axis: 6378136.3 ellipsoid_flattening: 0.0033528131778969 <xarray.Dataset> Dimensions: (meas_ind: 40, time: 2779, wvf_ind: 128) Coordinates: * time (time) datetime64[ns] 2017-06-19T15:14:39.356848 ... 2017-06-19T16:04:56.808873920 * meas_ind (meas_ind) int8 0 1 2 3 4 ... 36 37 38 39 * wvf_ind (wvf_ind) int8 0 1 2 3 ... 125 126 127 lat (time) float64 ... lon (time) float64 ... lon_40hz (time, meas_ind) float64 ... lat_40hz (time, meas_ind) float64 ... Data variables: time_40hz (time, meas_ind) datetime64[ns] ... surface_type (time) float32 ... rad_surf_type (time) float32 ... qual_alt_1hz_range (time) float32 ... qual_alt_1hz_swh (time) float32 ... qual_alt_1hz_sig0 (time) float32 ... qual_alt_1hz_off_nadir_angle_wf (time) float32 ... qual_inst_corr_1hz_range (time) float32 ... qual_inst_corr_1hz_swh (time) float32 ... qual_inst_corr_1hz_sig0 (time) float32 ... qual_rad_1hz_tb_k (time) float32 ... qual_rad_1hz_tb_ka (time) float32 ... alt_state_flag_acq_mode_40hz (time, meas_ind) float32 ... alt_state_flag_tracking_mode_40hz (time, meas_ind) float32 ... orb_state_flag_diode (time) float32 ... orb_state_flag_rest (time) float32 ... ecmwf_meteo_map_avail (time) float32 ... trailing_edge_variation_flag (time) float32 ... trailing_edge_variation_flag_40hz (time, meas_ind) float32 ... ice_flag (time) float32 ... interp_flag_mean_sea_surface (time) float32 ... interp_flag_mdt (time) float32 ... interp_flag_ocean_tide_sol1 (time) float32 ... interp_flag_ocean_tide_sol2 (time) float32 ... interp_flag_meteo (time) float32 ... alt (time) float64 ... alt_40hz (time, meas_ind) float64 ... orb_alt_rate (time) float32 ... range (time) float64 ... range_40hz (time, meas_ind) float64 ... range_used_40hz (time, meas_ind) float32 ... range_rms (time) float32 ... range_numval (time) float32 ... number_of_iterations (time, meas_ind) float32 ... net_instr_corr_range (time) float64 ... model_dry_tropo_corr (time) float32 ... model_wet_tropo_corr (time) float32 ... rad_wet_tropo_corr (time) float32 ... iono_corr_gim (time) float32 ... sea_state_bias (time) float32 ... swh (time) float32 ... swh_40hz (time, meas_ind) float32 ... swh_used_40hz (time, meas_ind) float32 ... swh_rms (time) float32 ... swh_numval (time) float32 ... net_instr_corr_swh (time) float32 ... sig0 (time) float32 ... sig0_40hz (time, meas_ind) float32 ... sig0_used_40hz (time, meas_ind) float32 ... sig0_rms (time) float32 ... sig0_numval (time) float32 ... agc (time) float32 ... agc_rms (time) float32 ... agc_numval (time) float32 ... net_instr_corr_sig0 (time) float32 ... atmos_corr_sig0 (time) float32 ... off_nadir_angle_wf (time) float32 ... off_nadir_angle_wf_40hz (time, meas_ind) float32 ... tb_k (time) float32 ... tb_ka (time) float32 ... mean_sea_surface (time) float64 ... mean_topography (time) float64 ... geoid (time) float64 ... bathymetry (time) float64 ... inv_bar_corr (time) float32 ... hf_fluctuations_corr (time) float32 ... ocean_tide_sol1 (time) float64 ... ocean_tide_sol2 (time) float64 ... ocean_tide_equil (time) float32 ... ocean_tide_non_equil (time) float32 ... load_tide_sol1 (time) float32 ... load_tide_sol2 (time) float32 ... solid_earth_tide (time) float32 ... pole_tide (time) float32 ... wind_speed_model_u (time) float32 ... wind_speed_model_v (time) float32 ... wind_speed_alt (time) float32 ... rad_water_vapor (time) float32 ... rad_liquid_water (time) float32 ... ice1_range_40hz (time, meas_ind) float64 ... ice1_sig0_40hz (time, meas_ind) float32 ... ice1_qual_flag_40hz (time, meas_ind) float32 ... seaice_range_40hz (time, meas_ind) float64 ... seaice_sig0_40hz (time, meas_ind) float32 ... seaice_qual_flag_40hz (time, meas_ind) float32 ... ice2_range_40hz (time, meas_ind) float64 ... ice2_le_sig0_40hz (time, meas_ind) float32 ... ice2_sig0_40hz (time, meas_ind) float32 ... ice2_sigmal_40hz (time, meas_ind) float32 ... ice2_slope1_40hz (time, meas_ind) float64 ... ice2_slope2_40hz (time, meas_ind) float64 ... ice2_mqe_40hz (time, meas_ind) float32 ... ice2_qual_flag_40hz (time, meas_ind) float32 ... mqe_40hz (time, meas_ind) float32 ... peakiness_40hz (time, meas_ind) float32 ... ssha (time) float32 ... tracker_40hz (time, meas_ind) float64 ... tracker_used_40hz (time, meas_ind) float32 ... tracker_diode_40hz (time, meas_ind) float64 ... pri_counter_40hz (time, meas_ind) float64 ... qual_alt_1hz_off_nadir_angle_pf (time) float32 ... off_nadir_angle_pf (time) float32 ... off_nadir_angle_rain_40hz (time, meas_ind) float32 ... uso_corr (time) float64 ... internal_path_delay_corr (time) float64 ... modeled_instr_corr_range (time) float32 ... doppler_corr (time) float32 ... cog_corr (time) float32 ... modeled_instr_corr_swh (time) float32 ... internal_corr_sig0 (time) float32 ... modeled_instr_corr_sig0 (time) float32 ... agc_40hz (time, meas_ind) float32 ... agc_corr_40hz (time, meas_ind) float32 ... scaling_factor_40hz (time, meas_ind) float64 ... epoch_40hz (time, meas_ind) float64 ... width_leading_edge_40hz (time, meas_ind) float64 ... amplitude_40hz (time, meas_ind) float64 ... thermal_noise_40hz (time, meas_ind) float64 ... seaice_epoch_40hz (time, meas_ind) float64 ... seaice_amplitude_40hz (time, meas_ind) float64 ... ice2_epoch_40hz (time, meas_ind) float64 ... ice2_amplitude_40hz (time, meas_ind) float64 ... ice2_mean_amplitude_40hz (time, meas_ind) float64 ... ice2_thermal_noise_40hz (time, meas_ind) float64 ... ice2_slope_40hz (time, meas_ind) float64 ... signal_to_noise_ratio (time) float32 ... waveforms_40hz (time, meas_ind, wvf_ind) float32 ... Attributes: Conventions: CF-1.1 title: GDR - Expertise dataset institution: CNES source: radar altimeter history: 2017-07-21 08:25:19 : Creation contact: CNES aviso@oceanobs.com, EUMETSAT ops@... references: L1 library=V4.5p1, L2 library=V5.5p2, ... processing_center: SALP reference_document: SARAL/ALTIKA Products Handbook, SALP-M... mission_name: SARAL altimeter_sensor_name: ALTIKA radiometer_sensor_name: ALTIKA_RAD doris_sensor_name: DGXX cycle_number: 110 absolute_rev_number: 22546 pass_number: 2 absolute_pass_number: 109220 equator_time: 2017-06-19 15:39:46.492000 equator_longitude: 35.21 first_meas_time: 2017-06-19 15:14:39.356848 last_meas_time: 2017-06-19 16:04:56.808874 xref_altimeter_level1: ALK_ALT_1PaS20170619_154722_20170619_1... xref_radiometer_level1: ALK_RAD_1PaS20170619_154643_20170619_1... xref_altimeter_characterisation: ALK_CHA_AXVCNE20131115_120000_20100101... xref_radiometer_characterisation: ALK_CHR_AXVCNE20110207_180000_20110101... xref_altimeter_ltm: ALK_CAL_AXXCNE20170720_110014_20130102... xref_doris_uso: SRL_OS1_AXXCNE20170720_083800_20130226... xref_orbit_data: SRL_VOR_AXVCNE20170720_111700_20170618... xref_pf_data: SRL_VPF_AXVCNE20170720_111800_20170618... xref_pole_location: SMM_POL_AXXCNE20170721_071500_19870101... xref_gim_data: SRL_ION_AXPCNE20170620_074756_20170619... xref_mog2d_data: SMM_MOG_AXVCNE20170709_191501_20170619... xref_orf_data: SRL_ORF_AXXCNE20170720_083800_20160704... xref_meteorological_files: SMM_APA_AXVCNE20170619_170611_20170619... ellipsoid_axis: 6378136.3 ellipsoid_flattening: 0.0033528131778969

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
432342306 https://github.com/pydata/xarray/issues/2501#issuecomment-432342306 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDQzMjM0MjMwNg== rabernat 1197350 2018-10-23T17:27:50Z 2018-10-23T17:27:50Z MEMBER

^ I'm assuming you're in a notebook. If not, call print instead of display.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
432342180 https://github.com/pydata/xarray/issues/2501#issuecomment-432342180 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDQzMjM0MjE4MA== rabernat 1197350 2018-10-23T17:27:30Z 2018-10-23T17:27:30Z MEMBER

In open_mfdataset, all of the dimensions and coordinates of the individual files have to be checked and verified to be compatible. That is often the source of slow performance with open_mfdataset.

To help us help you debug, please provide more information about the files your are opening. Specifically, please call open_dataset() directly on the first two files and copy and paste the output here. Specifically, do something like this python from glob import glob import xarray as xr all_files = glob('*1002*.nc') display(xr.open_dataset(all_files[0])) display(xr.open_dataset(all_files[1]))

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 17.85ms · About: xarray-datasette