home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 372848074

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
372848074 MDU6SXNzdWUzNzI4NDgwNzQ= 2501 open_mfdataset usage and limitations. 1492047 closed 0     22 2018-10-23T07:31:42Z 2021-01-27T18:06:16Z 2021-01-27T18:06:16Z CONTRIBUTOR      

I'm trying to understand and use the open_mfdataset function to open a huge amount of files. I thought this function would be quit similar to dask.dataframe.from_delayed and allow to "load" and work on an amount of data only limited by the number of Dask workers (or "unlimited" considering it could be "lazily loaded").

But my tests showed something quit different. It seems xarray requires the index to be copied back to the Dask client in order to "auto_combine" data.

Doing some tests on a small portion of my data I have something like this.

Each file has these dimensions: time: ~2871, xx_ind: 40, yy_ind: 128. The concatenation of these files is made on the time dimension and my understanding is that only the time is loaded and brought back to the client (other dimensions are constant).

Parallel tests are made with 200 dask workers.

```python =================== Loading 1002 files ===================

xr.open_mfdataset('1002.nc')

peak memory: 1660.59 MiB, increment: 1536.25 MiB Wall time: 1min 29s

xr.open_mfdataset('1002.nc', parallel=True)

peak memory: 1745.14 MiB, increment: 1602.43 MiB Wall time: 53 s

=================== Loading 5010 files ===================

xr.open_mfdataset('5010.nc')

peak memory: 7419.99 MiB, increment: 7315.36 MiB Wall time: 8min 33s

xr.open_mfdataset('5010.nc', parallel=True)

peak memory: 8249.75 MiB, increment: 8112.07 MiB Wall time: 4min 48s ```

As you can see, the amount of memory used for this operation is significant and I won't be able to do this on much more files. When using the parallel option, the loading of files take a few seconds (judging from what the Dask dashboard is showing) and I'm guessing the rest of the time is for the "auto_combine".

So I'm wondering if I'm doing something wrong, if there other way to load data or if I cannot use xarray directly for this quantity of data and have to use Dask directly.

Thanks in advance.

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-34-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 xarray: 0.10.9+32.g9f4474d.dirty pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: 0.6.2 h5py: 2.8.0 Nio: None zarr: 2.2.0 cftime: 1.0.1 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 0.19.4 distributed: 1.23.3 matplotlib: 3.0.0 cartopy: None seaborn: None setuptools: 40.4.3 pip: 18.1 conda: None pytest: 3.9.1 IPython: 7.0.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2501/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 22 rows from issue in issue_comments
Powered by Datasette · Queries took 0.568ms · About: xarray-datasette