issues: 372848074

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
372848074	MDU6SXNzdWUzNzI4NDgwNzQ=	2501	open_mfdataset usage and limitations.	1492047	closed	0			22	2018-10-23T07:31:42Z	2021-01-27T18:06:16Z	2021-01-27T18:06:16Z	CONTRIBUTOR				I'm trying to understand and use the open_mfdataset function to open a huge amount of files. I thought this function would be quit similar to dask.dataframe.from_delayed and allow to "load" and work on an amount of data only limited by the number of Dask workers (or "unlimited" considering it could be "lazily loaded"). But my tests showed something quit different. It seems xarray requires the index to be copied back to the Dask client in order to "auto_combine" data. Doing some tests on a small portion of my data I have something like this. Each file has these dimensions: time: ~2871, xx_ind: 40, yy_ind: 128. The concatenation of these files is made on the time dimension and my understanding is that only the time is loaded and brought back to the client (other dimensions are constant). Parallel tests are made with 200 dask workers. ```python =================== Loading 1002 files =================== xr.open_mfdataset('1002.nc') peak memory: 1660.59 MiB, increment: 1536.25 MiB Wall time: 1min 29s xr.open_mfdataset('1002.nc', parallel=True) peak memory: 1745.14 MiB, increment: 1602.43 MiB Wall time: 53 s =================== Loading 5010 files =================== xr.open_mfdataset('5010.nc') peak memory: 7419.99 MiB, increment: 7315.36 MiB Wall time: 8min 33s xr.open_mfdataset('5010.nc', parallel=True) peak memory: 8249.75 MiB, increment: 8112.07 MiB Wall time: 4min 48s ``` As you can see, the amount of memory used for this operation is significant and I won't be able to do this on much more files. When using the parallel option, the loading of files take a few seconds (judging from what the Dask dashboard is showing) and I'm guessing the rest of the time is for the "auto_combine". So I'm wondering if I'm doing something wrong, if there other way to load data or if I cannot use xarray directly for this quantity of data and have to use Dask directly. Thanks in advance. INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-34-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 xarray: 0.10.9+32.g9f4474d.dirty pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: 0.6.2 h5py: 2.8.0 Nio: None zarr: 2.2.0 cftime: 1.0.1 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 0.19.4 distributed: 1.23.3 matplotlib: 3.0.0 cartopy: None seaborn: None setuptools: 40.4.3 pip: 18.1 conda: None pytest: 3.9.1 IPython: 7.0.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2501/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed	13221727	issue

Links from other tables

0 rows from issues_id in issues_labels
22 rows from issue in issue_comments