id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
372848074,MDU6SXNzdWUzNzI4NDgwNzQ=,2501,open_mfdataset usage and limitations.,1492047,closed,0,,,22,2018-10-23T07:31:42Z,2021-01-27T18:06:16Z,2021-01-27T18:06:16Z,CONTRIBUTOR,,,,"I'm trying to understand and use the open_mfdataset function to open a huge amount of files.
I thought this function would be quit similar to dask.dataframe.from_delayed and allow to ""load"" and work on an amount of data only limited by the number of Dask workers (or ""unlimited"" considering it could be ""lazily loaded"").
But my tests showed something quit different.
It seems xarray requires the index to be copied back to the Dask client in order to ""auto_combine"" data.
Doing some tests on a small portion of my data I have something like this.
Each file has these dimensions: time: ~2871, xx_ind: 40, yy_ind: 128.
The concatenation of these files is made on the time dimension and my understanding is that only the time is loaded and brought back to the client (other dimensions are constant).
Parallel tests are made with 200 dask workers.
```python
=================== Loading 1002 files ===================
xr.open_mfdataset('*1002*.nc')
peak memory: 1660.59 MiB, increment: 1536.25 MiB
Wall time: 1min 29s
xr.open_mfdataset('*1002*.nc', parallel=True)
peak memory: 1745.14 MiB, increment: 1602.43 MiB
Wall time: 53 s
=================== Loading 5010 files ===================
xr.open_mfdataset('*5010*.nc')
peak memory: 7419.99 MiB, increment: 7315.36 MiB
Wall time: 8min 33s
xr.open_mfdataset('*5010*.nc', parallel=True)
peak memory: 8249.75 MiB, increment: 8112.07 MiB
Wall time: 4min 48s
```
As you can see, the amount of memory used for this operation is significant and I won't be able to do this on much more files.
When using the parallel option, the loading of files take a few seconds (judging from what the Dask dashboard is showing) and I'm guessing the rest of the time is for the ""auto_combine"".
So I'm wondering if I'm doing something wrong, if there other way to load data or if I cannot use xarray directly for this quantity of data and have to use Dask directly.
Thanks in advance.
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8
xarray: 0.10.9+32.g9f4474d.dirty
pandas: 0.23.4
numpy: 1.15.2
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: 0.6.2
h5py: 2.8.0
Nio: None
zarr: 2.2.0
cftime: 1.0.1
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: None
cyordereddict: None
dask: 0.19.4
distributed: 1.23.3
matplotlib: 3.0.0
cartopy: None
seaborn: None
setuptools: 40.4.3
pip: 18.1
conda: None
pytest: 3.9.1
IPython: 7.0.1
sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2501/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
676696822,MDExOlB1bGxSZXF1ZXN0NDY1OTYxOTIw,4333,Support explicitly setting a dimension order with to_dataframe(),1492047,closed,0,,,11,2020-08-11T08:46:45Z,2020-08-19T20:37:38Z,2020-08-14T18:28:26Z,CONTRIBUTOR,,0,pydata/xarray/pulls/4333,"
- [x] Closes #4331
- [x] Tests added
- [x] Passes `isort . && black . && mypy . && flake8`
- [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst`
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4333/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull
347895055,MDU6SXNzdWUzNDc4OTUwNTU=,2346,Dataset/DataArray to_dataframe() dimensions order mismatch.,1492047,closed,0,,,4,2018-08-06T12:03:00Z,2020-08-10T17:45:43Z,2020-08-08T07:10:28Z,CONTRIBUTOR,,,,"#### Code Sample
```python
import xarray as xr
import numpy as np
data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('y', 'x'))
ds = xr.Dataset({'foo': data})
# Applied on the Dataset
ds.to_dataframe()
# foo
#x y
#a 0 0.348519
# 1 -0.322634
# 2 -0.683181
#b 0 0.197501
# 1 0.504810
# 2 -1.871626
# Applied to the DataArray
ds['foo'].to_dataframe()
# foo
#y x
#0 a 0.348519
# b 0.197501
#1 a -0.322634
# b 0.504810
#2 a -0.683181
# b -1.871626
```
#### Problem description
The **to_dataframe** method applied to a DataArray will respect the dimensions order whereas the same method applied to a Dataset will use an alphabetically sorted order.
In both situation **to_dataframe** calls **_to_dataframe()** with an argument.
The DataArray uses an **OrderedDict** but the Dataset uses **self.dims** (which is a **SortedKeyDict**) as argument.
#### Output of ``xr.show_versions()``
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-23-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8
xarray: 0.10.8
pandas: 0.23.4
numpy: 1.14.5
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: None
h5py: 2.8.0
Nio: None
zarr: 2.2.0
bottleneck: None
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: 2.2.2
cartopy: None
seaborn: None
setuptools: 40.0.0
pip: 18.0
conda: None
pytest: 3.7.1
IPython: 6.5.0
sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2346/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue