home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

3 rows where state = "closed" and user = 1492047 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 2
  • pull 1

state 1

  • closed · 3 ✖

repo 1

  • xarray 3
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
372848074 MDU6SXNzdWUzNzI4NDgwNzQ= 2501 open_mfdataset usage and limitations. Thomas-Z 1492047 closed 0     22 2018-10-23T07:31:42Z 2021-01-27T18:06:16Z 2021-01-27T18:06:16Z CONTRIBUTOR      

I'm trying to understand and use the open_mfdataset function to open a huge amount of files. I thought this function would be quit similar to dask.dataframe.from_delayed and allow to "load" and work on an amount of data only limited by the number of Dask workers (or "unlimited" considering it could be "lazily loaded").

But my tests showed something quit different. It seems xarray requires the index to be copied back to the Dask client in order to "auto_combine" data.

Doing some tests on a small portion of my data I have something like this.

Each file has these dimensions: time: ~2871, xx_ind: 40, yy_ind: 128. The concatenation of these files is made on the time dimension and my understanding is that only the time is loaded and brought back to the client (other dimensions are constant).

Parallel tests are made with 200 dask workers.

```python =================== Loading 1002 files ===================

xr.open_mfdataset('1002.nc')

peak memory: 1660.59 MiB, increment: 1536.25 MiB Wall time: 1min 29s

xr.open_mfdataset('1002.nc', parallel=True)

peak memory: 1745.14 MiB, increment: 1602.43 MiB Wall time: 53 s

=================== Loading 5010 files ===================

xr.open_mfdataset('5010.nc')

peak memory: 7419.99 MiB, increment: 7315.36 MiB Wall time: 8min 33s

xr.open_mfdataset('5010.nc', parallel=True)

peak memory: 8249.75 MiB, increment: 8112.07 MiB Wall time: 4min 48s ```

As you can see, the amount of memory used for this operation is significant and I won't be able to do this on much more files. When using the parallel option, the loading of files take a few seconds (judging from what the Dask dashboard is showing) and I'm guessing the rest of the time is for the "auto_combine".

So I'm wondering if I'm doing something wrong, if there other way to load data or if I cannot use xarray directly for this quantity of data and have to use Dask directly.

Thanks in advance.

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-34-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 xarray: 0.10.9+32.g9f4474d.dirty pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: 0.6.2 h5py: 2.8.0 Nio: None zarr: 2.2.0 cftime: 1.0.1 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 0.19.4 distributed: 1.23.3 matplotlib: 3.0.0 cartopy: None seaborn: None setuptools: 40.4.3 pip: 18.1 conda: None pytest: 3.9.1 IPython: 7.0.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2501/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
676696822 MDExOlB1bGxSZXF1ZXN0NDY1OTYxOTIw 4333 Support explicitly setting a dimension order with to_dataframe() Thomas-Z 1492047 closed 0     11 2020-08-11T08:46:45Z 2020-08-19T20:37:38Z 2020-08-14T18:28:26Z CONTRIBUTOR   0 pydata/xarray/pulls/4333
  • [x] Closes #4331
  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4333/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
347895055 MDU6SXNzdWUzNDc4OTUwNTU= 2346 Dataset/DataArray to_dataframe() dimensions order mismatch. Thomas-Z 1492047 closed 0     4 2018-08-06T12:03:00Z 2020-08-10T17:45:43Z 2020-08-08T07:10:28Z CONTRIBUTOR      

Code Sample

```python import xarray as xr import numpy as np

data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('y', 'x')) ds = xr.Dataset({'foo': data})

Applied on the Dataset

ds.to_dataframe()

foo

x y

a 0 0.348519

1 -0.322634

2 -0.683181

b 0 0.197501

1 0.504810

2 -1.871626

Applied to the DataArray

ds['foo'].to_dataframe()

foo

y x

0 a 0.348519

b 0.197501

1 a -0.322634

b 0.504810

2 a -0.683181

b -1.871626

```

Problem description

The to_dataframe method applied to a DataArray will respect the dimensions order whereas the same method applied to a Dataset will use an alphabetically sorted order.

In both situation to_dataframe calls _to_dataframe() with an argument. The DataArray uses an OrderedDict but the Dataset uses self.dims (which is a SortedKeyDict) as argument.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-23-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: 2.8.0 Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 40.0.0 pip: 18.0 conda: None pytest: 3.7.1 IPython: 6.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2346/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 25.183ms · About: xarray-datasette