github: issues: 3 rows where state = "closed" and user = 1492047 sorted by updated

3 rows where state = "closed" and user = 1492047 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	closed_at	author_association	draft	pull_request	body	reactions	state_reason	repo	type
372848074	MDU6SXNzdWUzNzI4NDgwNzQ=	2501	open_mfdataset usage and limitations.	Thomas-Z 1492047	closed	22	2018-10-23T07:31:42Z	2021-01-27T18:06:16Z	2021-01-27T18:06:16Z	CONTRIBUTOR			I'm trying to understand and use the open_mfdataset function to open a huge amount of files. I thought this function would be quit similar to dask.dataframe.from_delayed and allow to "load" and work on an amount of data only limited by the number of Dask workers (or "unlimited" considering it could be "lazily loaded"). But my tests showed something quit different. It seems xarray requires the index to be copied back to the Dask client in order to "auto_combine" data. Doing some tests on a small portion of my data I have something like this. Each file has these dimensions: time: ~2871, xx_ind: 40, yy_ind: 128. The concatenation of these files is made on the time dimension and my understanding is that only the time is loaded and brought back to the client (other dimensions are constant). Parallel tests are made with 200 dask workers. ```python =================== Loading 1002 files =================== xr.open_mfdataset('1002.nc') peak memory: 1660.59 MiB, increment: 1536.25 MiB Wall time: 1min 29s xr.open_mfdataset('1002.nc', parallel=True) peak memory: 1745.14 MiB, increment: 1602.43 MiB Wall time: 53 s =================== Loading 5010 files =================== xr.open_mfdataset('5010.nc') peak memory: 7419.99 MiB, increment: 7315.36 MiB Wall time: 8min 33s xr.open_mfdataset('5010.nc', parallel=True) peak memory: 8249.75 MiB, increment: 8112.07 MiB Wall time: 4min 48s ``` As you can see, the amount of memory used for this operation is significant and I won't be able to do this on much more files. When using the parallel option, the loading of files take a few seconds (judging from what the Dask dashboard is showing) and I'm guessing the rest of the time is for the "auto_combine". So I'm wondering if I'm doing something wrong, if there other way to load data or if I cannot use xarray directly for this quantity of data and have to use Dask directly. Thanks in advance. INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-34-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 xarray: 0.10.9+32.g9f4474d.dirty pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: 0.6.2 h5py: 2.8.0 Nio: None zarr: 2.2.0 cftime: 1.0.1 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 0.19.4 distributed: 1.23.3 matplotlib: 3.0.0 cartopy: None seaborn: None setuptools: 40.4.3 pip: 18.1 conda: None pytest: 3.9.1 IPython: 7.0.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2501/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
676696822	MDExOlB1bGxSZXF1ZXN0NDY1OTYxOTIw	4333	Support explicitly setting a dimension order with to_dataframe()	Thomas-Z 1492047	closed	11	2020-08-11T08:46:45Z	2020-08-19T20:37:38Z	2020-08-14T18:28:26Z	CONTRIBUTOR	0	pydata/xarray/pulls/4333	[x] Closes #4331 [x] Tests added [x] Passes `isort . && black . && mypy . && flake8` [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/4333/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
347895055	MDU6SXNzdWUzNDc4OTUwNTU=	2346	Dataset/DataArray to_dataframe() dimensions order mismatch.	Thomas-Z 1492047	closed	4	2018-08-06T12:03:00Z	2020-08-10T17:45:43Z	2020-08-08T07:10:28Z	CONTRIBUTOR			Code Sample ```python import xarray as xr import numpy as np data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('y', 'x')) ds = xr.Dataset({'foo': data}) Applied on the Dataset ds.to_dataframe() foo x y a 0 0.348519 1 -0.322634 2 -0.683181 b 0 0.197501 1 0.504810 2 -1.871626 Applied to the DataArray ds['foo'].to_dataframe() foo y x 0 a 0.348519 b 0.197501 1 a -0.322634 b 0.504810 2 a -0.683181 b -1.871626 ``` Problem description The to_dataframe method applied to a DataArray will respect the dimensions order whereas the same method applied to a Dataset will use an alphabetically sorted order. In both situation to_dataframe calls _to_dataframe() with an argument. The DataArray uses an OrderedDict but the Dataset uses self.dims (which is a SortedKeyDict) as argument. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-23-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: 2.8.0 Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 40.0.0 pip: 18.0 conda: None pytest: 3.7.1 IPython: 6.5.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2346/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

3 rows where state = "closed" and user = 1492047 sorted by updated_at descending

Code Sample

Applied on the Dataset

foo

x y

a 0 0.348519

1 -0.322634

2 -0.683181

b 0 0.197501

1 0.504810

2 -1.871626

Applied to the DataArray

foo

y x

0 a 0.348519

b 0.197501

1 a -0.322634

b 0.504810

2 a -0.683181

b -1.871626

Problem description

Output of xr.show_versions()

Advanced export

Output of `xr.show_versions()`