issues
4 rows where comments = 11, repo = 13221727 and user = 1197350 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date), closed_at (date)
| id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at ▲ | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 218260909 | MDU6SXNzdWUyMTgyNjA5MDk= | 1340 | round-trip performance with save_mfdataset / open_mfdataset | rabernat 1197350 | closed | 0 | 11 | 2017-03-30T16:52:26Z | 2019-05-01T22:12:06Z | 2019-05-01T22:12:06Z | MEMBER | I have encountered some major performance bottlenecks in trying to write and then read multi-file netcdf datasets. I start with an xarray dataset created by xgcm with the following repr:
An important point to note is that there are lots of "non-dimension coordinates" corresponding to various parameters of the numerical grid. I save this dataset to a multi-file netCDF dataset as follows:
Then I try to re-load this dataset
This raises an error:
I need to specify I just thought I would document this, because 18 minutes seems way too long to load a dataset. |
{
"url": "https://api.github.com/repos/pydata/xarray/issues/1340/reactions",
"total_count": 0,
"+1": 0,
"-1": 0,
"laugh": 0,
"hooray": 0,
"confused": 0,
"heart": 0,
"rocket": 0,
"eyes": 0
} |
completed | xarray 13221727 | issue | ||||||
| 280626621 | MDU6SXNzdWUyODA2MjY2MjE= | 1770 | slow performance when storing datasets in gcsfs-backed zarr stores | rabernat 1197350 | closed | 0 | 11 | 2017-12-08T21:46:32Z | 2019-01-13T03:52:46Z | 2019-01-13T03:52:46Z | MEMBER | We are working on integrating zarr with xarray. In the process, we have encountered a performance issue that I am documenting here. At this point, it is not clear if the core issue is in zarr, gcsfs, dask, or xarray. I originally started posting this in zarr, but in the process, I became more convinced the issue was with xarray. Dask OnlyHere is an example using only dask and zarr. ```python connect to a local dask schedulerfrom dask.distributed import Client client = Client('tcp://129.236.20.45:8786') create a big dask arrayimport dask.array as dsa shape = (30, 50, 1080, 2160) chunkshape = (1, 1, 1080, 2160) ar = dsa.random.random(shape, chunks=chunkshape) connect to gcs and create MutableMappingimport gcsfs fs = gcsfs.GCSFileSystem(project='pangeo-181919') gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test999', gcs=fs, check=True, create=True) create a zarr array to store intoimport zarr za = zarr.create(ar.shape, chunks=chunkshape, dtype=ar.dtype, store=gcsmap) write itar.store(za, lock=False) ``` When you do this, it spends a long time serializing stuff before the computation starts. For a more fine-grained look at the process, one can instead do
Some debugging by @mrocklin revealed the following step is quite slow
There is room for improvement here, but overall, zarr + gcsfs + dask seem to integrate well and give decent performance. XarrayThis get much worse once xarray enters the picture. (Note that this example requires the xarray PR pydata/xarray#1528, which has not been merged yet.) ```python wrap the dask array in an xarrayimport xarray as xr import numpy as np ds = xr.DataArray(ar, dims=['time', 'depth', 'lat', 'lon'], coords={'lat': np.linspace(-90, 90, Ny), 'lon': np.linspace(0, 360, Nx)}).to_dataset(name='temperature') store to a different bucketgcsmap = gcsfs.mapping.GCSMap('pangeo-data/test1', gcs=fs, check=True, create=True) ds.to_zarr(store=gcsmap, mode='w') ``` Now the store step takes 18 minutes. Most of this time, is upfront, during which there is little CPU activity and no network activity. After about 15 minutes or so, it finally starts computing, at which point the writes to gcs proceed more-or-less at the same rate as with the dask-only example. Profiling the
I don't understand this, since I specifically eliminated locks when storing the zarr arrays. |
{
"url": "https://api.github.com/repos/pydata/xarray/issues/1770/reactions",
"total_count": 0,
"+1": 0,
"-1": 0,
"laugh": 0,
"hooray": 0,
"confused": 0,
"heart": 0,
"rocket": 0,
"eyes": 0
} |
completed | xarray 13221727 | issue | ||||||
| 287569331 | MDExOlB1bGxSZXF1ZXN0MTYyMjI0MTg2 | 1817 | fix rasterio chunking with s3 datasets | rabernat 1197350 | closed | 0 | 11 | 2018-01-10T20:37:45Z | 2018-01-24T09:33:07Z | 2018-01-23T16:33:28Z | MEMBER | 0 | pydata/xarray/pulls/1817 |
This is a simple fix for token generation of non-filename targets for rasterio. The problem is that I have no idea how to test it without actually hitting s3 (which requires boto and aws credentials). |
{
"url": "https://api.github.com/repos/pydata/xarray/issues/1817/reactions",
"total_count": 0,
"+1": 0,
"-1": 0,
"laugh": 0,
"hooray": 0,
"confused": 0,
"heart": 0,
"rocket": 0,
"eyes": 0
} |
xarray 13221727 | pull | |||||
| 229474101 | MDExOlB1bGxSZXF1ZXN0MTIxMTQyODkw | 1413 | concat prealigned objects | rabernat 1197350 | closed | 0 | 11 | 2017-05-17T20:16:00Z | 2017-07-17T21:53:53Z | 2017-07-17T21:53:40Z | MEMBER | 0 | pydata/xarray/pulls/1413 |
This is an initial PR to bypass index alignment and coordinate checking when concatenating datasets. |
{
"url": "https://api.github.com/repos/pydata/xarray/issues/1413/reactions",
"total_count": 0,
"+1": 0,
"-1": 0,
"laugh": 0,
"hooray": 0,
"confused": 0,
"heart": 0,
"rocket": 0,
"eyes": 0
} |
xarray 13221727 | pull |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issues] (
[id] INTEGER PRIMARY KEY,
[node_id] TEXT,
[number] INTEGER,
[title] TEXT,
[user] INTEGER REFERENCES [users]([id]),
[state] TEXT,
[locked] INTEGER,
[assignee] INTEGER REFERENCES [users]([id]),
[milestone] INTEGER REFERENCES [milestones]([id]),
[comments] INTEGER,
[created_at] TEXT,
[updated_at] TEXT,
[closed_at] TEXT,
[author_association] TEXT,
[active_lock_reason] TEXT,
[draft] INTEGER,
[pull_request] TEXT,
[body] TEXT,
[reactions] TEXT,
[performed_via_github_app] TEXT,
[state_reason] TEXT,
[repo] INTEGER REFERENCES [repos]([id]),
[type] TEXT
);
CREATE INDEX [idx_issues_repo]
ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
ON [issues] ([user]);
