github: issues: 4 rows where comments = 11, repo = 13221727 and user = 1197350 sorted by updated

4 rows where comments = 11, repo = 13221727 and user = 1197350 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	closed_at	author_association	draft	pull_request	body	reactions	state_reason	repo	type
218260909	MDU6SXNzdWUyMTgyNjA5MDk=	1340	round-trip performance with save_mfdataset / open_mfdataset	rabernat 1197350	closed	11	2017-03-30T16:52:26Z	2019-05-01T22:12:06Z	2019-05-01T22:12:06Z	MEMBER			I have encountered some major performance bottlenecks in trying to write and then read multi-file netcdf datasets. I start with an xarray dataset created by xgcm with the following repr: <xarray.Dataset> Dimensions: (XC: 400, XG: 400, YC: 400, YG: 400, Z: 40, Zl: 40, Zp1: 41, Zu: 40, layer_1TH_bounds: 43, layer_1TH_center: 42, layer_1TH_interface: 41, time: 1566) Coordinates: iter (time) int64 8294400 8294976 8295552 8296128 ... * time (time) int64 8294400 8294976 8295552 8296128 ... * XC (XC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * YG (YG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * XG (XG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ... * YC (YC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ... * Zu (Zu) >f4 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 -91.0 ... * Zl (Zl) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Zp1 (Zp1) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ... * Z (Z) >f4 -5.0 -15.0 -25.0 -36.0 -49.0 -64.0 -81.5 ... rAz (YG, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dyC (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAw (YC, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... dxC (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dxG (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... dyG (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ... rAs (YG, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... Depth (YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... rA (YC, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ... PHrefF (Zp1) >f4 0.0 98.1 196.2 294.3 412.02 549.36 706.32 ... PHrefC (Z) >f4 49.05 147.15 245.25 353.16 480.69 627.84 ... drC (Zp1) >f4 5.0 10.0 10.0 11.0 13.0 15.0 17.5 20.5 ... drF (Z) >f4 10.0 10.0 10.0 12.0 14.0 16.0 19.0 22.0 ... hFacC (Z, YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacW (Z, YC, XG) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... hFacS (Z, YG, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... * layer_1TH_bounds (layer_1TH_bounds) >f4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_interface (layer_1TH_interface) >f4 0.0 0.2 0.4 0.6 0.8 1.0 ... * layer_1TH_center (layer_1TH_center) float32 -0.1 0.1 0.3 0.5 0.7 0.9 ... Data variables: T (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... U (time, Z, YC, XG) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... V (time, Z, YG, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... S (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ... Eta (time, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... W (time, Zl, YC, XC) float32 -0.0 -0.0 -0.0 -0.0 -0.0 ... An important point to note is that there are lots of "non-dimension coordinates" corresponding to various parameters of the numerical grid. I save this dataset to a multi-file netCDF dataset as follows: `python iternums, datasets = zip(ds.groupby('time')) paths = [outdir + 'xmitgcm_data.%010d.nc' % it for it in iternums] xr.save_mfdataset(datasets, paths)` This takes many hours to run, since it has to read and write all the data. (I think there are some performance issues here too, related to how dask schedules the read / write tasks, but that is probably a separate issue.) Then I try to re-load this dataset `python ds_nc = xr.open_mfdataset('xmitgcm_data..nc')` This raises an error: `ValueError: too many different dimensions to concatenate: {'YG', 'Z', 'Zl', 'Zp1', 'layer_1TH_interface', 'YC', 'XC', 'layer_1TH_center', 'Zu', 'layer_1TH_bounds', 'XG'}` I need to specify `concat_dim='time'` in order to properly concatenate the data. It seems like this should be unnecessary, since I am reading back data that was just written with xarray, but I understand why (the dimensions of the Data Variables in each file are just Z, YC, XC, with no time dimension). Once I do that, it works, but it takes 18 minutes to load the dataset. I assume this is because it has to check the compatibility of all all the non-dimension coordinates. I just thought I would document this, because 18 minutes seems way too long to load a dataset.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1340/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
280626621	MDU6SXNzdWUyODA2MjY2MjE=	1770	slow performance when storing datasets in gcsfs-backed zarr stores	rabernat 1197350	closed	11	2017-12-08T21:46:32Z	2019-01-13T03:52:46Z	2019-01-13T03:52:46Z	MEMBER			We are working on integrating zarr with xarray. In the process, we have encountered a performance issue that I am documenting here. At this point, it is not clear if the core issue is in zarr, gcsfs, dask, or xarray. I originally started posting this in zarr, but in the process, I became more convinced the issue was with xarray. Dask Only Here is an example using only dask and zarr. ```python connect to a local dask scheduler from dask.distributed import Client client = Client('tcp://129.236.20.45:8786') create a big dask array import dask.array as dsa shape = (30, 50, 1080, 2160) chunkshape = (1, 1, 1080, 2160) ar = dsa.random.random(shape, chunks=chunkshape) connect to gcs and create MutableMapping import gcsfs fs = gcsfs.GCSFileSystem(project='pangeo-181919') gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test999', gcs=fs, check=True, create=True) create a zarr array to store into import zarr za = zarr.create(ar.shape, chunks=chunkshape, dtype=ar.dtype, store=gcsmap) write it ar.store(za, lock=False) ``` When you do this, it spends a long time serializing stuff before the computation starts. For a more fine-grained look at the process, one can instead do `python delayed_obj = a.store(za, compute=False, lock=False) %prun future = client.compute(dobj)` This reveals that the pre-compute step takes about 10s. Monitoring the distributed scheduler, I can see that, once the computation starts, it takes about 1:30 to store the array (27 GB). (This is actually not bad!) Some debugging by @mrocklin revealed the following step is quite slow `python import cloudpickle %time len(cloudpickle.dumps(za))` On my system, this was taking close to 1s. On contrast, when the `store` passed to `gcsmap` is not a `GCSMap` but instead a path, it is in the microsecond territory. So pickling `GCSMap` objects is relatively slow. I'm not sure whether this pickling happens when we call `client.compute` or during the task execution. There is room for improvement here, but overall, zarr + gcsfs + dask seem to integrate well and give decent performance. Xarray This get much worse once xarray enters the picture. (Note that this example requires the xarray PR pydata/xarray#1528, which has not been merged yet.) ```python wrap the dask array in an xarray import xarray as xr import numpy as np ds = xr.DataArray(ar, dims=['time', 'depth', 'lat', 'lon'], coords={'lat': np.linspace(-90, 90, Ny), 'lon': np.linspace(0, 360, Nx)}).to_dataset(name='temperature') store to a different bucket gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test1', gcs=fs, check=True, create=True) ds.to_zarr(store=gcsmap, mode='w') ``` Now the store step takes 18 minutes. Most of this time, is upfront, during which there is little CPU activity and no network activity. After about 15 minutes or so, it finally starts computing, at which point the writes to gcs proceed more-or-less at the same rate as with the dask-only example. Profiling the `to_zarr` with snakeviz reveals that it is spending most of its time waiting for thread locks. I don't understand this, since I specifically eliminated locks when storing the zarr arrays.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1770/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
287569331	MDExOlB1bGxSZXF1ZXN0MTYyMjI0MTg2	1817	fix rasterio chunking with s3 datasets	rabernat 1197350	closed	11	2018-01-10T20:37:45Z	2018-01-24T09:33:07Z	2018-01-23T16:33:28Z	MEMBER	0	pydata/xarray/pulls/1817	[x] Closes #1816 (remove if there is no corresponding issue, which should only be the case for minor changes) [x] Tests added (for all bug fixes or enhancements) [x] Tests passed (for all non-documentation changes) [x] Passes `git diff upstream/master */py \| flake8 --diff` (remove if you did not edit any Python files) [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later) This is a simple fix for token generation of non-filename targets for rasterio. The problem is that I have no idea how to test it without actually hitting s3 (which requires boto and aws credentials).	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1817/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
229474101	MDExOlB1bGxSZXF1ZXN0MTIxMTQyODkw	1413	concat prealigned objects	rabernat 1197350	closed	11	2017-05-17T20:16:00Z	2017-07-17T21:53:53Z	2017-07-17T21:53:40Z	MEMBER	0	pydata/xarray/pulls/1413	[x] Closes #1385 [ ] Tests added / passed [ ] Passes `git diff upstream/master \| flake8 --diff` [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API This is an initial PR to bypass index alignment and coordinate checking when concatenating datasets.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1413/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

4 rows where comments = 11, repo = 13221727 and user = 1197350 sorted by updated_at descending

Dask Only

connect to a local dask scheduler

create a big dask array

connect to gcs and create MutableMapping

create a zarr array to store into

write it

Xarray

wrap the dask array in an xarray

store to a different bucket

Advanced export