issues: 280626621
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
280626621 | MDU6SXNzdWUyODA2MjY2MjE= | 1770 | slow performance when storing datasets in gcsfs-backed zarr stores | 1197350 | closed | 0 | 11 | 2017-12-08T21:46:32Z | 2019-01-13T03:52:46Z | 2019-01-13T03:52:46Z | MEMBER | We are working on integrating zarr with xarray. In the process, we have encountered a performance issue that I am documenting here. At this point, it is not clear if the core issue is in zarr, gcsfs, dask, or xarray. I originally started posting this in zarr, but in the process, I became more convinced the issue was with xarray. Dask OnlyHere is an example using only dask and zarr. ```python connect to a local dask schedulerfrom dask.distributed import Client client = Client('tcp://129.236.20.45:8786') create a big dask arrayimport dask.array as dsa shape = (30, 50, 1080, 2160) chunkshape = (1, 1, 1080, 2160) ar = dsa.random.random(shape, chunks=chunkshape) connect to gcs and create MutableMappingimport gcsfs fs = gcsfs.GCSFileSystem(project='pangeo-181919') gcsmap = gcsfs.mapping.GCSMap('pangeo-data/test999', gcs=fs, check=True, create=True) create a zarr array to store intoimport zarr za = zarr.create(ar.shape, chunks=chunkshape, dtype=ar.dtype, store=gcsmap) write itar.store(za, lock=False) ``` When you do this, it spends a long time serializing stuff before the computation starts. For a more fine-grained look at the process, one can instead do
Some debugging by @mrocklin revealed the following step is quite slow
There is room for improvement here, but overall, zarr + gcsfs + dask seem to integrate well and give decent performance. XarrayThis get much worse once xarray enters the picture. (Note that this example requires the xarray PR pydata/xarray#1528, which has not been merged yet.) ```python wrap the dask array in an xarrayimport xarray as xr import numpy as np ds = xr.DataArray(ar, dims=['time', 'depth', 'lat', 'lon'], coords={'lat': np.linspace(-90, 90, Ny), 'lon': np.linspace(0, 360, Nx)}).to_dataset(name='temperature') store to a different bucketgcsmap = gcsfs.mapping.GCSMap('pangeo-data/test1', gcs=fs, check=True, create=True) ds.to_zarr(store=gcsmap, mode='w') ``` Now the store step takes 18 minutes. Most of this time, is upfront, during which there is little CPU activity and no network activity. After about 15 minutes or so, it finally starts computing, at which point the writes to gcs proceed more-or-less at the same rate as with the dask-only example. Profiling the I don't understand this, since I specifically eliminated locks when storing the zarr arrays. |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/1770/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |