issues: 613012939
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
613012939 | MDExOlB1bGxSZXF1ZXN0NDEzODQ3NzU0 | 4035 | Support parallel writes to regions of zarr stores | 1217238 | closed | 0 | 17 | 2020-05-06T02:40:19Z | 2020-11-04T06:19:01Z | 2020-11-04T06:19:01Z | MEMBER | 0 | pydata/xarray/pulls/4035 | This PR adds support for a This is useful for creating large Zarr datasets without requiring dask. For example, the separate workers in a simulation job might each write a single non-overlapping chunk of a Zarr file. The standard way to handle such datasets today is to first write netCDF files in each process, and then consolidate them afterwards with dask (see #3096). Creating empty Zarr storesIn order to do so, the Zarr file must be pre-existing with desired variables in the right shapes/chunks. It is desirable to be able to create such stores without actually writing data, because datasets that we want to write in parallel may be very large. In the example below, I achieve this filling a
I think (1) is maybe the cleanest option (no extra API endpoints). Unchunked variablesOne potential gotcha concerns coordinate arrays that are not chunked, e.g., consider parallel writing of a dataset divided along time with 2D If a Zarr store does not have atomic writes, then conceivably this could result in corrupted data. The default DirectoryStore has atomic writes and cloud based object stores should also be atomic, so perhaps this doesn't matter in practice, but at the very least it's inefficient and could cause issues for large-scale jobs due to resource contention. Options include:
I think (4) would be my preferred option. Some users would undoubtedly find this annoying, but the power-users for whom we are adding this feature would likely appreciate it. Usage example```python import xarray import dask.array as da ds = xarray.Dataset({'u': (('x',), da.arange(1000, chunks=100))}) create the new zarr store, but don't write datapath = 'my-data.zarr' ds.to_zarr(path, compute=False) look at the unwritten datads_opened = xarray.open_zarr(path) print('Data before writing:', ds_opened.u.data[::100].compute()) Data before writing: [ 1 100 1 100 100 1 1 1 1 1]write out each slice (could be in separate processes)for start in range(0, 1000, 100): selection = {'x': slice(start, start + 100)} ds.isel(selection).to_zarr(path, region=selection) print('Data after writing:', ds_opened.u.data[::100].compute()) Data after writing: [ 0 100 200 300 400 500 600 700 800 900]```
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/4035/reactions", "total_count": 4, "+1": 4, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
13221727 | pull |