pull_requests: 137819104
This data as json
id | node_id | number | state | locked | title | user | body | created_at | updated_at | closed_at | merged_at | merge_commit_sha | assignee | milestone | draft | head | base | author_association | auto_merge | repo | url | merged_by |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
137819104 | MDExOlB1bGxSZXF1ZXN0MTM3ODE5MTA0 | 1528 | closed | 0 | WIP: Zarr backend | 1197350 | - [x] Closes #1223 - [x] Tests added / passed - [x] Passes ``git diff upstream/master | flake8 --diff`` - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API I think that a zarr backend could be the ideal storage format for xarray datasets, overcoming many of the frustrations associated with netcdf and enabling optimal performance on cloud platforms. This is a very basic start to implementing a zarr backend (as proposed in #1223); however, I am taking a somewhat different approach. I store the whole dataset in a single zarr group. I encode the extra metadata needed by xarray (so far just dimension information) as attributes within the zarr group and child arrays. I hide these special attributes from the user by wrapping the attribute dictionaries in a "`HiddenKeyDict`", so that they can't be viewed or modified. I have no tests yet (:flushed:), but the following code works. ```python from xarray.backends.zarr import ZarrStore import xarray as xr import numpy as np ds = xr.Dataset( {'foo': (('y', 'x'), np.ones((100, 200)), {'myattr1': 1, 'myattr2': 2}), 'bar': (('x',), np.zeros(200))}, {'y': (('y',), np.arange(100)), 'x': (('x',), np.arange(200))}, {'some_attr': 'copana'} ).chunk({'y': 50, 'x': 40}) zs = ZarrStore(store='zarr_test') ds.dump_to_store(zs) ds2 = xr.Dataset.load_store(zs) assert ds2.equals(ds) ``` There is a very long way to go here, but I thought I would just get a PR started. Some questions that would help me move forward. 1. What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr? 1. Should we encode / decode CF for zarr stores? 1. Do we want to always automatically align dask chunks with the underlying zarr chunks? 1. What sort of public API should the zarr backend have? Should you be able to load zarr stores via `open_dataset`? Or do we need a new method? I think `.to_zarr()` would be quite useful. 1. zarr arrays are extensible along all axes. What does this imply for unlimited dimensions? 1. Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed. | 2017-08-27T02:38:01Z | 2018-02-13T21:35:03Z | 2017-12-14T02:11:36Z | 2017-12-14T02:11:36Z | 8fe7eb0fbcb7aaa90d894bcf32dc1408735e5d9d | 0 | f5633cabd19189675b607379badc2c19b86c0b8e | 89a1a9883c0c8409dad8dbcccf1ab73a3ea2cafc | MEMBER | 13221727 | https://github.com/pydata/xarray/pull/1528 |
Links from other tables
- 2 rows from pull_requests_id in labels_pull_requests