home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 253136694

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
253136694 MDExOlB1bGxSZXF1ZXN0MTM3ODE5MTA0 1528 WIP: Zarr backend 1197350 closed 0     103 2017-08-27T02:38:01Z 2018-02-13T21:35:03Z 2017-12-14T02:11:36Z MEMBER   0 pydata/xarray/pulls/1528
  • [x] Closes #1223
  • [x] Tests added / passed
  • [x] Passes git diff upstream/master | flake8 --diff
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

I think that a zarr backend could be the ideal storage format for xarray datasets, overcoming many of the frustrations associated with netcdf and enabling optimal performance on cloud platforms.

This is a very basic start to implementing a zarr backend (as proposed in #1223); however, I am taking a somewhat different approach. I store the whole dataset in a single zarr group. I encode the extra metadata needed by xarray (so far just dimension information) as attributes within the zarr group and child arrays. I hide these special attributes from the user by wrapping the attribute dictionaries in a "HiddenKeyDict", so that they can't be viewed or modified.

I have no tests yet (:flushed:), but the following code works. ```python from xarray.backends.zarr import ZarrStore import xarray as xr import numpy as np

ds = xr.Dataset( {'foo': (('y', 'x'), np.ones((100, 200)), {'myattr1': 1, 'myattr2': 2}), 'bar': (('x',), np.zeros(200))}, {'y': (('y',), np.arange(100)), 'x': (('x',), np.arange(200))}, {'some_attr': 'copana'} ).chunk({'y': 50, 'x': 40})

zs = ZarrStore(store='zarr_test') ds.dump_to_store(zs) ds2 = xr.Dataset.load_store(zs) assert ds2.equals(ds) ```

There is a very long way to go here, but I thought I would just get a PR started. Some questions that would help me move forward.

  1. What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?
  2. Should we encode / decode CF for zarr stores?
  3. Do we want to always automatically align dask chunks with the underlying zarr chunks?
  4. What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.
  5. zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?
  6. Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1528/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 pull

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 103 rows from issue in issue_comments
Powered by Datasette · Queries took 81.659ms · About: xarray-datasette