home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

71 rows where author_association = "MEMBER" and issue = 253136694 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 5

  • rabernat 37
  • shoyer 17
  • mrocklin 9
  • jhamman 7
  • fmaussion 1

issue 1

  • WIP: Zarr backend · 71 ✖

author_association 1

  • MEMBER · 71 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
364954680 https://github.com/pydata/xarray/pull/1528#issuecomment-364954680 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDk1NDY4MA== rabernat 1197350 2018-02-12T15:21:51Z 2018-02-12T15:21:51Z MEMBER

I'm enjoying this discussion. Zarr offers lots of new possibilities for appending / updating datasets that we should try to support. I personally would really like to be able to append / extend existing arrays from within xarray.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364804265 https://github.com/pydata/xarray/pull/1528#issuecomment-364804265 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwNDI2NQ== shoyer 1217238 2018-02-12T00:15:23Z 2018-02-12T00:15:23Z MEMBER

See https://github.com/dask/dask/issues/2000 for the dask issue. Once this works in dask it should be quite easy to implement in xarray, too.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364804162 https://github.com/pydata/xarray/pull/1528#issuecomment-364804162 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwNDE2Mg== shoyer 1217238 2018-02-12T00:14:22Z 2018-02-12T00:14:22Z MEMBER

@martindurant that could probably be addressed most cleanly by improving __setitem__ support for dask.array.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364802374 https://github.com/pydata/xarray/pull/1528#issuecomment-364802374 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwMjM3NA== jhamman 2443309 2018-02-11T23:54:01Z 2018-02-11T23:54:01Z MEMBER

@martindurant - If I understand your question correctly, I think you should be able to follow a pretty standard xarray workflow:

```Python ds = xr.Dataset() ds['your_varname'] = xr.DataArray(some_dask_array, dims=['dimname0', 'dimname1', ...], coords=dict_of_preknown_coords)

repeat for each variable you want in your dataset

ds.to_zarr(some_zarr_store)

then to open

ds2 = xr.open_zarr(some_zarr_store) ```

Two things to note:

1) if you are looking for decent performance when writing to a remote store, make sure you're working off xarray@master as #1800 fixed a number of choke points in the to_zarr implementation 2) if you are pushing to GCS, some_zarr_store can be a GCSMap.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364801395 https://github.com/pydata/xarray/pull/1528#issuecomment-364801395 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwMTM5NQ== mrocklin 306380 2018-02-11T23:40:18Z 2018-02-11T23:40:18Z MEMBER

Does the to_zarr method suffice: http://xarray.pydata.org/en/latest/generated/xarray.Dataset.to_zarr.html#xarray.Dataset.to_zarr ?

On Sun, Feb 11, 2018 at 6:35 PM, Martin Durant notifications@github.com wrote:

Question: how would one build a zarr-xarray dataset?

With zarr you can open an array that contains no data, and use set-slice notation to fill in the values (which is what dask's store essentially does).

If I have some pre-known coordinates and bigger-than-memory data arrays, how would I go about getting the values into the zarr structure? If this can't be done directly with the xarray interface, is there a way to call zarr's open/create/zeros such that the corresponding array will appear as a variable when the same dataset is opened with xarray?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-364801073, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIWtzhFRhlOoLnRJiQrTubrDuQ0Xks5tT3lIgaJpZM4PDrlp .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
351588678 https://github.com/pydata/xarray/pull/1528#issuecomment-351588678 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MTU4ODY3OA== shoyer 1217238 2017-12-14T02:23:03Z 2017-12-14T02:23:03Z MEMBER

woohoo, thank you Ryan!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
351401474 https://github.com/pydata/xarray/pull/1528#issuecomment-351401474 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MTQwMTQ3NA== rabernat 1197350 2017-12-13T14:09:12Z 2017-12-13T14:09:12Z MEMBER

Will merge later today if no further comments.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
350557153 https://github.com/pydata/xarray/pull/1528#issuecomment-350557153 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MDU1NzE1Mw== fmaussion 10050469 2017-12-10T15:45:13Z 2017-12-10T15:45:13Z MEMBER

Thanks for the tremendous work @rabernat , looking forward to testing this!

In the future it would be nice to shortly describe the advantages of zarr over netcdf for new users. A speed benchmark could help, too! This can be done once the backend has more maturity, and when we will refactor the I/O docs

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
350365780 https://github.com/pydata/xarray/pull/1528#issuecomment-350365780 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MDM2NTc4MA== rabernat 1197350 2017-12-08T20:36:26Z 2017-12-08T20:36:26Z MEMBER

Any more reviews? @fmaussion & @pwolfram: you have experience with backends. Your reviews would be valuable.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
350352097 https://github.com/pydata/xarray/pull/1528#issuecomment-350352097 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MDM1MjA5Nw== shoyer 1217238 2017-12-08T19:34:09Z 2017-12-08T19:34:09Z MEMBER

The default keyword was introduced in python 3.4, so this doesn't work in 2.7. I have tried a couple of options to overcome this but none of them have worked.

Oops, this is my fault!

Instead, try: python ndims = [k.ndim for k in key if isinstance(k, np.ndarray)] array_subspace_size = max(ndims) if ndims else 0

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
350343117 https://github.com/pydata/xarray/pull/1528#issuecomment-350343117 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MDM0MzExNw== mrocklin 306380 2017-12-08T18:55:35Z 2017-12-08T18:55:35Z MEMBER

Not as far as I know.

On Fri, Dec 8, 2017 at 1:53 PM, Ryan Abernathey notifications@github.com wrote:

@rabernat commented on this pull request.

In xarray/backends/common.py https://github.com/pydata/xarray/pull/1528#discussion_r155848074:

@@ -184,7 +185,7 @@ def sync(self): import dask.array as da import dask if LooseVersion(dask.version) > LooseVersion('0.8.1'): - da.store(self.sources, self.targets, lock=GLOBAL_LOCK) + da.store(self.sources, self.targets, lock=self.lock)

There is no reason that a task run on the distributed system will not show up on the dashboard. My first guess is that somehow you're using a local scheduler.

I was not using a local scheduler. After digging further, I can see the tasks on the distributed dashboard using a regular zarr.DirectoryStore, but not when I pass a gcsfs.mapping.GCSMap to to_zarr. Is there any reasons these two should behave differently?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#discussion_r155848074, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszH6-lNha6n9cCYIa-jDFFiH2Jk4Xks5s-YWvgaJpZM4PDrlp .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
350336238 https://github.com/pydata/xarray/pull/1528#issuecomment-350336238 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MDMzNjIzOA== rabernat 1197350 2017-12-08T18:26:58Z 2017-12-08T18:26:58Z MEMBER

There is a silly lingering issue that I need help resolving.

In a8b478543a978bd98c37711609c610432fdc7d07, @jhamman added a function _replace_slices_with_arrays related to vectorized indexing. This function contains a line

python array_subspace_size = max( (k.ndim for k in key if isinstance(k, np.ndarray)), default=0)

The default keyword was introduced in python 3.4, so this doesn't work in 2.7. I have tried a couple of options to overcome this but none of them have worked. Would someone care to help out with this? It is possibly the last remaining issue to resolve before this PR is really ready to be merged.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
349992006 https://github.com/pydata/xarray/pull/1528#issuecomment-349992006 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0OTk5MjAwNg== rabernat 1197350 2017-12-07T14:59:12Z 2017-12-07T14:59:12Z MEMBER

@jhamman, I can't reproduce your error. If you can give me a reproducible example, I will make a test for it.

I think this is converging.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
349766763 https://github.com/pydata/xarray/pull/1528#issuecomment-349766763 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0OTc2Njc2Mw== rabernat 1197350 2017-12-06T20:36:03Z 2017-12-06T20:36:03Z MEMBER

@jhamman - but the error being raised is wrong! There is a string formatting error raised in trying to generate a useful, informative error message.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
349738624 https://github.com/pydata/xarray/pull/1528#issuecomment-349738624 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0OTczODYyNA== jhamman 2443309 2017-12-06T18:54:41Z 2017-12-06T18:54:56Z MEMBER

@rabernat - in trying out your branch, I've run into this error (mentioned by @mrocklin in pangeo-data/pangeo#19):

```Python-traceback ... ~/anaconda/envs/pangeo-dev/lib/python3.6/site-packages/xarray-0.10.0_79_g7b50320-py3.6.egg/xarray/backends/zarr.py in _extract_zarr_variable_encoding(variable, raise_on_invalid) 228 229 chunks = _determine_zarr_chunks(encoding.get('chunks'), variable.chunks, --> 230 variable.ndim) 231 encoding['chunks'] = chunks 232 return encoding

~/anaconda/envs/pangeo-dev/lib/python3.6/site-packages/xarray-0.10.0_79_g7b50320-py3.6.egg/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim) 134 "Zarr requires uniform chunk sizes excpet for final chunk." 135 " Variable %r has incompatible chunks. Consider " --> 136 "rechunking using chunk()." % var_chunks) 137 # last chunk is allowed to be smaller 138 last_var_chunk = all_var_chunks[-1]

TypeError: not all arguments converted during string formatting ```

As far as I can tell, reworking my chunk sizes to divide evenly into the dataset dimensions has corrected the problem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
349554730 https://github.com/pydata/xarray/pull/1528#issuecomment-349554730 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0OTU1NDczMA== shoyer 1217238 2017-12-06T07:10:37Z 2017-12-06T07:10:37Z MEMBER

I just pushed a commit adding a test for backends.zarr._replace_slices_with_arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
349540155 https://github.com/pydata/xarray/pull/1528#issuecomment-349540155 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0OTU0MDE1NQ== rabernat 1197350 2017-12-06T05:38:26Z 2017-12-06T05:38:26Z MEMBER

I believe that this is now complete enough to consider merging. I have addressed nearly all of @shoyer's suggestions. I have added a bunch more tests and am now quite satisfied with the test suite. I wrote some basic documentation, with the usual disclaimers about the experimental nature of this new feature.

The zarr tests will not run if the zarr version is less than 2.2.0. This is not released yet. This means that only the py36-zarr-dev build actually runs the zarr tests. Once @alimanfoo releases the next version, the zarr tests should kick in on all the builds.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
349495568 https://github.com/pydata/xarray/pull/1528#issuecomment-349495568 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0OTQ5NTU2OA== rabernat 1197350 2017-12-06T01:08:11Z 2017-12-06T01:08:11Z MEMBER

@jhamman - could you elaborate on the nature of the error you got with uneven dask chunks. We should be catching this and raising a useful error message.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
349488598 https://github.com/pydata/xarray/pull/1528#issuecomment-349488598 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0OTQ4ODU5OA== mrocklin 306380 2017-12-06T00:30:21Z 2017-12-06T00:30:21Z MEMBER

We tried this out on a cloud-deployed cluster on GCE and things worked pleasantly. Some conversation here: https://github.com/pangeo-data/pangeo/issues/19

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
348569223 https://github.com/pydata/xarray/pull/1528#issuecomment-348569223 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0ODU2OTIyMw== shoyer 1217238 2017-12-01T18:20:32Z 2017-12-01T18:20:32Z MEMBER

To finish it up, I propose to raise an error when attempting to encode variable-length string data. If someone can give me a quick one liner to help identify such datatypes, that would be helpful.

Variable length strings are stored with dtype=object. So something like dtype.kind == 'O' should work.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
348564159 https://github.com/pydata/xarray/pull/1528#issuecomment-348564159 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0ODU2NDE1OQ== rabernat 1197350 2017-12-01T17:58:59Z 2017-12-01T17:59:06Z MEMBER

Sorry this has become such a behemoth. I know it is hard to review. I couldn't see how to make a more atomic PR because a new backend has lots of interrelated parts that need each other in order to work.

To finish it up, I propose to raise an error when attempting to encode variable-length string data. If someone can give me a quick one liner to help identify such datatypes, that would be helpful.

We will revisit these encoding issues once Stephan's refactoring is merged.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
348560326 https://github.com/pydata/xarray/pull/1528#issuecomment-348560326 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0ODU2MDMyNg== shoyer 1217238 2017-12-01T17:43:03Z 2017-12-01T17:43:03Z MEMBER

I'll give this another look over the weekend.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
348414545 https://github.com/pydata/xarray/pull/1528#issuecomment-348414545 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0ODQxNDU0NQ== jhamman 2443309 2017-12-01T06:40:47Z 2017-12-01T06:40:47Z MEMBER

@rabernat - following @shoyer's thoughts here and in #1753, I'm not apposed to skipping the last few failing tests and live to fight strings another day.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347989858 https://github.com/pydata/xarray/pull/1528#issuecomment-347989858 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0Nzk4OTg1OA== rabernat 1197350 2017-11-29T20:42:34Z 2017-11-29T20:42:34Z MEMBER

Actually, I think I just realized how to do it without too much pain. Stand by.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347987097 https://github.com/pydata/xarray/pull/1528#issuecomment-347987097 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0Nzk4NzA5Nw== rabernat 1197350 2017-11-29T20:32:07Z 2017-11-29T20:32:07Z MEMBER

Is it possible to add one of these filters to XArray's default use of Zarr?

Because of the way the backends are structured right now, it is hard to bypass the existing encoding and replace it with a new encoding scheme. #1087 will make this easy to do. But now it is complicated.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347984582 https://github.com/pydata/xarray/pull/1528#issuecomment-347984582 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0Nzk4NDU4Mg== shoyer 1217238 2017-11-29T20:22:33Z 2017-11-29T20:22:33Z MEMBER

I'm fine skipping strings entirely for now. They are indeed unneeded for most netCDF datasets.

On Wed, Nov 29, 2017 at 8:18 PM Ryan Abernathey notifications@github.com wrote:

Right now I am in a dilemma over how to move forward. Fixing this string encoding issue will require some serious hacks to cf encoding. If I do this before #1087 https://github.com/pydata/xarray/pull/1087 is finished, it will be a waste of time (and a pain). On the other hand #1087 https://github.com/pydata/xarray/pull/1087 could take a long time, since it is a major refactor itself.

Is there some way to punt on the multi-length string encoding for now? We could just error if such variables are present. This would allow us to get the experimental zarr backend out into the wild. FWIW, none of the datasets I want to use this with actually have any string data variables at all. I believe 95% of netcdf datasets are just regular numbers. This is an edge case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-347983448, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1pSXGvZDcCgNX3DRBZs3yupZB118ks5s7bwBgaJpZM4PDrlp .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347983854 https://github.com/pydata/xarray/pull/1528#issuecomment-347983854 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0Nzk4Mzg1NA== mrocklin 306380 2017-11-29T20:19:37Z 2017-11-29T20:19:37Z MEMBER

FWIW I think the best option at the moment is to make sure you add either Pickle or MsgPack filter for any zarr array with an object dtype.

Is it possible to add one of these filters to XArray's default use of Zarr?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347983448 https://github.com/pydata/xarray/pull/1528#issuecomment-347983448 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0Nzk4MzQ0OA== rabernat 1197350 2017-11-29T20:18:08Z 2017-11-29T20:18:08Z MEMBER

Right now I am in a dilemma over how to move forward. Fixing this string encoding issue will require some serious hacks to cf encoding. If I do this before #1087 is finished, it will be a waste of time (and a pain). On the other hand #1087 could take a long time, since it is a major refactor itself.

Is there some way to punt on the multi-length string encoding for now? We could just error if such variables are present. This would allow us to get the experimental zarr backend out into the wild. FWIW, none of the datasets I want to use this with actually have any string data variables at all. I believe 95% of netcdf datasets are just regular numbers. This is an edge case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347981682 https://github.com/pydata/xarray/pull/1528#issuecomment-347981682 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0Nzk4MTY4Mg== mrocklin 306380 2017-11-29T20:11:25Z 2017-11-29T20:11:25Z MEMBER

FWIW my vote is for msgpack over pickle for both performance and cross-language reasons

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347351224 https://github.com/pydata/xarray/pull/1528#issuecomment-347351224 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM1MTIyNA== shoyer 1217238 2017-11-27T22:32:47Z 2017-11-28T07:51:31Z MEMBER

Overall, I find the conventions module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends.

Agreed!

I wonder why zarr doesn't have a UTF-8 variable length string type (https://github.com/alimanfoo/zarr/issues/206) -- that would feel like the obvious first choice for encoding this data.

That said, xarary should be able to use fixed-length bytes just fine, doing UTF-8 encoding/decoding on the fly.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347382612 https://github.com/pydata/xarray/pull/1528#issuecomment-347382612 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM4MjYxMg== rabernat 1197350 2017-11-28T01:21:34Z 2017-11-28T01:21:34Z MEMBER

When still in the original interpreter session, all the objects still exist in memory, so all the pointers stored in the array are still valid.

Do you think this persistence could affect xarray's tests? The way the tests work is via a context manager, like this python @contextlib.contextmanager def roundtrip(self, data, save_kwargs={}, open_kwargs={}, allow_cleanup_failure=False): with create_tmp_file( suffix='.zarr', allow_cleanup_failure=allow_cleanup_failure) as tmp_file: data.to_zarr(store=tmp_file, **save_kwargs) yield xr.open_zarr(tmp_file, **open_kwargs)

Do we need to add an extra step after data.to_zarr to somehow purge such objects?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347381865 https://github.com/pydata/xarray/pull/1528#issuecomment-347381865 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM4MTg2NQ== rabernat 1197350 2017-11-28T01:16:58Z 2017-11-28T01:16:58Z MEMBER

Out[2]: Bus error: 10 😱

Perhaps zarr should raise an error when assigning zgs.x[:] = values?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347380750 https://github.com/pydata/xarray/pull/1528#issuecomment-347380750 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM4MDc1MA== rabernat 1197350 2017-11-28T01:10:01Z 2017-11-28T01:10:10Z MEMBER

zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory

@alimanfoo: the following also seems to works with directory store python values = np.array([b'ab', b'cdef', np.nan], dtype=object) zgs = zarr.open_group(store='zarr_directory') zgs.create('x', shape=values.shape, dtype=values.dtype) zgs.x[:] = values

This seems to contradict your statement above. What am I missing?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347323043 https://github.com/pydata/xarray/pull/1528#issuecomment-347323043 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzMyMzA0Mw== rabernat 1197350 2017-11-27T20:48:35Z 2017-11-27T20:53:28Z MEMBER

After a few more tweaks, this is now quite close to passing all the CFEncodedDataTest tests.

The remaining issues are all related to the encoding of strings. Basically, zarr's handling of strings: http://zarr.readthedocs.io/en/latest/tutorial.html?highlight=strings#string-arrays is considerably different from netCDF's. Because ZarrStore is a subclass of WritableCFDataStore, all of the dataset variables get passed through encode_cf_variable before writing. This screws up things that actually work already quite naturally.

Consider the following direct creation of a variable length string in zarr: python values = np.array([b'ab', b'cdef', np.nan], dtype=object) zgs = zarr.open_group() zgs.create('x', shape=values.shape, dtype=values.dtype) zgs.x[:] = values zgs.x
Array(/x, (3,), object, chunks=(3,), order=C) nbytes: 24; nbytes_stored: 350; ratio: 0.1; initialized: 1/1 compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) store: DictStore

It seems we can encode variable-length strings into objects just fine. (np.testing.assert_array_equal(values, zgs.x[:]) fails only because of the nan value. The array round-trips just fine.)

However, after passing through xarray's cf encoding, this no longer works: python encoding = {'_FillValue': b'X', 'dtype': 'S1'} original = xr.Dataset({'x': ('t', values, {}, encoding)}) zarr_dict_store = {} original.to_zarr(store=zarr_dict_store) zs = zarr.open_group(store=zarr_dict_store) print(zs.x) print(zs.x[:]) Array(/x, (3, 4), |S1, chunks=(3, 4), order=C) nbytes: 12; nbytes_stored: 428; ratio: 0.0; initialized: 1/1 compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) store: dict array([[b'a', b'b', b'', b''], [b'c', b'd', b'e', b'f'], [b'X', b'', b'', b'']], dtype='|S1')

Here is everything that happens in encode_cf_variable: python var = maybe_encode_datetime(var, name=name) var = maybe_encode_timedelta(var, name=name) var, needs_copy = maybe_encode_offset_and_scale(var, needs_copy, name=name) var, needs_copy = maybe_encode_fill_value(var, needs_copy, name=name) var = maybe_encode_nonstring_dtype(var, name=name) var = maybe_default_fill_value(var) var = maybe_encode_bools(var) var = ensure_dtype_not_object(var, name=name) var = maybe_encode_string_dtype(var, name=name)

The challenge now is to figure out which parts of this we need to bypass for zarr and how to implement that bypassing.

Overall, I find the conventions module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends.

At this point, I would appreciate some input from an encoding expert before I go refactoring stuff.

edit: The actual tests that fail are CFEncodedDataTest.test_roundtrip_bytes_with_fill_value and CFEncodedDataTest.test_roundtrip_string_encoded_characters. One option to move forward would be just to skip those tests for zarr. I am eager to get this out in the wild to see how it plays with real datasets.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345778844 https://github.com/pydata/xarray/pull/1528#issuecomment-345778844 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTc3ODg0NA== mrocklin 306380 2017-11-20T18:05:25Z 2017-11-20T18:05:25Z MEMBER

This is, of course, by design :)

It's so nice when well-designed things come together and just work as planned :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345575240 https://github.com/pydata/xarray/pull/1528#issuecomment-345575240 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTU3NTI0MA== mrocklin 306380 2017-11-20T02:28:07Z 2017-11-20T02:28:07Z MEMBER

That is, indeed, quite exciting. Also exciting is that I was able to look at and compute on your data easily.

```python In [1]: import zarr

In [2]: import gcsfs

In [3]: fs = gcsfs.GCSFileSystem(project='pangeo-181919')

In [4]: gcsmap = gcsfs.mapping.GCSMap('zarr_store_test', gcs=fs, check=True, create=False)

In [5]: import xarray as xr

In [6]: ds_gcs = xr.open_zarr(gcsmap, mode='r')

In [7]: ds_gcs Out[7]: <xarray.Dataset> Dimensions: (x: 200, y: 100) Coordinates: * x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... * y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... Data variables: bar (x) float64 dask.array<shape=(200,), chunksize=(40,)> foo (y, x) float32 dask.array<shape=(100, 200), chunksize=(50, 40)> Attributes: array_atr: [1, 2] some_attr: copana

In [8]: ds_gcs.sum() Out[8]: <xarray.Dataset> Dimensions: () Data variables: bar float64 dask.array<shape=(), chunksize=()> foo float32 dask.array<shape=(), chunksize=()>

In [9]: ds_gcs.sum().compute() Out[9]: <xarray.Dataset> Dimensions: () Data variables: bar float64 0.0 foo float32 20000.0 ```

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345574445 https://github.com/pydata/xarray/pull/1528#issuecomment-345574445 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTU3NDQ0NQ== rabernat 1197350 2017-11-20T02:21:08Z 2017-11-20T02:21:08Z MEMBER

Those following this thread will probably be very excited to learn that the following code works with my zarr_backend branch: python import gcsfs fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None) gcsmap = gcsfs.mapping.GCSMap('zarr_store_test', gcs=fs, check=True, create=False) ds.to_zarr(store=gcsmap) ds_gcs = xr.open_zarr(gcsmap, mode='r')

I never doubted this would be possible, but seeing it in action is quite exciting!

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345128506 https://github.com/pydata/xarray/pull/1528#issuecomment-345128506 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTEyODUwNg== jhamman 2443309 2017-11-17T02:38:41Z 2017-11-17T02:38:41Z MEMBER

@rabernat - It might a little but we'll sort it out. See https://github.com/rabernat/xarray/pull/3.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345126452 https://github.com/pydata/xarray/pull/1528#issuecomment-345126452 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTEyNjQ1Mg== rabernat 1197350 2017-11-17T02:24:56Z 2017-11-17T02:24:56Z MEMBER

@jhamman would it screw you up if I pushed a few commits tonight? I won’t touch the ZarrArrayWrapper. But I figured out how to fix auto_chunk.

Sent from my iPhone

On Nov 16, 2017, at 7:12 PM, Matthew Rocklin notifications@github.com wrote:

Hooray for standard interfaces!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345104713 https://github.com/pydata/xarray/pull/1528#issuecomment-345104713 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTEwNDcxMw== mrocklin 306380 2017-11-17T00:12:01Z 2017-11-17T00:12:01Z MEMBER

Hooray for standard interfaces!

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 1,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345101150 https://github.com/pydata/xarray/pull/1528#issuecomment-345101150 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTEwMTE1MA== mrocklin 306380 2017-11-16T23:52:48Z 2017-11-16T23:52:48Z MEMBER

The gcsfs library also provides a MutableMapping for Google Cloud Storage.

The dask.distributed library now also provides a distributed lock for synchronization, if necessary though in practice we should just rechunk the dask.array before writing.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345091139 https://github.com/pydata/xarray/pull/1528#issuecomment-345091139 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTA5MTEzOQ== shoyer 1217238 2017-11-16T23:02:14Z 2017-11-16T23:02:14Z MEMBER

can we brainstorm what a ZarrArrayWraper would need to be compatible with the new indexing API?

We will need to write new adapter code to map xarray's explicit indexer classes onto the appropriate zarr methods, e.g., ```python def getitem(self, key): array = self.get_arraay() if isinstance(key, BasicIndexer): return array[key.tuple] elif isinstance(key, VectorizedIndexer): return array.vindex[_replace_slices_with_arrays(key.tuple, self.shape)] else: assert isinstance(key, OuterIndexer) return array.oindex[key.tuple]

untested, but I think this does the appropriate shape munging to make slices

appear as the last axes of the result array

def _replace_slice_with_arrays(key, shape): num_slices = sum(1 for k in key if isinstance(k, slice)) num_arrays = len(shape) - num_slices new_key = [] slice_count = 0 for k, size in zip(key, shape): if isinstance(k, slice): array = np.arange(*k.indices(size)) sl = [np.newaxis] * len(shape) sl[num_arrays + slice_count] = np.newaxis k = array[sl] slice_count += 1 else: assert isinstance(k, numpy.ndarray) k = k[(slice(None),) * num_arrays + (np.newaxis,) * num_slices] new_key.append(k) return tuple(new_key) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345034208 https://github.com/pydata/xarray/pull/1528#issuecomment-345034208 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTAzNDIwOA== rabernat 1197350 2017-11-16T19:22:01Z 2017-11-16T19:22:01Z MEMBER

Some things I would like to add to the zarr test suite:

  • [ ] specifying zarr-specific encoding options (compressors and filters)
  • [ ] writing to different zarr storage backends (e.g. dict store, can we mock an S3 store?)
  • [ ] different combinations of zarr and dask chunks. one <=> one, many <=> one are supported; many <=> one and many <=> many should raise errors / warnings (not thread safe)
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345030848 https://github.com/pydata/xarray/pull/1528#issuecomment-345030848 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTAzMDg0OA== rabernat 1197350 2017-11-16T19:10:31Z 2017-11-16T19:10:31Z MEMBER

FYI: I'm playing with your branch a bit today.

Great! If you use the latest zarr master, you should get the same test results as this travis build: https://travis-ci.org/pydata/xarray/jobs/301606996

There are two outstanding failures related to encoding (test_roundtrip_bytes_with_fill_value and test_roundtrip_string_encoded_characters). And auto-caching is not working (test_dataset_caching). I consider these pretty minor.

The biggest problem is that, for reasons I don't understand, my "auto-chunking" behavior does not work (this is covered by the only zarr-specific test method: test_auto_chunk). My goal is to have zarr be lazy-by-default and create dask chunks for every zarr chunk. However, my implementation of this does not work: https://github.com/pydata/xarray/pull/1528/files#diff-1bba25ab0d8275d763572bfdd10377c6R325

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345026224 https://github.com/pydata/xarray/pull/1528#issuecomment-345026224 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTAyNjIyNA== jhamman 2443309 2017-11-16T18:53:42Z 2017-11-16T18:53:42Z MEMBER

@rabernat - FYI: I'm playing with your branch a bit today.

@shoyer and @rabernat, can we brainstorm what a ZarrArrayWraper would need to be compatible with the new indexing API? I'm happy to implement it but could use a few pointers to get started.

```Python class ZarrArrayWraper(BackendArray): def init(self, variable_name, datastore): self.datastore = datastore self.variable_name = variable_name array = self.get_array() self.shape = array.shape self.dtype = np.dtype(array.dtype.kind + str(array.dtype.itemsize))

def get_array(self):
    self.datastore.assert_open()
    return self.datastore.ds[self.variable_name]  # returns a zarr-array

def __getitem__(self, key):
    with self.datastore.ensure_open(autoclose=True):
        data = IndexingAdapter(self.get_array())[key]  # which indexing adapter? 
        return np.array(data, dtype=self.dtype, copy=copy)

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
344040853 https://github.com/pydata/xarray/pull/1528#issuecomment-344040853 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NDA0MDg1Mw== rabernat 1197350 2017-11-13T20:04:12Z 2017-11-13T20:04:12Z MEMBER

😬 that's my punishment for being slow!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
344040250 https://github.com/pydata/xarray/pull/1528#issuecomment-344040250 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NDA0MDI1MA== shoyer 1217238 2017-11-13T20:02:03Z 2017-11-13T20:02:03Z MEMBER

@rabernat sorry for the churn here, but you are also probably going to need to update after the explicit indexing changes in #1705.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
339815147 https://github.com/pydata/xarray/pull/1528#issuecomment-339815147 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzOTgxNTE0Nw== rabernat 1197350 2017-10-26T22:07:10Z 2017-10-26T22:07:10Z MEMBER

Fantastic! Are you planning a release any time soon? If not we can set up to test against the github master.

Sent from my iPhone

On Oct 26, 2017, at 5:04 PM, Alistair Miles notifications@github.com wrote:

Just to say, support for 0d arrays, and for arrays with one or more zero-length dimensions, is in zarr master.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
335204883 https://github.com/pydata/xarray/pull/1528#issuecomment-335204883 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNTIwNDg4Mw== rabernat 1197350 2017-10-09T16:09:50Z 2017-10-09T16:09:50Z MEMBER

I'm on paternity leave for the next 2 weeks

Congratulations! If you could just merge alimanfoo/zarr#154, it would really help us move forward.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
335162205 https://github.com/pydata/xarray/pull/1528#issuecomment-335162205 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNTE2MjIwNQ== rabernat 1197350 2017-10-09T13:43:49Z 2017-10-09T13:43:49Z MEMBER

I won't be able to put any effort into zarr in the next month

Does this include merging PRs?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
335027491 https://github.com/pydata/xarray/pull/1528#issuecomment-335027491 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNTAyNzQ5MQ== rabernat 1197350 2017-10-08T18:23:50Z 2017-10-08T18:23:50Z MEMBER

For thoroughness this might be worth doing with custom JSON encoder on the zarr side, but would also be easy to do in the xarray wrapper.

My impression is that zarr development is moving conservatively, so we would be better off finding workarounds in xarray.

@shoyer: where in the code would you recommend putting this logic? It seems like part of encoding / decoding to me.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
334981929 https://github.com/pydata/xarray/pull/1528#issuecomment-334981929 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNDk4MTkyOQ== rabernat 1197350 2017-10-08T04:16:58Z 2017-10-08T18:21:30Z MEMBER

There are two zarr issues that are causing some tests to fail:

  1. zarr can't store zero-dimensional arrays. python za = zarr.create(shape=(), store='tmp_file') za[...] = 0 raises a file permission error. I believe that this is alimanfoo/zarr#150.
  2. lots of the things that xarray likes to put in attributes are not serializable by zarr python za = zarr.create(shape=(1), store='tmp_file') za.attrs['foo'] = np.float32(0) raises TypeError: Object of type 'float32' is not JSON serializable. This is alimanfoo/zarr#156.

Most of the failures of tests inherited from CFEncodedDataTest can be attributed to one of these two issues.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
335015485 https://github.com/pydata/xarray/pull/1528#issuecomment-335015485 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNTAxNTQ4NQ== shoyer 1217238 2017-10-08T15:46:36Z 2017-10-08T15:46:36Z MEMBER

For serializing attributes, the easiest fix is to call .item() on any numpy scalars (instances of np.generic) and .tolist() on any numpy arrays. For thoroughness this might be worth doing with custom JSON encoder on the zarr side, but would also be easy to do in the xarray wrapper.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
334982373 https://github.com/pydata/xarray/pull/1528#issuecomment-334982373 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNDk4MjM3Mw== rabernat 1197350 2017-10-08T04:31:02Z 2017-10-08T04:31:09Z MEMBER

I worked on this on the plane back from Seattle. Yay for having no internet access!

Would appreciate feedback on the questions raised above from @shoyer, @jhamman, and anyone else with backend expertise.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
334633708 https://github.com/pydata/xarray/pull/1528#issuecomment-334633708 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNDYzMzcwOA== rabernat 1197350 2017-10-06T01:15:05Z 2017-10-06T01:15:05Z MEMBER

Here is where we are at with the Zarr backend tests xarray/tests/test_backends.py::ZarrDataTest::test_coordinates_encoding PASSED xarray/tests/test_backends.py::ZarrDataTest::test_dataset_caching FAILED xarray/tests/test_backends.py::ZarrDataTest::test_dataset_compute PASSED xarray/tests/test_backends.py::ZarrDataTest::test_default_fill_value FAILED xarray/tests/test_backends.py::ZarrDataTest::test_encoding_kwarg FAILED xarray/tests/test_backends.py::ZarrDataTest::test_encoding_same_dtype PASSED xarray/tests/test_backends.py::ZarrDataTest::test_invalid_dataarray_names_raise FAILED xarray/tests/test_backends.py::ZarrDataTest::test_load PASSED xarray/tests/test_backends.py::ZarrDataTest::test_orthogonal_indexing FAILED xarray/tests/test_backends.py::ZarrDataTest::test_pickle FAILED xarray/tests/test_backends.py::ZarrDataTest::test_pickle_dataarray PASSED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_None_variable PASSED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_boolean_dtype PASSED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_coordinates PASSED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_datetime_data FAILED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_endian PASSED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_example_1_netcdf FAILED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_float64_data PASSED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_mask_and_scale FAILED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_object_dtype FAILED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_string_data PASSED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_strings_with_fill_value FAILED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_test_data PASSED xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_timedelta_data FAILED xarray/tests/test_backends.py::ZarrDataTest::test_unsigned_roundtrip_mask_and_scale FAILED xarray/tests/test_backends.py::ZarrDataTest::test_write_store PASSED xarray/tests/test_backends.py::ZarrDataTest::test_zero_dimensional_variable FAILED

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
334633152 https://github.com/pydata/xarray/pull/1528#issuecomment-334633152 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNDYzMzE1Mg== rabernat 1197350 2017-10-06T01:10:29Z 2017-10-06T01:10:29Z MEMBER

With @jhamman's help, I just made a little progress on this.

We now have a bare bones test suite for the zarr backend. This is very helpful for revealing where more work is needed: encoding. So the next step is to seriously confront that issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
334316122 https://github.com/pydata/xarray/pull/1528#issuecomment-334316122 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNDMxNjEyMg== jhamman 2443309 2017-10-04T23:14:58Z 2017-10-04T23:14:58Z MEMBER

@rabernat - testing should be fully functional now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
333579128 https://github.com/pydata/xarray/pull/1528#issuecomment-333579128 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzMzU3OTEyOA== jhamman 2443309 2017-10-02T15:58:05Z 2017-10-02T15:58:05Z MEMBER

@rabernat - re backends testing, #1557 is pretty close. I can wrap it up this week.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
333336320 https://github.com/pydata/xarray/pull/1528#issuecomment-333336320 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzMzMzNjMyMA== rabernat 1197350 2017-09-30T21:13:48Z 2017-09-30T21:13:48Z MEMBER

@martindurant: I may have some time to get back to working on this next week. (Especially if @jhamman can help me sort out the backend testing.) What is the status of your branch?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
327900874 https://github.com/pydata/xarray/pull/1528#issuecomment-327900874 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNzkwMDg3NA== shoyer 1217238 2017-09-07T19:32:41Z 2017-09-07T19:32:41Z MEMBER

@rabernat indeed, the backend tests are not terribly well organized right now. Probably the place to start is to inherit from DatasetIOTestCases and TestCase and then implement create_store and roundtrip. DaskTest abuses the "backend" notation a little bit, but these lines cover the essentials: https://github.com/pydata/xarray/blob/98a05f11c6f38489c82e86c9e9df796e7fb65fd2/xarray/tests/test_backends.py#L1271-L1279

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
327849640 https://github.com/pydata/xarray/pull/1528#issuecomment-327849640 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNzg0OTY0MA== rabernat 1197350 2017-09-07T16:17:13Z 2017-09-07T16:17:13Z MEMBER

I am stuck on figuring out how to develop a new test case for this. (It doesn't help that #1531 is messing up the backend tests.)

If @shoyer can give us a few hints about how to best implement a test class (i.e. what to subclass, etc.), I think that could jumpstart testing and move the PR forward.

I welcome contributions from others such as @martindurant on this. I won't have much time in the near future, since a new semester just dropped on me like a load of bricks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325742232 https://github.com/pydata/xarray/pull/1528#issuecomment-325742232 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTc0MjIzMg== shoyer 1217238 2017-08-29T17:50:04Z 2017-08-29T17:50:04Z MEMBER

If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working.

The only advantage here would be for non-xarray users, who could use zarr to do this decoding/encoding automatically.

For what it's worth, the implementation of scale offsets in xarray looks basically equivalent to what's done in zarr. I don't think there's a performance difference either way.

A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data.

If you use chunks, I believe HDF5/NetCDF4 do the same thing, e.g., ``` In [10]: with h5py.File('one-chunk.h5') as f: f.create_dataset('foo', (100, 100), chunks=(100, 100))

In [11]: with h5py.File('many-chunk.h5') as f: f.create_dataset('foo', (100000, 100000), chunks=(100, 100))

In [12]: ls -l | grep chunk.h5 -rw-r--r-- 1 shoyer eng 1400 Aug 29 10:48 many-chunk.h5 -rw-r--r-- 1 shoyer eng 1400 Aug 29 10:48 one-chunk.h5 ``` (Note the same file-size)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325738019 https://github.com/pydata/xarray/pull/1528#issuecomment-325738019 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTczODAxOQ== rabernat 1197350 2017-08-29T17:35:09Z 2017-08-29T17:35:09Z MEMBER

One path forward for now would be to ignore the filters like FixedScaleOffset that are not present in netCDF, let xarray handle the CF encoding / decoding, and just put the compressors (e.g. Blosc, Zlib) and their parameters in the xarray variable encoding.

If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working.

@alimanfoo: when do you anticipate the 2.2 zarr release to happen? Will the API change significantly? If so, I will wait for that to move forward here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325723577 https://github.com/pydata/xarray/pull/1528#issuecomment-325723577 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcyMzU3Nw== shoyer 1217238 2017-08-29T16:43:58Z 2017-08-29T16:44:25Z MEMBER

Is the goal here to be able to round-trip the file, such that calling .to_netcdf() produces an identical file to the original source file?

Yes, exactly.

I don't understand how encoding interacts with attributes? When is something an attribute vs. an encoding (add_offset for example)?

Typically, we store things in encoding that are attributes on the underlying NetCDF file, but no longer make sense to describe the decoded data. For example: - On the file, add_offset is an attribute. - If loaded with open_dataset(..., mask_and_scale=True), add_offset can be found in encoding, not attrs, because the data has already been offset. - If loaded with open_dataset(..., mask_and_scale=False), add_offset will still be on attrs (the data has not been offset).

How does xarray know whether the store automatically encodes / decodes the encodings vs. when it has to be done by xarray, e.g. by calling mask_and_scale

Currently, we assume that stores never do this, and always handle it ourselves. We might need a special exception for zarr and scale/offset encoding.

Does this mean that my ZarrStore should inherit from WritableCFDataStore instead of AbstractWritableDataStore?

Maybe, though again it will probably need slightly customized conventions for writing data (if we let zarr handling scale/offset encoding).

I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above.

We have two options: 1. Handle it all in xarray via the machinery in conventions.py. Never pass the arguments to do scale/offset encoding to zarr (just save them as attributes). 2. Handle it all in zarr. We'll need special case logic to skip this part of encoding.

I think (2) would be the preferred way to do this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325716892 https://github.com/pydata/xarray/pull/1528#issuecomment-325716892 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcxNjg5Mg== shoyer 1217238 2017-08-29T16:19:57Z 2017-08-29T16:19:57Z MEMBER

@rabernat I think this is #1531 -- require_pynio seems to have infected all our other requirements!

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 1,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325690352 https://github.com/pydata/xarray/pull/1528#issuecomment-325690352 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTY5MDM1Mg== rabernat 1197350 2017-08-29T14:54:53Z 2017-08-29T14:54:53Z MEMBER

I am now trying to understand the backend test suite structure.

Can someone explain to me why so many tests are skipped? For example, if I run py.test -v xarray/tests/test_backends.py -rsx -k GenericNetCDFDataTest

I get ``` ================================================== test session starts ================================================== platform darwin -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /Users/rpa/anaconda/bin/python cachedir: .cache rootdir: /Users/rpa/RND/Public/xarray, inifile: setup.cfg plugins: cov-2.5.1 collected 683 items

xarray/tests/test_backends.py::GenericNetCDFDataTest::test_coordinates_encoding SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_cross_engine_read_write_netcdf3 PASSED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_dataset_caching SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_dataset_compute SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_default_fill_value SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_kwarg SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_same_dtype SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_unlimited_dims PASSED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_engine PASSED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_invalid_dataarray_names_raise SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_load SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_orthogonal_indexing PASSED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_pickle SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_pickle_dataarray SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_None_variable SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_boolean_dtype SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_coordinates SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_datetime_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_endian SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_example_1_netcdf SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_float64_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_mask_and_scale SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_object_dtype SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_string_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_strings_with_fill_value SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_test_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_timedelta_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_write_store PASSED xarray/tests/test_backends.py::GenericNetCDFDataTest::test_zero_dimensional_variable SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_coordinates_encoding SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_cross_engine_read_write_netcdf3 PASSED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_dataset_caching SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_dataset_compute SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_default_fill_value SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_kwarg SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_same_dtype SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_unlimited_dims PASSED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_engine PASSED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_invalid_dataarray_names_raise SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_load SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_orthogonal_indexing PASSED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_pickle SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_pickle_dataarray SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_None_variable SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_boolean_dtype SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_coordinates SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_datetime_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_endian SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_example_1_netcdf SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_float64_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_mask_and_scale SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_object_dtype SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_string_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_strings_with_fill_value SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_test_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_timedelta_data SKIPPED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_write_store PASSED xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_zero_dimensional_variable SKIPPED ================================================ short test summary info ================================================ SKIP [2] xarray/tests/test_backends.py:382: requires pynio SKIP [2] xarray/tests/test_backends.py:214: requires pynio SKIP [2] xarray/tests/test_backends.py:178: requires pynio SKIP [2] xarray/tests/test_backends.py:468: requires pynio SKIP [2] xarray/tests/test_backends.py:439: requires pynio SKIP [2] xarray/tests/test_backends.py:490: requires pynio SKIP [2] xarray/tests/test_backends.py:428: requires pynio SKIP [2] xarray/tests/test_backends.py:145: requires pynio SKIP [2] xarray/tests/test_backends.py:197: requires pynio SKIP [2] xarray/tests/test_backends.py:207: requires pynio SKIP [2] xarray/tests/test_backends.py:230: requires pynio SKIP [2] xarray/tests/test_backends.py:311: requires pynio SKIP [2] xarray/tests/test_backends.py:300: requires pynio SKIP [2] xarray/tests/test_backends.py:271: requires pynio SKIP [2] xarray/tests/test_backends.py:409: requires pynio SKIP [2] xarray/tests/test_backends.py:291: requires pynio SKIP [2] xarray/tests/test_backends.py:286: requires pynio SKIP [2] xarray/tests/test_backends.py:362: requires pynio SKIP [2] xarray/tests/test_backends.py:235: requires pynio SKIP [2] xarray/tests/test_backends.py:264: requires pynio SKIP [2] xarray/tests/test_backends.py:334: requires pynio SKIP [2] xarray/tests/test_backends.py:139: requires pynio SKIP [2] xarray/tests/test_backends.py:280: requires pynio SKIP [2] xarray/tests/test_backends.py:109: requires pynio ```

Those line numbers refer to all of the skipped methods. Why should I need pynio to run those tests?

It looks like the same thing is happening on travis: https://travis-ci.org/pydata/xarray/jobs/268805771#L1527

Maybe @pwolfram understands this stuff?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325660754 https://github.com/pydata/xarray/pull/1528#issuecomment-325660754 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTY2MDc1NA== rabernat 1197350 2017-08-29T13:18:33Z 2017-08-29T13:18:33Z MEMBER

encoding keeps track of how variables are represented in a file (e.g., chunking schemes, _FillValue/add_offset/scale_factor compression, time units), so we reconstruct a netCDF file that looks almost exactly like the file we've read from disk.

Is the goal here to be able to round-trip the file, such that calling .to_netcdf() produces an identical file to the original source file? For zarr, I think this would mean having the ability to read from one zarr store into xarray, and then write back to a different store, and have these two stores be identical. That makes sense to me.

I don't understand how encoding interacts with attributes? When is something an attribute vs. an encoding (add_offset for example)? How does xarray know whether the store automatically encodes / decodes the encodings vs. when it has to be done by xarray, e.g. by calling mask_and_scale

Should we encode / decode CF for zarr stores?

Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.

Does this mean that my ZarrStore should inherit from WritableCFDataStore instead of AbstractWritableDataStore?

Regarding encoding, zarr has its own internal mechanism for encoding, which it calls "filters", that closely resemble some of the CF encoding options. For example the FixedScaleOffset filter does something similar to as xarray's mask_and_scale function.

I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325525827 https://github.com/pydata/xarray/pull/1528#issuecomment-325525827 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTUyNTgyNw== shoyer 1217238 2017-08-29T01:14:05Z 2017-08-29T01:14:05Z MEMBER

What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?

encoding keeps track of how variables are represented in a file (e.g., chunking schemes, _FillValue/add_offset/scale_factor compression, time units), so we reconstruct a netCDF file that looks almost exactly like the file we've read from disk. In the case of zarr, I guess we might include chunking, fill values, compressor options....

Should we encode / decode CF for zarr stores?

Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.

Do we want to always automatically align dask chunks with the underlying zarr chunks?

This would be nice! But it's also a bigger issue (will look for the number, I think it's already been opened).

What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.

Still need to think about this one.

zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?

I guess we can ignore them (maybe add a warning?) -- they're not part of the zarr data model.

Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.

I don't think we need any autoclose logic at all -- zarr doesn't leave open files hanging around already.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325226656 https://github.com/pydata/xarray/pull/1528#issuecomment-325226656 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTIyNjY1Ng== rabernat 1197350 2017-08-27T21:42:23Z 2017-08-27T21:42:23Z MEMBER

Is the aim to reduce the number of metadata files hanging around?

This is also part of my goal. I think all the metadata can be stored internally to zarr via attributes. There just have to be some "special" attributes that xarray hides from the user. This is the same as h5netcdf.

@alimanfoo suggested this should be possible in that earlier thread:

Specifically I'm wondering if this could all be stored as attributes on the Zarr array, with some conventions for special xarray attribute names?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325226495 https://github.com/pydata/xarray/pull/1528#issuecomment-325226495 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTIyNjQ5NQ== rabernat 1197350 2017-08-27T21:38:35Z 2017-08-27T21:38:35Z MEMBER

Could you comment more on the difference between your approach and mine?

Your functions are a great proof of concept for the relative ease of interoperability between xarray and zarr. What I have done here is to implement an xarray "backend" (i.e. DataStore) that uses zarr as its storage medium. This puts zarr on the same level as netCDF and HDF5 as a "first class" storage format for xarray data, as suggested by @shoyer in the comment on that thread. My hope is that this will enable the magical performance benefits that you have anticipated.

Digging deeper into that thread, I see @shoyer makes the following proposition:

So we could either directly write a DataStore or write a separate "znetcdf" or "netzdf" module that implements an interface similar to h5netcdf (which itself is a thin wrapper on top of h5py).

With this PR, I have started to do the former (write a DataStore). However, I can already see the wisdom of what he says next:

All things being equal, I would prefer the later approach, because people seem to find these intermediate interfaces useful, and it would help clarify the specification of the file format vs. details of how xarray uses it.

I have already implemented my own custom DataStore for a different project, so I felt comfortable diving into this. But I might end up reinventing the wheel several times over if I continue down this road. In particular, I can see that my HiddenKeyDict is very similar to h5netcdf's treatment of attributes. (I had never looked at the h5netcdf code until just now!)

On the other hand, zarr is so simple to use that a separate wrapper package might be overkill.

So I am still not sure whether the approach I am taking here is worth pursuing further. I consider this a highly experimental PR, and I'm really looking for feedback.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325173551 https://github.com/pydata/xarray/pull/1528#issuecomment-325173551 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTE3MzU1MQ== rabernat 1197350 2017-08-27T02:40:22Z 2017-08-27T02:40:22Z MEMBER

cc @martindurant, @mrocklin, @alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 18.815ms · About: xarray-datasette