html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/1528#issuecomment-365412033,https://api.github.com/repos/pydata/xarray/issues/1528,365412033,MDEyOklzc3VlQ29tbWVudDM2NTQxMjAzMw==,6042212,2018-02-13T21:35:03Z,2018-02-13T21:35:03Z,CONTRIBUTOR,"Yeah, ideally when adding a variable like
```
ds['myvar'] = xr.DataArray(data=da.zeros(..., chunks=(..)), dims=['l', 'b', 'v'])
ds.to_zarr(mapping)
```
we should be able to apply an optimization strategy in which the zarr array is created without filling in all those unnecessary zeros. This seems doable.
On the other hand, implementing
```
ds.myvar[slice, slice, slice] = some data
ds.to_zarr(mapping)
```
(which cannot be done currently with dask-arrays at all), in such a way that only partitions with data get updated - this seems really hard.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364954680,https://api.github.com/repos/pydata/xarray/issues/1528,364954680,MDEyOklzc3VlQ29tbWVudDM2NDk1NDY4MA==,1197350,2018-02-12T15:21:51Z,2018-02-12T15:21:51Z,MEMBER,I'm enjoying this discussion. Zarr offers lots of new possibilities for appending / updating datasets that we should try to support. I personally would really like to be able to append / extend existing arrays from within xarray.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364817111,https://api.github.com/repos/pydata/xarray/issues/1528,364817111,MDEyOklzc3VlQ29tbWVudDM2NDgxNzExMQ==,6042212,2018-02-12T02:43:43Z,2018-02-12T03:47:48Z,CONTRIBUTOR,"OK, so the way to do this in pure-zarr appears to be to simply create the appropriate zarr array and set it's dimensions attribute:
```
ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005),
'l': np.arange(150, 72, -0.005),
'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)},
ds.to_zarr(mapping)
g = zarr.open_group(mapping)
arr = g.zeros(..., shape like l, b, v)
arr.attrs['_ARRAY_DIMENSIONS'] = ['l', 'b', 'v']
```
`xr..open_zarr(mapping)` now shows the new array, without having to materialize any data into it, and `arr` can be written to piecemeal - without the convenience of the coordinate mapping, of course.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364812486,https://api.github.com/repos/pydata/xarray/issues/1528,364812486,MDEyOklzc3VlQ29tbWVudDM2NDgxMjQ4Ng==,3019665,2018-02-12T01:51:40Z,2018-02-12T01:51:40Z,NONE,"So Zarr supports storing structured arrays. Maybe thatβs what you are looking for, @martindurant? Would suggest using the latest 2.2.0 RC though as it fixed a few issues in this regard (particularly with NumPy 1.14).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364804697,https://api.github.com/repos/pydata/xarray/issues/1528,364804697,MDEyOklzc3VlQ29tbWVudDM2NDgwNDY5Nw==,6042212,2018-02-12T00:19:55Z,2018-02-12T00:19:55Z,CONTRIBUTOR,"It might be enough, in this case, to provide some helper function in zarr to create and fetch arrays that will show up as variables in xarray - this need not be specific to being used via dask. I am assuming with the work done in this PR, that there is an unambiguous way to determine if a zarr group can be interpreted as an xarray dataset, and that zarr then knows how to add things that look like variables (which generally in the zarr case don't involve writing any actual data until the parts of the array are filled in).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364804265,https://api.github.com/repos/pydata/xarray/issues/1528,364804265,MDEyOklzc3VlQ29tbWVudDM2NDgwNDI2NQ==,1217238,2018-02-12T00:15:23Z,2018-02-12T00:15:23Z,MEMBER,"See https://github.com/dask/dask/issues/2000 for the dask issue. Once this works in dask it should be quite easy to implement in xarray, too.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364804162,https://api.github.com/repos/pydata/xarray/issues/1528,364804162,MDEyOklzc3VlQ29tbWVudDM2NDgwNDE2Mg==,1217238,2018-02-12T00:14:22Z,2018-02-12T00:14:22Z,MEMBER,@martindurant that could probably be addressed most cleanly by improving `__setitem__` support for dask.array.,"{""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364803984,https://api.github.com/repos/pydata/xarray/issues/1528,364803984,MDEyOklzc3VlQ29tbWVudDM2NDgwMzk4NA==,6042212,2018-02-12T00:12:36Z,2018-02-12T00:12:36Z,CONTRIBUTOR,"@jhamman , that partially solves what I mean, I can probably turn my data into dask arrays with some difficulty; but really I was hoping for something like the following:
```
ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005),
'l': np.arange(150, 72, -0.005),
'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)},
arr = ds.create_new_zero_array(dims=['l', 'b', 'v'])
arr[0:10, :, :] = 1
```
and expect to be able to set the values of the new variable in the same way that you can with the equivalent zarr array. I can probably get around this by setting the values with `da.zeros`, finding the zarr array in the dataset, and then setting its values.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364802374,https://api.github.com/repos/pydata/xarray/issues/1528,364802374,MDEyOklzc3VlQ29tbWVudDM2NDgwMjM3NA==,2443309,2018-02-11T23:54:01Z,2018-02-11T23:54:01Z,MEMBER,"@martindurant - If I understand your question correctly, I think you should be able to follow a pretty standard xarray workflow:
```Python
ds = xr.Dataset()
ds['your_varname'] = xr.DataArray(some_dask_array,
dims=['dimname0', 'dimname1', ...],
coords=dict_of_preknown_coords)
# repeat for each variable you want in your dataset
ds.to_zarr(some_zarr_store)
# then to open
ds2 = xr.open_zarr(some_zarr_store)
```
Two things to note:
1) if you are looking for decent performance when writing to a remote store, make sure you're working off xarray@master as #1800 fixed a number of choke points in the to_zarr implementation
2) if you are pushing to GCS, `some_zarr_store` can be a `GCSMap`. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364801395,https://api.github.com/repos/pydata/xarray/issues/1528,364801395,MDEyOklzc3VlQ29tbWVudDM2NDgwMTM5NQ==,306380,2018-02-11T23:40:18Z,2018-02-11T23:40:18Z,MEMBER,"Does the to_zarr method suffice:
http://xarray.pydata.org/en/latest/generated/xarray.Dataset.to_zarr.html#xarray.Dataset.to_zarr
?
On Sun, Feb 11, 2018 at 6:35 PM, Martin Durant
wrote:
> Question: how would one *build* a zarr-xarray dataset?
>
> With zarr you can open an array that contains no data, and use set-slice
> notation to fill in the values (which is what dask's store essentially
> does).
>
> If I have some pre-known coordinates and bigger-than-memory data arrays,
> how would I go about getting the values into the zarr structure? If this
> can't be done directly with the xarray interface, is there a way to call
> zarr's open/create/zeros such that the corresponding array will appear as a
> variable when the same dataset is opened with xarray?
>
> β
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364801073,https://api.github.com/repos/pydata/xarray/issues/1528,364801073,MDEyOklzc3VlQ29tbWVudDM2NDgwMTA3Mw==,6042212,2018-02-11T23:35:34Z,2018-02-11T23:35:34Z,CONTRIBUTOR,"Question: how would one *build* a zarr-xarray dataset?
With zarr you can open an array that contains no data, and use set-slice notation to fill in the values (which is what dask's store essentially does).
If I have some pre-known coordinates and bigger-than-memory data arrays, how would I go about getting the values into the zarr structure? If this can't be done directly with the xarray interface, is there a way to call zarr's open/create/zeros such that the corresponding array will appear as a variable when the same dataset is opened with xarray?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-351588678,https://api.github.com/repos/pydata/xarray/issues/1528,351588678,MDEyOklzc3VlQ29tbWVudDM1MTU4ODY3OA==,1217238,2017-12-14T02:23:03Z,2017-12-14T02:23:03Z,MEMBER,"woohoo, thank you Ryan!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-351401474,https://api.github.com/repos/pydata/xarray/issues/1528,351401474,MDEyOklzc3VlQ29tbWVudDM1MTQwMTQ3NA==,1197350,2017-12-13T14:09:12Z,2017-12-13T14:09:12Z,MEMBER,Will merge later today if no further comments.,"{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-350557153,https://api.github.com/repos/pydata/xarray/issues/1528,350557153,MDEyOklzc3VlQ29tbWVudDM1MDU1NzE1Mw==,10050469,2017-12-10T15:45:13Z,2017-12-10T15:45:13Z,MEMBER,"Thanks for the tremendous work @rabernat , looking forward to testing this!
In the future it would be nice to shortly describe the advantages of zarr over netcdf for new users. A speed benchmark could help, too! This can be done once the backend has more maturity, and when we will refactor the I/O docs","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-350504017,https://api.github.com/repos/pydata/xarray/issues/1528,350504017,MDEyOklzc3VlQ29tbWVudDM1MDUwNDAxNw==,3019665,2017-12-09T20:38:58Z,2017-12-09T20:38:58Z,NONE,"> Just to confirm, if writes are aligned with chunk boundaries in the destination array then no locking is required.
As a minor point to complement what Matthew and Alistair have already said, one can pretty easily `rechunk` beforehand so that the chunks will have a nice 1-to-1 non-overlapping mapping on disk. Not sure whether this strategy is good enough to make default. However have had no issues doing this myself. Also would expect it is better than holding one lock over the whole Zarr Array. Though there may be some strange edge cases that I have not encountered.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-350375750,https://api.github.com/repos/pydata/xarray/issues/1528,350375750,MDEyOklzc3VlQ29tbWVudDM1MDM3NTc1MA==,703554,2017-12-08T21:24:45Z,2017-12-08T22:27:47Z,CONTRIBUTOR,"Just to confirm, if writes are aligned with chunk boundaries in the
destination array then no locking is required.
Also if you're going to be moving large datasets into cloud storage and
doing distributed computing then it may be worth investigating compressors
and compressor options as good compression ratio may make a big difference
where network bandwidth may be the limiting factor. I would suggest using
the Blosc compressor with cname='zstd'. I would also suggest using shuffle,
the Blosc codec in latest numcodecs has an AUTOSHUFFLE option so byte
shuffle is used for arrays with >1 byte item size and bit shuffle is used
for arrays with 1 byte item size
. I would also experiment with compression level (clevel) to see how speed
balances against compression ratio. E.g., Blosc(cname='zstd', clevel=5,
shuffle=Blosc.AUTOSHUFFLE) may be a good starting point. The default
compressor is Blosc(cname='lz4', ...) is more optimised for fast local
storage, so speed is very good but compression ratio is moderate, this may
not be best for distributed computing.
","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-350379064,https://api.github.com/repos/pydata/xarray/issues/1528,350379064,MDEyOklzc3VlQ29tbWVudDM1MDM3OTA2NA==,703554,2017-12-08T21:40:40Z,2017-12-08T22:27:35Z,CONTRIBUTOR,"Some examples of compressor benchmarking here may be useful
http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html
The specific conclusions probably won't apply to your data but some of the
code and ideas may be useful. Since writing that article I added Zstd and
LZ4 compressors in numcodecs so those may also be worth trying in addition
to Blosc with various configurations. (Blosc breaks up each chunk into
blocks which enables multithreaded compression/decompression but can also
reduce compression ratio over the same compressor library used without
Blosc. I.e., Blosc(cname='zstd', clevel=1) will behave differently from
Zstd(level=1) even though the same underlying compression library
(Zstandard) is being used.)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-350365780,https://api.github.com/repos/pydata/xarray/issues/1528,350365780,MDEyOklzc3VlQ29tbWVudDM1MDM2NTc4MA==,1197350,2017-12-08T20:36:26Z,2017-12-08T20:36:26Z,MEMBER,Any more reviews? @fmaussion & @pwolfram: you have experience with backends. Your reviews would be valuable.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-350352097,https://api.github.com/repos/pydata/xarray/issues/1528,350352097,MDEyOklzc3VlQ29tbWVudDM1MDM1MjA5Nw==,1217238,2017-12-08T19:34:09Z,2017-12-08T19:34:09Z,MEMBER,"> The default keyword was introduced in python 3.4, so this doesn't work in 2.7. I have tried a couple of options to overcome this but none of them have worked.
Oops, this is my fault!
Instead, try:
```python
ndims = [k.ndim for k in key if isinstance(k, np.ndarray)]
array_subspace_size = max(ndims) if ndims else 0
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-350343117,https://api.github.com/repos/pydata/xarray/issues/1528,350343117,MDEyOklzc3VlQ29tbWVudDM1MDM0MzExNw==,306380,2017-12-08T18:55:35Z,2017-12-08T18:55:35Z,MEMBER,"Not as far as I know.
On Fri, Dec 8, 2017 at 1:53 PM, Ryan Abernathey
wrote:
> *@rabernat* commented on this pull request.
> ------------------------------
>
> In xarray/backends/common.py
> :
>
> > @@ -184,7 +185,7 @@ def sync(self):
> import dask.array as da
> import dask
> if LooseVersion(dask.__version__) > LooseVersion('0.8.1'):
> - da.store(self.sources, self.targets, lock=GLOBAL_LOCK)
> + da.store(self.sources, self.targets, lock=self.lock)
>
> There is no reason that a task run on the distributed system will not show
> up on the dashboard. My first guess is that somehow you're using a local
> scheduler.
>
> I was not using a local scheduler. After digging further, I can see the
> tasks on the distributed dashboard using a regular zarr.DirectoryStore,
> but not when I pass a gcsfs.mapping.GCSMap to to_zarr. Is there any
> reasons these two should behave differently?
>
> β
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-350336238,https://api.github.com/repos/pydata/xarray/issues/1528,350336238,MDEyOklzc3VlQ29tbWVudDM1MDMzNjIzOA==,1197350,2017-12-08T18:26:58Z,2017-12-08T18:26:58Z,MEMBER,"There is a silly lingering issue that I need help resolving.
In a8b478543a978bd98c37711609c610432fdc7d07, @jhamman added a function `_replace_slices_with_arrays` related to vectorized indexing. This function contains a line
```python
array_subspace_size = max(
(k.ndim for k in key if isinstance(k, np.ndarray)), default=0)
```
The `default` keyword was introduced in python 3.4, so this doesn't work in 2.7. I have tried a couple of options to overcome this but none of them have worked. Would someone care to help out with this? It is possibly the last remaining issue to resolve before this PR is really ready to be merged.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-349992006,https://api.github.com/repos/pydata/xarray/issues/1528,349992006,MDEyOklzc3VlQ29tbWVudDM0OTk5MjAwNg==,1197350,2017-12-07T14:59:12Z,2017-12-07T14:59:12Z,MEMBER,"@jhamman, I can't reproduce your error. If you can give me a reproducible example, I will make a test for it.
I think this is converging.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-349766763,https://api.github.com/repos/pydata/xarray/issues/1528,349766763,MDEyOklzc3VlQ29tbWVudDM0OTc2Njc2Mw==,1197350,2017-12-06T20:36:03Z,2017-12-06T20:36:03Z,MEMBER,"@jhamman - but the error being raised is wrong! There is a string formatting error raised in trying to generate a useful, informative error message.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-349738624,https://api.github.com/repos/pydata/xarray/issues/1528,349738624,MDEyOklzc3VlQ29tbWVudDM0OTczODYyNA==,2443309,2017-12-06T18:54:41Z,2017-12-06T18:54:56Z,MEMBER,"@rabernat - in trying out your branch, I've run into this error (mentioned by @mrocklin in pangeo-data/pangeo#19):
```Python-traceback
...
~/anaconda/envs/pangeo-dev/lib/python3.6/site-packages/xarray-0.10.0_79_g7b50320-py3.6.egg/xarray/backends/zarr.py in _extract_zarr_variable_encoding(variable, raise_on_invalid)
228
229 chunks = _determine_zarr_chunks(encoding.get('chunks'), variable.chunks,
--> 230 variable.ndim)
231 encoding['chunks'] = chunks
232 return encoding
~/anaconda/envs/pangeo-dev/lib/python3.6/site-packages/xarray-0.10.0_79_g7b50320-py3.6.egg/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim)
134 ""Zarr requires uniform chunk sizes excpet for final chunk.""
135 "" Variable %r has incompatible chunks. Consider ""
--> 136 ""rechunking using `chunk()`."" % var_chunks)
137 # last chunk is allowed to be smaller
138 last_var_chunk = all_var_chunks[-1]
TypeError: not all arguments converted during string formatting
```
As far as I can tell, reworking my chunk sizes to divide evenly into the dataset dimensions has corrected the problem.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-349554730,https://api.github.com/repos/pydata/xarray/issues/1528,349554730,MDEyOklzc3VlQ29tbWVudDM0OTU1NDczMA==,1217238,2017-12-06T07:10:37Z,2017-12-06T07:10:37Z,MEMBER,I just pushed a commit adding a test for `backends.zarr._replace_slices_with_arrays`.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-349540155,https://api.github.com/repos/pydata/xarray/issues/1528,349540155,MDEyOklzc3VlQ29tbWVudDM0OTU0MDE1NQ==,1197350,2017-12-06T05:38:26Z,2017-12-06T05:38:26Z,MEMBER,"I believe that this is now complete enough to consider merging. I have addressed nearly all of @shoyer's suggestions. I have added a bunch more tests and am now quite satisfied with the test suite. I wrote some basic documentation, with the usual disclaimers about the experimental nature of this new feature.
The zarr tests will not run if the zarr version is less than 2.2.0. This is not released yet. This means that only the py36-zarr-dev build actually runs the zarr tests. Once @alimanfoo releases the next version, the zarr tests should kick in on all the builds.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-349495568,https://api.github.com/repos/pydata/xarray/issues/1528,349495568,MDEyOklzc3VlQ29tbWVudDM0OTQ5NTU2OA==,1197350,2017-12-06T01:08:11Z,2017-12-06T01:08:11Z,MEMBER,@jhamman - could you elaborate on the nature of the error you got with uneven dask chunks. We should be catching this and raising a useful error message.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-349488598,https://api.github.com/repos/pydata/xarray/issues/1528,349488598,MDEyOklzc3VlQ29tbWVudDM0OTQ4ODU5OA==,306380,2017-12-06T00:30:21Z,2017-12-06T00:30:21Z,MEMBER,We tried this out on a cloud-deployed cluster on GCE and things worked pleasantly. Some conversation here: https://github.com/pangeo-data/pangeo/issues/19,"{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-348839453,https://api.github.com/repos/pydata/xarray/issues/1528,348839453,MDEyOklzc3VlQ29tbWVudDM0ODgzOTQ1Mw==,703554,2017-12-04T01:40:57Z,2017-12-04T01:40:57Z,CONTRIBUTOR,"I know you're not including string support in this PR, but for interest, there are a couple of changes coming into zarr via https://github.com/alimanfoo/zarr/pull/212 that may be relevant in future.
It should now be impossible to generate a segfault via a badly configured object array. It is also now much harder to badly configure an object array. When creating an object array, an object codec should be provided via the ``object_codec`` parameter. There are now three codecs in numcodecs that can be used for variable length text strings: MsgPack, Pickle and JSON (new). [Examples notebook here](https://github.com/alimanfoo/zarr/blob/14ac8d9bf19633232f6522dfcd925f300722b82b/notebooks/object_arrays.ipynb). In that notebook I also ran some simple benchmarks and MsgPack comes out well, but JSON isn't too shabby either.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-348569223,https://api.github.com/repos/pydata/xarray/issues/1528,348569223,MDEyOklzc3VlQ29tbWVudDM0ODU2OTIyMw==,1217238,2017-12-01T18:20:32Z,2017-12-01T18:20:32Z,MEMBER,"> To finish it up, I propose to raise an error when attempting to encode variable-length string data. If someone can give me a quick one liner to help identify such datatypes, that would be helpful.
Variable length strings are stored with `dtype=object`. So something like `dtype.kind == 'O'` should work.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-348564159,https://api.github.com/repos/pydata/xarray/issues/1528,348564159,MDEyOklzc3VlQ29tbWVudDM0ODU2NDE1OQ==,1197350,2017-12-01T17:58:59Z,2017-12-01T17:59:06Z,MEMBER,"Sorry this has become such a behemoth. I know it is hard to review. I couldn't see how to make a more atomic PR because a new backend has lots of interrelated parts that need each other in order to work.
To finish it up, I propose to raise an error when attempting to encode variable-length string data. If someone can give me a quick one liner to help identify such datatypes, that would be helpful.
We will revisit these encoding issues once Stephan's refactoring is merged. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-348560326,https://api.github.com/repos/pydata/xarray/issues/1528,348560326,MDEyOklzc3VlQ29tbWVudDM0ODU2MDMyNg==,1217238,2017-12-01T17:43:03Z,2017-12-01T17:43:03Z,MEMBER,I'll give this another look over the weekend.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-348414545,https://api.github.com/repos/pydata/xarray/issues/1528,348414545,MDEyOklzc3VlQ29tbWVudDM0ODQxNDU0NQ==,2443309,2017-12-01T06:40:47Z,2017-12-01T06:40:47Z,MEMBER,"@rabernat - following @shoyer's thoughts here and in #1753, I'm not apposed to skipping the last few failing tests and live to fight strings another day.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347989858,https://api.github.com/repos/pydata/xarray/issues/1528,347989858,MDEyOklzc3VlQ29tbWVudDM0Nzk4OTg1OA==,1197350,2017-11-29T20:42:34Z,2017-11-29T20:42:34Z,MEMBER,"Actually, I think I just realized how to do it without too much pain. Stand by.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347987097,https://api.github.com/repos/pydata/xarray/issues/1528,347987097,MDEyOklzc3VlQ29tbWVudDM0Nzk4NzA5Nw==,1197350,2017-11-29T20:32:07Z,2017-11-29T20:32:07Z,MEMBER,"> Is it possible to add one of these filters to XArray's default use of Zarr?
Because of the way the backends are structured right now, it is hard to bypass the existing encoding and replace it with a new encoding scheme. #1087 will make this easy to do. But now it is complicated.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347984582,https://api.github.com/repos/pydata/xarray/issues/1528,347984582,MDEyOklzc3VlQ29tbWVudDM0Nzk4NDU4Mg==,1217238,2017-11-29T20:22:33Z,2017-11-29T20:22:33Z,MEMBER,"I'm fine skipping strings entirely for now. They are indeed unneeded for
most netCDF datasets.
On Wed, Nov 29, 2017 at 8:18 PM Ryan Abernathey
wrote:
> Right now I am in a dilemma over how to move forward. Fixing this string
> encoding issue will require some serious hacks to cf encoding. If I do this
> before #1087 is finished, it
> will be a waste of time (and a pain). On the other hand #1087
> could take a long time,
> since it is a major refactor itself.
>
> Is there some way to punt on the multi-length string encoding for now? We
> could just error if such variables are present. This would allow us to get
> the experimental zarr backend out into the wild. FWIW, none of the datasets
> I want to use this with actually have any string data variables at all. I
> believe 95% of netcdf datasets are just regular numbers. This is an edge
> case.
>
> β
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347983854,https://api.github.com/repos/pydata/xarray/issues/1528,347983854,MDEyOklzc3VlQ29tbWVudDM0Nzk4Mzg1NA==,306380,2017-11-29T20:19:37Z,2017-11-29T20:19:37Z,MEMBER,"> FWIW I think the best option at the moment is to make sure you add either Pickle or MsgPack filter for any zarr array with an object dtype.
Is it possible to add one of these filters to XArray's default use of Zarr?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347983448,https://api.github.com/repos/pydata/xarray/issues/1528,347983448,MDEyOklzc3VlQ29tbWVudDM0Nzk4MzQ0OA==,1197350,2017-11-29T20:18:08Z,2017-11-29T20:18:08Z,MEMBER,"Right now I am in a dilemma over how to move forward. Fixing this string encoding issue will require some serious hacks to cf encoding. If I do this before #1087 is finished, it will be a waste of time (and a pain). On the other hand #1087 could take a long time, since it is a major refactor itself.
Is there some way to punt on the multi-length string encoding for now? We could just error if such variables are present. This would allow us to get the experimental zarr backend out into the wild. FWIW, none of the datasets I want to use this with actually have any string data variables at all. I believe 95% of netcdf datasets are just regular numbers. This is an edge case.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347981682,https://api.github.com/repos/pydata/xarray/issues/1528,347981682,MDEyOklzc3VlQ29tbWVudDM0Nzk4MTY4Mg==,306380,2017-11-29T20:11:25Z,2017-11-29T20:11:25Z,MEMBER,FWIW my vote is for msgpack over pickle for both performance and cross-language reasons,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347351224,https://api.github.com/repos/pydata/xarray/issues/1528,347351224,MDEyOklzc3VlQ29tbWVudDM0NzM1MTIyNA==,1217238,2017-11-27T22:32:47Z,2017-11-28T07:51:31Z,MEMBER,"> Overall, I find the conventions module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends.
Agreed!
I wonder why zarr doesn't have a UTF-8 variable length string type (https://github.com/alimanfoo/zarr/issues/206) -- that would feel like the obvious first choice for encoding this data.
That said, xarary *should* be able to use fixed-length bytes just fine, doing UTF-8 encoding/decoding on the fly.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347385269,https://api.github.com/repos/pydata/xarray/issues/1528,347385269,MDEyOklzc3VlQ29tbWVudDM0NzM4NTI2OQ==,703554,2017-11-28T01:36:29Z,2017-11-28T01:49:24Z,CONTRIBUTOR,"FWIW I think the best option at the moment is to make sure you add either Pickle or MsgPack filter for any zarr array with an object dtype.
BTW I was thinking that zarr should automatically add one of these filters any time someone creates an array with an object dtype, to avoid them hitting the pointer issue. If you have any thoughts on best solution drop them here: https://github.com/alimanfoo/zarr/issues/208
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347382612,https://api.github.com/repos/pydata/xarray/issues/1528,347382612,MDEyOklzc3VlQ29tbWVudDM0NzM4MjYxMg==,1197350,2017-11-28T01:21:34Z,2017-11-28T01:21:34Z,MEMBER,"> When still in the original interpreter session, all the objects still exist
> in memory, so all the pointers stored in the array are still valid.
Do you think this persistence could affect xarray's tests? The way the tests work is via a context manager, like this
```python
@contextlib.contextmanager
def roundtrip(self, data, save_kwargs={}, open_kwargs={},
allow_cleanup_failure=False):
with create_tmp_file(
suffix='.zarr',
allow_cleanup_failure=allow_cleanup_failure) as tmp_file:
data.to_zarr(store=tmp_file, **save_kwargs)
yield xr.open_zarr(tmp_file, **open_kwargs)
```
Do we need to add an extra step after `data.to_zarr` to somehow purge such objects?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347381865,https://api.github.com/repos/pydata/xarray/issues/1528,347381865,MDEyOklzc3VlQ29tbWVudDM0NzM4MTg2NQ==,1197350,2017-11-28T01:16:58Z,2017-11-28T01:16:58Z,MEMBER,"`Out[2]: Bus error: 10` π±
Perhaps zarr should raise an error when assigning `zgs.x[:] = values`?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347381734,https://api.github.com/repos/pydata/xarray/issues/1528,347381734,MDEyOklzc3VlQ29tbWVudDM0NzM4MTczNA==,703554,2017-11-28T01:16:07Z,2017-11-28T01:16:07Z,CONTRIBUTOR,"When still in the original interpreter session, all the objects still exist
in memory, so all the pointers stored in the array are still valid. Restart
the session and the objects are gone and the pointers are invalid.
On Tue, Nov 28, 2017 at 1:14 AM, Alistair Miles
wrote:
> Try exiting and restarting the interpreter, then running:
>
> zgs = zarr.open_group(store='zarr_directory')
> zgs.x[:]
>
>
> On Tue, Nov 28, 2017 at 1:10 AM, Ryan Abernathey > wrote:
>
>> zarr needs a filter that can encode and pack the strings into a single
>> buffer, except in the special case where the data are being stored in-memory
>>
>> @alimanfoo : the following also seems to
>> works with directory store
>>
>> values = np.array([b'ab', b'cdef', np.nan], dtype=object)
>> zgs = zarr.open_group(store='zarr_directory')
>> zgs.create('x', shape=values.shape, dtype=values.dtype)
>> zgs.x[:] = values
>>
>> This seems to contradict your statement above. What am I missing?
>>
>> β
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> , or mute
>> the thread
>>
>> .
>>
>
>
>
> --
> Alistair Miles
> Head of Epidemiological Informatics
> Centre for Genomics and Global Health
> Big Data Institute Building
> Old Road Campus
> Roosevelt Drive
> Oxford
> OX3 7LF
> United Kingdom
> Phone: +44 (0)1865 743596 <+44%201865%20743596>
> Email: alimanfoo@googlemail.com
> Web: http://a limanfoo.github.io/
> Twitter: https://twitter.com/alimanfoo
>
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347381500,https://api.github.com/repos/pydata/xarray/issues/1528,347381500,MDEyOklzc3VlQ29tbWVudDM0NzM4MTUwMA==,703554,2017-11-28T01:14:42Z,2017-11-28T01:14:42Z,CONTRIBUTOR,"Try exiting and restarting the interpreter, then running:
zgs = zarr.open_group(store='zarr_directory')
zgs.x[:]
On Tue, Nov 28, 2017 at 1:10 AM, Ryan Abernathey
wrote:
> zarr needs a filter that can encode and pack the strings into a single
> buffer, except in the special case where the data are being stored in-memory
>
> @alimanfoo : the following also seems to
> works with directory store
>
> values = np.array([b'ab', b'cdef', np.nan], dtype=object)
> zgs = zarr.open_group(store='zarr_directory')
> zgs.create('x', shape=values.shape, dtype=values.dtype)
> zgs.x[:] = values
>
> This seems to contradict your statement above. What am I missing?
>
> β
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
>
> .
>
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347380750,https://api.github.com/repos/pydata/xarray/issues/1528,347380750,MDEyOklzc3VlQ29tbWVudDM0NzM4MDc1MA==,1197350,2017-11-28T01:10:01Z,2017-11-28T01:10:10Z,MEMBER,">zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory
@alimanfoo: the following also seems to works with directory store
```python
values = np.array([b'ab', b'cdef', np.nan], dtype=object)
zgs = zarr.open_group(store='zarr_directory')
zgs.create('x', shape=values.shape, dtype=values.dtype)
zgs.x[:] = values
```
This seems to contradict your statement above. What am I missing?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347363503,https://api.github.com/repos/pydata/xarray/issues/1528,347363503,MDEyOklzc3VlQ29tbWVudDM0NzM2MzUwMw==,703554,2017-11-27T23:27:41Z,2017-11-27T23:27:41Z,CONTRIBUTOR,"For variable length strings (or any array with an object dtype) zarr needs
a filter that can encode and pack the strings into a single buffer, except
in the special case where the data are being stored in-memory (as in your
first example). The filter has to be specified manually, some examples
here: http://zarr.readthedocs.io/en/master/tutorial.html#string-arrays.
There are two codecs currently in numcodecs that can do this, one is
Pickle, the other is MsgPack. I haven't done any benchmarking of data size
or encoding speed, but MsgPack may be preferable because it's more portable.
There was some discussion a while back about creating a codec that handles
variable-length strings by encoding via UTF8 then concatenating encoded
bytes and lengths or offsets, IIRC similar to Arrow, and maybe even
creating a special ""text"" dtype that inserts this filter automatically so
you don't have to add it manually. But there hasn't been a strong
motivation so far.
On Mon, Nov 27, 2017 at 10:32 PM, Stephan Hoyer
wrote:
> Overall, I find the conventions module to be a bit unwieldy. There is a
> lot of stuff in there, not all of which is related to CF conventions. It
> would be useful to separate the actual conventions from the encoding /
> decoding needed for different backends.
>
> Agreed!
>
> I wonder why zarr doesn't have a UTF-8 variable length string type (
> alimanfoo/zarr#206 ) --
> that would feel like the obvious first choice for encoding this data.
>
> That said, xarary *should* be able to use first-length bytes just fine,
> doing UTF-8 encoding/decoding on the fly.
>
> β
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
>
> .
>
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347323043,https://api.github.com/repos/pydata/xarray/issues/1528,347323043,MDEyOklzc3VlQ29tbWVudDM0NzMyMzA0Mw==,1197350,2017-11-27T20:48:35Z,2017-11-27T20:53:28Z,MEMBER,"After a few more tweaks, this is now quite close to passing all the `CFEncodedDataTest` tests.
The remaining issues are all related to the encoding of strings. Basically, zarr's handling of strings:
http://zarr.readthedocs.io/en/latest/tutorial.html?highlight=strings#string-arrays
is considerably different from netCDF's. Because `ZarrStore` is a subclass of `WritableCFDataStore`, all of the dataset variables get passed through `encode_cf_variable` before writing. This screws up things that actually work already quite naturally.
Consider the following direct creation of a variable length string in zarr:
```python
values = np.array([b'ab', b'cdef', np.nan], dtype=object)
zgs = zarr.open_group()
zgs.create('x', shape=values.shape, dtype=values.dtype)
zgs.x[:] = values
zgs.x
```
```
Array(/x, (3,), object, chunks=(3,), order=C)
nbytes: 24; nbytes_stored: 350; ratio: 0.1; initialized: 1/1
compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
store: DictStore
```
It seems we can encode variable-length strings into objects just fine. (`np.testing.assert_array_equal(values, zgs.x[:])` fails only because of the `nan` value. The array round-trips just fine.)
However, after passing through xarray's cf encoding, this no longer works:
```python
encoding = {'_FillValue': b'X', 'dtype': 'S1'}
original = xr.Dataset({'x': ('t', values, {}, encoding)})
zarr_dict_store = {}
original.to_zarr(store=zarr_dict_store)
zs = zarr.open_group(store=zarr_dict_store)
print(zs.x)
print(zs.x[:])
```
```
Array(/x, (3, 4), |S1, chunks=(3, 4), order=C)
nbytes: 12; nbytes_stored: 428; ratio: 0.0; initialized: 1/1
compressor: Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
store: dict
array([[b'a', b'b', b'', b''],
[b'c', b'd', b'e', b'f'],
[b'X', b'', b'', b'']],
dtype='|S1')
```
Here is everything that happens in `encode_cf_variable`:
```python
var = maybe_encode_datetime(var, name=name)
var = maybe_encode_timedelta(var, name=name)
var, needs_copy = maybe_encode_offset_and_scale(var, needs_copy, name=name)
var, needs_copy = maybe_encode_fill_value(var, needs_copy, name=name)
var = maybe_encode_nonstring_dtype(var, name=name)
var = maybe_default_fill_value(var)
var = maybe_encode_bools(var)
var = ensure_dtype_not_object(var, name=name)
var = maybe_encode_string_dtype(var, name=name)
```
The challenge now is to figure out which parts of this we need to bypass for zarr and how to implement that bypassing.
Overall, I find the `conventions` module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends.
At this point, I would appreciate some input from an encoding expert before I go refactoring stuff.
edit: The actual tests that fail are `CFEncodedDataTest.test_roundtrip_bytes_with_fill_value` and `CFEncodedDataTest.test_roundtrip_string_encoded_characters`. One option to move forward would be just to skip those tests for zarr. I am eager to get this out in the wild to see how it plays with real datasets.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345778844,https://api.github.com/repos/pydata/xarray/issues/1528,345778844,MDEyOklzc3VlQ29tbWVudDM0NTc3ODg0NA==,306380,2017-11-20T18:05:25Z,2017-11-20T18:05:25Z,MEMBER,"> This is, of course, by design :)
It's so nice when well-designed things come together and just work as planned :)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345770374,https://api.github.com/repos/pydata/xarray/issues/1528,345770374,MDEyOklzc3VlQ29tbWVudDM0NTc3MDM3NA==,6042212,2017-11-20T17:37:01Z,2017-11-20T17:37:01Z,CONTRIBUTOR,"This is, of course, by design :)
I imagine there is much that could be done to optimise performance, but for fewer, larger chunks, it should be pretty good.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345619509,https://api.github.com/repos/pydata/xarray/issues/1528,345619509,MDEyOklzc3VlQ29tbWVudDM0NTYxOTUwOQ==,703554,2017-11-20T08:07:44Z,2017-11-20T08:07:44Z,CONTRIBUTOR,"Fantastic!
On Monday, November 20, 2017, Matthew Rocklin
wrote:
> That is, indeed, quite exciting. Also exciting is that I was able to look
> at and compute on your data easily.
>
> In [1]: import zarr
>
> In [2]: import gcsfs
>
> In [3]: fs = gcsfs.GCSFileSystem(project='pangeo-181919')
>
> In [4]: gcsmap = gcsfs.mapping.GCSMap('zarr_store_test', gcs=fs, check=True, create=False)
>
> In [5]: import xarray as xr
>
> In [6]: ds_gcs = xr.open_zarr(gcsmap, mode='r')
>
> In [7]: ds_gcs
> Out[7]:
> Dimensions: (x: 200, y: 100)
> Coordinates:
> * x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
> * y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
> Data variables:
> bar (x) float64 dask.array
> foo (y, x) float32 dask.array
> Attributes:
> array_atr: [1, 2]
> some_attr: copana
>
> In [8]: ds_gcs.sum()
> Out[8]:
> Dimensions: ()
> Data variables:
> bar float64 dask.array
> foo float32 dask.array
>
> In [9]: ds_gcs.sum().compute()
> Out[9]:
> Dimensions: ()
> Data variables:
> bar float64 0.0
> foo float32 20000.0
>
> β
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
>
> .
>
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345575240,https://api.github.com/repos/pydata/xarray/issues/1528,345575240,MDEyOklzc3VlQ29tbWVudDM0NTU3NTI0MA==,306380,2017-11-20T02:28:07Z,2017-11-20T02:28:07Z,MEMBER,"That is, indeed, quite exciting. Also exciting is that I was able to look at and compute on your data easily.
```python
In [1]: import zarr
In [2]: import gcsfs
In [3]: fs = gcsfs.GCSFileSystem(project='pangeo-181919')
In [4]: gcsmap = gcsfs.mapping.GCSMap('zarr_store_test', gcs=fs, check=True, create=False)
In [5]: import xarray as xr
In [6]: ds_gcs = xr.open_zarr(gcsmap, mode='r')
In [7]: ds_gcs
Out[7]:
Dimensions: (x: 200, y: 100)
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
bar (x) float64 dask.array
foo (y, x) float32 dask.array
Attributes:
array_atr: [1, 2]
some_attr: copana
In [8]: ds_gcs.sum()
Out[8]:
Dimensions: ()
Data variables:
bar float64 dask.array
foo float32 dask.array
In [9]: ds_gcs.sum().compute()
Out[9]:
Dimensions: ()
Data variables:
bar float64 0.0
foo float32 20000.0
```","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345574445,https://api.github.com/repos/pydata/xarray/issues/1528,345574445,MDEyOklzc3VlQ29tbWVudDM0NTU3NDQ0NQ==,1197350,2017-11-20T02:21:08Z,2017-11-20T02:21:08Z,MEMBER,"Those following this thread will probably be very excited to learn that the following code works with my zarr_backend branch:
```python
import gcsfs
fs = gcsfs.GCSFileSystem(project='pangeo-181919', token=None)
gcsmap = gcsfs.mapping.GCSMap('zarr_store_test', gcs=fs, check=True, create=False)
ds.to_zarr(store=gcsmap)
ds_gcs = xr.open_zarr(gcsmap, mode='r')
```
I never doubted this would be possible, but seeing it in action is quite exciting!","{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345128506,https://api.github.com/repos/pydata/xarray/issues/1528,345128506,MDEyOklzc3VlQ29tbWVudDM0NTEyODUwNg==,2443309,2017-11-17T02:38:41Z,2017-11-17T02:38:41Z,MEMBER,@rabernat - It might a little but we'll sort it out. See https://github.com/rabernat/xarray/pull/3.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345126452,https://api.github.com/repos/pydata/xarray/issues/1528,345126452,MDEyOklzc3VlQ29tbWVudDM0NTEyNjQ1Mg==,1197350,2017-11-17T02:24:56Z,2017-11-17T02:24:56Z,MEMBER,"@jhamman would it screw you up if I pushed a few commits tonight? I wonβt touch the ZarrArrayWrapper. But I figured out how to fix auto_chunk.
Sent from my iPhone
> On Nov 16, 2017, at 7:12 PM, Matthew Rocklin wrote:
>
> Hooray for standard interfaces!
>
> β
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345104713,https://api.github.com/repos/pydata/xarray/issues/1528,345104713,MDEyOklzc3VlQ29tbWVudDM0NTEwNDcxMw==,306380,2017-11-17T00:12:01Z,2017-11-17T00:12:01Z,MEMBER,Hooray for standard interfaces!,"{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 1, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345104440,https://api.github.com/repos/pydata/xarray/issues/1528,345104440,MDEyOklzc3VlQ29tbWVudDM0NTEwNDQ0MA==,6042212,2017-11-17T00:10:19Z,2017-11-17T00:10:19Z,CONTRIBUTOR,"`hdfs3` also has a MutableMapping for HDFS. I did not succeed in getting one into azure-datalake-store, but it would not be hard to make. In this way, zarr can become a pretty general array cloud storage mechanism.","{""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345101150,https://api.github.com/repos/pydata/xarray/issues/1528,345101150,MDEyOklzc3VlQ29tbWVudDM0NTEwMTE1MA==,306380,2017-11-16T23:52:48Z,2017-11-16T23:52:48Z,MEMBER,"The gcsfs library also provides a MutableMapping for Google Cloud Storage.
The dask.distributed library now also provides a distributed lock for synchronization, if necessary though in practice we should just rechunk the dask.array before writing.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345091139,https://api.github.com/repos/pydata/xarray/issues/1528,345091139,MDEyOklzc3VlQ29tbWVudDM0NTA5MTEzOQ==,1217238,2017-11-16T23:02:14Z,2017-11-16T23:02:14Z,MEMBER,"> can we brainstorm what a ZarrArrayWraper would need to be compatible with the new indexing API?
We will need to write new adapter code to map xarray's explicit indexer classes onto the appropriate zarr methods, e.g.,
```python
def __getitem__(self, key):
array = self.get_arraay()
if isinstance(key, BasicIndexer):
return array[key.tuple]
elif isinstance(key, VectorizedIndexer):
return array.vindex[_replace_slices_with_arrays(key.tuple, self.shape)]
else:
assert isinstance(key, OuterIndexer)
return array.oindex[key.tuple]
# untested, but I think this does the appropriate shape munging to make slices
# appear as the last axes of the result array
def _replace_slice_with_arrays(key, shape):
num_slices = sum(1 for k in key if isinstance(k, slice))
num_arrays = len(shape) - num_slices
new_key = []
slice_count = 0
for k, size in zip(key, shape):
if isinstance(k, slice):
array = np.arange(*k.indices(size))
sl = [np.newaxis] * len(shape)
sl[num_arrays + slice_count] = np.newaxis
k = array[sl]
slice_count += 1
else:
assert isinstance(k, numpy.ndarray)
k = k[(slice(None),) * num_arrays + (np.newaxis,) * num_slices]
new_key.append(k)
return tuple(new_key)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345080945,https://api.github.com/repos/pydata/xarray/issues/1528,345080945,MDEyOklzc3VlQ29tbWVudDM0NTA4MDk0NQ==,703554,2017-11-16T22:18:04Z,2017-11-16T22:18:04Z,CONTRIBUTOR,"Re different zarr storage backends, main options are plain dict, DirectoryStore, ZipStore, and there's a [new DBMStore class just merged](https://github.com/alimanfoo/zarr/pull/186) which enables storage in any DBM-style database (e.g., Berkeley DB). ZipStore has some constraints because of how zip files work, you can't really replace an entry in a zip file which means anything that writes the same array chunk more than once will generate warnings. Dask's S3Map should also work, I haven't tried it and obviously not ideal for unit tests but I'd be interested if you get any experience with it.
Re different combinations of zarr and dask chunks, it can be thread safe even if chunks are not aligned, just need to pass a synchronizer when instantiating the array or group. Zarr has a ThreadSynchronizer class which can be used for thread-based parallelism. If a synchronizer is provided, it is used to lock each chunk individually during write operations. [More info here](http://zarr.readthedocs.io/en/latest/tutorial.html#parallel-computing-and-synchronization).
Re fill values, zarr has a native concept of fill value for each array, with the fill value stored as part of the array metadata. Array metadata are stored as JSON and I recently [merged a fix](https://github.com/alimanfoo/zarr/pull/176) so that a bytes fill values could be used (via base64 encoding). I believe the netcdf way is to store fill value separately as value of ""_FillValue"" attribute? You could do this with zarr but user attributes are also JSON and so you would need to do your own encoding/decoding. But if possible I'd suggest using the native zarr fill_value support as it handles bytes fill value encoding and also checks to ensure fill values are valid wrt the array dtype.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345034208,https://api.github.com/repos/pydata/xarray/issues/1528,345034208,MDEyOklzc3VlQ29tbWVudDM0NTAzNDIwOA==,1197350,2017-11-16T19:22:01Z,2017-11-16T19:22:01Z,MEMBER,"Some things I would like to add to the zarr test suite:
- [ ] specifying zarr-specific encoding options ([compressors and filters](http://zarr.readthedocs.io/en/latest/tutorial.html#compressors))
- [ ] writing to different zarr storage backends (e.g. dict store, can we mock an S3 store?)
- [ ] different combinations of zarr and dask chunks. one <=> one, many <=> one are supported; many <=> one and many <=> many should raise errors / warnings (not thread safe)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345030848,https://api.github.com/repos/pydata/xarray/issues/1528,345030848,MDEyOklzc3VlQ29tbWVudDM0NTAzMDg0OA==,1197350,2017-11-16T19:10:31Z,2017-11-16T19:10:31Z,MEMBER,"> FYI: I'm playing with your branch a bit today.
Great! If you use the latest zarr master, you should get the same test results as this travis build:
https://travis-ci.org/pydata/xarray/jobs/301606996
There are two outstanding failures related to encoding (`test_roundtrip_bytes_with_fill_value` and `test_roundtrip_string_encoded_characters`). And auto-caching is not working (`test_dataset_caching`). I consider these pretty minor.
The biggest problem is that, for reasons I don't understand, my ""auto-chunking"" behavior does not work (this is covered by the only zarr-specific test method: `test_auto_chunk`). My goal is to have zarr be lazy-by-default and create dask chunks for every zarr chunk. However, my implementation of this does not work:
https://github.com/pydata/xarray/pull/1528/files#diff-1bba25ab0d8275d763572bfdd10377c6R325
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345026224,https://api.github.com/repos/pydata/xarray/issues/1528,345026224,MDEyOklzc3VlQ29tbWVudDM0NTAyNjIyNA==,2443309,2017-11-16T18:53:42Z,2017-11-16T18:53:42Z,MEMBER,"@rabernat - FYI: I'm playing with your branch a bit today.
@shoyer and @rabernat, can we brainstorm what a `ZarrArrayWraper` would need to be compatible with the new indexing API? I'm happy to implement it but could use a few pointers to get started.
```Python
class ZarrArrayWraper(BackendArray):
def __init__(self, variable_name, datastore):
self.datastore = datastore
self.variable_name = variable_name
array = self.get_array()
self.shape = array.shape
self.dtype = np.dtype(array.dtype.kind +
str(array.dtype.itemsize))
def get_array(self):
self.datastore.assert_open()
return self.datastore.ds[self.variable_name] # returns a zarr-array
def __getitem__(self, key):
with self.datastore.ensure_open(autoclose=True):
data = IndexingAdapter(self.get_array())[key] # which indexing adapter?
return np.array(data, dtype=self.dtype, copy=copy)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-344040853,https://api.github.com/repos/pydata/xarray/issues/1528,344040853,MDEyOklzc3VlQ29tbWVudDM0NDA0MDg1Mw==,1197350,2017-11-13T20:04:12Z,2017-11-13T20:04:12Z,MEMBER,π¬ that's my punishment for being slow!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-344040250,https://api.github.com/repos/pydata/xarray/issues/1528,344040250,MDEyOklzc3VlQ29tbWVudDM0NDA0MDI1MA==,1217238,2017-11-13T20:02:03Z,2017-11-13T20:02:03Z,MEMBER,"@rabernat sorry for the churn here, but you are also probably going to need to update after the explicit indexing changes in #1705.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-339897936,https://api.github.com/repos/pydata/xarray/issues/1528,339897936,MDEyOklzc3VlQ29tbWVudDMzOTg5NzkzNg==,703554,2017-10-27T07:42:34Z,2017-10-27T07:42:34Z,CONTRIBUTOR,"Suggest testing against GitHub master, there are a few other issues I'd
like to work through before next release.
On Thu, 26 Oct 2017 at 23:07, Ryan Abernathey
wrote:
> Fantastic! Are you planning a release any time soon? If not we can set up
> to test against the github master.
>
> Sent from my iPhone
>
> > On Oct 26, 2017, at 5:04 PM, Alistair Miles
> wrote:
> >
> > Just to say, support for 0d arrays, and for arrays with one or more
> zero-length dimensions, is in zarr master.
> >
> > β
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub, or mute the thread.
> >
>
> β
> You are receiving this because you were mentioned.
>
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
>
> .
>
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-339815147,https://api.github.com/repos/pydata/xarray/issues/1528,339815147,MDEyOklzc3VlQ29tbWVudDMzOTgxNTE0Nw==,1197350,2017-10-26T22:07:10Z,2017-10-26T22:07:10Z,MEMBER,"Fantastic! Are you planning a release any time soon? If not we can set up to test against the github master.
Sent from my iPhone
> On Oct 26, 2017, at 5:04 PM, Alistair Miles wrote:
>
> Just to say, support for 0d arrays, and for arrays with one or more zero-length dimensions, is in zarr master.
>
> β
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-339800443,https://api.github.com/repos/pydata/xarray/issues/1528,339800443,MDEyOklzc3VlQ29tbWVudDMzOTgwMDQ0Mw==,703554,2017-10-26T21:04:17Z,2017-10-26T21:04:17Z,CONTRIBUTOR,"Just to say, support for 0d arrays, and for arrays with one or more zero-length dimensions, is in zarr master.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-335186616,https://api.github.com/repos/pydata/xarray/issues/1528,335186616,MDEyOklzc3VlQ29tbWVudDMzNTE4NjYxNg==,703554,2017-10-09T15:07:29Z,2017-10-09T17:23:21Z,CONTRIBUTOR,"I'm on paternity leave for the next 2 weeks, then will be catching up for a
couple of weeks I expect. May be able to merge straightforward PRs but will
have limited bandwidth.","{""total_count"": 3, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 3, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-335204883,https://api.github.com/repos/pydata/xarray/issues/1528,335204883,MDEyOklzc3VlQ29tbWVudDMzNTIwNDg4Mw==,1197350,2017-10-09T16:09:50Z,2017-10-09T16:09:50Z,MEMBER,"> I'm on paternity leave for the next 2 weeks
Congratulations! If you could just merge alimanfoo/zarr#154, it would really help us move forward.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-335162205,https://api.github.com/repos/pydata/xarray/issues/1528,335162205,MDEyOklzc3VlQ29tbWVudDMzNTE2MjIwNQ==,1197350,2017-10-09T13:43:49Z,2017-10-09T13:43:49Z,MEMBER,"> I won't be able to put any effort into zarr in the
> next month
Does this include merging PRs?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-335030993,https://api.github.com/repos/pydata/xarray/issues/1528,335030993,MDEyOklzc3VlQ29tbWVudDMzNTAzMDk5Mw==,703554,2017-10-08T19:17:27Z,2017-10-08T23:37:47Z,CONTRIBUTOR,"FWIW I think some JSON encoders for attributes would ultimately be a useful
addition to zarr, but I won't be able to put any effort into zarr in the
next month, so workarounds in xarray sounds like a good idea for now.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-335027491,https://api.github.com/repos/pydata/xarray/issues/1528,335027491,MDEyOklzc3VlQ29tbWVudDMzNTAyNzQ5MQ==,1197350,2017-10-08T18:23:50Z,2017-10-08T18:23:50Z,MEMBER,"> For thoroughness this might be worth doing with custom JSON encoder on the zarr side, but would also be easy to do in the xarray wrapper.
My impression is that zarr development is moving conservatively, so we would be better off finding workarounds in xarray.
@shoyer: where in the code would you recommend putting this logic? It seems like part of encoding / decoding to me.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-334981929,https://api.github.com/repos/pydata/xarray/issues/1528,334981929,MDEyOklzc3VlQ29tbWVudDMzNDk4MTkyOQ==,1197350,2017-10-08T04:16:58Z,2017-10-08T18:21:30Z,MEMBER,"There are two zarr issues that are causing some tests to fail:
1. zarr can't store zero-dimensional arrays.
```python
za = zarr.create(shape=(), store='tmp_file')
za[...] = 0
```
raises a file permission error. I believe that this is alimanfoo/zarr#150.
1. lots of the things that xarray likes to put in attributes are not serializable by zarr
```python
za = zarr.create(shape=(1), store='tmp_file')
za.attrs['foo'] = np.float32(0)
```
raises `TypeError: Object of type 'float32' is not JSON serializable`. This is alimanfoo/zarr#156.
Most of the failures of tests inherited from `CFEncodedDataTest` can be attributed to one of these two issues.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-335015485,https://api.github.com/repos/pydata/xarray/issues/1528,335015485,MDEyOklzc3VlQ29tbWVudDMzNTAxNTQ4NQ==,1217238,2017-10-08T15:46:36Z,2017-10-08T15:46:36Z,MEMBER,"For serializing attributes, the easiest fix is to call `.item()` on any numpy scalars (instances of `np.generic`) and `.tolist()` on any numpy arrays. For thoroughness this might be worth doing with custom JSON encoder on the zarr side, but would also be easy to do in the xarray wrapper.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-334982373,https://api.github.com/repos/pydata/xarray/issues/1528,334982373,MDEyOklzc3VlQ29tbWVudDMzNDk4MjM3Mw==,1197350,2017-10-08T04:31:02Z,2017-10-08T04:31:09Z,MEMBER,"I worked on this on the plane back from Seattle. Yay for having no internet access!
Would appreciate feedback on the questions raised above from @shoyer, @jhamman, and anyone else with backend expertise.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-334633708,https://api.github.com/repos/pydata/xarray/issues/1528,334633708,MDEyOklzc3VlQ29tbWVudDMzNDYzMzcwOA==,1197350,2017-10-06T01:15:05Z,2017-10-06T01:15:05Z,MEMBER,"Here is where we are at with the Zarr backend tests
```
xarray/tests/test_backends.py::ZarrDataTest::test_coordinates_encoding PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_dataset_caching FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_dataset_compute PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_default_fill_value FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_encoding_kwarg FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_encoding_same_dtype PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_invalid_dataarray_names_raise FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_load PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_orthogonal_indexing FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_pickle FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_pickle_dataarray PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_None_variable PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_boolean_dtype PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_coordinates PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_datetime_data FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_endian PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_example_1_netcdf FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_float64_data PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_mask_and_scale FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_object_dtype FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_string_data PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_strings_with_fill_value FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_test_data PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_roundtrip_timedelta_data FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_unsigned_roundtrip_mask_and_scale FAILED
xarray/tests/test_backends.py::ZarrDataTest::test_write_store PASSED
xarray/tests/test_backends.py::ZarrDataTest::test_zero_dimensional_variable FAILED
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-334633152,https://api.github.com/repos/pydata/xarray/issues/1528,334633152,MDEyOklzc3VlQ29tbWVudDMzNDYzMzE1Mg==,1197350,2017-10-06T01:10:29Z,2017-10-06T01:10:29Z,MEMBER,"With @jhamman's help, I just made a little progress on this.
We now have a bare bones test suite for the zarr backend. This is very helpful for revealing where more work is needed: encoding. So the next step is to seriously confront that issue. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-334316122,https://api.github.com/repos/pydata/xarray/issues/1528,334316122,MDEyOklzc3VlQ29tbWVudDMzNDMxNjEyMg==,2443309,2017-10-04T23:14:58Z,2017-10-04T23:14:58Z,MEMBER,@rabernat - testing should be fully functional now. ,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-333579128,https://api.github.com/repos/pydata/xarray/issues/1528,333579128,MDEyOklzc3VlQ29tbWVudDMzMzU3OTEyOA==,2443309,2017-10-02T15:58:05Z,2017-10-02T15:58:05Z,MEMBER,"@rabernat - re backends testing, #1557 is pretty close. I can wrap it up this week.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-333400272,https://api.github.com/repos/pydata/xarray/issues/1528,333400272,MDEyOklzc3VlQ29tbWVudDMzMzQwMDI3Mg==,6042212,2017-10-01T19:26:22Z,2017-10-01T19:26:22Z,CONTRIBUTOR,"I have not done anything, I'm afraid, since posting my commit, the content of which is just an example of how you might pass parameters down to zarr, and a test-case which shows that the basic data is round-tripping properly, but actually the dataset does not come back with the same structure as it started off.
We can loop back and decide where to go from here.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-333336320,https://api.github.com/repos/pydata/xarray/issues/1528,333336320,MDEyOklzc3VlQ29tbWVudDMzMzMzNjMyMA==,1197350,2017-09-30T21:13:48Z,2017-09-30T21:13:48Z,MEMBER,@martindurant: I may have some time to get back to working on this next week. (Especially if @jhamman can help me sort out the backend testing.) What is the status of your branch?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-327901739,https://api.github.com/repos/pydata/xarray/issues/1528,327901739,MDEyOklzc3VlQ29tbWVudDMyNzkwMTczOQ==,6042212,2017-09-07T19:36:15Z,2017-09-07T19:36:15Z,CONTRIBUTOR,"@shoyer , is https://github.com/martindurant/xarray/commit/6c1fb6b76ebba862a1c5831210ce026160da0065 a reasonable start ?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-327900874,https://api.github.com/repos/pydata/xarray/issues/1528,327900874,MDEyOklzc3VlQ29tbWVudDMyNzkwMDg3NA==,1217238,2017-09-07T19:32:41Z,2017-09-07T19:32:41Z,MEMBER,"@rabernat indeed, the backend tests are not terribly well organized right now. Probably the place to start is to inherit from `DatasetIOTestCases` and `TestCase` and then implement `create_store` and `roundtrip`. `DaskTest` abuses the ""backend"" notation a little bit, but these lines cover the essentials:
https://github.com/pydata/xarray/blob/98a05f11c6f38489c82e86c9e9df796e7fb65fd2/xarray/tests/test_backends.py#L1271-L1279","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-327849640,https://api.github.com/repos/pydata/xarray/issues/1528,327849640,MDEyOklzc3VlQ29tbWVudDMyNzg0OTY0MA==,1197350,2017-09-07T16:17:13Z,2017-09-07T16:17:13Z,MEMBER,"I am stuck on figuring out how to develop a new test case for this. (It doesn't help that #1531 is messing up the backend tests.)
If @shoyer can give us a few hints about how to best implement a test class (i.e. what to subclass, etc.), I think that could jumpstart testing and move the PR forward.
I welcome contributions from others such as @martindurant on this. I won't have much time in the near future, since a new semester just dropped on me like a load of bricks.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-327833777,https://api.github.com/repos/pydata/xarray/issues/1528,327833777,MDEyOklzc3VlQ29tbWVudDMyNzgzMzc3Nw==,6042212,2017-09-07T15:23:31Z,2017-09-07T15:23:31Z,CONTRIBUTOR,"@rabernat , is there anything I can do to help push this along?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325813339,https://api.github.com/repos/pydata/xarray/issues/1528,325813339,MDEyOklzc3VlQ29tbWVudDMyNTgxMzMzOQ==,703554,2017-08-29T21:43:48Z,2017-08-29T21:43:48Z,CONTRIBUTOR,"On Tuesday, August 29, 2017, Ryan Abernathey
wrote:
>
> @alimanfoo : when do you anticipate the 2.2
> zarr release to happen? Will the API change significantly? If so, I will
> wait for that to move forward here.
>
Zarr 2.2 will hopefully happen some time in the next 2 months, but it will
be fully backwards-compatible, no breaking API changes.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325742232,https://api.github.com/repos/pydata/xarray/issues/1528,325742232,MDEyOklzc3VlQ29tbWVudDMyNTc0MjIzMg==,1217238,2017-08-29T17:50:04Z,2017-08-29T17:50:04Z,MEMBER,"> If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working.
The only advantage here would be for non-xarray users, who could use zarr to do this decoding/encoding automatically.
For what it's worth, the implementation of scale offsets in xarray looks basically equivalent to what's done in zarr. I don't think there's a performance difference either way.
> A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data.
If you use chunks, I believe HDF5/NetCDF4 do the same thing, e.g.,
```
In [10]: with h5py.File('one-chunk.h5') as f: f.create_dataset('foo', (100, 100), chunks=(100, 100))
In [11]: with h5py.File('many-chunk.h5') as f: f.create_dataset('foo', (100000, 100000), chunks=(100, 100))
In [12]: ls -l | grep chunk.h5
-rw-r--r-- 1 shoyer eng 1400 Aug 29 10:48 many-chunk.h5
-rw-r--r-- 1 shoyer eng 1400 Aug 29 10:48 one-chunk.h5
```
(Note the same file-size)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325738019,https://api.github.com/repos/pydata/xarray/issues/1528,325738019,MDEyOklzc3VlQ29tbWVudDMyNTczODAxOQ==,1197350,2017-08-29T17:35:09Z,2017-08-29T17:35:09Z,MEMBER,"One path forward for now would be to ignore the filters like `FixedScaleOffset` that are not present in netCDF, let xarray handle the CF encoding / decoding, and just put the compressors (e.g. `Blosc`, `Zlib`) and their parameters in the xarray variable encoding.
If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working.
@alimanfoo: when do you anticipate the 2.2 zarr release to happen? Will the API change significantly? If so, I will wait for that to move forward here.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325729013,https://api.github.com/repos/pydata/xarray/issues/1528,325729013,MDEyOklzc3VlQ29tbWVudDMyNTcyOTAxMw==,703554,2017-08-29T17:02:41Z,2017-08-29T17:02:41Z,CONTRIBUTOR,"FWIW all filter (codec) classes have been migrated from zarr to a separate packaged called numcodecs and will be imported from there in the next (2.2) zarr release. Here is [FixedScaleOffset](https://github.com/alimanfoo/numcodecs/blob/master/numcodecs/fixedscaleoffset.py). Implementation is basic numpy, probably some room for optimization. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325728378,https://api.github.com/repos/pydata/xarray/issues/1528,325728378,MDEyOklzc3VlQ29tbWVudDMyNTcyODM3OA==,6042212,2017-08-29T17:00:29Z,2017-08-29T17:00:29Z,CONTRIBUTOR,"A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data. This probably meshes perfectly well with most user's understanding of missing data/fill value.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325727354,https://api.github.com/repos/pydata/xarray/issues/1528,325727354,MDEyOklzc3VlQ29tbWVudDMyNTcyNzM1NA==,6042212,2017-08-29T16:57:10Z,2017-08-29T16:57:10Z,CONTRIBUTOR,"Worth pointing out here, that the zarr filter-set is extensible (I suppose hdf5 is too, but I don't think this is ever done in practice), but I don't think it makes any particular claims to performance.
I think both of the options above are reasonable, and there is no particular reason to exclude either: a zarr variable could look to xarray like floats but actually be stored as ints (i.e., arguments are passed to zarr), or it could look like ints which xarray expects to inflate to floats (i.e., stored as an attribute). I mean, if a user stores a float variable, but includes kwargs to zarr for scale/filter (or any other filter arguments), we should make no attempt to interrupt that.
The only question is, if the user wishes to apply scale/offset in xarray, which is their most likely intention? I would guess the latter, compute in xarray and use attributes, since xarray users probably don't know about zarr and its filters.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325727280,https://api.github.com/repos/pydata/xarray/issues/1528,325727280,MDEyOklzc3VlQ29tbWVudDMyNTcyNzI4MA==,703554,2017-08-29T16:56:55Z,2017-08-29T16:56:55Z,CONTRIBUTOR,"Following this with interest.
Regarding autoclose, just to confirm that zarr doesn't really have any notion of whether something is open or closed. When using the DirectoryStore storage class (most common use case I imagine), all files are automatically closed, nothing is kept open. There are some storage classes (e.g., ZipStore) that do require an explicit close call to finalise the file on disk if you have been writing data, but I think you can ignore this in xarray and leave it up to the user to manage this themselves.
Out of interest, @shoyer do you still think there would be value in writing a wrapper for zarr analogous to h5netcdf? Or does this PR provide all the necessary functionality?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325723577,https://api.github.com/repos/pydata/xarray/issues/1528,325723577,MDEyOklzc3VlQ29tbWVudDMyNTcyMzU3Nw==,1217238,2017-08-29T16:43:58Z,2017-08-29T16:44:25Z,MEMBER,"> Is the goal here to be able to round-trip the file, such that calling .to_netcdf() produces an identical file to the original source file?
Yes, exactly.
> I don't understand how encoding interacts with attributes? When is something an attribute vs. an encoding (add_offset for example)?
Typically, we store things in encoding that are attributes on the underlying NetCDF file, but no longer make sense to describe the decoded data. For example:
- On the file, `add_offset` is an attribute.
- If loaded with `open_dataset(..., mask_and_scale=True)`, `add_offset` can be found in `encoding`, not `attrs`, because the data has already been offset.
- If loaded with `open_dataset(..., mask_and_scale=False)`, `add_offset` will still be on `attrs` (the data has not been offset).
> How does xarray know whether the store automatically encodes / decodes the encodings vs. when it has to be done by xarray, e.g. by calling mask_and_scale
Currently, we assume that stores never do this, and always handle it ourselves. We might need a special exception for zarr and scale/offset encoding.
> Does this mean that my ZarrStore should inherit from WritableCFDataStore instead of AbstractWritableDataStore?
Maybe, though again it will probably need slightly customized conventions for writing data (if we let zarr handling scale/offset encoding).
> I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above.
We have two options:
1. Handle it all in xarray via the machinery in `conventions.py`. Never pass the arguments to do scale/offset encoding to zarr (just save them as attributes).
2. Handle it all in zarr. We'll need special case logic to skip this part of encoding.
I think (2) would be the preferred way to do this.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325716892,https://api.github.com/repos/pydata/xarray/issues/1528,325716892,MDEyOklzc3VlQ29tbWVudDMyNTcxNjg5Mg==,1217238,2017-08-29T16:19:57Z,2017-08-29T16:19:57Z,MEMBER,@rabernat I think this is #1531 -- `require_pynio` seems to have infected all our other requirements!,"{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 1, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325690352,https://api.github.com/repos/pydata/xarray/issues/1528,325690352,MDEyOklzc3VlQ29tbWVudDMyNTY5MDM1Mg==,1197350,2017-08-29T14:54:53Z,2017-08-29T14:54:53Z,MEMBER,"I am now trying to understand the backend test suite structure.
Can someone explain to me why so many tests are skipped? For example, if I run
```
py.test -v xarray/tests/test_backends.py -rsx -k GenericNetCDFDataTest
```
I get
```
================================================== test session starts ==================================================
platform darwin -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -- /Users/rpa/anaconda/bin/python
cachedir: .cache
rootdir: /Users/rpa/RND/Public/xarray, inifile: setup.cfg
plugins: cov-2.5.1
collected 683 items
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_coordinates_encoding SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_cross_engine_read_write_netcdf3 PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_dataset_caching SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_dataset_compute SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_default_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_kwarg SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_same_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_encoding_unlimited_dims PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_engine PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_invalid_dataarray_names_raise SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_load SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_orthogonal_indexing PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_pickle SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_pickle_dataarray SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_None_variable SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_boolean_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_coordinates SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_datetime_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_endian SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_example_1_netcdf SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_float64_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_mask_and_scale SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_object_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_string_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_strings_with_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_test_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_roundtrip_timedelta_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_write_store PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTest::test_zero_dimensional_variable SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_coordinates_encoding SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_cross_engine_read_write_netcdf3 PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_dataset_caching SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_dataset_compute SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_default_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_kwarg SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_same_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_encoding_unlimited_dims PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_engine PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_invalid_dataarray_names_raise SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_load SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_orthogonal_indexing PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_pickle SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_pickle_dataarray SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_None_variable SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_boolean_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_coordinates SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_datetime_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_endian SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_example_1_netcdf SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_float64_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_mask_and_scale SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_object_dtype SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_string_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_strings_with_fill_value SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_test_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_roundtrip_timedelta_data SKIPPED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_write_store PASSED
xarray/tests/test_backends.py::GenericNetCDFDataTestAutocloseTrue::test_zero_dimensional_variable SKIPPED
================================================ short test summary info ================================================
SKIP [2] xarray/tests/test_backends.py:382: requires pynio
SKIP [2] xarray/tests/test_backends.py:214: requires pynio
SKIP [2] xarray/tests/test_backends.py:178: requires pynio
SKIP [2] xarray/tests/test_backends.py:468: requires pynio
SKIP [2] xarray/tests/test_backends.py:439: requires pynio
SKIP [2] xarray/tests/test_backends.py:490: requires pynio
SKIP [2] xarray/tests/test_backends.py:428: requires pynio
SKIP [2] xarray/tests/test_backends.py:145: requires pynio
SKIP [2] xarray/tests/test_backends.py:197: requires pynio
SKIP [2] xarray/tests/test_backends.py:207: requires pynio
SKIP [2] xarray/tests/test_backends.py:230: requires pynio
SKIP [2] xarray/tests/test_backends.py:311: requires pynio
SKIP [2] xarray/tests/test_backends.py:300: requires pynio
SKIP [2] xarray/tests/test_backends.py:271: requires pynio
SKIP [2] xarray/tests/test_backends.py:409: requires pynio
SKIP [2] xarray/tests/test_backends.py:291: requires pynio
SKIP [2] xarray/tests/test_backends.py:286: requires pynio
SKIP [2] xarray/tests/test_backends.py:362: requires pynio
SKIP [2] xarray/tests/test_backends.py:235: requires pynio
SKIP [2] xarray/tests/test_backends.py:264: requires pynio
SKIP [2] xarray/tests/test_backends.py:334: requires pynio
SKIP [2] xarray/tests/test_backends.py:139: requires pynio
SKIP [2] xarray/tests/test_backends.py:280: requires pynio
SKIP [2] xarray/tests/test_backends.py:109: requires pynio
```
Those line numbers refer to all of the skipped methods. Why should I need pynio to run those tests?
It looks like the same thing is happening on travis: https://travis-ci.org/pydata/xarray/jobs/268805771#L1527
Maybe @pwolfram understands this stuff?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325660754,https://api.github.com/repos/pydata/xarray/issues/1528,325660754,MDEyOklzc3VlQ29tbWVudDMyNTY2MDc1NA==,1197350,2017-08-29T13:18:33Z,2017-08-29T13:18:33Z,MEMBER,"> encoding keeps track of how variables are represented in a file (e.g., chunking schemes, _FillValue/add_offset/scale_factor compression, time units), so we reconstruct a netCDF file that looks almost exactly like the file we've read from disk.
Is the goal here to be able to round-trip the file, such that calling `.to_netcdf()` produces an identical file to the original source file? For zarr, I think this would mean having the ability to read from one zarr store into xarray, and then write back to a different store, and have these two stores be identical. That makes sense to me.
I *don't* understand how encoding interacts with attributes? When is something an attribute vs. an encoding (`add_offset` for example)? How does xarray know whether the store automatically encodes / decodes the encodings vs. when it has to be done by xarray, e.g. by calling [`mask_and_scale`](https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L35)
> > Should we encode / decode CF for zarr stores?
> Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.
Does this mean that my `ZarrStore` should inherit from `WritableCFDataStore` instead of `AbstractWritableDataStore`?
Regarding encoding, zarr has its own internal mechanism for encoding, which it calls ""filters"", that closely resemble some of the CF encoding options. For example the [`FixedScaleOffset`](http://zarr.readthedocs.io/en/latest/api/codecs.html#zarr.codecs.FixedScaleOffset) filter does something similar to as xarray's [`mask_and_scale`](https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L35) function.
I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325525827,https://api.github.com/repos/pydata/xarray/issues/1528,325525827,MDEyOklzc3VlQ29tbWVudDMyNTUyNTgyNw==,1217238,2017-08-29T01:14:05Z,2017-08-29T01:14:05Z,MEMBER,"> What is ""encoding"" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?
`encoding` keeps track of how variables are represented in a file (e.g., chunking schemes, `_FillValue`/`add_offset`/`scale_factor` compression, time units), so we reconstruct a netCDF file that looks almost exactly like the file we've read from disk. In the case of zarr, I guess we might include chunking, fill values, compressor options....
> Should we encode / decode CF for zarr stores?
Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.
> Do we want to always automatically align dask chunks with the underlying zarr chunks?
This would be nice! But it's also a bigger issue (will look for the number, I think it's already been opened).
> What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.
Still need to think about this one.
> zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?
I guess we can ignore them (maybe add a warning?) -- they're not part of the zarr data model.
> Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.
I don't think we need any autoclose logic at all -- zarr doesn't leave open files hanging around already.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325390391,https://api.github.com/repos/pydata/xarray/issues/1528,325390391,MDEyOklzc3VlQ29tbWVudDMyNTM5MDM5MQ==,6042212,2017-08-28T15:41:08Z,2017-08-28T15:41:08Z,CONTRIBUTOR,"@rabernat : on actually looking through your code :) Happy to see you doing exactly as I felt I was not knowledgeable to do and poke xarray's guts. If I can help in any way, please let me know, although I don't have a lot of spare hours right now.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325226656,https://api.github.com/repos/pydata/xarray/issues/1528,325226656,MDEyOklzc3VlQ29tbWVudDMyNTIyNjY1Ng==,1197350,2017-08-27T21:42:23Z,2017-08-27T21:42:23Z,MEMBER,"> Is the aim to reduce the number of metadata files hanging around?
This is also part of my goal. I think all the metadata can be stored internally to zarr via attributes. There just have to be some ""special"" attributes that xarray hides from the user. This is the same as h5netcdf.
@alimanfoo suggested this should be possible in that earlier thread:
> Specifically I'm wondering if this could all be stored as attributes on the
Zarr array, with some conventions for special xarray attribute names?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694