html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/1528#issuecomment-364804265,https://api.github.com/repos/pydata/xarray/issues/1528,364804265,MDEyOklzc3VlQ29tbWVudDM2NDgwNDI2NQ==,1217238,2018-02-12T00:15:23Z,2018-02-12T00:15:23Z,MEMBER,"See https://github.com/dask/dask/issues/2000 for the dask issue. Once this works in dask it should be quite easy to implement in xarray, too.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-364804162,https://api.github.com/repos/pydata/xarray/issues/1528,364804162,MDEyOklzc3VlQ29tbWVudDM2NDgwNDE2Mg==,1217238,2018-02-12T00:14:22Z,2018-02-12T00:14:22Z,MEMBER,@martindurant that could probably be addressed most cleanly by improving `__setitem__` support for dask.array.,"{""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-351588678,https://api.github.com/repos/pydata/xarray/issues/1528,351588678,MDEyOklzc3VlQ29tbWVudDM1MTU4ODY3OA==,1217238,2017-12-14T02:23:03Z,2017-12-14T02:23:03Z,MEMBER,"woohoo, thank you Ryan!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-350352097,https://api.github.com/repos/pydata/xarray/issues/1528,350352097,MDEyOklzc3VlQ29tbWVudDM1MDM1MjA5Nw==,1217238,2017-12-08T19:34:09Z,2017-12-08T19:34:09Z,MEMBER,"> The default keyword was introduced in python 3.4, so this doesn't work in 2.7. I have tried a couple of options to overcome this but none of them have worked.
Oops, this is my fault!
Instead, try:
```python
ndims = [k.ndim for k in key if isinstance(k, np.ndarray)]
array_subspace_size = max(ndims) if ndims else 0
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-349554730,https://api.github.com/repos/pydata/xarray/issues/1528,349554730,MDEyOklzc3VlQ29tbWVudDM0OTU1NDczMA==,1217238,2017-12-06T07:10:37Z,2017-12-06T07:10:37Z,MEMBER,I just pushed a commit adding a test for `backends.zarr._replace_slices_with_arrays`.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-348569223,https://api.github.com/repos/pydata/xarray/issues/1528,348569223,MDEyOklzc3VlQ29tbWVudDM0ODU2OTIyMw==,1217238,2017-12-01T18:20:32Z,2017-12-01T18:20:32Z,MEMBER,"> To finish it up, I propose to raise an error when attempting to encode variable-length string data. If someone can give me a quick one liner to help identify such datatypes, that would be helpful.
Variable length strings are stored with `dtype=object`. So something like `dtype.kind == 'O'` should work.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-348560326,https://api.github.com/repos/pydata/xarray/issues/1528,348560326,MDEyOklzc3VlQ29tbWVudDM0ODU2MDMyNg==,1217238,2017-12-01T17:43:03Z,2017-12-01T17:43:03Z,MEMBER,I'll give this another look over the weekend.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347984582,https://api.github.com/repos/pydata/xarray/issues/1528,347984582,MDEyOklzc3VlQ29tbWVudDM0Nzk4NDU4Mg==,1217238,2017-11-29T20:22:33Z,2017-11-29T20:22:33Z,MEMBER,"I'm fine skipping strings entirely for now. They are indeed unneeded for
most netCDF datasets.
On Wed, Nov 29, 2017 at 8:18 PM Ryan Abernathey
wrote:
> Right now I am in a dilemma over how to move forward. Fixing this string
> encoding issue will require some serious hacks to cf encoding. If I do this
> before #1087 is finished, it
> will be a waste of time (and a pain). On the other hand #1087
> could take a long time,
> since it is a major refactor itself.
>
> Is there some way to punt on the multi-length string encoding for now? We
> could just error if such variables are present. This would allow us to get
> the experimental zarr backend out into the wild. FWIW, none of the datasets
> I want to use this with actually have any string data variables at all. I
> believe 95% of netcdf datasets are just regular numbers. This is an edge
> case.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-347351224,https://api.github.com/repos/pydata/xarray/issues/1528,347351224,MDEyOklzc3VlQ29tbWVudDM0NzM1MTIyNA==,1217238,2017-11-27T22:32:47Z,2017-11-28T07:51:31Z,MEMBER,"> Overall, I find the conventions module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends.
Agreed!
I wonder why zarr doesn't have a UTF-8 variable length string type (https://github.com/alimanfoo/zarr/issues/206) -- that would feel like the obvious first choice for encoding this data.
That said, xarary *should* be able to use fixed-length bytes just fine, doing UTF-8 encoding/decoding on the fly.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-345091139,https://api.github.com/repos/pydata/xarray/issues/1528,345091139,MDEyOklzc3VlQ29tbWVudDM0NTA5MTEzOQ==,1217238,2017-11-16T23:02:14Z,2017-11-16T23:02:14Z,MEMBER,"> can we brainstorm what a ZarrArrayWraper would need to be compatible with the new indexing API?
We will need to write new adapter code to map xarray's explicit indexer classes onto the appropriate zarr methods, e.g.,
```python
def __getitem__(self, key):
array = self.get_arraay()
if isinstance(key, BasicIndexer):
return array[key.tuple]
elif isinstance(key, VectorizedIndexer):
return array.vindex[_replace_slices_with_arrays(key.tuple, self.shape)]
else:
assert isinstance(key, OuterIndexer)
return array.oindex[key.tuple]
# untested, but I think this does the appropriate shape munging to make slices
# appear as the last axes of the result array
def _replace_slice_with_arrays(key, shape):
num_slices = sum(1 for k in key if isinstance(k, slice))
num_arrays = len(shape) - num_slices
new_key = []
slice_count = 0
for k, size in zip(key, shape):
if isinstance(k, slice):
array = np.arange(*k.indices(size))
sl = [np.newaxis] * len(shape)
sl[num_arrays + slice_count] = np.newaxis
k = array[sl]
slice_count += 1
else:
assert isinstance(k, numpy.ndarray)
k = k[(slice(None),) * num_arrays + (np.newaxis,) * num_slices]
new_key.append(k)
return tuple(new_key)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-344040250,https://api.github.com/repos/pydata/xarray/issues/1528,344040250,MDEyOklzc3VlQ29tbWVudDM0NDA0MDI1MA==,1217238,2017-11-13T20:02:03Z,2017-11-13T20:02:03Z,MEMBER,"@rabernat sorry for the churn here, but you are also probably going to need to update after the explicit indexing changes in #1705.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-335015485,https://api.github.com/repos/pydata/xarray/issues/1528,335015485,MDEyOklzc3VlQ29tbWVudDMzNTAxNTQ4NQ==,1217238,2017-10-08T15:46:36Z,2017-10-08T15:46:36Z,MEMBER,"For serializing attributes, the easiest fix is to call `.item()` on any numpy scalars (instances of `np.generic`) and `.tolist()` on any numpy arrays. For thoroughness this might be worth doing with custom JSON encoder on the zarr side, but would also be easy to do in the xarray wrapper.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-327900874,https://api.github.com/repos/pydata/xarray/issues/1528,327900874,MDEyOklzc3VlQ29tbWVudDMyNzkwMDg3NA==,1217238,2017-09-07T19:32:41Z,2017-09-07T19:32:41Z,MEMBER,"@rabernat indeed, the backend tests are not terribly well organized right now. Probably the place to start is to inherit from `DatasetIOTestCases` and `TestCase` and then implement `create_store` and `roundtrip`. `DaskTest` abuses the ""backend"" notation a little bit, but these lines cover the essentials:
https://github.com/pydata/xarray/blob/98a05f11c6f38489c82e86c9e9df796e7fb65fd2/xarray/tests/test_backends.py#L1271-L1279","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325742232,https://api.github.com/repos/pydata/xarray/issues/1528,325742232,MDEyOklzc3VlQ29tbWVudDMyNTc0MjIzMg==,1217238,2017-08-29T17:50:04Z,2017-08-29T17:50:04Z,MEMBER,"> If we think there is an advantage to using the zarr native filters, that could be added via a future PR once we have the basic backend working.
The only advantage here would be for non-xarray users, who could use zarr to do this decoding/encoding automatically.
For what it's worth, the implementation of scale offsets in xarray looks basically equivalent to what's done in zarr. I don't think there's a performance difference either way.
> A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data.
If you use chunks, I believe HDF5/NetCDF4 do the same thing, e.g.,
```
In [10]: with h5py.File('one-chunk.h5') as f: f.create_dataset('foo', (100, 100), chunks=(100, 100))
In [11]: with h5py.File('many-chunk.h5') as f: f.create_dataset('foo', (100000, 100000), chunks=(100, 100))
In [12]: ls -l | grep chunk.h5
-rw-r--r-- 1 shoyer eng 1400 Aug 29 10:48 many-chunk.h5
-rw-r--r-- 1 shoyer eng 1400 Aug 29 10:48 one-chunk.h5
```
(Note the same file-size)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325723577,https://api.github.com/repos/pydata/xarray/issues/1528,325723577,MDEyOklzc3VlQ29tbWVudDMyNTcyMzU3Nw==,1217238,2017-08-29T16:43:58Z,2017-08-29T16:44:25Z,MEMBER,"> Is the goal here to be able to round-trip the file, such that calling .to_netcdf() produces an identical file to the original source file?
Yes, exactly.
> I don't understand how encoding interacts with attributes? When is something an attribute vs. an encoding (add_offset for example)?
Typically, we store things in encoding that are attributes on the underlying NetCDF file, but no longer make sense to describe the decoded data. For example:
- On the file, `add_offset` is an attribute.
- If loaded with `open_dataset(..., mask_and_scale=True)`, `add_offset` can be found in `encoding`, not `attrs`, because the data has already been offset.
- If loaded with `open_dataset(..., mask_and_scale=False)`, `add_offset` will still be on `attrs` (the data has not been offset).
> How does xarray know whether the store automatically encodes / decodes the encodings vs. when it has to be done by xarray, e.g. by calling mask_and_scale
Currently, we assume that stores never do this, and always handle it ourselves. We might need a special exception for zarr and scale/offset encoding.
> Does this mean that my ZarrStore should inherit from WritableCFDataStore instead of AbstractWritableDataStore?
Maybe, though again it will probably need slightly customized conventions for writing data (if we let zarr handling scale/offset encoding).
> I don't yet understand how to make these elements work together properly, for example, do avoid applying the scale / offset function twice, as I mentioned above.
We have two options:
1. Handle it all in xarray via the machinery in `conventions.py`. Never pass the arguments to do scale/offset encoding to zarr (just save them as attributes).
2. Handle it all in zarr. We'll need special case logic to skip this part of encoding.
I think (2) would be the preferred way to do this.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325716892,https://api.github.com/repos/pydata/xarray/issues/1528,325716892,MDEyOklzc3VlQ29tbWVudDMyNTcxNjg5Mg==,1217238,2017-08-29T16:19:57Z,2017-08-29T16:19:57Z,MEMBER,@rabernat I think this is #1531 -- `require_pynio` seems to have infected all our other requirements!,"{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 1, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694
https://github.com/pydata/xarray/pull/1528#issuecomment-325525827,https://api.github.com/repos/pydata/xarray/issues/1528,325525827,MDEyOklzc3VlQ29tbWVudDMyNTUyNTgyNw==,1217238,2017-08-29T01:14:05Z,2017-08-29T01:14:05Z,MEMBER,"> What is ""encoding"" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?
`encoding` keeps track of how variables are represented in a file (e.g., chunking schemes, `_FillValue`/`add_offset`/`scale_factor` compression, time units), so we reconstruct a netCDF file that looks almost exactly like the file we've read from disk. In the case of zarr, I guess we might include chunking, fill values, compressor options....
> Should we encode / decode CF for zarr stores?
Yes, probably, if we want to handle netcdf conventions for times, fill values and scaling.
> Do we want to always automatically align dask chunks with the underlying zarr chunks?
This would be nice! But it's also a bigger issue (will look for the number, I think it's already been opened).
> What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.
Still need to think about this one.
> zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?
I guess we can ignore them (maybe add a warning?) -- they're not part of the zarr data model.
> Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.
I don't think we need any autoclose logic at all -- zarr doesn't leave open files hanging around already.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,253136694