home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where author_association = "NONE" and user = 3922329 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 2

  • open_mfdataset: support for multiple zarr datasets 7
  • Different data values from xarray open_mfdataset when using chunks 3

user 1

  • dmedv · 10 ✖

author_association 1

  • NONE · 10 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
573550514 https://github.com/pydata/xarray/issues/3668#issuecomment-573550514 https://api.github.com/repos/pydata/xarray/issues/3668 MDEyOklzc3VlQ29tbWVudDU3MzU1MDUxNA== dmedv 3922329 2020-01-13T08:13:10Z 2020-01-13T09:01:02Z NONE

@jhamman I did already confirm it with a zarr-only test, pickling and unpickling a zarr group object. I get the same error as with an xarray dataset: ValueError: group not found at path ''

Not sure if we can call it a bug though. According to the storage specification https://zarr.readthedocs.io/en/stable/spec/v2.html#storage for a group to exist a .zgroup key must exist under the corresponding logical path, so in the case of DirectoryStore it's natural to check if a .zgroup file exists at group object creation time.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset: support for multiple zarr datasets 546562676
573455625 https://github.com/pydata/xarray/issues/3686#issuecomment-573455625 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3MzQ1NTYyNQ== dmedv 3922329 2020-01-12T20:48:20Z 2020-01-12T20:51:01Z NONE

Actually, there is no need to separate them. One can simply do something like this to apply the mask: ds.analysed_sst.where(ds.analysed_sst != fill_value).mean() * scale_factor + offset It's not a bug, but if we set mask_and_scale=False, it's left up to us to apply the mask manually.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127
573451230 https://github.com/pydata/xarray/issues/3686#issuecomment-573451230 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3MzQ1MTIzMA== dmedv 3922329 2020-01-12T19:59:31Z 2020-01-12T20:25:16Z NONE

@abarciauskas-bgse Yes, indeed, I forgot about _FillValue. That would mess up the mean calculation with mask_and_scale=False. I think it would be nice if it were possible to control the mask application in open_dataset separately from scale/offset.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127
573393003 https://github.com/pydata/xarray/issues/3668#issuecomment-573393003 https://api.github.com/repos/pydata/xarray/issues/3668 MDEyOklzc3VlQ29tbWVudDU3MzM5MzAwMw== dmedv 3922329 2020-01-12T08:23:01Z 2020-01-12T08:23:01Z NONE

Zarr documentation is not entirely clear on whether metadata gets pickled or not with zarr.storage.DirectoryStore: https://zarr.readthedocs.io/en/stable/tutorial.html#pickle-support

but the code shows that the metadata is read from a file upon __init__, and I guess xarray is simply relying on zarr's own serialization, and there is no easy way to bypass it.

See https://github.com/zarr-developers/zarr-python/blob/v2.4.0/zarr/hierarchy.py#L113 and https://github.com/zarr-developers/zarr-python/blob/v2.4.0/zarr/storage.py#L785-L791

I think at this point I will just give up and mount the necessary directories on the client, but at least I have a much better understanding of the issue now.

Feel free to close if you think there's nothing else that can/should be done in xarray code about it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset: support for multiple zarr datasets 546562676
573380688 https://github.com/pydata/xarray/issues/3686#issuecomment-573380688 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3MzM4MDY4OA== dmedv 3922329 2020-01-12T04:18:43Z 2020-01-12T04:27:23Z NONE

Actually, that's true not just for open_mfdataset, but even for open_dataset with a single file. I've tried it with one of those files from PO.DAAC, and got similar results - slightly different values depending on the chunking strategy.

Just a guess, but I think the problem here is that the calculations are done in floating-point arithmetic (probably float32...), and you get accumulated precision errors depending on the number of chunks.

Internally in the NetCDF file analysed_sst values are stored as int16, with real scale and offset values, so the correct way to calculate the mean would be to do it in original int16, and then apply scale and offset to the result. Automatic scaling is on by default (i.e. it will replace original array values with new scaled values), but you can turn it off in open_dataset with the mask_and_scale=False option: http://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html I tried doing this, and then I got identical results with chunked and unchunked versions. Can pass this option to open_mfdataset as well with **kwargs.

I'm basically just starting to use xarray myself, so please someone correct me if any of the above is wrong.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127
573367338 https://github.com/pydata/xarray/issues/3668#issuecomment-573367338 https://api.github.com/repos/pydata/xarray/issues/3668 MDEyOklzc3VlQ29tbWVudDU3MzM2NzMzOA== dmedv 3922329 2020-01-12T00:24:55Z 2020-01-12T02:24:39Z NONE

I did another experiment: copied the metadata to the client (.zgroup, .zarray, and .zattrs files only), preserving the directory structure. That worked, i.e. I could run calculations with remote data by wrapping them inside dask.delayed. I guess if the metadata could be cached in the object, that would solve my problem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset: support for multiple zarr datasets 546562676
572605475 https://github.com/pydata/xarray/issues/3668#issuecomment-572605475 https://api.github.com/repos/pydata/xarray/issues/3668 MDEyOklzc3VlQ29tbWVudDU3MjYwNTQ3NQ== dmedv 3922329 2020-01-09T15:13:59Z 2020-01-09T15:13:59Z NONE

@rabernat Fair enough. In our case it would be possible to mount NFS shares on the client, and if all else fails I will do exactly that. However, from architectural perspective, that would make the whole system a bit more tightly coupled than I would like, and it's easy to imagine other use-cases, where mounting data on the client would not be possible. Also, the ability to work with remote data using just xarray and dask, the way it already works with NetCDF, looks pretty neat, even if unintentional, and I am inclined to pursue that route at least a bit further.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset: support for multiple zarr datasets 546562676
572355926 https://github.com/pydata/xarray/issues/3668#issuecomment-572355926 https://api.github.com/repos/pydata/xarray/issues/3668 MDEyOklzc3VlQ29tbWVudDU3MjM1NTkyNg== dmedv 3922329 2020-01-09T02:40:44Z 2020-01-09T02:40:44Z NONE

I tried to do serialization/deserialization by hand:

  • logged in to one of the Dask worker, loaded zarr data locally using open_zarr, pickled the resulting dataset

python ds = xr.open_zarr("/sciserver/filedb02-01/ocean/LLC4320/SST") pickle.dump(ds, open("/home/dask/zarr.p", "wb"))

  • copied the pickle file to the client, tried to unpickle it

ds = pickle.load(open("zarr.p", "rb"))

It failed with the same error:

``` UnpicklingErrorTraceback (most recent call last) <ipython-input-77-4809dc01c404> in <module> ----> 1 a = pickle.loads(s)

UnpicklingError: pickle data was truncated

import pickle, xarray pickle.load(open("zarr.p", "rb")) zarr = pickle.load(open("zarr.p", "rb"))

KeyErrorTraceback (most recent call last) ~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in init(self, store, path, read_only, chunk_store, cache_attrs, synchronizer) 109 mkey = self._key_prefix + group_meta_key --> 110 meta_bytes = store[mkey] 111 except KeyError:

~/miniconda3/lib/python3.6/site-packages/zarr/storage.py in getitem(self, key) 726 else: --> 727 raise KeyError(key) 728

KeyError: '.zgroup'

During handling of the above exception, another exception occurred:

ValueErrorTraceback (most recent call last) <ipython-input-83-cd9f4ae936eb> in <module> ----> 1 zarr = pickle.load(open("zarr.p", "rb"))

~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in setstate(self, state) 269 270 def setstate(self, state): --> 271 self.init(*state) 272 273 def _item_path(self, item):

~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in init(self, store, path, read_only, chunk_store, cache_attrs, synchronizer) 110 meta_bytes = store[mkey] 111 except KeyError: --> 112 err_group_not_found(path) 113 else: 114 meta = decode_group_metadata(meta_bytes)

~/miniconda3/lib/python3.6/site-packages/zarr/errors.py in err_group_not_found(path) 27 28 def err_group_not_found(path): ---> 29 raise ValueError('group not found at path %r' % path) 30 31

ValueError: group not found at path '' ```

I then tried the same thing with a NetCDF dataset, and it worked fine. Also, the pickle file for NetCDF was much smaller. So I guess in the case of zarr dataset there is some initialization code that tries to open the zarr files when the dataset object gets deserialized on the client, and of course it cannot, because there is no data on the client. That explains a lot... although I'm still not sure if xarray was ever intended to be used that way. Maybe I'm trying to do a completely wrong thing here?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset: support for multiple zarr datasets 546562676
572332890 https://github.com/pydata/xarray/issues/3668#issuecomment-572332890 https://api.github.com/repos/pydata/xarray/issues/3668 MDEyOklzc3VlQ29tbWVudDU3MjMzMjg5MA== dmedv 3922329 2020-01-09T01:07:39Z 2020-01-09T01:22:53Z NONE

Here is the stacktrace (somewhat abbreviated). Looks like a deserialization problem. As far as I can see from the Dask status dashboard and worker logs, open_zarr does finish normally on the worker. Just in case, I ran client.get_versions(check=True), and it didn't show any library mismatches.

``` distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95\x92\x13\x01\x00\x00\x00\x00\x00\x8c\x13xarray.core.dataset\x94\x8c\x07Dataset\x94\x93\x94)\x81\x94 ...

...

KeyErrorTraceback (most recent call last) ~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in init(self, store, path, read_only, chunk_store, cache_attrs, synchronizer) 109 mkey = self._key_prefix + group_meta_key --> 110 meta_bytes = store[mkey] 111 except KeyError:

~/miniconda3/lib/python3.6/site-packages/zarr/storage.py in getitem(self, key) 726 else: --> 727 raise KeyError(key) 728

KeyError: '.zgroup'

During handling of the above exception, another exception occurred:

ValueErrorTraceback (most recent call last) <ipython-input-60-5c7db35096c7> in <module> 6 chunks={} 7 ) ----> 8 ds = dask.compute(dask.delayed(_xr.open_zarr)('/sciserver/filedb02-01/ocean/LLC4320/SST',**open_kwargs))[0]

...

~/miniconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py in loads(x) 57 def loads(x): 58 try: ---> 59 return pickle.loads(x) 60 except Exception: 61 logger.info("Failed to deserialize %s", x[:10000], exc_info=True)

~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in setstate(self, state) 269 270 def setstate(self, state): --> 271 self.init(*state) 272 273 def _item_path(self, item):

~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in init(self, store, path, read_only, chunk_store, cache_attrs, synchronizer) 110 meta_bytes = store[mkey] 111 except KeyError: --> 112 err_group_not_found(path) 113 else: 114 meta = decode_group_metadata(meta_bytes)

~/miniconda3/lib/python3.6/site-packages/zarr/errors.py in err_group_not_found(path) 27 28 def err_group_not_found(path): ---> 29 raise ValueError('group not found at path %r' % path) 30 31

ValueError: group not found at path ''

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset: support for multiple zarr datasets 546562676
572311400 https://github.com/pydata/xarray/issues/3668#issuecomment-572311400 https://api.github.com/repos/pydata/xarray/issues/3668 MDEyOklzc3VlQ29tbWVudDU3MjMxMTQwMA== dmedv 3922329 2020-01-08T23:41:22Z 2020-01-08T23:45:59Z NONE

@rabernat Each Dask worker is running on its own machine. The data that I am trying to work with is distributed among workers, but all of it is accessible from any individual worker via cross-mounted NFS shares, so this works like a shared data storage, basically. None of that data is available on the client.

For now, I'm trying to open just a single zarr store. I have only mentioned open_mfdataset as an example, because it has this parallel option, unlike open_dataset or open_zarr. This is really not about combining multiple datasets, but about working with data on a remote Dask cluster. Sorry, if I haven't made it absolutely clear from the start.

@dcherian You mean this code?

```python def modify(ds): # modify ds here return ds

this is basically what open_mfdataset does

open_kwargs = dict(decode_cf=True, decode_times=False) open_tasks = [dask.delayed(xr.open_dataset)(f, **open_kwargs) for f in file_names] tasks = [dask.delayed(modify)(task) for task in open_tasks] datasets = dask.compute(tasks) # get a list of xarray.Datasets combined = xr.combine_nested(datasets) # or some combination of concat, merge ```

In case of a single data source, I think, it can be condensed into this: open_kwargs = dict( decode_cf=True, decode_times=False ) ds = dask.compute(dask.delayed(xr.open_dataset)(file_name, **open_kwargs))[0] But it doesn't work quite as I expected, either with zarr, or with NetCDF. First I'll have to explain what I get with open_dataset and a NetCDF file. The code above runs, but when I try to do calculations on the obtained dataset, for example

ds['Temp'].mean().compute()

I get

FileNotFoundError: [Errno 2] No such file or directory

on the client. Only if I wrap it in dask.delayed again, it will run properly:

dask.compute(dask.delayed(ds['Temp'].mean)())

So, this approach is not fully equivalent to what open_mfdataset does, and unfortunately that doesn't work for me, because I would like to be able to use the xarray dataset transparently, without having to program Dask explicitly.

If I add chunks={} to open_kwargs, similar to this line in the open_mfdataset implementation https://github.com/pydata/xarray/blob/v0.14.1/xarray/backends/api.py#L885 , then it starts behaving exactly like open_mfdataset and I can use the dataset transparently. I don't quite understand what's going on there, but so far so good.

Now, back to zarr: ds = dask.compute(dask.delayed(xr.open_zarr)(zarr_dataset_path, **open_kwargs))[0] doesn't run at all, regardless of the chunks setting, giving me

ValueError: group not found at path ''

so I don't even get a dataset object. Seems that something is quite different in the zarr backend implementation. I haven't had the chance to look at the code carefully yet, but I will do so in the next few days.

Sorry for this long-winded explanation, I hope it clarifies what I'm trying to achieve here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset: support for multiple zarr datasets 546562676

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 29.121ms · About: xarray-datasette