issue_comments
14 rows where issue = 546562676 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- open_mfdataset: support for multiple zarr datasets · 14 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
573910792 | https://github.com/pydata/xarray/issues/3668#issuecomment-573910792 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MzkxMDc5Mg== | rabernat 1197350 | 2020-01-13T22:50:41Z | 2020-01-13T22:50:48Z | MEMBER | It would be wonderful if we could translate this complex xarray issue into a minimally simple zarr issue. Then the zarr devs can decide whether this use case is compatible with the zarr spec or not. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
573550514 | https://github.com/pydata/xarray/issues/3668#issuecomment-573550514 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MzU1MDUxNA== | dmedv 3922329 | 2020-01-13T08:13:10Z | 2020-01-13T09:01:02Z | NONE | @jhamman I did already confirm it with a zarr-only test, pickling and unpickling a zarr group object. I get the same error as with an xarray dataset: Not sure if we can call it a bug though. According to the storage specification https://zarr.readthedocs.io/en/stable/spec/v2.html#storage for a group to exist a |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
573509747 | https://github.com/pydata/xarray/issues/3668#issuecomment-573509747 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MzUwOTc0Nw== | jhamman 2443309 | 2020-01-13T05:06:45Z | 2020-01-13T05:06:45Z | MEMBER | @dmedv and @rabernat - after thinking about this a bit more and reviewing the links in the last post, I'm pretty sure we're bumping into a bug in zarray's directory store pickle support. It would be nice to confirm this with some zarr-only tests but I don't see why the store needs to reference the zgroup files when the object is unpickled. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
573393003 | https://github.com/pydata/xarray/issues/3668#issuecomment-573393003 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MzM5MzAwMw== | dmedv 3922329 | 2020-01-12T08:23:01Z | 2020-01-12T08:23:01Z | NONE | Zarr documentation is not entirely clear on whether metadata gets pickled or not with but the code shows that the metadata is read from a file upon See https://github.com/zarr-developers/zarr-python/blob/v2.4.0/zarr/hierarchy.py#L113 and https://github.com/zarr-developers/zarr-python/blob/v2.4.0/zarr/storage.py#L785-L791 I think at this point I will just give up and mount the necessary directories on the client, but at least I have a much better understanding of the issue now. Feel free to close if you think there's nothing else that can/should be done in xarray code about it. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
573367338 | https://github.com/pydata/xarray/issues/3668#issuecomment-573367338 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MzM2NzMzOA== | dmedv 3922329 | 2020-01-12T00:24:55Z | 2020-01-12T02:24:39Z | NONE | I did another experiment: copied the metadata to the client ( |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
573197896 | https://github.com/pydata/xarray/issues/3668#issuecomment-573197896 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MzE5Nzg5Ng== | jhamman 2443309 | 2020-01-10T20:43:30Z | 2020-01-10T20:43:30Z | MEMBER | Also, @dmedv, can you add the output of |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
573196874 | https://github.com/pydata/xarray/issues/3668#issuecomment-573196874 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MzE5Njg3NA== | jhamman 2443309 | 2020-01-10T20:40:14Z | 2020-01-10T20:40:14Z | MEMBER |
True. I think its fair to say that the behavior you are enjoying (accessing data that the client cannot see) is the exception, not the rule. I expect there are many places in our backends that will not support this functionality at present. The motivation for implementing the Ironically, this dask issue also popped up and has some significant overlap here: https://github.com/dask/dask/issues/5769 In both of these cases, the desire is for the worker to open the file (or zarr dataset), construct the underlying dask arrays, and return the meta object. This requires the object to be fully pickle-able and for any references to be maintained. It is possible, as indicated by your traceback, that the zarr backend is trying to reference the |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
572605475 | https://github.com/pydata/xarray/issues/3668#issuecomment-572605475 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MjYwNTQ3NQ== | dmedv 3922329 | 2020-01-09T15:13:59Z | 2020-01-09T15:13:59Z | NONE | @rabernat Fair enough. In our case it would be possible to mount NFS shares on the client, and if all else fails I will do exactly that. However, from architectural perspective, that would make the whole system a bit more tightly coupled than I would like, and it's easy to imagine other use-cases, where mounting data on the client would not be possible. Also, the ability to work with remote data using just xarray and dask, the way it already works with NetCDF, looks pretty neat, even if unintentional, and I am inclined to pursue that route at least a bit further. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
572369966 | https://github.com/pydata/xarray/issues/3668#issuecomment-572369966 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MjM2OTk2Ng== | rabernat 1197350 | 2020-01-09T03:42:23Z | 2020-01-09T03:42:23Z | MEMBER | Thanks for these detailed reports! The scenario you are describing--trying to open a file that is not accessible at all from the client--is certainly not something we ever considered when designing this. It is a miracle to me that it does work with netCDF. I think you are on track with the serialization diagnostics. I believe that @jhamman has the best understanding of this topic. He implemented the parallel mode in In the meantime, it seems worth asking the obvious question...how hard would it be to mount the NFS volume on the client? That would avoid having to go down this route. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
572355926 | https://github.com/pydata/xarray/issues/3668#issuecomment-572355926 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MjM1NTkyNg== | dmedv 3922329 | 2020-01-09T02:40:44Z | 2020-01-09T02:40:44Z | NONE | I tried to do serialization/deserialization by hand:
It failed with the same error: ``` UnpicklingErrorTraceback (most recent call last) <ipython-input-77-4809dc01c404> in <module> ----> 1 a = pickle.loads(s) UnpicklingError: pickle data was truncated import pickle, xarray pickle.load(open("zarr.p", "rb")) zarr = pickle.load(open("zarr.p", "rb")) KeyErrorTraceback (most recent call last) ~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in init(self, store, path, read_only, chunk_store, cache_attrs, synchronizer) 109 mkey = self._key_prefix + group_meta_key --> 110 meta_bytes = store[mkey] 111 except KeyError: ~/miniconda3/lib/python3.6/site-packages/zarr/storage.py in getitem(self, key) 726 else: --> 727 raise KeyError(key) 728 KeyError: '.zgroup' During handling of the above exception, another exception occurred: ValueErrorTraceback (most recent call last) <ipython-input-83-cd9f4ae936eb> in <module> ----> 1 zarr = pickle.load(open("zarr.p", "rb")) ~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in setstate(self, state) 269 270 def setstate(self, state): --> 271 self.init(*state) 272 273 def _item_path(self, item): ~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in init(self, store, path, read_only, chunk_store, cache_attrs, synchronizer) 110 meta_bytes = store[mkey] 111 except KeyError: --> 112 err_group_not_found(path) 113 else: 114 meta = decode_group_metadata(meta_bytes) ~/miniconda3/lib/python3.6/site-packages/zarr/errors.py in err_group_not_found(path) 27 28 def err_group_not_found(path): ---> 29 raise ValueError('group not found at path %r' % path) 30 31 ValueError: group not found at path '' ``` I then tried the same thing with a NetCDF dataset, and it worked fine. Also, the pickle file for NetCDF was much smaller. So I guess in the case of zarr dataset there is some initialization code that tries to open the zarr files when the dataset object gets deserialized on the client, and of course it cannot, because there is no data on the client. That explains a lot... although I'm still not sure if xarray was ever intended to be used that way. Maybe I'm trying to do a completely wrong thing here? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
572332890 | https://github.com/pydata/xarray/issues/3668#issuecomment-572332890 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MjMzMjg5MA== | dmedv 3922329 | 2020-01-09T01:07:39Z | 2020-01-09T01:22:53Z | NONE | Here is the stacktrace (somewhat abbreviated). Looks like a deserialization problem. As far as I can see from the Dask status dashboard and worker logs, ``` distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95\x92\x13\x01\x00\x00\x00\x00\x00\x8c\x13xarray.core.dataset\x94\x8c\x07Dataset\x94\x93\x94)\x81\x94 ... ... KeyErrorTraceback (most recent call last) ~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in init(self, store, path, read_only, chunk_store, cache_attrs, synchronizer) 109 mkey = self._key_prefix + group_meta_key --> 110 meta_bytes = store[mkey] 111 except KeyError: ~/miniconda3/lib/python3.6/site-packages/zarr/storage.py in getitem(self, key) 726 else: --> 727 raise KeyError(key) 728 KeyError: '.zgroup' During handling of the above exception, another exception occurred: ValueErrorTraceback (most recent call last) <ipython-input-60-5c7db35096c7> in <module> 6 chunks={} 7 ) ----> 8 ds = dask.compute(dask.delayed(_xr.open_zarr)('/sciserver/filedb02-01/ocean/LLC4320/SST',**open_kwargs))[0] ... ~/miniconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py in loads(x) 57 def loads(x): 58 try: ---> 59 return pickle.loads(x) 60 except Exception: 61 logger.info("Failed to deserialize %s", x[:10000], exc_info=True) ~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in setstate(self, state) 269 270 def setstate(self, state): --> 271 self.init(*state) 272 273 def _item_path(self, item): ~/miniconda3/lib/python3.6/site-packages/zarr/hierarchy.py in init(self, store, path, read_only, chunk_store, cache_attrs, synchronizer) 110 meta_bytes = store[mkey] 111 except KeyError: --> 112 err_group_not_found(path) 113 else: 114 meta = decode_group_metadata(meta_bytes) ~/miniconda3/lib/python3.6/site-packages/zarr/errors.py in err_group_not_found(path) 27 28 def err_group_not_found(path): ---> 29 raise ValueError('group not found at path %r' % path) 30 31 ValueError: group not found at path '' |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
572311400 | https://github.com/pydata/xarray/issues/3668#issuecomment-572311400 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MjMxMTQwMA== | dmedv 3922329 | 2020-01-08T23:41:22Z | 2020-01-08T23:45:59Z | NONE | @rabernat Each Dask worker is running on its own machine. The data that I am trying to work with is distributed among workers, but all of it is accessible from any individual worker via cross-mounted NFS shares, so this works like a shared data storage, basically. None of that data is available on the client. For now, I'm trying to open just a single zarr store. I have only mentioned @dcherian You mean this code? ```python def modify(ds): # modify ds here return ds this is basically what open_mfdataset doesopen_kwargs = dict(decode_cf=True, decode_times=False) open_tasks = [dask.delayed(xr.open_dataset)(f, **open_kwargs) for f in file_names] tasks = [dask.delayed(modify)(task) for task in open_tasks] datasets = dask.compute(tasks) # get a list of xarray.Datasets combined = xr.combine_nested(datasets) # or some combination of concat, merge ``` In case of a single data source, I think, it can be condensed into this:
I get
on the client. Only if I wrap it in
So, this approach is not fully equivalent to what If I add Now, back to zarr:
so I don't even get a dataset object. Seems that something is quite different in the zarr backend implementation. I haven't had the chance to look at the code carefully yet, but I will do so in the next few days. Sorry for this long-winded explanation, I hope it clarifies what I'm trying to achieve here. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
572205386 | https://github.com/pydata/xarray/issues/3668#issuecomment-572205386 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MjIwNTM4Ng== | rabernat 1197350 | 2020-01-08T18:51:06Z | 2020-01-08T18:51:06Z | MEMBER | Hi @dmedv -- thanks a lot for raising this issue here! One clarification question: is there just a single zarr store you are trying to read? Or are you trying to combine multiple stores, like
Can you provide more detail about how the zarr data is distributed across the different workers and client. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 | |
572196698 | https://github.com/pydata/xarray/issues/3668#issuecomment-572196698 | https://api.github.com/repos/pydata/xarray/issues/3668 | MDEyOklzc3VlQ29tbWVudDU3MjE5NjY5OA== | dcherian 2448579 | 2020-01-08T18:28:57Z | 2020-01-08T18:28:57Z | MEMBER | You can use the pseudocode here: https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets and change |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset: support for multiple zarr datasets 546562676 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 4