issue_comments
9 rows where issue = 466994138 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- Support parallel writes to zarr store · 9 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
1092436439 | https://github.com/pydata/xarray/issues/3096#issuecomment-1092436439 | https://api.github.com/repos/pydata/xarray/issues/3096 | IC_kwDOAMm_X85BHUHX | max-sixty 5635139 | 2022-04-08T04:43:14Z | 2022-04-08T04:43:14Z | MEMBER | I think this was closed by https://github.com/pydata/xarray/pull/4035 (which I'm going to start using shortly!), so I'll close this, but feel free to reopen if I missed something. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Support parallel writes to zarr store 466994138 | |
730446943 | https://github.com/pydata/xarray/issues/3096#issuecomment-730446943 | https://api.github.com/repos/pydata/xarray/issues/3096 | MDEyOklzc3VlQ29tbWVudDczMDQ0Njk0Mw== | rabernat 1197350 | 2020-11-19T15:22:41Z | 2020-11-19T15:22:41Z | MEMBER | Just a note that #4035 provides a new way to do parallel writing to zarr stores. @VincentDehaye & @cdibble, would you be willing to test this out and see if it meets your needs? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Support parallel writes to zarr store 466994138 | |
672978363 | https://github.com/pydata/xarray/issues/3096#issuecomment-672978363 | https://api.github.com/repos/pydata/xarray/issues/3096 | MDEyOklzc3VlQ29tbWVudDY3Mjk3ODM2Mw== | cdibble 8380659 | 2020-08-12T16:26:46Z | 2020-08-12T16:26:46Z | NONE | Hi All, Thanks for all of your great work, support, and discussion on these and other pages. I very much appreciate it as I am working with Xarray and Zarr quite a lot for large geospatial data storage and manipulation. I wanted to add a note to this discussion that I have had success using Zarr's built-in It does seem that providing explicit chunking rules as you have mentioned above (or using the Zarr encoding argument, which I haven't tried but I think is another option) is a great way to handle this and likely outperforms the locking approach (just a guess- would love to hear from others about this). But the locks are pretty easily implemented and seem to have helped me avoid the problems related to race conditions with Zarr. For the sake of completeness, here is a simple example of how you might do this:
I would be happy to discuss further and am very much open to critique, instruction, etc. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Support parallel writes to zarr store 466994138 | |
549730000 | https://github.com/pydata/xarray/issues/3096#issuecomment-549730000 | https://api.github.com/repos/pydata/xarray/issues/3096 | MDEyOklzc3VlQ29tbWVudDU0OTczMDAwMA== | VincentDehaye 18643609 | 2019-11-05T09:06:27Z | 2019-11-05T09:06:27Z | NONE | Coming back on this issue in order not to leave it inactive and to provide some feedback to the community. The problem with the open_mfdataset solution was that the lazy open of a single lead time dataset was still taking 150MB in memory, leading to 150*209 = 31,35GB minimum memory requirement. When I tried with a bigger (64GB memory) machine, I was then blocked with the rechunking which was exceeding the machine's resources and making the script crash. So we ended up using a dask cluster which solved the concurrency and resources limitations. My second use-case (https://github.com/pydata/xarray/issues/3096#issuecomment-516043946) still remains though, I am wondering if it matches the intended use of zarr and if we want to do something about it, in this case I can open a separate issue documenting it. All in all I would say my original problem is not relevant anymore, either one can do it with open_mfdataset on a single machine as proposed by @rabernat, you just need some amount of memory (and probably much more if you need to rechunk), or you do it with a dask cluster, which is the solution we chose. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Support parallel writes to zarr store 466994138 | |
516047812 | https://github.com/pydata/xarray/issues/3096#issuecomment-516047812 | https://api.github.com/repos/pydata/xarray/issues/3096 | MDEyOklzc3VlQ29tbWVudDUxNjA0NzgxMg== | rabernat 1197350 | 2019-07-29T15:47:13Z | 2019-07-29T15:47:13Z | MEMBER | @VincentDehaye - we are eager to help you. But it is difficult to hit a moving target. I would like to politely suggest that we keep this issue on topic: making sure that parallel append to zarr store works as expected. Your latest post revealed that you did not try our suggested resolution (use I recommend you open a new, separate issue related to "storing different variables being indexed by the same dimension". |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Support parallel writes to zarr store 466994138 | |
516043946 | https://github.com/pydata/xarray/issues/3096#issuecomment-516043946 | https://api.github.com/repos/pydata/xarray/issues/3096 | MDEyOklzc3VlQ29tbWVudDUxNjA0Mzk0Ng== | VincentDehaye 18643609 | 2019-07-29T15:37:27Z | 2019-07-29T15:38:31Z | NONE | Coming back on this issue (still haven't had time to try the open_mfdataset approach), I have another use case where I would like to store different variables being indexed by the same dimension, but not all available at the same moment. For example, I would have variables V1 and V2 indexed on dimension D1. V1 would be available at time T, and I would like to store it in my S3 bucket at this moment, but V2 would only be available at time T+1. In this case, I would like to be able to save the values of V2 at time T+1, leaving the missing V2 values filled with the fill_value specified in the metadata between T and T+1. What actually happens is that you can append such data, but then if you want to open the resulting zarr the open_zarr function needs to be given V2 as value for its drop_variables argument, otherwise you get the error shown in my original post. However, as the open_zarr function is called when appending as well (cf. original post's error trace), and in this case you can not provide this argument, you will fail the next append attempts, thus preventing you from appending the values of V2. Your dataset is now frozen. Am I misusing the functionality, or do you know any workaround using xarray and not coding everything myself (for optimization reasons)? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Support parallel writes to zarr store 466994138 | |
511174605 | https://github.com/pydata/xarray/issues/3096#issuecomment-511174605 | https://api.github.com/repos/pydata/xarray/issues/3096 | MDEyOklzc3VlQ29tbWVudDUxMTE3NDYwNQ== | shoyer 1217238 | 2019-07-14T05:28:22Z | 2019-07-14T05:28:43Z | MEMBER |
Yes, this is the suggested workflow! It is definitely possible to create a zarr dataset and then write to it in parallel with a bunch of processes, but not via xarray's |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Support parallel writes to zarr store 466994138 | |
510816294 | https://github.com/pydata/xarray/issues/3096#issuecomment-510816294 | https://api.github.com/repos/pydata/xarray/issues/3096 | MDEyOklzc3VlQ29tbWVudDUxMDgxNjI5NA== | VincentDehaye 18643609 | 2019-07-12T09:19:41Z | 2019-07-12T09:19:41Z | NONE |
Hi @rabernat, thank you for your quick answer. I edited my MCVE so that you can reproduce the error(as long as you have access to a S3 bucket). I actually forgot about With regards to |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Support parallel writes to zarr store 466994138 | |
510659320 | https://github.com/pydata/xarray/issues/3096#issuecomment-510659320 | https://api.github.com/repos/pydata/xarray/issues/3096 | MDEyOklzc3VlQ29tbWVudDUxMDY1OTMyMA== | rabernat 1197350 | 2019-07-11T21:23:33Z | 2019-07-11T21:23:33Z | MEMBER | Hi @VincentDehaye. Thanks for being an early adopter! We really appreciate your feedback. I'm sorry it didn't work as expected. We are in really new territory with this feature. I'm a bit confused about why you are using the multiprocessing module here. The recommended way of parallelizing xarray operations is via the built-in dask support. There are no guarantees that multiprocessing like you're doing will work right. When we talk about parallel append, we are always talking about dask. Your MCVE is not especially helpful for debugging because the two key functions (make_xarray_dataset and upload_to_s3) are not shown. Could you try simplifying your example a bit? I know it is hard when cloud is involved. But try to let us see more of what is happening under the hood. If you are creating a dataset for the first time, you probably don't want append. You want to do
If you are using a dask cluster, this will automatically parallelize everything. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Support parallel writes to zarr store 466994138 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 5