issue_comments
7 rows where issue = 345715825 and user = 12278765 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date)
issue 1
- Out-of-core processing with dask not working properly? · 7 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
415450958 | https://github.com/pydata/xarray/issues/2329#issuecomment-415450958 | https://api.github.com/repos/pydata/xarray/issues/2329 | MDEyOklzc3VlQ29tbWVudDQxNTQ1MDk1OA== | lrntct 12278765 | 2018-08-23T15:02:57Z | 2018-08-23T15:02:57Z | NONE | It seems that I managed to get something working as it should.
I first load my monthly grib files with iris, convert to xarray, then write to zarr. This uses all the CPU cores, but loads the full array into memory. Since the individual arrays are relatively small, that is not an issue.
Then I load the monthly zarr stores with xarray, concatenate them with Using that big zarr storage, plotting a map of the mean values along the time axis takes around 15min, uses all the cores and around 24GB of RAM. The strange part is: I think I tried that before and it was not working... |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Out-of-core processing with dask not working properly? 345715825 | |
415005804 | https://github.com/pydata/xarray/issues/2329#issuecomment-415005804 | https://api.github.com/repos/pydata/xarray/issues/2329 | MDEyOklzc3VlQ29tbWVudDQxNTAwNTgwNA== | lrntct 12278765 | 2018-08-22T11:51:36Z | 2018-08-22T11:51:36Z | NONE | The dask task graph seems right (mean on the time dimension, lower number of chunks to make the visualisation practical): If I understand well, the 'getter' are doing the actual reading of the file, but in reality, they do not seem to run in parallel. As for the zarr writing part, I do not know how to check the task graph. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Out-of-core processing with dask not working properly? 345715825 | |
409565674 | https://github.com/pydata/xarray/issues/2329#issuecomment-409565674 | https://api.github.com/repos/pydata/xarray/issues/2329 | MDEyOklzc3VlQ29tbWVudDQwOTU2NTY3NA== | lrntct 12278765 | 2018-08-01T12:58:31Z | 2018-08-01T12:58:31Z | NONE | I ran a comparison of the impact of chunk sizes with a profiler:
I am not sure if the profiler results are useful:
In the case of chunks on I don't know if this helps, but it would be great to have a solution or workaround for that. Surely I am not the only one working with dataset of that size? What would be the best practice in my case? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Out-of-core processing with dask not working properly? 345715825 | |
409276937 | https://github.com/pydata/xarray/issues/2329#issuecomment-409276937 | https://api.github.com/repos/pydata/xarray/issues/2329 | MDEyOklzc3VlQ29tbWVudDQwOTI3NjkzNw== | lrntct 12278765 | 2018-07-31T16:08:33Z | 2018-07-31T16:08:33Z | NONE | I did some tests with my big netcdf. The chunking indeed makes a difference. ``` chunks = {'time': 'auto', 'lat': 'auto', 'lon': 'auto'} ds = xr.open_dataset('era5_precip.nc', chunks=chunks) ds.sum().load() real 161m37.119s user 33m9.720s sys 63m47.696s chunks = {'time': 1} ds = xr.open_dataset('era5_precip.nc', chunks=chunks) print(ds.sum().load()) real 109m55.839s user 303m40.665s sys 451m30.788s ``` I'll do some more tests with the calculation of the mean on the time axis, it might be more representative of what I want to do. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Out-of-core processing with dask not working properly? 345715825 | |
409167123 | https://github.com/pydata/xarray/issues/2329#issuecomment-409167123 | https://api.github.com/repos/pydata/xarray/issues/2329 | MDEyOklzc3VlQ29tbWVudDQwOTE2NzEyMw== | lrntct 12278765 | 2018-07-31T10:04:10Z | 2018-07-31T10:04:41Z | NONE | @fmaussion |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Out-of-core processing with dask not working properly? 345715825 | |
409157118 | https://github.com/pydata/xarray/issues/2329#issuecomment-409157118 | https://api.github.com/repos/pydata/xarray/issues/2329 | MDEyOklzc3VlQ29tbWVudDQwOTE1NzExOA== | lrntct 12278765 | 2018-07-31T09:28:48Z | 2018-07-31T09:28:48Z | NONE | @rabernat I tried to do the sum. I have the same issue. The process just seems to read the disk endlessly, without even writing to the RAM: I tried to lower the chunk size, but it doesn't seem to change anything. Without chunk, I logically get a I plan to do time-series analysis, so I thought that having contiguous chunks in time would be more efficient. The netcdf was created with |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Out-of-core processing with dask not working properly? 345715825 | |
408894639 | https://github.com/pydata/xarray/issues/2329#issuecomment-408894639 | https://api.github.com/repos/pydata/xarray/issues/2329 | MDEyOklzc3VlQ29tbWVudDQwODg5NDYzOQ== | lrntct 12278765 | 2018-07-30T15:01:27Z | 2018-07-30T15:10:43Z | NONE | @rabernat Thanks for your answer. I have one big NetCDF of ~500GB.
What I have changed:
- Run in a Jupyter notebook with distributed to get the dashboard
- Change the chunks to Now when I run The Dask dashboard looks like this: After a while I get warnings:
Is this the expected behaviour? I was somehow expecting that each worker will read a chunk and then write it to zarr, streamlined. This does not seem to be the case. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Out-of-core processing with dask not working properly? 345715825 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 1