issue_comments
5 rows where issue = 233350060 and user = 12229877 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- If a NetCDF file is chunked on disk, open it with compatible dask chunks · 5 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
308928211 | https://github.com/pydata/xarray/issues/1440#issuecomment-308928211 | https://api.github.com/repos/pydata/xarray/issues/1440 | MDEyOklzc3VlQ29tbWVudDMwODkyODIxMQ== | Zac-HD 12229877 | 2017-06-16T04:11:10Z | 2018-02-10T07:13:11Z | CONTRIBUTOR | @matt-long, I think that's a separate issue. Please open a new pull request, including a link to data that will let us reproduce the problem. @jhamman - [updated] I was always keen to work on this if I could make time, but have since changed jobs. However I'd still be happy to help anyone who wants to work on it with design and review. I definitely want to preserve the exact present semantics of dict arguments (so users have exact control, with a warning if it's incompatible with disk chunks). I may interpret int arguments as a (deprecated) hint though, as that's what it's mostly used for, and will add a fairly limited hints API to start with - more advanced users can just specify exact chunks. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060 | |
309353545 | https://github.com/pydata/xarray/issues/1440#issuecomment-309353545 | https://api.github.com/repos/pydata/xarray/issues/1440 | MDEyOklzc3VlQ29tbWVudDMwOTM1MzU0NQ== | Zac-HD 12229877 | 2017-06-19T06:50:57Z | 2017-07-14T02:35:04Z | CONTRIBUTOR | I've just had a meeting at NCI which has helped clarify what I'm trying to do and how to tell if it's working. This comment is mostly for my own notes, and public for anyone interested. I'll refer to dask chunks as 'blocks' (as in 'blocked algorithms'), and netcdf chunks in a file as 'chunks', to avoid confusion) The approximate algorithm I'm thinking about is outlined in this comment above. Considerations, in rough order of performance impact, are:
Bottom line, I could come up with something pretty quickly but would perfer to take a little longer to write and explore some benchmarks. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060 | |
307074048 | https://github.com/pydata/xarray/issues/1440#issuecomment-307074048 | https://api.github.com/repos/pydata/xarray/issues/1440 | MDEyOklzc3VlQ29tbWVudDMwNzA3NDA0OA== | Zac-HD 12229877 | 2017-06-08T11:16:30Z | 2017-06-08T11:16:30Z | CONTRIBUTOR | 🎉 My view is actually that anyone who can beat the default heuristic should just specify their chunks - you'd already need a good sense for the data and your computation (and the heuristic!). IMO, the few cases where tuning is desirable - but manual chunks are impractical - don't justify adding yet another kwarg to the fairly busy interfaces. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060 | |
307002325 | https://github.com/pydata/xarray/issues/1440#issuecomment-307002325 | https://api.github.com/repos/pydata/xarray/issues/1440 | MDEyOklzc3VlQ29tbWVudDMwNzAwMjMyNQ== | Zac-HD 12229877 | 2017-06-08T05:28:04Z | 2017-06-08T05:28:04Z | CONTRIBUTOR | I love a real-world example 😄 This sounds pretty similar to how I'm thinking of doing it, with a few caveats - mostly that Taking a step back for a moment, chunks are great for avoiding out-of-memory errors, faster processing of reorderable operations, and efficient indexing. The overhead is not great when data is small or chunks are small, it's bad when a single on-disk chunk is on multiple dask chunks, and very bad when a dask chunk includes several files. (of course all of these are generalisations with pathological cases, but IMO good enough to build some heuristics on) With that in mind, here's how I'd decide whether to use the heuristic:
Having decided to use a heuristic, we know the array shape and dimensions, the chunk shape if any, and the hint if any:
It's probably a good idea to constrain this further, so that the ratio of chunk edge length along dimensions should not exceed the greater of 100:1 or four times the ratio of chunks on disk (I don't have universal profiling to back this up, but it's always worked well for me). This will mitigate the potentially-very-large effects of dimension order, especially in unchunked files or large chunks. For datasets (as opposed to arrays), I'd calculate chunks once for the largest dtype and just reuse that shape. |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060 | |
306688091 | https://github.com/pydata/xarray/issues/1440#issuecomment-306688091 | https://api.github.com/repos/pydata/xarray/issues/1440 | MDEyOklzc3VlQ29tbWVudDMwNjY4ODA5MQ== | Zac-HD 12229877 | 2017-06-07T05:09:06Z | 2017-06-07T05:09:06Z | CONTRIBUTOR |
This sounds like a very good idea to me 👍
I think that depends on the size of the data - a very common workflow in our group is to open some national-scale collection, select a small (MB to low GB) section, and proceed with that. At this scale we only use chunks because many of the input files are larger than memory, and shape is basically irrelevant - chunks avoid loading anything until after selecting the subset (I think this is related to #1396). It's certainly good to know the main processing dimensions though, and user-guided chunk selection heuristics could take us a long way - I actually think a dimension hint and good heuristics are likely to perform better than most users (who are not experts and have not profiled their performance). The set notation is also very elegant, but I wonder about the interpretation. With |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 1