home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where issue = 233350060 and user = 12229877 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • Zac-HD · 5 ✖

issue 1

  • If a NetCDF file is chunked on disk, open it with compatible dask chunks · 5 ✖

author_association 1

  • CONTRIBUTOR 5
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
308928211 https://github.com/pydata/xarray/issues/1440#issuecomment-308928211 https://api.github.com/repos/pydata/xarray/issues/1440 MDEyOklzc3VlQ29tbWVudDMwODkyODIxMQ== Zac-HD 12229877 2017-06-16T04:11:10Z 2018-02-10T07:13:11Z CONTRIBUTOR

@matt-long, I think that's a separate issue. Please open a new pull request, including a link to data that will let us reproduce the problem.

@jhamman - [updated] I was always keen to work on this if I could make time, but have since changed jobs. However I'd still be happy to help anyone who wants to work on it with design and review.

I definitely want to preserve the exact present semantics of dict arguments (so users have exact control, with a warning if it's incompatible with disk chunks). I may interpret int arguments as a (deprecated) hint though, as that's what it's mostly used for, and will add a fairly limited hints API to start with - more advanced users can just specify exact chunks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060
309353545 https://github.com/pydata/xarray/issues/1440#issuecomment-309353545 https://api.github.com/repos/pydata/xarray/issues/1440 MDEyOklzc3VlQ29tbWVudDMwOTM1MzU0NQ== Zac-HD 12229877 2017-06-19T06:50:57Z 2017-07-14T02:35:04Z CONTRIBUTOR

I've just had a meeting at NCI which has helped clarify what I'm trying to do and how to tell if it's working. This comment is mostly for my own notes, and public for anyone interested. I'll refer to dask chunks as 'blocks' (as in 'blocked algorithms'), and netcdf chunks in a file as 'chunks', to avoid confusion)

The approximate algorithm I'm thinking about is outlined in this comment above. Considerations, in rough order of performance impact, are:

  • No block should include data from multiple files (near-absolute, due to locking - though concurrent read is supported on lower levels?)
  • Contiguous access is critical in un-chunked files - reading the fastest-changing dimension in small parts murders IO performance. It may be important in chunked files too, at the level of chunks, but that's mostly down to Dask and user access patterns.
  • Each chunk, if the file is chunked, must fall entirely into a single block.
  • Blocks should usually be around 100MB, to balance scheduler overhead with memory usage of intermediate results.
  • Blocks should be of even(ish) size - if a dimension is of size 100 with chunks of 30, better to have blocks of 60 at the expense of relative shape than have blocks of 90 with one almost empty.
  • Chunks are cached by underlying libraries, so benchmarks need to clear the caches each run for valid results. Note that this affects IO but typically not decompression of chunks.
  • Contra contiguous access (above), it might be good to prevent very skewed block shapes (ie 1*X*Y or T*1*1). Possibly limiting the lowest edge length of a block (10px?), or limiting the edge ratio of a block (20:1?)
  • If a dimension hint is given (eg 'my analysis will be spatial'), making blocks larger along other dimensions will make whole-file processing more efficient, and subsetting less efficient. (Unless Dask can optimise away loading of part of a block?)

Bottom line, I could come up with something pretty quickly but would perfer to take a little longer to write and explore some benchmarks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060
307074048 https://github.com/pydata/xarray/issues/1440#issuecomment-307074048 https://api.github.com/repos/pydata/xarray/issues/1440 MDEyOklzc3VlQ29tbWVudDMwNzA3NDA0OA== Zac-HD 12229877 2017-06-08T11:16:30Z 2017-06-08T11:16:30Z CONTRIBUTOR

🎉

My view is actually that anyone who can beat the default heuristic should just specify their chunks - you'd already need a good sense for the data and your computation (and the heuristic!). IMO, the few cases where tuning is desirable - but manual chunks are impractical - don't justify adding yet another kwarg to the fairly busy interfaces.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060
307002325 https://github.com/pydata/xarray/issues/1440#issuecomment-307002325 https://api.github.com/repos/pydata/xarray/issues/1440 MDEyOklzc3VlQ29tbWVudDMwNzAwMjMyNQ== Zac-HD 12229877 2017-06-08T05:28:04Z 2017-06-08T05:28:04Z CONTRIBUTOR

I love a real-world example 😄 This sounds pretty similar to how I'm thinking of doing it, with a few caveats - mostly that cate assumes that the data is 3D, has lat and lon, single time step, spatial dimensions wholly divisible some small N. Obviously this is fine for CCI data, but not generally true of things Xarray might open.

Taking a step back for a moment, chunks are great for avoiding out-of-memory errors, faster processing of reorderable operations, and efficient indexing. The overhead is not great when data is small or chunks are small, it's bad when a single on-disk chunk is on multiple dask chunks, and very bad when a dask chunk includes several files. (of course all of these are generalisations with pathological cases, but IMO good enough to build some heuristics on)

With that in mind, here's how I'd decide whether to use the heuristic:

  • If chunks is a dict, never use heuristic (always use explicit user chunks)
  • If chunks is a hint, eg set or list as discussed above, or a later proposal, always use heuristic mode - guided by the hint, of course. Files which may otherwise default to non-heuristic or non-chunking mode (eg in mfdataset) could use eg. the empty set to activate the heuristics without hints.
  • If chunks is None, and the uncompressed data is above a size threshold (eg 500MB, 1GB), use chunks given by the heuristic

Having decided to use a heuristic, we know the array shape and dimensions, the chunk shape if any, and the hint if any:

  • Start by selecting a maximum nbytes for the chunk, eg 250 MB
  • If the total array nbytes is <= max_nbytes, use a single chunk for the whole thing; return
  • If the array is stored as (a, b, ...) chunks on disk, our dask chunks must be (m.a, n.b, ...), ie each dimension has some independent integer multiple.
  • Loop over dimensions not to chunk (per above, either those not in the set, or those in a string), adding one to the respective multiplier. Alternatively if the file is now five or less dask chunks across, simply divide the on-disk chunks into 4, 3, 2, or 1 dask chunks (avoiding a dimension of 1.1 chunks, etc)
  • For each step, if this would make the chunk larger than max_nbytes, return.
  • Repeat the loop for remaining dimensions.
  • If the array is not chunked on disk, increase a divisor for each dimension to chunk until the chunk size is <= max_nbytes and return.

It's probably a good idea to constrain this further, so that the ratio of chunk edge length along dimensions should not exceed the greater of 100:1 or four times the ratio of chunks on disk (I don't have universal profiling to back this up, but it's always worked well for me). This will mitigate the potentially-very-large effects of dimension order, especially in unchunked files or large chunks.

For datasets (as opposed to arrays), I'd calculate chunks once for the largest dtype and just reuse that shape.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060
306688091 https://github.com/pydata/xarray/issues/1440#issuecomment-306688091 https://api.github.com/repos/pydata/xarray/issues/1440 MDEyOklzc3VlQ29tbWVudDMwNjY4ODA5MQ== Zac-HD 12229877 2017-06-07T05:09:06Z 2017-06-07T05:09:06Z CONTRIBUTOR

I'd certainly support a warning when dask chunks do not align with the on-disk chunks.

This sounds like a very good idea to me 👍

I think its unavoidable that users understand how their data will be processed (e.g., whether operations will be mapped over time or space). But maybe some sort of heuristics (if not a fully automated solution) are possible.

I think that depends on the size of the data - a very common workflow in our group is to open some national-scale collection, select a small (MB to low GB) section, and proceed with that. At this scale we only use chunks because many of the input files are larger than memory, and shape is basically irrelevant - chunks avoid loading anything until after selecting the subset (I think this is related to #1396).

It's certainly good to know the main processing dimensions though, and user-guided chunk selection heuristics could take us a long way - I actually think a dimension hint and good heuristics are likely to perform better than most users (who are not experts and have not profiled their performance).

The set notation is also very elegant, but I wonder about the interpretation. With chunks=, I specify how to break up the data - and any omitted dimensions are not chunked. For the hint, I'd expect to express which dimension(s) to keep - ie {'lat', lon'} should indicate that my analysis is mostly spatial, rather than mostly not. Maybe we could use a string (eg time for timeseries or lat lon for spatial) instead of a set to specify large chunk dimensions?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 16.385ms · About: xarray-datasette