home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where issue = 717410970 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • aurghs 4
  • shoyer 1
  • ravwojdyla 1
  • weiji14 1

author_association 4

  • COLLABORATOR 4
  • CONTRIBUTOR 1
  • MEMBER 1
  • NONE 1

issue 1

  • Flexible backends - Harmonise zarr chunking with other backends chunking · 7 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
735372002 https://github.com/pydata/xarray/issues/4496#issuecomment-735372002 https://api.github.com/repos/pydata/xarray/issues/4496 MDEyOklzc3VlQ29tbWVudDczNTM3MjAwMg== aurghs 35919497 2020-11-29T10:29:34Z 2020-11-29T10:29:34Z COLLABORATOR

@ravwojdyla I think that currently there is no way to do this. But it would be nice to have an interface that allows defining different chunks for each variable. The main problem that I see in implementing that is to keep the ´xr.open_dataset(... chunks=)´, ´ds.chunk´ and ´ds.chunks´ interfaces backwards compatible. Probably a new issue for that would be better since this refactor is already a little bit tricky and your proposal could be implemented separately.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible backends - Harmonise zarr chunking with other backends chunking 717410970
732486436 https://github.com/pydata/xarray/issues/4496#issuecomment-732486436 https://api.github.com/repos/pydata/xarray/issues/4496 MDEyOklzc3VlQ29tbWVudDczMjQ4NjQzNg== ravwojdyla 1419010 2020-11-23T23:31:25Z 2020-11-23T23:31:39Z NONE

Hi. I'm trying to find an issue that is closest to the problem that I have, and this seems to be the best one, and most related.

Say, I have a zarr dataset with multiple variables Foo, Bar and Baz (and potentially, many more), there are 2 dimensions: x, y (potentially more). Say both Foo and Bar are large 2d arrays dims: x, y, Baz is relatively small 1d array dim: y. Say I would like to read that dataset with xarray but increase chunk from the native zarr chunk size for x and y but only for Foo and Bar, I would like to keep native chunking for Baz. afaiu currently I would do that with chunks parameter to open_dataset/open_zarr, but if I do do that via say dict(x=N, y=M) that will change chunking for all variables that use those dimensions, which isn't exactly what I need, I need those changed only for Foo and Bar. Is there a way to do that? Should that be part of the "harmonisation"? One could imagine that xarray could accept a dict of dict akin to {var: {dim: chunk_spec}} to specify chunking for specific variables.

Note that rechunk after reading is not what I want, I would like to specify chunking at read op.

Let me know if you would prefer me to open a completely new issue for this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible backends - Harmonise zarr chunking with other backends chunking 717410970
721590240 https://github.com/pydata/xarray/issues/4496#issuecomment-721590240 https://api.github.com/repos/pydata/xarray/issues/4496 MDEyOklzc3VlQ29tbWVudDcyMTU5MDI0MA== aurghs 35919497 2020-11-04T08:35:08Z 2020-11-04T09:22:01Z COLLABORATOR

@weiji14 Thank you very much for your feedback. I think we should align also xr.open_mfdataset. In the case of engine == zarr and chunk == -1 there is a UserWarning also in xr.open_dataset, but I think it should be removed.

Maybe we should evaluate for the future to integrate/use dask function dask.array.core.normalize_chunks (https://docs.dask.org/en/latest/array-api.html#dask.array.core.normalize_chunks) with the key previous_chunks (see comment https://github.com/pydata/xarray/pull/2530#discussion_r247352940) It could be particularly useful for (re-)chunking taking into account the previous chunks or the on-disk chunks, especially if the on-disk chunks are small.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible backends - Harmonise zarr chunking with other backends chunking 717410970
721466404 https://github.com/pydata/xarray/issues/4496#issuecomment-721466404 https://api.github.com/repos/pydata/xarray/issues/4496 MDEyOklzc3VlQ29tbWVudDcyMTQ2NjQwNA== weiji14 23487320 2020-11-04T01:47:30Z 2020-11-04T01:49:39Z CONTRIBUTOR

Just a general comment on the xr.open_dataset(engine="zarr") part, I prefer to keep or reduce the amount of chunks= options (i.e. Option 1) rather than add another chunks="encoded" option.

For those who are confused, this is the current state of xr.open_mfdataset (correct me if I'm wrong):

| :arrow_down: engine\chunk :arrow_right: | None (default) | 'auto' | {} | -1 | |--------------------------------------------------------| -------------------|-------|----|-------| | None (i.e. default for NetCDF) | np.ndarray | dask.Array (produces origintal chunks as in NetCDF obj??) | dask.Array (rechunked into 1 chunk) | dask.Array (rechunked into 1 chunk) | | zarr | np.ndarray | dask.Array (original chunks as in Zarr obj) | dask.Array (original chunks as in Zarr obj) | dask.Array (rechunked into 1 chunk + UserWarning) |

Sample code to test (run in jupyter notebook to see the dask chunk visual):

```python import xarray as xr import fsspec # Opening NetCDF dataset: xr.Dataset = xr.open_dataset( "http://thredds.ucar.edu/thredds/dodsC/grib/NCEP/HRRR/CONUS_2p5km/Best", chunks={} ) dataset.Temperature_height_above_ground.data # Opening Zarr zstore = fsspec.get_mapper( url="gs://cmip6/CMIP/NCAR/CESM2/historical/r9i1p1f1/Amon/tas/gn/" ) dataset: xr.Dataset = xr.open_dataset( filename_or_obj=zstore, engine="zarr", chunks={}, backend_kwargs=dict(consolidated=True), ) dataset.tas.data ```
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible backends - Harmonise zarr chunking with other backends chunking 717410970
720785384 https://github.com/pydata/xarray/issues/4496#issuecomment-720785384 https://api.github.com/repos/pydata/xarray/issues/4496 MDEyOklzc3VlQ29tbWVudDcyMDc4NTM4NA== aurghs 35919497 2020-11-02T23:32:48Z 2020-11-03T09:28:48Z COLLABORATOR

I think we can keep talking here about xarray chunking interface.

It seems that the interface for chunking is a tricky problem in xarray. There are involved different interfaces already implemented: - dask: da.rechunk, da.from_array - xarray: xr.open_dataset - xarray: ds.chunk - xarray-zarr: xr.open_dataset(engine="zarr") (≈ xr.open_zarr)

They are similar, but there are some inconsistencies.

dask The allowed values for chunking in dask are: - dictionary (or tuple) - integers > 0 - -1: no chunking (along this dimension) - auto: allow the chunking (in this dimension) to accommodate ideal chunk sizes (default 128MiB)

The allowed values in the dictionary are: -1, auto, None (no change to the chunking along this dimension) Note: None isn't supported outside the dictionary. Note: If chunking along some dimension is not specified then the chunking along this dimension will not change (e.g. {} is equivalent to {0: None})

xarray: xr.open_dataset for all the engines != "zarr" It works as dask but also None is supported. If chunk is None then it doesn't use dask at all.

xarray: ds.chunk It works as dask but also None is supported. None is equivalent to a dictionary with all values None (and equivalent to the empty dictionary).

xarray: xr.open_dataset(engine="zarr") It works as dask except for: - None is supported. If chunk is None then it doesn't use dask at all. - If chunking along some dimension is not specified then encoded chunks are used. - auto is equivalent to the empty dictionary, encoded chunks are used. - auto inside the dictionary is passed on to dask and behaves as in dask.

Points to be discussed:

1) auto and {} The main problem is how to uniform dask and xarray-zarr.

Option 1 Maybe the encoded chunking provided by the backend can be seen just as the current on-disk data chunking. According to dask interface, if in a dictionary the chunks for some dimension are None or not defined, then the current chunking along that dimension doesn't change. From this perspective, we would have: - with auto it uses dask auto-chunking. - with -1 it uses dask but no chunking. - with {} it uses the backend encoded chunks (when available) for on-disk data (xr.open_dataset) and the current chunking for already opened datasets (ds.chunk)

Note: ds.chunk behavior would be unchanged Note: xr.open_dataset would be unchanged, except for engine="zarr", since currently the var.encodings["chunks"] is defined only by zarr.

Option 2 We could use a different new value for the encoded chunks (e.g.encoded TBC). Something like: open_dataset(chunks="encoded") open_dataset(chunks={"x": "encoded", "y": 10,...}) Both expressions could be supported. cons: - chunks="encoded": with zarr the user probably needs to specify always to use the encoded chunks. - chunks="encoded": the user must specify explicitly in the dictionary which dimension should be chunked with the encoded chunks, that's very inconvenient (but is it really used? @weiji14 do you have some idea about it?).

2) None chunks=None should produce the same result in xr.open_dataset and ds.rechunk.

@shoyer, @alexamici, @jhamman, @dcherian, @weiji14 suggestions are welcome

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible backends - Harmonise zarr chunking with other backends chunking 717410970
709403137 https://github.com/pydata/xarray/issues/4496#issuecomment-709403137 https://api.github.com/repos/pydata/xarray/issues/4496 MDEyOklzc3VlQ29tbWVudDcwOTQwMzEzNw== shoyer 1217238 2020-10-15T15:27:29Z 2020-10-15T15:27:29Z MEMBER

With regards to overwrite_encoded_chunks=True, see https://github.com/pydata/xarray/pull/2530

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible backends - Harmonise zarr chunking with other backends chunking 717410970
706098129 https://github.com/pydata/xarray/issues/4496#issuecomment-706098129 https://api.github.com/repos/pydata/xarray/issues/4496 MDEyOklzc3VlQ29tbWVudDcwNjA5ODEyOQ== aurghs 35919497 2020-10-09T10:18:10Z 2020-10-09T10:18:10Z COLLABORATOR
  • The key value auto is redundant because it has the same behavior as {}, we could remove one of them.

That's not completely true. With no dask installed auto uses chunk=None, while {} raises an error. Probably it makes sense.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Flexible backends - Harmonise zarr chunking with other backends chunking 717410970

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.157ms · About: xarray-datasette