home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

9 rows where issue = 393214032 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • rabernat 4
  • jhamman 3
  • ktyle 2

author_association 2

  • MEMBER 7
  • NONE 2

issue 1

  • Xarray to Zarr error (in compress / numcodecs functions) · 9 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
453799948 https://github.com/pydata/xarray/issues/2624#issuecomment-453799948 https://api.github.com/repos/pydata/xarray/issues/2624 MDEyOklzc3VlQ29tbWVudDQ1Mzc5OTk0OA== jhamman 2443309 2019-01-13T03:54:07Z 2019-01-13T03:54:07Z MEMBER

I'm going to close this as the original issue (error in compression/codecs) has been resolved. @ktyle - I'd be happy to continue this discussion on the Pangeo issue tracker if you'd like to discuss optimal chunk layout more.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray to Zarr error (in compress / numcodecs functions)  393214032
451206728 https://github.com/pydata/xarray/issues/2624#issuecomment-451206728 https://api.github.com/repos/pydata/xarray/issues/2624 MDEyOklzc3VlQ29tbWVudDQ1MTIwNjcyOA== jhamman 2443309 2019-01-03T16:59:06Z 2019-01-03T16:59:06Z MEMBER

@ktyle - glad to hear things are moving for you. I'm pretty sure the last chunk in each of your datasets is smaller than the rest. So after concatenation, you end up with a small chunk in the middle and at the end of the time dimension. I bet if you used a chunk size of 172 (divides evenly into 2924), you wouldn't need to rechunk.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray to Zarr error (in compress / numcodecs functions)  393214032
450692965 https://github.com/pydata/xarray/issues/2624#issuecomment-450692965 https://api.github.com/repos/pydata/xarray/issues/2624 MDEyOklzc3VlQ29tbWVudDQ1MDY5Mjk2NQ== ktyle 1961038 2018-12-31T21:44:39Z 2018-12-31T21:44:39Z NONE

Ok, thanks all for the advice. Clearly further subdivisions of the multi-level variables are in order.

However, working with a single level (sea-level pressure) from our CFSR datasets, I find that if I specify the chunksize on the Time dimension when using xr.open_mfdataset, the to_zarr function fails on the resulting dataset with a "non-uniform chunksize" error.

If, however, I take the resulting dataset and "re-chunk" with the .chunk method, although the two datasets "look identical", the to_zarr write succeeds.

Link to notebook:

https://nbviewer.jupyter.org/url/www.atmos.albany.edu/facstaff/ktyle/temp/Xarray_to_zarr_ex1.ipynb

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray to Zarr error (in compress / numcodecs functions)  393214032
449184669 https://github.com/pydata/xarray/issues/2624#issuecomment-449184669 https://api.github.com/repos/pydata/xarray/issues/2624 MDEyOklzc3VlQ29tbWVudDQ0OTE4NDY2OQ== rabernat 1197350 2018-12-21T00:16:40Z 2018-12-21T00:16:40Z MEMBER

You can also rechunk your dataset after the fact using the chunk method:

Not a good idea in this case. The original 49GB chunks will still exist in the task graph and will have to be computed before the rechunking step.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray to Zarr error (in compress / numcodecs functions)  393214032
449184291 https://github.com/pydata/xarray/issues/2624#issuecomment-449184291 https://api.github.com/repos/pydata/xarray/issues/2624 MDEyOklzc3VlQ29tbWVudDQ0OTE4NDI5MQ== jhamman 2443309 2018-12-21T00:14:22Z 2018-12-21T00:14:22Z MEMBER

You can also rechunk your dataset after the fact using the chunk method:

Python ds = ds.chunk({'time': 1})

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray to Zarr error (in compress / numcodecs functions)  393214032
449151325 https://github.com/pydata/xarray/issues/2624#issuecomment-449151325 https://api.github.com/repos/pydata/xarray/issues/2624 MDEyOklzc3VlQ29tbWVudDQ0OTE1MTMyNQ== rabernat 1197350 2018-12-20T22:09:20Z 2018-12-20T22:09:20Z MEMBER

So the key information is this: dask.array<shape=(2920, 32, 361, 720), chunksize=(1460, 32, 361, 720)>

This says that your dask chunk size is 1460 x 32 x 361 x 720 (x 4 bytes for float32 data) = 48573849600 bytes = ~49 GB. So this dataset is probably unusable for any purpose, including serialization (to zarr, netCDF, or any other format supported by xarray.)

Furthermore, the dask chunks will be automatically mapped to zarr chunks by xarray. These zarr chunks would be much too big to be useful. Zarr docs say "at least 1MB". In my example notebook I recommeded 10-100 MB.)

For both zarr and dask, you can think of a chunk as an amount of data that can be comfortably held in memory and passed around the network. (That's where the 10 - 100 MB estimate comes from.) It is also the minimum size of data that can be read from the dataset at once. Even if you only need one single value, the whole chunk needs to be read into memory and decompressed.

I would recommend you chunk along the time dimension. You can accomplish by adding the chunks keyword when opening the dataset python ds = xr.open_mfdataset([f1, f2], chunks={'time': 1})

I imagine that will fix most of your issues.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray to Zarr error (in compress / numcodecs functions)  393214032
449146417 https://github.com/pydata/xarray/issues/2624#issuecomment-449146417 https://api.github.com/repos/pydata/xarray/issues/2624 MDEyOklzc3VlQ29tbWVudDQ0OTE0NjQxNw== ktyle 1961038 2018-12-20T21:49:15Z 2018-12-20T21:59:33Z NONE

@rabernat Yeah I think the chunksize in the time dimension is too large:

```` <xarray.Dataset>

Dimensions: (lat: 361, lev: 32, lon: 720, time: 2920) Coordinates: * lat (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0 * lon (lon) float32 -180.0 -179.5 -179.0 -178.5 ... 178.5 179.0 179.5 * lev (lev) float32 1000.0 975.0 950.0 925.0 ... 50.0 30.0 20.0 10.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: g (time, lev, lat, lon) float32 dask.array<shape=(2920, 32, 361, 720), chunksize=(1460, 32, 361, 720)> ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray to Zarr error (in compress / numcodecs functions)  393214032
449145011 https://github.com/pydata/xarray/issues/2624#issuecomment-449145011 https://api.github.com/repos/pydata/xarray/issues/2624 MDEyOklzc3VlQ29tbWVudDQ0OTE0NTAxMQ== rabernat 1197350 2018-12-20T21:43:40Z 2018-12-20T21:43:40Z MEMBER

I thought I might try specifying no compression, as supported in Zarr, by adding "compressor = None" as a kwarg in the to_zarr call in xarray, but that is not supported.

The syntax and an example for specifying a compressor is given in the docs here: http://xarray.pydata.org/en/latest/io.html#zarr-compressors-and-filters. It needs to be part of the encoding keyword. But I don't think this will solve your problem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray to Zarr error (in compress / numcodecs functions)  393214032
449144275 https://github.com/pydata/xarray/issues/2624#issuecomment-449144275 https://api.github.com/repos/pydata/xarray/issues/2624 MDEyOklzc3VlQ29tbWVudDQ0OTE0NDI3NQ== rabernat 1197350 2018-12-20T21:40:44Z 2018-12-20T21:40:44Z MEMBER

@ktyle - it sounds like your chunks are too big.

Can you post xarray's representation of your dataset before writing it to zarr? Call print(ds) and paste the output here.

p.s. I edited your comment a bit to put the code into code blocks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Xarray to Zarr error (in compress / numcodecs functions)  393214032

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.458ms · About: xarray-datasette
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows