home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 1433534927

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1433534927 I_kwDOAMm_X85VcgHP 7248 xarray.align Inflating Source NetCDF Data 92732695 closed 1     3 2022-11-02T17:36:04Z 2023-09-12T15:33:43Z 2023-09-12T15:33:43Z NONE      

What is your issue?

Hi there,

I've been experiencing a peculiar issue that I hope I could get some further insight on. Some background - I'm processing data from raw NetCDF with Xarray to be written to a Zarr store. The hope is that I can append new pieces of data as they are received. The code is as follows: First I write an empty Zarr store to prepare for incoming data. Each of the dimensions/coordinates are pre-populated with all known values. Dimension w is one million in length. The others vary in length but are generally < 50 ``` empty_ds = xarray.Dataset( coords=dict( a=(a), b=(b), c=(c), d=(d), e=(e), f=(f), g=(g), h=(h), w=(w) ), attrs=dict(description=f"Dataset for {a[0]}") )

ds.to_zarr('./zarr', compute=False, mode="w") ```

I then need to open individual NetCDF files, whose coordinates are a subset of those that exist in the Zarr store. As you can see below, the dataset is ~8MB per file.

I then perform an alignment between the Zarr store, which has all coordinate values, and the individual NetCDF files contents to make sure dimensions and coordinates match:

``` ds1 = xarray.open_mfdataset('data.nc') zarr_ds = xarray.open_zarr(".zarr/")

a, b = xarray.align(zarr_ds, ds1, join="outer", exclude="h") #we exclude h because its the dimension we will append by b.to_zarr("./zarr/", mode="a", append_dim="h") ```

These operations work, however the new aligned dataset b that I want to write to the Zarr store has blown up in size as a result: And after writing back to the Zarr store, results in a large file, whose side should theoretically be ~8MB before compression.

Does anyone have insight as to why this might be happening? I've tried changing chunking settings when both opening and writing data, changing the dtypes of dimensions, etc. My hunch is that it has something to do with the dimension that is one million in length. For context, the data variable contains one million data points that correspond to the one million values of w. Thanks for the help!

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7248/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 3 rows from issue in issue_comments
Powered by Datasette · Queries took 0.628ms · About: xarray-datasette