home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

1 row where state = "closed" and user = 99441529 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 1

  • issue 1

state 1

  • closed · 1 ✖

repo 1

  • xarray 1
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1804983457 I_kwDOAMm_X85rldyh 7987 Existing chunks not being respected on to_zarr() hansmohrmann 99441529 closed 0     2 2023-07-14T14:30:20Z 2023-11-24T22:15:04Z 2023-11-24T22:15:04Z NONE      

What is your issue?

Hi folks, I'm not sure what I'm doing wrong here. Context: I have a dataset, adding a coordinate variable with some specified chunking, and then that chunking is reset when writing to_zarr() and opening from disk. Even if I call unify_chunks() before writing or explicitly set the preferred_chunks encoding, the chunks get reset upon write. I have managed to force the chunks by explicitly setting ds.end_date.encoding['chunks']= (2,) before writing (note that before this, no encoding was set, so this isn't the old issue of previous encoding overwriting the dask chunks). It took a while to find this, and I don't understand why the existing behavior disregards the existing chunks on single-dimension variables or coordinates. The issue is that with this default behavior, the dataset written to disk has inconsistent dimensions on read, throws an error, and requires a call to unify_chunks() to be usable.

Reproducer below:

versions: xarray 2023.6.0 dask 2023.6.0 zarr 2.14.2

``` import pandas as pd import numpy as np import xarray as xr from datetime import datetime import dask.array as da

from pandas.tseries.offsets import MonthEnd

create toy dataset

dates = [datetime(2021, 1, 1), datetime(2021, 2, 1), datetime(2021, 3, 1), datetime(2021, 4, 1)] ds = xr.Dataset( data_vars=dict( tas=(["time", "lat", "lon",], np.array([300]) * np.ones((len(dates), 4, 4)), ), ), coords=dict( time=dates, lat=np.array([10, 11, 12, 13]), lon=np.array([20, 21, 22, 23]), ), )

chunk dataset

ds = ds.chunk(chunks={"lat": 2, "lon": 2, 'time': 2})

add end date coordinate

time_df = pd.DataFrame({"time": list(ds['time'].values)}) time_df['end_date'] = time_df['time'] + MonthEnd(0) ds = ds.assign_coords( end_date=("time", da.from_array(time_df['end_date'].values)))

print("---data before unify chunks--- \n", ds.end_date.data) ds = ds.unify_chunks() # surely you'll respect my chunking print("---data before writing to s3--- \n", ds.end_date.data) ds.time.encoding['preferred_chunks']= {'time': 2} # please use this chunking? ds.end_date.encoding['preferred_chunks']= {'time': 2} # pretty please use this chunking?

write data to s3

data_path = "/tmp/mydata.zarr" ds.to_zarr(data_path, mode='w')

read data back from s3

ds_check = xr.open_zarr(data_path)

print("---data after writing to disk and reading back in--- \n", ds_check.end_date.data) ```

output from code above: ---data before unify chunks--- dask.array<array, shape=(4,), dtype=datetime64[ns], chunksize=(4,), chunktype=numpy.ndarray> ---data before writing to s3--- dask.array<rechunk-merge, shape=(4,), dtype=datetime64[ns], chunksize=(2,), chunktype=numpy.ndarray> ---data after writing to disk and reading back in--- dask.array<open_dataset-end_date, shape=(4,), dtype=datetime64[ns], chunksize=(4,), chunktype=numpy.ndarray>

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7987/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 20.51ms · About: xarray-datasette