home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

1 row where repo = 13221727, state = "open" and user = 42455466 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue 1

state 1

  • open · 1 ✖

repo 1

  • xarray · 1 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1083621690 I_kwDOAMm_X85AlsE6 6084 Initialise zarr metadata without computing dask graph dougiesquire 42455466 open 0     6 2021-12-17T21:17:42Z 2024-04-03T19:08:26Z   NONE      

Is your feature request related to a problem? Please describe. On writing large zarr stores, the xarray docs recommend first creating an initial Zarr store without writing all of its array data. The recommended approach is to first create a dummy dask-backed Dataset, and then call to_zarr with compute=False to write only metadata to Zarr. This works great.

It seems that in one common use case for this approach (including the example in the above docs), the entire dataset to be written to zarr is already represented in a Dataset (let's call this ds). Thus, rather than creating a dummy Dataset with exactly the same metadata as ds, it is more convenient to initialise the zarr Store with ds.to_zarr(..., compute=False). See for example:

https://discourse.pangeo.io/t/many-netcdf-to-single-zarr-store-using-concurrent-futures/2029 https://discourse.pangeo.io/t/map-blocks-and-to-zarr-region/2019 https://discourse.pangeo.io/t/netcdf-to-zarr-best-practices/1119/12 https://discourse.pangeo.io/t/best-practice-for-memory-management-to-iteratively-write-a-large-dataset-with-xarray/1989

However, calling to_zarr with compute=False still computes the dask graph for writing the Zarr store. The graph is never used in this use-case, but computing the graph can take a really long time for large graphs.

Describe the solution you'd like Is there scope to add an option to to_zarr to initialise the store without computing the dask graph? Or perhaps an initialise_zarr method would be cleaner?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6084/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 30.09ms · About: xarray-datasette