home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

4 rows where comments = 0, state = "open" and user = 5635139 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue 4

state 1

  • open · 4 ✖

repo 1

  • xarray 4
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1916677049 I_kwDOAMm_X85yPiu5 8245 Tools for writing distributed zarrs max-sixty 5635139 open 0     0 2023-09-28T04:25:45Z 2024-01-04T00:15:09Z   MEMBER      

What is your issue?

There seems to be a common pattern for writing zarrs from a distributed set of machines, in parallel. It's somewhat described in the prose of the io docs. Quoting:

  • Creating the template — "the first step is creating an initial Zarr store without writing all of its array data. This can be done by first creating a Dataset with dummy values stored in dask, and then calling to_zarr with compute=False to write only metadata to Zarr"
  • Writing out each region from workers — "a Zarr store with the correct variable shapes and attributes exists that can be filled out by subsequent calls to to_zarr. The region provides a mapping from dimension names to Python slice objects indicating where the data should be written (in index space, not coordinate space)"

I've been using this fairly successfully recently. It's much better than writing hundreds or thousands of data variables, since many small data variables create a huge number of files.

Are there some tools we can provide to make this easier? Some ideas: - [ ] compute=False is arguably a less-than-obvious kwarg meaning "write metadata". Maybe this should be a method, maybe it's a candidate for renaming? Or maybe make_template can be an abstraction over it. Something like xarray_beam.make_template to make the template from a Dataset? - Or from an array of indexes? - https://github.com/pydata/xarray/issues/8343 - https://github.com/pydata/xarray/pull/8460 - [ ] What happens if one worker's data isn't aligned on some dimensions? Will that write to the wrong location? Could we offer an option, similar to the above, to reindex on the template dimensions?

  • [ ] When writing a region, we need to drop other vars. Can we offer this as a kwarg? Occasionally I'll add a dimension with an index to a dataset, run the function to write it — and it'll fail, because I forgot to add that index to the .drop_vars call that precedes the write. When we're writing a template, all the indexes are written up front anyway. (edit: #6260)
    • https://github.com/pydata/xarray/pull/8460

More minor papercuts: - [ ] I've hit an issue where writing a region seemed to cause the worker to attempt to load the whole array into memory — can we offer guarantees for when (non-metadata) data will be loaded during to_zarr? - [ ] How about adding raise_if_dask_computes to our public API? The alternative I've been doing is watching htop and existing if I see memory ballooning, which is less cerebral... - [ ] It doesn't seem easy to write coords on a DataArray. For example, writing xr.tutorial.load_dataset('air_temperature').assign_coords(lat2=da.lat + 2, a=(('lon',), ['a'] * len(da.lon))).chunk().to_zarr('foo.zarr', compute=False) will cause the non-index coords to be written as empty. But writing them separately conflicts with having a single variable. Currently I manually load each coord before writing, which is not super-friendly.

Some things that were in the list here, as they've been completed!! - [x] Requiring region to be specified as an int range can be inconvenient — would it feasible to have a function that grabs the template metadata, calculates the region ints, and then calculates the implied indexes? - Edit: suggested at https://github.com/pydata/xarray/issues/7702

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8245/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
    xarray 13221727 issue
1918061661 I_kwDOAMm_X85yU0xd 8251 `.chunk()` doesn't create chunks on 0 dim arrays max-sixty 5635139 open 0     0 2023-09-28T18:30:50Z 2023-09-30T21:31:05Z   MEMBER      

What happened?

.chunk's docstring states:

``` """Coerce this array's data into a dask arrays with the given chunks.

    If this variable is a non-dask array, it will be converted to dask
    array. If it's a dask array, it will be rechunked to the given chunk
    sizes.

```

...but this doesn't happen for 0 dim arrays; example below.

For context, as part of #8245, I had a function that creates a template array. It created an empty DataArray, then expanded dims for each dimension. And it kept blowing up memory! ...until I realized that it was actually not a lazy array.

What did you expect to happen?

It may be that we can't have a 0-dim dask array — but then we should raise in this method, rather than return the wrong thing.

Minimal Complete Verifiable Example

```Python [ins] In [1]: type(xr.DataArray().chunk().data) Out[1]: numpy.ndarray

[ins] In [2]: type(xr.DataArray(1).chunk().data) Out[2]: numpy.ndarray

[ins] In [3]: type(xr.DataArray([1]).chunk().data) Out[3]: dask.array.core.Array ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: 0d6cd2a39f61128e023628c4352f653537585a12 python: 3.9.18 (main, Aug 24 2023, 21:19:58) [Clang 14.0.3 (clang-1403.0.22.14.1)] python-bits: 64 OS: Darwin OS-release: 22.6.0 machine: arm64 processor: arm byteorder: little LC_ALL: en_US.UTF-8 LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2023.8.1.dev25+g8215911a.d20230914 pandas: 2.1.1 numpy: 1.25.2 scipy: 1.11.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.16.0 cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.4.0 distributed: 2023.7.1 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: 0.2.3.dev30+gd26e29e fsspec: 2021.11.1 cupy: None pint: None sparse: None flox: 0.7.2 numpy_groupies: 0.9.19 setuptools: 68.1.2 pip: 23.2.1 conda: None pytest: 7.4.0 mypy: 1.5.1 IPython: 8.15.0 sphinx: 4.3.2
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8251/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1917820711 I_kwDOAMm_X85yT58n 8248 `write_empty_chunks` not in `DataArray.to_zarr` max-sixty 5635139 open 0     0 2023-09-28T15:48:22Z 2023-09-28T15:49:35Z   MEMBER      

What is your issue?

Our to_zarr methods on DataArray & Dataset are slightly inconsistent — Dataset.to_zarr has write_empty_chunks and chunkmanager_store_kwargs. They're also in a different order.


Up a level — not sure of the best way of enforcing consistency here; a couple of ideas. - We could have tests that operate on both a DataArray and Dataset, parameterized by fixtures (might also help reduce the duplication in some of our tests), though we then need to make the tests generic. We could have some general tests which just test that methods work, and then delegate to the current per-object tests for finer guarantees. - We could have a tool which collects the differences between DataArray & Dataset methods and snapshots them — then we'll see if they diverge, while allowing for some divergences.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8248/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1125030343 I_kwDOAMm_X85DDpnH 6243 Maintenance improvements max-sixty 5635139 open 0     0 2022-02-05T21:01:51Z 2022-02-05T21:01:51Z   MEMBER      

Is your feature request related to a problem?

At the end of the dev call, we discussed ways to do better at maintenance. I'd like to make Xarray a wonderful place to contribute, partly because it was so formative for me in becoming more involved with software engineering.

Describe the solution you'd like

We've already come far, because of the hard work of many of us!

A few ideas, in increasing order of radical-ness - We looked at @andersy005's dashboards for PRs & Issues. Could we expose this, both to hold ourselves accountable and signal to potential contributors that we care about turnaround time for their contributions? - Is there a systematic way of understanding who should review something? - FWIW a few months ago I looked for a bot that would recommend a reviewer based on who had contributed code in the past, which I think I've seen before. But I couldn't find one generally available. This would be really helpful — we wouldn't have n people each assessing whether they're the best reviewer for each contribution. If anyone does better than me at finding something like this, that would be awesome. - Could we add a label so people can say "now I'm waiting for a review", and track how long those stay up? - Ensuring the 95th percentile is < 2 days is more important than the median being in the hours. It does pain me when I see PRs get dropped for a few weeks. TBC, I'm as responsible as anyone. - Could we have a bot that asks for feedback on the review process — i.e. "I received a prompt and helpful review", "I would recommend a friend contribute to Xarray", etc?

Describe alternatives you've considered

No response

Additional context

There's always a danger with making stats legible that Goodhart's law strikes. And sometimes stats are not joyful, and lots of people come here for joy. So probably there's a tradeoff.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6243/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 36.311ms · About: xarray-datasette