id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1916677049,I_kwDOAMm_X85yPiu5,8245,Tools for writing distributed zarrs,5635139,open,0,,,0,2023-09-28T04:25:45Z,2024-01-04T00:15:09Z,,MEMBER,,,,"### What is your issue? There seems to be a common pattern for writing zarrs from a distributed set of machines, in parallel. It's somewhat described in the prose of [the io docs](https://docs.xarray.dev/en/stable/user-guide/io.html#appending-to-existing-zarr-stores). Quoting: - Creating the template — ""the first step is creating an initial Zarr store without writing all of its array data. This can be done by first creating a Dataset with dummy values stored in [dask](https://docs.xarray.dev/en/stable/user-guide/dask.html#dask), and then calling to_zarr with compute=False to write only metadata to Zarr"" - Writing out each region from workers — ""a Zarr store with the correct variable shapes and attributes exists that can be filled out by subsequent calls to to_zarr. The region provides a mapping from dimension names to Python slice objects indicating where the data should be written (in index space, not coordinate space)"" I've been using this fairly successfully recently. It's much better than writing hundreds or thousands of data variables, since many small data variables create a huge number of files. Are there some tools we can provide to make this easier? Some ideas: - [ ] `compute=False` is arguably a less-than-obvious kwarg meaning ""write metadata"". Maybe this should be a method, maybe it's a candidate for renaming? Or maybe `make_template` can be an abstraction over it. Something like [`xarray_beam.make_template`](https://xarray-beam.readthedocs.io/en/latest/_autosummary/xarray_beam.make_template.html) to make the template from a Dataset? - Or from an array of indexes? - https://github.com/pydata/xarray/issues/8343 - https://github.com/pydata/xarray/pull/8460 - [ ] What happens if one worker's data isn't aligned on some dimensions? Will that write to the wrong location? Could we offer an option, similar to the above, to reindex on the template dimensions? - [ ] When writing a region, we need to drop other vars. Can we offer this as a kwarg? Occasionally I'll add a dimension with an index to a dataset, run the function to write it — and it'll fail, because I forgot to add that index to the `.drop_vars` call that precedes the write. When we're writing a template, all the indexes are written up front anyway. (edit: #6260) - https://github.com/pydata/xarray/pull/8460 More minor papercuts: - [ ] I've hit an issue where writing a region seemed to cause the worker to attempt to load the whole array into memory — can we offer guarantees for when (non-metadata) data will be loaded during `to_zarr`? - [ ] How about adding [`raise_if_dask_computes`](https://tgsmc.slack.com/archives/C04TBA4NBHD/p1695840018643219) to our public API? The alternative I've been doing is watching `htop` and existing if I see memory ballooning, which is less cerebral... - [ ] It doesn't seem easy to write coords on a DataArray. For example, writing `xr.tutorial.load_dataset('air_temperature').assign_coords(lat2=da.lat + 2, a=(('lon',), ['a'] * len(da.lon))).chunk().to_zarr('foo.zarr', compute=False)` will cause the non-index coords to be written as empty. But writing them separately conflicts with having a single variable. Currently I manually load each coord before writing, which is not super-friendly. Some things that were in the list here, as they've been completed!! - [x] Requiring `region` to be specified as an int range can be inconvenient — would it feasible to have a function that grabs the template metadata, calculates the region ints, and then calculates the implied indexes? - Edit: suggested at https://github.com/pydata/xarray/issues/7702","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8245/reactions"", ""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,,13221727,issue 1918061661,I_kwDOAMm_X85yU0xd,8251,`.chunk()` doesn't create chunks on 0 dim arrays,5635139,open,0,,,0,2023-09-28T18:30:50Z,2023-09-30T21:31:05Z,,MEMBER,,,,"### What happened? `.chunk`'s docstring states: ``` """"""Coerce this array's data into a dask arrays with the given chunks. If this variable is a non-dask array, it will be converted to dask array. If it's a dask array, it will be rechunked to the given chunk sizes. ``` ...but this doesn't happen for 0 dim arrays; example below. For context, as part of #8245, I had a function that creates a template array. It created an empty `DataArray`, then expanded dims for each dimension. And it kept blowing up memory! ...until I realized that it was actually not a lazy array. ### What did you expect to happen? It may be that we can't have a 0-dim dask array — but then we should raise in this method, rather than return the wrong thing. ### Minimal Complete Verifiable Example ```Python [ins] In [1]: type(xr.DataArray().chunk().data) Out[1]: numpy.ndarray [ins] In [2]: type(xr.DataArray(1).chunk().data) Out[2]: numpy.ndarray [ins] In [3]: type(xr.DataArray([1]).chunk().data) Out[3]: dask.array.core.Array ``` ### MVCE confirmation - [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. - [X] Complete example — the example is self-contained, including all data and the text of any traceback. - [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result. - [X] New issue — a search of GitHub Issues suggests this is not a duplicate. ### Relevant log output _No response_ ### Anything else we need to know? _No response_ ### Environment
INSTALLED VERSIONS ------------------ commit: 0d6cd2a39f61128e023628c4352f653537585a12 python: 3.9.18 (main, Aug 24 2023, 21:19:58) [Clang 14.0.3 (clang-1403.0.22.14.1)] python-bits: 64 OS: Darwin OS-release: 22.6.0 machine: arm64 processor: arm byteorder: little LC_ALL: en_US.UTF-8 LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2023.8.1.dev25+g8215911a.d20230914 pandas: 2.1.1 numpy: 1.25.2 scipy: 1.11.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.16.0 cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.4.0 distributed: 2023.7.1 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: 0.2.3.dev30+gd26e29e fsspec: 2021.11.1 cupy: None pint: None sparse: None flox: 0.7.2 numpy_groupies: 0.9.19 setuptools: 68.1.2 pip: 23.2.1 conda: None pytest: 7.4.0 mypy: 1.5.1 IPython: 8.15.0 sphinx: 4.3.2
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8251/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 1917820711,I_kwDOAMm_X85yT58n,8248,`write_empty_chunks` not in `DataArray.to_zarr`,5635139,open,0,,,0,2023-09-28T15:48:22Z,2023-09-28T15:49:35Z,,MEMBER,,,,"### What is your issue? Our `to_zarr` methods on `DataArray` & `Dataset` are slightly inconsistent — `Dataset.to_zarr` has `write_empty_chunks` and `chunkmanager_store_kwargs`. They're also in a different order. --- Up a level — not sure of the best way of enforcing consistency here; a couple of ideas. - We could have tests that operate on both a `DataArray` and `Dataset`, parameterized by fixtures (might also help reduce the duplication in some of our tests), though we then need to make the tests generic. We could have some general tests which just test that methods work, and then delegate to the current per-object tests for finer guarantees. - We could have a tool which collects the differences between `DataArray` & `Dataset` methods and snapshots them — then we'll see if they diverge, while allowing for some divergences.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8248/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 1125030343,I_kwDOAMm_X85DDpnH,6243,Maintenance improvements,5635139,open,0,,,0,2022-02-05T21:01:51Z,2022-02-05T21:01:51Z,,MEMBER,,,,"### Is your feature request related to a problem? At the end of the dev call, we discussed ways to do better at maintenance. I'd like to make Xarray a wonderful place to contribute, partly because it was so formative for me in becoming more involved with software engineering. ### Describe the solution you'd like We've already come far, because of the hard work of many of us! A few ideas, in increasing order of radical-ness - We looked at @andersy005's dashboards for PRs & Issues. Could we expose this, both to hold ourselves accountable and signal to potential contributors that we care about turnaround time for their contributions? - Is there a systematic way of understanding who should review something? - FWIW a few months ago I looked for a bot that would recommend a reviewer based on who had contributed code in the past, which I think I've seen before. But I couldn't find one generally available. This would be really helpful — we wouldn't have n people each assessing whether they're the best reviewer for each contribution. If anyone does better than me at finding something like this, that would be awesome. - Could we add a label so people can say ""now I'm waiting for a review"", and track how long those stay up? - Ensuring the 95th percentile is < 2 days is more important than the median being in the hours. It does pain me when I see PRs get dropped for a few weeks. TBC, I'm as responsible as anyone. - Could we have a bot that asks for feedback on the review process — i.e. ""I received a prompt and helpful review"", ""I would recommend a friend contribute to Xarray"", etc? ### Describe alternatives you've considered _No response_ ### Additional context There's always a danger with making stats legible that Goodhart's law strikes. And sometimes stats are not joyful, and lots of people come here for joy. So probably there's a tradeoff.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6243/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue