home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

35 rows where user = 90008 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)

type 2

  • pull 18
  • issue 17

state 2

  • closed 25
  • open 10

repo 1

  • xarray 35
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1933712083 I_kwDOAMm_X85zQhrT 8289 segfault with a particular netcdf4 file hmaarrfk 90008 open 0     11 2023-10-09T20:07:17Z 2024-05-03T16:54:18Z   CONTRIBUTOR      

What happened?

The following code yields a segfault on my machine (and many other machines with a similar environment)

``` import xarray filename = 'tiny.nc.txt' engine = "netcdf4"

dataset = xarray.open_dataset(filename, engine=engine)

i = 0 for i in range(60): xarray.open_dataset(filename, engine=engine) ```

tiny.nc.txt mrc.nc.txt

What did you expect to happen?

Not to segfault.

Minimal Complete Verifiable Example

  1. Generate some netcdf4 with my application.
  2. Trim the netcdf4 file down (load it, and drop all the vars I can while still reproducing this bug)
  3. Try to read it.

```Python import xarray from tqdm import tqdm filename = 'mrc.nc.txt' engine = "h5netcdf" dataset = xarray.open_dataset(filename, engine=engine)

for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"): xarray.open_dataset(filename, engine=engine)

engine = "netcdf4"

dataset = xarray.open_dataset(filename, engine=engine) for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"): xarray.open_dataset(filename, engine=engine)

filename = 'tiny.nc.txt'

engine = "h5netcdf" dataset = xarray.open_dataset(filename, engine=engine) for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"): xarray.open_dataset(filename, engine=engine)

engine = "netcdf4"

dataset = xarray.open_dataset(filename, engine=engine) for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"): xarray.open_dataset(filename, engine=engine) ```

hand crafting the file from start to finish seems to not segfault: ``` import xarray import numpy as np engine = 'netcdf4'

dataset = xarray.Dataset()

coords = {} coords['image_x'] = np.arange(1, dtype='int') dataset = dataset.assign_coords(coords)

dataset['image'] = xarray.DataArray( np.zeros((1,), dtype='uint8'), dims=('image_x',) )

%%

dataset.to_netcdf('mrc.nc.txt')

%%

dataset = xarray.open_dataset('mrc.nc.txt', engine=engine)

for i in range(10): xarray.open_dataset('mrc.nc.txt', engine=engine)

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Python i=0 passes i=1 mostly segfaults, but sometimes it can take more than 1 iteration

Anything else we need to know?

At first I thought it was deep in hdf5, but I am less convinced now

xref: https://github.com/HDFGroup/hdf5/issues/3649

Environment

``` INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 | packaged by Ramona Optics | (main, Jun 27 2023, 02:59:09) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.1-060501-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.9.1.dev25+g46643bb1.d20231009 pandas: 2.1.1 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.4 pydap: None h5netcdf: 1.2.0 h5py: 3.9.0 Nio: None zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.3.0 distributed: 2023.3.0 matplotlib: 3.8.0 cartopy: None seaborn: None numbagg: None fsspec: 2023.9.2 cupy: None pint: 0.22 sparse: None flox: None numpy_groupies: None setuptools: 68.2.2 pip: 23.2.1 conda: 23.7.4 pytest: 7.4.2 mypy: None IPython: 8.16.1 sphinx: 7.2.6 ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8289/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2129180716 PR_kwDOAMm_X85mld8X 8736 Make list_chunkmanagers more resilient to broken entrypoints hmaarrfk 90008 closed 0     6 2024-02-11T21:37:38Z 2024-03-13T17:54:02Z 2024-03-13T17:54:02Z CONTRIBUTOR   0 pydata/xarray/pulls/8736

As I'm a developing my custom chunk manager, I'm often checking out between my development branch and production branch breaking the entrypoint.

This made xarray impossible to import unless I re-ran pip install -e . -vv which is somewhat tiring.

This should help xarray be more resilient in other software's bugs in case they install malformed entrypoints

Example:

```python

from xarray.core.parallelcompat import list_chunkmanagers

list_chunkmanagers() <ipython-input-3-19326f4950bc>:1: UserWarning: Failed to load entrypoint MyChunkManager due to No module named 'my.array._chunkmanager'. Skipping. list_chunkmanagers() {'dask': <xarray.core.daskmanager.DaskManager at 0x7f5b826231c0>} ```

Thank you for considering.

  • [x] Closes #xxxx
  • [x] Tests added
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] New functions/methods are listed in api.rst

This is mostly a quality of life thing for developers, I don't see this as a user visible change.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8736/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
2128501296 I_kwDOAMm_X85-3low 8733 A basic default ChunkManager for arrays that report their own chunks hmaarrfk 90008 open 0     21 2024-02-10T14:36:55Z 2024-03-10T17:26:13Z   CONTRIBUTOR      

Is your feature request related to a problem?

I'm creating duckarrays for various file backed datastructures for mine that are naturally "chunked". i.e. different parts of the array may appear in completely different files.

Using these "chunks" and the "strides" algorithms can better decide on how to iterate in a convenient manner.

For example, an MP4 file's chunks may be defined as being delimited by I frames, while images stored in a TIFF may be delimited by a page.

So for me, chunks are not so useful for parallel computing, but more for computing locally and choosing the appropriate way to iterate through a large arrays (TB of uncompressed data).

Describe the solution you'd like

I think a default Chunk manager could simply implement compute as np.asarray as a default instance, and be a catchall to all other instances.

Advanced users could then go in an reimplement their own chunkmanager, but I was unable to use my duckarrays that incldued a chunk property because they weren't associated with any chunk manager.

Something as simple as:

```patch diff --git a/xarray/core/parallelcompat.py b/xarray/core/parallelcompat.py index c009ef48..bf500abb 100644 --- a/xarray/core/parallelcompat.py +++ b/xarray/core/parallelcompat.py @@ -681,3 +681,26 @@ class ChunkManagerEntrypoint(ABC, Generic[T_ChunkedArray]): cubed.store """ raise NotImplementedError() + + +class DefaultChunkManager(ChunkMangerEntrypoint): + def init(self) -> None: + self.array_cls = None + + def is_chunked_array(self, data: Any) -> bool: + return is_duck_array(data) and hasattr(data, "chunks") + + def chunks(self, data: T_ChunkedArray) -> T_NormalizedChunks: + return data.chunks + + def compute(self, data: T_ChunkedArray | Any, kwargs) -> tuple[np.ndarray, ...]: + raise tuple(np.asarray(d) for d in data) + + def normalize_chunks(self, args, kwargs): + raise NotImplementedError() + + def from_array(self, *args, kwargs): + raise NotImplementedError() + + def apply_gufunc(self, args, *kwargs): + raise NotImplementedError()

```

Describe alternatives you've considered

I created my own chunk manager, with my own chunk manager entry point.

Kinda tedious...

Additional context

It seems that this is related to: https://github.com/pydata/xarray/pull/7019

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8733/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2131345470 PR_kwDOAMm_X85ms1Q6 8738 Don't break users that were already using ChunkManagerEntrypoint hmaarrfk 90008 closed 0     1 2024-02-13T02:17:55Z 2024-02-13T15:37:54Z 2024-02-13T03:21:32Z CONTRIBUTOR   0 pydata/xarray/pulls/8738

For example, you just broke cubed

https://github.com/xarray-contrib/cubed-xarray/blob/main/cubed_xarray/cubedmanager.py#L15

Not sure how much you care, it didn't seem like anybody other than me ever tried this module on github...

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8738/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
2131364916 PR_kwDOAMm_X85ms5QB 8739 Add a test for usability of duck arrays with chunks property hmaarrfk 90008 open 0     1 2024-02-13T02:46:47Z 2024-02-13T03:35:24Z   CONTRIBUTOR   0 pydata/xarray/pulls/8739

xref: https://github.com/pydata/xarray/issues/8733

```python xarray/tests/test_variable.py F ================================================ FAILURES ================================================ ____________________________ TestAsCompatibleData.test_duck_array_with_chunks ____________________________ self = <xarray.tests.test_variable.TestAsCompatibleData object at 0x7f3d1b122e60> def test_duck_array_with_chunks(self): # Non indexable type class CustomArray(NDArrayMixin, indexing.ExplicitlyIndexed): def __init__(self, array): self.array = array @property def chunks(self): return self.shape def __array_function__(self, *args, **kwargs): return NotImplemented def __array_ufunc__(self, *args, **kwargs): return NotImplemented array = CustomArray(np.arange(3)) assert is_chunked_array(array) var = Variable(dims=("x"), data=array) > var.load() /home/mark/git/xarray/xarray/tests/test_variable.py:2745: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /home/mark/git/xarray/xarray/core/variable.py:936: in load self._data = to_duck_array(self._data, **kwargs) /home/mark/git/xarray/xarray/namedarray/pycompat.py:129: in to_duck_array chunkmanager = get_chunked_array_type(data) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (CustomArray(array=array([0, 1, 2])),), chunked_arrays = [CustomArray(array=array([0, 1, 2]))] chunked_array_types = {<class 'xarray.tests.test_variable.TestAsCompatibleData.test_duck_array_with_chunks.<locals>.CustomArray'>} chunkmanagers = {'dask': <xarray.namedarray.daskmanager.DaskManager object at 0x7f3d1b1568f0>} def get_chunked_array_type(*args: Any) -> ChunkManagerEntrypoint[Any]: """ Detects which parallel backend should be used for given set of arrays. Also checks that all arrays are of same chunking type (i.e. not a mix of cubed and dask). """ # TODO this list is probably redundant with something inside xarray.apply_ufunc ALLOWED_NON_CHUNKED_TYPES = {int, float, np.ndarray} chunked_arrays = [ a for a in args if is_chunked_array(a) and type(a) not in ALLOWED_NON_CHUNKED_TYPES ] # Asserts all arrays are the same type (or numpy etc.) chunked_array_types = {type(a) for a in chunked_arrays} if len(chunked_array_types) > 1: raise TypeError( f"Mixing chunked array types is not supported, but received multiple types: {chunked_array_types}" ) elif len(chunked_array_types) == 0: raise TypeError("Expected a chunked array but none were found") # iterate over defined chunk managers, seeing if each recognises this array type chunked_arr = chunked_arrays[0] chunkmanagers = list_chunkmanagers() selected = [ chunkmanager for chunkmanager in chunkmanagers.values() if chunkmanager.is_chunked_array(chunked_arr) ] if not selected: > raise TypeError( f"Could not find a Chunk Manager which recognises type {type(chunked_arr)}" E TypeError: Could not find a Chunk Manager which recognises type <class 'xarray.tests.test_variable.TestAsCompatibleData.test_duck_array_with_chunks.<locals>.CustomArray'> /home/mark/git/xarray/xarray/namedarray/parallelcompat.py:158: TypeError ============================================ warnings summary ============================================ xarray/testing/assertions.py:9 /home/mark/git/xarray/xarray/testing/assertions.py:9: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================== short test summary info ========================================= FAILED xarray/tests/test_variable.py::TestAsCompatibleData::test_duck_array_with_chunks - TypeError: Could not find a Chunk Manager which recognises type <class 'xarray.tests.test_variable.Te... ====================================== 1 failed, 1 warning in 0.77s ====================================== (dev) ✘-1 ~/git/xarray [add_test_for_duck_array|✔] ``` </details>
  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8739/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
2034395026 PR_kwDOAMm_X85hnUnc 8534 Point users to where in their code they should make mods for Dataset.dims hmaarrfk 90008 closed 0     8 2023-12-10T14:31:29Z 2023-12-10T18:50:10Z 2023-12-10T18:23:42Z CONTRIBUTOR   0 pydata/xarray/pulls/8534

Its somewhat annoying to get warnings that point to a line within a library where the warning is issued. It really makes it unclear what one needs to change.

This points to the user's access of the dims attribute.

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8534/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1429172192 I_kwDOAMm_X85VL2_g 7239 include/exclude lists in Dataset.expand_dims hmaarrfk 90008 closed 0     6 2022-10-31T03:01:52Z 2023-11-05T06:29:06Z 2023-11-05T06:29:06Z CONTRIBUTOR      

Is your feature request related to a problem?

I would like to be able to expand the dimensions of a dataset, but most of the time, I only want to expand the datasets of a few key variables.

It would be nice if there were some kind of filter mechanism.

Describe the solution you'd like

```python import xarray as xr dataset = xr.Dataset(data_vars={'foo': 1, 'bar': 2}) dataset.expand_dims("zar", include_variables=["foo"])

Only foo is expanded, bar is left alone.

```

Describe alternatives you've considered

Writing my own function. I'll probably do this.

Subclassing. Too confusing and easy to "diverge" from you all when you do decide to implment this.

Additional context

For large datasets, you likely just want some key parameters expanded, and not all parameters expanded.

xarray version: 2022.10.0

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7239/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1152047670 I_kwDOAMm_X85Eqto2 6309 Read/Write performance optimizations for netcdf files hmaarrfk 90008 open 0     5 2022-02-26T17:40:40Z 2023-09-13T08:27:47Z   CONTRIBUTOR      

What happened?

I'm not too sure this is a bug report, but I figured I would share some of the investigation I've done on the topic of writing large datasets to netcdf.

For clarity, the usecase I'm considering is writing large in-memory array to persistant storage on Linux.

  • Array size 4-100GB
  • File Format Netcdf
  • Dask: No.
  • Hardware: Some very modern SSD that can write more than 1GB/s (Sabrent Rocket 4 Plus for example).
  • Operating System: Linux
  • Xarray version: 0.21.1 (or around there)

The symptoms are two fold: 1. The write speed is slow. About 1GB/s, much less than the 2-3 GB/s you can get with other means. 2. The Linux disk cache just keeps filling up.

Its quite hard to get good performance from systems, so I"m going to put a few more constraints on the type of data we are are writing: 1. The underlying numpy array must be alight to the linux Page boundary of 4096 bytes. 2. The underlying numpy array must have been pre-faulted and not swapped. (Do not use np.zeros, it doesn't fault the memory)

I feel like these two options are rather easy to get to as I'll show in my example.

What did you expect to happen?

I want to be able to write at 3.2GB/s with my shiny new SSD.

I want to leave my RAM unused when I'm archiving to disk.

Minimal Complete Verifiable Example

```Python import numpy as np import xarray as xr

def empty_aligned(shape, dtype=np.float64, align=4096): if not isinstance(shape, tuple): shape = (shape,)

dtype = np.dtype(dtype)
size = dtype.itemsize
# Compute the final size of the array
for s in shape:
    size *= s

a = np.empty(size + (align - 1), dtype=np.uint8)
data_align = a.ctypes.data % align
offset = 0 if data_align == 0 else (align - data_align)
arr = a[offset:offset + size].view(dtype)
# Don't use reshape since reshape might copy the data.
# This is the suggested way to assign a new shape with guarantee
# That the data won't be copied.
arr.shape = shape
return arr

dataset = xr.DataArray( empty_aligned((4, 1024, 1024, 1024), dtype='uint8'), name='mydata').to_dataset()

Fault and write data to this dataset

dataset['mydata'].data[...] = 1

%time dataset.to_netcdf("test", engine='h5netcdf') %time dataset.to_netcdf("test", engine='netcdf4') ```

Relevant log output

Both output about 3.5s equivalent to just about 1GB/s.

To get to about 3 ish GB/s (taking about 1.27s to write a 4GB array). One needs to do a few things:

  1. You must align the underlying data to disk.
  2. h5netcdf (h5py) backend https://github.com/h5py/h5py/pull/2040
  3. netcdf4: https://github.com/Unidata/netcdf-c/pull/2206
  4. You must use a driver that bypasses the operating system cache
  5. https://github.com/h5py/h5py/pull/2041

For the h5netcdf backend you would have to add the following kwargs to h5netcdf constructor kwargs = { "invalid_netcdf": invalid_netcdf, "phony_dims": phony_dims, "decode_vlen_strings": decode_vlen_strings, 'alignment_threshold': alignment_threshold, 'alignment_interval': alignment_interval, }

Anything else we need to know?

The main challenge is that while writing aligned data this way is REALLY fast, writing small chunks and unaligned data becomes REALLY slow.

Personally, I think that someone might be able to write a new HDF5 driver that does better optimization, I feel like this can help people loading large datasets which seems to be a large part of the community of xarray users.

Environment

``` INSTALLED VERSIONS


commit: None python: 3.9.9 (main, Dec 29 2021, 07:47:36) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.13.0-30-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1

xarray: 0.21.1 pandas: 1.4.0 numpy: 1.22.2 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: 0.13.1 h5py: 3.6.0.post1 Nio: None zarr: None cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.01.1 distributed: None matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.01.0 cupy: None pint: None sparse: None setuptools: 60.8.1 pip: 22.0.3 conda: None pytest: None IPython: 8.0.1 sphinx: None ```

h5py includes some additions of mine that allow you to use the DIRECT driver and I am using a version of HDF5 that is built with the DIRECT driver.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6309/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
    xarray 13221727 issue
1773296009 I_kwDOAMm_X85pslmJ 7940 decide on how to handle `empty_like` hmaarrfk 90008 open 0     8 2023-06-25T13:48:46Z 2023-07-05T16:36:35Z   CONTRIBUTOR      

Is your feature request related to a problem?

calling np.empty_like seems to be instantiating the whole array.

```python from xarray.tests import InaccessibleArray import xarray as xr import numpy as np

array = InaccessibleArray(np.zeros((3, 3), dtype="uint8")) da = xr.DataArray(array, dims=["x", "y"])

np.empty_like(da) ```

python Traceback (most recent call last): File "/home/mark/t.py", line 8, in <module> np.empty_like(da) File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/common.py", line 165, in __array__ return np.asarray(self.values, dtype=dtype) File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/dataarray.py", line 732, in values return self.variable.values File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/variable.py", line 614, in values return _as_array_or_item(self._data) File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/variable.py", line 314, in _as_array_or_item data = np.asarray(data) File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/tests/__init__.py", line 151, in __array__ raise UnexpectedDataAccess("Tried accessing data") xarray.tests.UnexpectedDataAccess: Tried accessing data

Describe the solution you'd like

I'm not too sure. This is why I raised this as a "feature" and not a bug.

On one hand, it is pretty hard to "get" the underlying class.

Is it a:

  • numpy array
  • a lazy thing that looks like a numpy array?
  • a dask array when it is dask?

I think that there are also some nuances between:

  1. Loading an nc file from a file (where things might be handled by dask even though you don't want them to be)
  2. Creating your xarray from in memory.

Describe alternatives you've considered

for now, i'm trying to avoid empty_like or zeros_like.

In general, we haven't seen much benefit from dask and cuda still needs careful memory management.

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7940/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1731320789 PR_kwDOAMm_X85Rougi 7883 Avoid one call to len when getting ndim of Variables hmaarrfk 90008 closed 0     3 2023-05-29T23:37:10Z 2023-07-03T15:44:32Z 2023-07-03T15:44:31Z CONTRIBUTOR   0 pydata/xarray/pulls/7883

I admit this is a super micro optimization but it avoids in certain cases the creation of a tuple, and a call to len on it.

I hit this as I was trying to understand why Variable indexing was so much slower than numpy indexing. It seems that bounds checking in python is just slower than in C.

Feel free to close this one if you don't want this kind of optimization.

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7883/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1428549868 I_kwDOAMm_X85VJfDs 7237 The new NON_NANOSECOND_WARNING is not very nice to end users hmaarrfk 90008 closed 0     5 2022-10-30T01:56:59Z 2023-05-09T12:52:54Z 2022-11-04T20:13:20Z CONTRIBUTOR      

What is your issue?

The new nanosecond warning doesn't really point anybody to where they should change their code.

Nor does it really tell them how to fix it.

import xarray as xr import numpy as np xr.DataArray(np.zeros(1, dtype='datetime64[us]')) yields

xarray/core/variable.py:194: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values.

https://github.com/pydata/xarray/blob/f32d354e295c05fb5c5ece7862f77f19d82d5894/xarray/core/variable.py#L194

I think at the very least, the stacklevel should be specified when calling the warn function.

It isn't really pretty, but I've been passing a parameter when I expect to pass up a warning to the end user: eg. https://github.com/vispy/vispy/pull/2405

However, others have not liked that approach.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7237/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1306457778 I_kwDOAMm_X85N3vay 6791 get_data or get_varibale method hmaarrfk 90008 closed 0     3 2022-07-15T20:24:31Z 2023-04-29T03:40:01Z 2023-04-29T03:40:01Z CONTRIBUTOR      

Is your feature request related to a problem?

I often store a few scalars or arrays in xarray containers.

However, when I want to optionally address their data the code I have to run ```python import xarray as xr dataset = xr.Dataset()

my_variable = dataset.get('my_variable', None) if my_variable is not None: my_variable = my_variable.data else: my_variable = np.asarray(1.0) # the default value I actually want ```

Describe the solution you'd like

```python import xarray as xr dataset = xr.Dataset()

my_variable = dataset.get_data('my_variable', np.asarray(1.0)) ```

Describe alternatives you've considered

No response

Additional context

Thank you!

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6791/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1675299031 I_kwDOAMm_X85j2wjX 7770 Provide a public API for adding new backends hmaarrfk 90008 closed 0     3 2023-04-19T17:06:24Z 2023-04-20T00:15:23Z 2023-04-20T00:15:23Z CONTRIBUTOR      

Is your feature request related to a problem?

I understand that this is a double edge sword. but we were relying on BACKEND_ENTRYPOINTS being a dictionary to a class and that broke in

https://github.com/pydata/xarray/pull/7523

Describe the solution you'd like

Some agreed upon way that we could create a new backend. This would allow users to provide more custom parameters to file creation attributes and other options that are currently not exposed via xarray.

I've used this to overwrite some parameters like netcdf global variables.

I've also used this to add alignment_threshold and alignment_interval to h5netcdf.

I did it through a custom backend because it felt like a contentious feature at the time. (I really do think it helps performance).

Describe alternatives you've considered

A deprecation cycle in the future???

Maybe this could have been acheived with the definition of RELOADABLE_BACKEND_ENTRYPOINTS and leaving the BACKEND_ENTRYPOINTS unchanged in signature.

Additional context

We used this to define the alignment within a file. netcdf4 exposed this as a global variable so we have to somewhat hack around it just before creation time.

I mean, you can probably say:

"Doing this is too complicated, we don't want to give any guarantees on this front."

I would agree with you.....

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7770/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
690546795 MDExOlB1bGxSZXF1ZXN0NDc3NDIwMTkz 4400 [WIP] Support nano second time encoding. hmaarrfk 90008 closed 0     10 2020-09-02T00:16:04Z 2023-03-26T20:59:00Z 2023-03-26T20:08:50Z CONTRIBUTOR   0 pydata/xarray/pulls/4400

Not too sure i have the bandwidth to complete this seeing as cftime and datetime don't have nanoseconds, but maybe it can help somebody.

  • [x] Closes #4183
  • [x] Tests added
  • [ ] Passes isort . && black . && mypy . && flake8
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4400/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1475567394 PR_kwDOAMm_X85ESe3u 7356 Avoid loading entire dataset by getting the nbytes in an array hmaarrfk 90008 closed 0     14 2022-12-05T03:29:53Z 2023-03-17T17:31:22Z 2022-12-12T16:46:40Z CONTRIBUTOR   0 pydata/xarray/pulls/7356

Using .data accidentally tries to load the whole lazy arrays into memory.

Sad.

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7356/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
689502005 MDExOlB1bGxSZXF1ZXN0NDc2NTM3Mzk3 4395 WIP: Ensure that zarr.ZipStores are closed hmaarrfk 90008 closed 0     4 2020-08-31T20:57:49Z 2023-01-31T21:39:15Z 2023-01-31T21:38:23Z CONTRIBUTOR   0 pydata/xarray/pulls/4395

ZipStores aren't always closed making it hard to use them as fluidly as regular zarr stores.

  • [ ] Closes #xxxx
  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8 # master doesn't pass black
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4395/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1432388736 I_kwDOAMm_X85VYISA 7245 coordinates not removed for variable encoding during reset_coords hmaarrfk 90008 open 0     5 2022-11-02T02:46:56Z 2023-01-15T16:23:15Z   CONTRIBUTOR      

What happened?

When calling reset_coords on a dataset that is loaded from disk, the coordinates are not removed from the encoding of the variable.

This means, that at save time they will be resaved as coordinates... annoying. (and erroneous)

What did you expect to happen?

No response

Minimal Complete Verifiable Example

```Python import xarray as xr

dataset = xr.Dataset( data_vars={'images': (('y', 'x'), np.zeros((10, 2)))}, coords={'zar': 1} )

dataset.to_netcdf('foo.nc', mode='w')

%%

foo_loaded = xr.open_dataset('foo.nc')

foo_loaded_reset = foo_loaded.reset_coords()

%%

assert 'zar' in foo_loaded.coords assert 'zar' not in foo_loaded_reset.coords assert 'zar' in foo_loaded_reset.data_vars foo_loaded_reset.to_netcdf('bar.nc', mode='w')

%% Now load the dataset

bar_loaded = xr.open_dataset('bar.nc') assert 'zar' not in bar_loaded.coords, 'zar is erroneously a coordinate'

%%

This is the problem

assert 'zar' not in foo_loaded_reset.images.encoding['coordinates'].split(' '), "zar should not be in here" ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

for _, variable in obj._variables.items(): coords_in_encoding = set(variable.encoding.get('coordinates', ' ').split(' ')) variable.encoding['coordinates'] = ' '.join(coords_in_encoding - set(names)) suggested fix in dataset.py, reset_coords

https://github.com/pydata/xarray/blob/513ee34f16cc8f9250a72952e33bf9b4c95d33d1/xarray/core/dataset.py#L1734

Environment

``` INSTALLED VERSIONS ------------------ commit: None python: 3.9.13 | packaged by Ramona Optics | (main, Aug 31 2022, 22:30:30) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 5.15.0-50-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.10.0 pandas: 1.5.1 numpy: 1.23.4 scipy: 1.9.3 netCDF4: 1.6.1 pydap: None h5netcdf: 1.0.2 h5py: 3.7.0 Nio: None zarr: 2.13.3 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.10.0 distributed: 2022.10.0 matplotlib: 3.6.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.10.0 cupy: None pint: 0.20.1 sparse: None flox: None numpy_groupies: None setuptools: 65.5.0 pip: 22.3 conda: 22.9.0 pytest: 7.2.0 IPython: 7.33.0 sphinx: 5.3.0 /home/mark/mambaforge/envs/mcam_dev/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7245/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1524642393 I_kwDOAMm_X85a4DJZ 7428 Avoid instantiating the data in prepare_variable hmaarrfk 90008 open 0     0 2023-01-08T19:18:49Z 2023-01-09T06:25:52Z   CONTRIBUTOR      

Is your feature request related to a problem?

I'm trying to extend the features of xarray for a new backend I'm developing internally. The main use case that we are trying to open a multi 100's of GB dataset, slice out a smaller dataset (10s of GB) and write it.

However, when we try to use functions like prepare_variable, the way they are currently written, they implicitely instantiate the whole data, (potentially 10s of GB) which incurs a huge "time cost" at a surprising (to me) point in the code.

https://github.com/pydata/xarray/blob/6e77f5e8942206b3e0ab08c3621ade1499d8235b/xarray/backends/h5netcdf_.py#L338

Describe the solution you'd like

Would it be possible to just remove the second return value from prepare_variable? It isn't particuarly "useful" and easy to obtain from the inputs to the function.

Describe alternatives you've considered

I'm proably going to create a new method, with a not so well chosen name like prepare_variable_no_data that does the above, but only for my backend. My code path that needs this only uses our custom backend.

Additional context

I think this would be useful, in general for other users that need more out of memory computation. I've found that you really have to "buy into" dask, all the way to the end, if you want to see any benefits. As such, if somebody used a dask array, this would create a serial choke point in:

https://github.com/pydata/xarray/blob/6e77f5e8942206b3e0ab08c3621ade1499d8235b/xarray/backends/common.py#L308

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7428/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1468595351 PR_kwDOAMm_X85D6oci 7334 Remove code used to support h5py<2.10.0 hmaarrfk 90008 closed 0     1 2022-11-29T19:34:24Z 2022-11-30T23:30:41Z 2022-11-30T23:30:41Z CONTRIBUTOR   0 pydata/xarray/pulls/7334

It seems that the relevant issue was fixed in 2.10.0 https://github.com/h5py/h5py/commit/466181b178c1b8a5bfa6fb8f217319e021f647e0

I'm not sure how far back you want to fix things. I'm hoping to test this on the CI.

I found this since I've been auditing slowdowns in our codebase, which has caused me to review much of the reading pipeline.

Do you want to add a test for h5py>=2.10.0? Or can we assume that users won't install things together. https://pypi.org/project/h5py/2.10.0/

I could for example set the backend to not be available if a version of h5py that is too old is detected. One could alternatively, just keep the code here.

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7334/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1428274982 PR_kwDOAMm_X85BzXXR 7236 Expand benchmarks for dataset insertion and creation hmaarrfk 90008 closed 0     8 2022-10-29T13:55:19Z 2022-10-31T15:04:13Z 2022-10-31T15:03:58Z CONTRIBUTOR   0 pydata/xarray/pulls/7236

Taken from discussions in https://github.com/pydata/xarray/issues/7224#issuecomment-1292216344

Thank you @Illviljan

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7236/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1423948375 I_kwDOAMm_X85U37pX 7224 Insertion speed of new dataset elements hmaarrfk 90008 open 0     3 2022-10-26T12:34:51Z 2022-10-29T22:39:39Z   CONTRIBUTOR      

What is your issue?

In https://github.com/pydata/xarray/pull/7221 I showed that a major contributor the slowdown in inserting a new element was the cost associated with an internal only debugging assert statement.

The benchmarks results 7221 and 7222 are pretty useful to look at.

Thank you for encouraging the creation of a "benchmark" so that we can monitor the performance of element insertion.

Unfortunately, that was the only "free" lunch I got.

A few other minor improvements can be obtained with: https://github.com/pydata/xarray/pull/7222

However, it seems to me that the fundamental reason this is "slow" is because element insertion is not so much "insertion" as it is: * Dataset Merge * Dataset Replacement of the internal methods.

This is really solidified in the https://github.com/pydata/xarray/blob/main/xarray/core/dataset.py#L4918

In my benchmarks, I found that in the limit of large datasets, list comprehensions of 1000 elements or more were often used to "search" for variables that were "indexed" https://github.com/pydata/xarray/blob/ca57e5cd984e626487636628b1d34dca85cc2e7c/xarray/core/merge.py#L267

I think a few speedsups can be obtained by avoiding these kinds of "searches" and list comprehensions. However, I think that the dataset would have to provide this kind of information to the merge_core routine, instead of the merge_core routine recreating it all the time.

Ultimately, I think you trade off "memory footprint" (due to the potential increase of datastructures you keep around) of a dataset, and "speed".

Anyway, I just wanted to share where I got.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7224/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1428264468 PR_kwDOAMm_X85BzVOE 7235 Fix type in benchmarks/merge.py hmaarrfk 90008 closed 0     0 2022-10-29T13:28:12Z 2022-10-29T15:52:45Z 2022-10-29T15:52:45Z CONTRIBUTOR   0 pydata/xarray/pulls/7235

I don't think this affects what is displayed that is determined by param_names

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7235/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1423321834 PR_kwDOAMm_X85Bi5BN 7222 Actually make the fast code path return early for Aligner.align hmaarrfk 90008 closed 0     6 2022-10-26T01:59:09Z 2022-10-28T16:22:36Z 2022-10-28T16:22:35Z CONTRIBUTOR   0 pydata/xarray/pulls/7222

In relation to my other PR.

Without this PR

With the early return

Removing the frivolous copy (does not pass tests) ![image](https://user-images.githubusercontent.com/90008/197916632-dbc89c21-94a9-4b92-af11-5b1fa5f5cddd.png)
Code for benchmark ```python from tqdm import tqdm import xarray as xr from time import perf_counter import numpy as np N = 1000 # Everybody is lazy loading now, so lets force modules to get instantiated dummy_dataset = xr.Dataset() dummy_dataset['a'] = 1 dummy_dataset['b'] = 1 del dummy_dataset time_elapsed = np.zeros(N) dataset = xr.Dataset() # tqdm = iter for i in tqdm(range(N)): time_start = perf_counter() dataset[f"var{i}"] = i time_end = perf_counter() time_elapsed[i] = time_end - time_start # %% from matplotlib import pyplot as plt plt.plot(np.arange(N), time_elapsed * 1E3, label='Time to add one variable') plt.xlabel("Number of existing variables") plt.ylabel("Time to add a variables (ms)") plt.ylim([0, 10]) plt.grid(True) ```

xref: https://github.com/pydata/xarray/pull/7221

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7222/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1423312198 PR_kwDOAMm_X85Bi3Dp 7221 Remove debugging slow assert statement hmaarrfk 90008 closed 0     13 2022-10-26T01:43:08Z 2022-10-28T02:49:44Z 2022-10-28T02:49:44Z CONTRIBUTOR   0 pydata/xarray/pulls/7221

We've been trying to understand why our code is slow. One part is that we use xarray.Datasets almost like dictionaries for our data. The following code is quite common for us

python import xarray as xr dataset = xr.Dataset() dataset['a'] = 1 dataset['b'] = 2

However, through benchmarks, it became obvious that the merge_core method of xarray was causing alot of slowdowns. main branch:

With this merge request:

```python from tqdm import tqdm import xarray as xr from time import perf_counter import numpy as np

N = 1000

Everybody is lazy loading now, so lets force modules to get instantiated

dummy_dataset = xr.Dataset() dummy_dataset['a'] = 1 dummy_dataset['b'] = 1 del dummy_dataset

time_elapsed = np.zeros(N) dataset = xr.Dataset()

for i in tqdm(range(N)): time_start = perf_counter() dataset[f"var{i}"] = i time_end = perf_counter() time_elapsed[i] = time_end - time_start

%%

from matplotlib import pyplot as plt

plt.plot(np.arange(N), time_elapsed * 1E3, label='Time to add one variable') plt.xlabel("Number of existing variables") plt.ylabel("Time to add a variables (ms)") plt.ylim([0, 50]) plt.grid(True) ```

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7221/reactions",
    "total_count": 2,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 2,
    "eyes": 0
}
    xarray 13221727 pull
1423916687 PR_kwDOAMm_X85Bk2By 7223 Dataset insertion benchmark hmaarrfk 90008 closed 0     2 2022-10-26T12:09:14Z 2022-10-27T15:38:09Z 2022-10-27T15:38:09Z CONTRIBUTOR   0 pydata/xarray/pulls/7223

xref: https://github.com/pydata/xarray/pull/7221

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7223/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
1410575877 PR_kwDOAMm_X85A4LHp 7172 Lazy import dask.distributed to reduce import time of xarray hmaarrfk 90008 closed 0     9 2022-10-16T18:25:31Z 2022-10-18T17:41:50Z 2022-10-18T17:06:34Z CONTRIBUTOR   0 pydata/xarray/pulls/7172

I was auditing the import time of my software and found that distributed added a non insignificant amount of time to the import of xarray:

Using tuna, one can find that the following are sources of delay in import time for xarray:

To audit, one can use the the command python -X importtime -c "import numpy as np; import pandas as pd; import dask.array; import xarray as xr" 2>import.log && tuna import.lo The command as is, breaks out the import time of numpy, pandas, and dask.array to allow you to focus on "other" costs within xarray. Main branch:

Proposed:

One would be tempted to think that this is due to xarray.testing and xarray.tutorial but those just move the imports one level down in tuna graphs.

  • [x] ~~Closes~~
  • [x] ~~Tests added~~
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [x] ~~New functions/methods are listed in api.rst~~
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7172/reactions",
    "total_count": 3,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 3,
    "eyes": 0
}
    xarray 13221727 pull
1098915891 I_kwDOAMm_X85BgCAz 6153 [FEATURE]: to_netcdf and additional keyword arguments hmaarrfk 90008 open 0     2 2022-01-11T09:39:35Z 2022-01-20T06:54:25Z   CONTRIBUTOR      

Is your feature request related to a problem?

I briefly tried to see if any issue was brought up but couldn't.

I'm hoping to be able to pass additional keyword arguments to the engine when using to_netcdf. https://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html

However, it doesn't seem to easy to do so.

Similar to how open_dataset has an additional **kwargs parameter, would it be reasonable to add a similar parameter, maybe engine_kwargs to the to_netcdf to allow users to pass additional parameters to the engine?

Describe the solution you'd like

```python import xarray as xr import numpy as np

dataset = xr.DataArray( data=np.zeros(3), name="hello" ).to_dataset()

dataset.to_netcdf("my_file.nc", engine="h5netcdf", engine_kwargs={"decode_vlen_strings=True"}) ```

Describe alternatives you've considered

One could forward the additional keyword arguments with **kwargs. I just feel like this makes things less "explicit".

Additional context

h5netcdf emits a warning that is hard to disable without passing a keyword argument to the constructor. https://github.com/h5netcdf/h5netcdf/issues/132

Also, for performance reasons, it might be very good to tune things like the storage data alignment.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6153/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1098924491 PR_kwDOAMm_X84wyU7M 6154 Use base ImportError not MoudleNotFoundError when testing for plugins hmaarrfk 90008 closed 0     4 2022-01-11T09:48:36Z 2022-01-11T10:28:51Z 2022-01-11T10:24:57Z CONTRIBUTOR   0 pydata/xarray/pulls/6154

Admittedly i had a pretty broken environment (I manually uninstalled C dependencies for python packages installed with conda), but I still expected xarray to "work" with a different backend.

I hope the comments in the code explain why ImportError is preferred to ModuleNotFoundError.

Thank you for considering.

  • [ ] Closes #xxxx
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6154/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
347962055 MDU6SXNzdWUzNDc5NjIwNTU= 2347 Serialization of just coordinates hmaarrfk 90008 closed 0     6 2018-08-06T15:03:29Z 2022-01-09T04:28:49Z 2022-01-09T04:28:49Z CONTRIBUTOR      

In the search for the perfect data storage mechanism, I find myself needing to store some of the images I am generating the metadata seperately. It is really useful for me to serialize just the coordinates of my DataArray.

My serialization method of choice is json since it allows me to read the metadata with just a text editor. For that, having the coordinates as a self contained dictionary is really important.

Currently, I convert just the coordinates to a dataset, and serialize that. The code looks something like this:

```python import xarray as xr import numpy as np

Setup an array with coordinates

n = np.zeros(3) coords={'x': np.arange(3)} m = xr.DataArray(n, dims=['x'], coords=coords)

coords_dataset_dict = m.coords.to_dataset().to_dict() coords_dict = coords_dataset_dict['coords']

Read/Write dictionary to JSON file

This works, but I'm essentially creating an emtpy dataset for it

coords_set = xr.Dataset.from_dict(coords_dataset_dict) coords2 = coords_set.coords # so many coords :D m2 = xr.DataArray(np.zeros(shape=m.shape), dims=m.dims, coords=coords2) ```

Would encapsulating this functionality in the Coordinates class be accepted as a PR?

It would add 2 functions that would look like: ```python def to_dict(self): # offload the heavy lifting to the Dataset class return self.to_dataset().to_dict()['coords']

def from_dict(self, d): # Offload the heavy lifting again to the Dataset class d_dataset = {'dims': [], 'attrs': [], 'coords': d} return Dataset.from_dict(d_dataset).coords ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2347/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
689390592 MDU6SXNzdWU2ODkzOTA1OTI= 4394 Is it possible to append_dim to netcdf stores hmaarrfk 90008 closed 0     2 2020-08-31T18:02:46Z 2020-08-31T22:11:10Z 2020-08-31T22:11:09Z CONTRIBUTOR      

Is your feature request related to a problem? Please describe. Feature request: It seems that it should be possible to append to netcdf4 stores along the unlimited dimensions. Is there an example of this?

Describe the solution you'd like I would like the following code to be valid: ```python from xarray.tests.test_dataset import create_append_test_data ds, ds_to_append, ds_with_new_var = create_append_test_data()

filename = 'test_dataset.nc'

Choose any one of

engine : {'netcdf4', 'scipy', 'h5netcdf'}

engine = 'netcdf4' ds.to_netcdf(filename, mode='w', unlimited_dims=['time'], engine=engine) ds_to_append.to_netcdf(filename, mode='a', unlimited_dims=['time'], engine=engine) ```

Describe alternatives you've considered I guess you could use zarr, but the fact that it creates multiple files is a problem.

Additional context xarray version: 0.16.0

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4394/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
587398134 MDExOlB1bGxSZXF1ZXN0MzkzMzQ5NzIx 3888 [WIP] [DEMO] Add tests for ZipStore for zarr hmaarrfk 90008 closed 0     6 2020-03-25T02:29:20Z 2020-03-26T04:23:05Z 2020-03-25T21:57:09Z CONTRIBUTOR   0 pydata/xarray/pulls/3888
  • [ ] Related to #3815
  • [ ] Tests added
  • [ ] Passes isort -rc . && black . && mypy . && flake8
  • [ ] Fully documented, including whats-new.rst for all changes and api.rst for new API
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3888/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
573577844 MDU6SXNzdWU1NzM1Nzc4NDQ= 3815 Opening from zarr.ZipStore fails to read (store???) unicode characters hmaarrfk 90008 open 0     20 2020-03-01T16:49:25Z 2020-03-26T04:22:29Z   CONTRIBUTOR      

See upstream: https://github.com/zarr-developers/zarr-python/issues/551

It seems that using a ZipStore creates 1 byte objects for Unicode string attributes.

For example, saving the same Dataset with a DirectoryStore and a Zip Store creates an attribute for a unicode array with 20 bytes in size in the first, and 1 byte in size in the second.

In fact, ubuntu file roller isn't even allowing me to extract the files.

I have a feeling it is due to the note in the zarr documentation

Note that Zip files do not provide any way to remove or replace existing entries.

https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore

MCVE Code Sample

ZipStore python import xarray as xr import zarr x = xr.Dataset() x['hello'] = 'world' x with zarr.ZipStore('test_store.zip', mode='w') as store: x.to_zarr(store) with zarr.ZipStore('test_store.zip', mode='r') as store: x_read = xr.open_zarr(store).compute()

Issued error ```python --------------------------------------------------------------------------- BadZipFile Traceback (most recent call last) <ipython-input-1-2a92a6db56ab> in <module> 7 x.to_zarr(store) 8 with zarr.ZipStore('test_store.zip', mode='r') as store: ----> 9 x_read = xr.open_zarr(store).compute() ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/dataset.py in compute(self, **kwargs) 803 """ 804 new = self.copy(deep=False) --> 805 return new.load(**kwargs) 806 807 def _persist_inplace(self, **kwargs) -> "Dataset": ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/dataset.py in load(self, **kwargs) 655 for k, v in self.variables.items(): 656 if k not in lazy_data: --> 657 v.load() 658 659 return self ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/variable.py in load(self, **kwargs) 370 self._data = as_compatible_data(self._data.compute(**kwargs)) 371 elif not hasattr(self._data, "__array_function__"): --> 372 self._data = np.asarray(self._data) 373 return self 374 ~/miniconda3/envs/dev/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order) 83 84 """ ---> 85 return array(a, dtype, copy=False, order=order) 86 87 ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/indexing.py in __array__(self, dtype) 545 def __array__(self, dtype=None): 546 array = as_indexable(self.array) --> 547 return np.asarray(array[self.key], dtype=None) 548 549 def transpose(self, order): ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/backends/zarr.py in __getitem__(self, key) 46 array = self.get_array() 47 if isinstance(key, indexing.BasicIndexer): ---> 48 return array[key.tuple] 49 elif isinstance(key, indexing.VectorizedIndexer): 50 return array.vindex[ ~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection) 570 571 fields, selection = pop_fields(selection) --> 572 return self.get_basic_selection(selection, fields=fields) 573 574 def get_basic_selection(self, selection=Ellipsis, out=None, fields=None): ~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields) 693 if self._shape == (): 694 return self._get_basic_selection_zd(selection=selection, out=out, --> 695 fields=fields) 696 else: 697 return self._get_basic_selection_nd(selection=selection, out=out, ~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_zd(self, selection, out, fields) 709 # obtain encoded data for chunk 710 ckey = self._chunk_key((0,)) --> 711 cdata = self.chunk_store[ckey] 712 713 except KeyError: ~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/storage.py in __getitem__(self, key) 1249 with self.mutex: 1250 with self.zf.open(key) as f: # will raise KeyError -> 1251 return f.read() 1252 1253 def __setitem__(self, key, value): ~/miniconda3/envs/dev/lib/python3.7/zipfile.py in read(self, n) 914 self._offset = 0 915 while not self._eof: --> 916 buf += self._read1(self.MAX_N) 917 return buf 918 ~/miniconda3/envs/dev/lib/python3.7/zipfile.py in _read1(self, n) 1018 if self._left <= 0: 1019 self._eof = True -> 1020 self._update_crc(data) 1021 return data 1022 ~/miniconda3/envs/dev/lib/python3.7/zipfile.py in _update_crc(self, newdata) 946 # Check the CRC if we're at the end of the file 947 if self._eof and self._running_crc != self._expected_crc: --> 948 raise BadZipFile("Bad CRC-32 for file %r" % self.name) 949 950 def read1(self, n): BadZipFile: Bad CRC-32 for file 'hello/0' 0 2 Untitled10.ipynb ```

Working Directory Store example python import xarray as xr import zarr x = xr.Dataset() x['hello'] = 'world' x store = zarr.DirectoryStore('test_store2.zarr') x.to_zarr(store) x_read = xr.open_zarr(store) x_read.compute() assert x_read.hello == x.hello

Expected Output

The string metadata should work.

Output of xr.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 5.3.0-40-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8 libhdf5: None libnetcdf: None xarray: 0.14.1 pandas: 1.0.0 numpy: 1.17.5 scipy: 1.4.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.4.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.10.1 distributed: 2.10.0 matplotlib: 3.1.3 cartopy: None seaborn: None numbagg: None setuptools: 45.1.0.post20200119 pip: 20.0.2 conda: None pytest: 5.3.1 IPython: 7.12.0 sphinx: 2.3.1 ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3815/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
335608017 MDU6SXNzdWUzMzU2MDgwMTc= 2251 netcdf roundtrip fails to preserve the shape of numpy arrays in attributes hmaarrfk 90008 closed 0     5 2018-06-25T23:52:07Z 2018-08-29T16:06:29Z 2018-08-29T16:06:28Z CONTRIBUTOR      

Code Sample

```python import numpy as np import xarray as xr

a = xr.DataArray(np.zeros((3, 3)), dims=('y', 'x')) a.attrs['my_array'] = np.arange(6, dtype='uint8').reshape(2, 3)

a.to_netcdf('a.nc') b = xr.open_dataarray('a.nc') b.load() assert np.all(b == a) print('all arrays equal')

assert b.dtype == a.dtype print('dtypes equal')

print(a.my_array.shape) print(b.my_array.shape) assert a.my_array.shape == b.my_array.shape ```

Problem description

I have some metadata that is in the form of numpy arrays. I would think that it should round trip with netcdf.

Expected Output

equal shapes inside the metadata

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.16.15-300.fc28.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.7 pandas: 0.23.0 numpy: 1.14.4 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.17.5 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.2.0 pip: 9.0.3 conda: None pytest: 3.6.1 IPython: 6.4.0 sphinx: 1.7.5
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2251/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
347712372 MDExOlB1bGxSZXF1ZXN0MjA2MjQ3MjE4 2344 FutureWarning: creation of DataArrays w/ coords Dataset hmaarrfk 90008 closed 0     7 2018-08-05T16:34:59Z 2018-08-06T16:02:09Z 2018-08-06T16:02:09Z CONTRIBUTOR   0 pydata/xarray/pulls/2344

Previously, this would raise a:

FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.

  • [ ] Closes #xxxx (remove if there is no corresponding issue, which should only be the case for minor changes)
  • [ ] Tests added (for all bug fixes or enhancements)
  • [ ] Tests passed (for all non-documentation changes)
  • [ ] Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2344/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
347558405 MDU6SXNzdWUzNDc1NTg0MDU= 2340 expand_dims erases named dim in the array's coordinates hmaarrfk 90008 closed 0     5 2018-08-03T23:00:07Z 2018-08-05T01:15:49Z 2018-08-04T03:39:49Z CONTRIBUTOR      

Code Sample, a copy-pastable example if possible

```python

%%

import xarray as xa import numpy as np

n = np.zeros((3, 2))

data = xa.DataArray(n, dims=['y', 'x'], coords={'y':range(3), 'x':range(2)})

data = data.assign_coords(z=xa.DataArray(np.arange(6).reshape((3, 2)), dims=['y', 'x']))

print('Original Data') print('=============') print(data)

%%

my_slice = data[0, 1] print("Sliced data") print("===========") print("z coordinate remembers it's own x value") print(f'x = {my_slice.z.x}')

%%

expanded_slice = data[0, 1].expand_dims('x') print("expanded slice") print("==============") print("forgot that 'z' had 'x' coordinates") print("but remembered it had a 'y' coordinate") print(f"z = {expanded_slice.z}") print(expanded_slice.z.x) ```

Output: Original Data ============= <xarray.DataArray (y: 3, x: 2)> array([[0., 0.], [0., 0.], [0., 0.]]) Coordinates: * y (y) int32 0 1 2 * x (x) int32 0 1 z (y, x) int32 0 1 2 3 4 5 Sliced data =========== z coordinate remembers it's own x value x = <xarray.DataArray 'x' ()> array(1) Coordinates: y int32 0 x int32 1 z int32 1 expanded slice ============== forgot that 'z' had 'x' coordinates but remembered it had a 'y' coordinate z = <xarray.DataArray 'z' ()> array(1) Coordinates: y int32 0 z int32 1 AttributeError: 'DataArray' object has no attribute 'x'

Problem description

The coordinate used to have an explicit dimension. When we expanded the dimension, that information should not be erased. Note that information about other coordinates are maintained.

The challenge

The coordinates probably have fewer dimensions than the original data. I'm not sure about xarray's model, but a few challenges come to mind: 1. is the relative order of dimensions maintained between data in the same dataset/dataarray? 2. Can coordinates have MORE dimensions than the array itself?

The answer to these two questions might make or break If not, then this becomes a very difficult problem to solve since we don't know where to insert this new dimension in the coordinate array.

Output of xr.show_versions()

xa.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None xarray: 0.10.7 pandas: 0.23.1 numpy: 1.14.3 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.1 distributed: 1.22.0 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.2.0 pip: 9.0.3 conda: None pytest: 3.7.1 IPython: 6.4.0 sphinx: 1.7.5
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2340/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 25.904ms · About: xarray-datasette