home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

416 rows where comments = 4 and type = "issue" sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: locked, milestone, author_association, state_reason, created_at (date), updated_at (date), closed_at (date)

state 2

  • closed 329
  • open 87

type 1

  • issue · 416 ✖

repo 1

  • xarray 416
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2276352251 I_kwDOAMm_X86HrmD7 8994 Improving performance of open_datatree TomNicholas 35968931 open 0     4 2024-05-02T19:43:17Z 2024-05-03T15:25:33Z   MEMBER      

What is your issue?

The implementation of open_datatree works, but is inefficient, because it calls open_dataset once for every group in the file. We should refactor this to improve the performance, which would fix issues like https://github.com/xarray-contrib/datatree/issues/330.

We discussed this in the datatree meeting, and my understanding is that concretely we need to:

  • [ ] Create an asv benchmark for open_datatree, probably involving first writing then benchmarking the opening of a special netCDF file that has no data but lots of groups.
  • [ ] Refactor the NetCDFDatastore class to only create one CachingFileManager object per file, not one per group, see https://github.com/pydata/xarray/blob/748bb3a328a65416022ec44ced8d461f143081b5/xarray/backends/netCDF4_.py#L406.
  • [ ] Refactor NetCDF4BackendEntrypoint.open_datatree to use an implementation that goes through NetCDFDatastore without calling the top-level xr.open_dataset again.
  • [ ] Check the performance of calling xr.open_datatree on a netCDF file has actually improved.

It would be great to get this done soon as part of the datatree integration project. @kmuehlbauer I know you were interested - are you willing / do you have time to take this task on?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8994/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2163608564 I_kwDOAMm_X86A9gv0 8802 Error when using `apply_ufunc` with `datetime64` as output dtype gcaria 44147817 open 0     4 2024-03-01T15:09:57Z 2024-05-03T12:19:14Z   CONTRIBUTOR      

What happened?

When using apply_ufunc with datetime64[ns] as output dtype, code throws error about converting from specific units to generic datetime units.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

```Python import xarray as xr import numpy as np

def _fn(arr: np.ndarray, time: np.ndarray) -> np.ndarray: return time[:10]

def fn(da: xr.DataArray) -> xr.DataArray: dim_out = "time_cp"

return xr.apply_ufunc(
    _fn,
    da,
    da.time,
    input_core_dims=[["time"], ["time"]],
    output_core_dims=[[dim_out]],
    vectorize=True,
    dask="parallelized",
    output_dtypes=["datetime64[ns]"],
    dask_gufunc_kwargs={"allow_rechunk": True, 
                        "output_sizes": {dim_out: 10},},
    exclude_dims=set(("time",)),
)

da_fake = xr.DataArray(np.random.rand(5,5,5), coords=dict(x=range(5), y=range(5), time=np.array(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'], dtype='datetime64[ns]') )).chunk(dict(x=2,y=2))

fn(da_fake.compute()).compute() # ValueError: Cannot convert from specific units to generic units in NumPy datetimes or timedeltas

fn(da_fake).compute() # same errors as above ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

```Python

ValueError Traceback (most recent call last) Cell In[211], line 1 ----> 1 fn(da_fake).compute()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/dataarray.py:1163, in DataArray.compute(self, kwargs) 1144 """Manually trigger loading of this array's data from disk or a 1145 remote source into memory and return a new array. The original is 1146 left unaltered. (...) 1160 dask.compute 1161 """ 1162 new = self.copy(deep=False) -> 1163 return new.load(kwargs)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/dataarray.py:1137, in DataArray.load(self, kwargs) 1119 def load(self, kwargs) -> Self: 1120 """Manually trigger loading of this array's data from disk or a 1121 remote source into memory and return this array. 1122 (...) 1135 dask.compute 1136 """ -> 1137 ds = self._to_temp_dataset().load(**kwargs) 1138 new = self._from_temp_dataset(ds) 1139 self._variable = new._variable

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/dataset.py:853, in Dataset.load(self, kwargs) 850 chunkmanager = get_chunked_array_type(lazy_data.values()) 852 # evaluate all the chunked arrays simultaneously --> 853 evaluated_data = chunkmanager.compute(lazy_data.values(), kwargs) 855 for k, data in zip(lazy_data, evaluated_data): 856 self.variables[k].data = data

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/daskmanager.py:70, in DaskManager.compute(self, data, kwargs) 67 def compute(self, data: DaskArray, kwargs) -> tuple[np.ndarray, ...]: 68 from dask.array import compute ---> 70 return compute(*data, kwargs)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/dask/base.py:628, in compute(traverse, optimize_graph, scheduler, get, args, kwargs) 625 postcomputes.append(x.dask_postcompute()) 627 with shorten_traceback(): --> 628 results = schedule(dsk, keys, kwargs) 630 return repack([f(r, a) for r, (f, a) in zip(results, postcomputes)])

File /srv/conda/envs/notebook/lib/python3.10/site-packages/numpy/lib/function_base.py:2372, in vectorize.call(self, args, kwargs) 2369 self._init_stage_2(args, kwargs) 2370 return self -> 2372 return self._call_as_normal(*args, kwargs)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/numpy/lib/function_base.py:2365, in vectorize._call_as_normal(self, args, *kwargs) 2362 vargs = [args[_i] for _i in inds] 2363 vargs.extend([kwargs[_n] for _n in names]) -> 2365 return self._vectorize_call(func=func, args=vargs)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/numpy/lib/function_base.py:2446, in vectorize._vectorize_call(self, func, args) 2444 """Vectorized call to func over positional args.""" 2445 if self.signature is not None: -> 2446 res = self._vectorize_call_with_signature(func, args) 2447 elif not args: 2448 res = func()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/numpy/lib/function_base.py:2506, in vectorize._vectorize_call_with_signature(self, func, args) 2502 outputs = _create_arrays(broadcast_shape, dim_sizes, 2503 output_core_dims, otypes, results) 2505 for output, result in zip(outputs, results): -> 2506 output[index] = result 2508 if outputs is None: 2509 # did not call the function even once 2510 if otypes is None:

ValueError: Cannot convert from specific units to generic units in NumPy datetimes or timedeltas ```

Anything else we need to know?

No response

Environment

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8802/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2270275688 I_kwDOAMm_X86HUaho 8985 update `to_netcdf` docstring to list support for explicit CDF5 writes JulioTBacmeister 9221710 open 0     4 2024-04-30T00:41:13Z 2024-04-30T20:48:46Z   NONE      

Is your feature request related to a problem?

I cannot get to_netcdf() to write files in CDF5 format as identifed by the 'ncdump -k' command.

Describe the solution you'd like

When I write a netcdf file using:

D.to_netcdf( filename )

then ask ncdump to tell me the kind of file I have,

ncdump -k filename

it returns 'netCDF-4'. Unfortunately, this file won't work in the Community Atmpshere Model (CAM), as an initial condition for example. CAM will bomb when it tries to read it. After converting the file with this command:

nccopy -k cdf5 filename cdf5_filename

the file now works in CAM. Also, the command

ncdump -k cdf5_filename

returns 'cdf5'.

I confess I don't know what the nccopy command is doing, but it seems to be needed for the file to be readable by CAM. I am looking for an option in the to_netcdf method that will explicitly write 'cdf5' files without needing to resort to the nccopy command.

Describe alternatives you've considered

Writing netcdf-4 files from xarray and converting via nccopy -k cdf5 filename cdf5_filename

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8985/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1389295853 I_kwDOAMm_X85Szvjt 7099 Pass arbitrary options to sel() benbovy 4160723 open 0     4 2022-09-28T12:44:52Z 2024-04-30T00:44:18Z   MEMBER      

Is your feature request related to a problem?

Currently .sel() accepts two options method and tolerance. These are relevant for default (pandas) indexes but not necessarily for other, custom indexes.

It would be also useful for custom indexes to expose their own selection options, e.g.,

  • index query optimization like the dualtree flag of sklearn.neighbors.KDTree.query
  • k-nearest neighbors selection with the creation of a new "k" dimension (+ coordinate / index) with user-defined name and size.

From #3223, it would be nice if we could also pass distinct options values per index.

What would be a good API for that?

Describe the solution you'd like

Some ideas:

A. Allow passing a tuple (labels, options_dict) as indexer value

python ds.sel(x=([0, 2], {"method": "nearest"}), y=3)

B. Expose an options kwarg that would accept a nested dict

python ds.sel(x=[0, 2], y=3, options={"x": {"method": "nearest"}})

Option A does not look very readable. Option B is slightly better, although the nested dictionary is not great.

Any other ideas? Some sort of context manager? Some Index specific API?

Describe alternatives you've considered

The API proposed in #3223 would look great if method and tolerance were the only accepted options, but less so for arbitrary options.

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7099/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
481761508 MDU6SXNzdWU0ODE3NjE1MDg= 3223 Feature request for multiple tolerance values when using nearest method and sel() NicWayand 1117224 open 0     4 2019-08-16T19:53:31Z 2024-04-29T23:21:04Z   NONE      

```python import xarray as xr import numpy as np import pandas as pd

Create test data

ds = xr.Dataset() ds.coords['lon'] = np.arange(-120,-60) ds.coords['lat'] = np.arange(30,50) ds.coords['time'] = pd.date_range('2018-01-01','2018-01-30') ds['AirTemp'] = xr.DataArray(np.ones((ds.lat.size,ds.lon.size,ds.time.size)), dims=['lat','lon','time'])

target_lat = [36.83] target_lon = [-110] target_time = [np.datetime64('2019-06-01')]

Nearest pulls a date too far away

ds.sel(lat=target_lat, lon=target_lon, time=target_time, method='nearest')

Adding tolerance for lat long, but also applied to time

ds.sel(lat=target_lat, lon=target_lon, time=target_time, method='nearest', tolerance=0.5)

Ideally tolerance could accept a dictionary but currently fails

ds.sel(lat=target_lat, lon=target_lon, time=target_time, method='nearest', tolerance={'lat':0.5, 'lon':0.5, 'time':np.timedelta64(1,'D')})

```

Expected Output

A dataset with nearest values to tolerances on each dim.

Problem Description

I would like to add the ability of tolerance to accept a dictionary for multiple tolerance values for different dimensions. Before I try implementing it, I wanted to 1) check it doesn't already exist or someone isn't working on it, and 2) get suggestions for how to proceed.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.7 | packaged by conda-forge | (default, Feb 20 2019, 02:51:38) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.9.184-0.1.ac.235.83.329.metal1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.11.3 pandas: 0.24.1 numpy: 1.15.4 scipy: 1.2.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: 2.9.0 Nio: 1.5.5 zarr: 2.2.0 cftime: 1.0.3.4 PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None cyordereddict: None dask: 1.1.2 distributed: 1.26.0 matplotlib: 3.0.3 cartopy: 0.17.0 seaborn: 0.9.0 setuptools: 40.8.0 pip: 19.0.3 conda: None pytest: None IPython: 7.3.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3223/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2259316341 I_kwDOAMm_X86Gqm51 8965 Support concurrent loading of variables dcherian 2448579 open 0     4 2024-04-23T16:41:24Z 2024-04-29T22:21:51Z   MEMBER      

Is your feature request related to a problem?

Today if users have to concurrently load multiple variables in a DataArray or Dataset, they have to use dask.

It struck me that it'd be pretty easy for .load to gain an executor kwarg that accepts anything that follows the concurrent.futures executor interface, and parallelize this loop.

https://github.com/pydata/xarray/blob/b0036749542145794244dee4c4869f3750ff2dee/xarray/core/dataset.py#L853-L857

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8965/reactions",
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1250939008 I_kwDOAMm_X85Kj9CA 6646 `dim` vs `dims` max-sixty 5635139 closed 0     4 2022-05-27T16:15:02Z 2024-04-29T18:24:56Z 2024-04-29T18:24:56Z MEMBER      

What is your issue?

I've recently been hit with this when experimenting with xr.dot and xr.corr — xr.dot takes dims, and xr.cov takes dim. Because they each take multiple arrays as positional args, kwargs are more conventional.

Should we standardize on one of these?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6646/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1024011835 I_kwDOAMm_X849CS47 5857 Incorrect results when using xarray.ufuncs.angle(..., deg=True) cvr 1119116 closed 0     4 2021-10-12T16:24:11Z 2024-04-28T20:58:55Z 2024-04-28T20:58:54Z NONE      

What happened:

The xarray.ufuncs.angle is broken. From the help docstring one may use option deg=True to have the result in degrees instead of radians (which is consistent with numpy.angle function). Yet results show that this is not the case. Moreover specifying deg=True or deg=False leads to the same result with the values in radians.

What you expected to happen:

To have the result of xarray.ufuncs.angle converted to degrees when option deg=True is specified.

Minimal Complete Verifiable Example:

```python

Put your MCVE code here

import numpy as np import xarray as xr

ds = xr.Dataset(coords={'wd': ('wd', np.arange(0, 360, 30, dtype=float))})

Z = xr.ufuncs.exp(1j * xr.ufuncs.radians(ds.wd)) D = xr.ufuncs.angle(Z, deg=True) # YIELDS INCORRECT RESULTS if not np.allclose(ds.wd, (D % 360)): print(f"Issue with angle operation: {D.values%360} instead of {ds.wd.values}" \ + f"\n\tERROR xr.ufuncs.angle(Z, deg=True) gives incorrect results !!!")

D = xr.ufuncs.degrees(xr.ufuncs.angle(Z)) # Works OK if not np.allclose(ds.wd, (D % 360)): print(f"Issue with angle operation: {D%360} instead of {ds.wd}" \ + f"\n\tERROR xr.ufuncs.degrees(xr.ufuncs.angle(Z)) gives incorrect results!!!")

D = xr.apply_ufunc(np.angle, Z, kwargs={'deg': True}) # Works OK if not np.allclose(ds.wd, (D % 360)): print(f"Issue with angle operation: {D%360} instead of {ds.wd}" \ + f"\n\tERROR xr.apply_ufunc(np.angle, Z, kwargs={{'deg': True}}) gives incorrect results!!!") ```

Anything else we need to know?:

Though xarray.ufuncs has a deprecated warning stating that the numpy equivalent may be used, this is not true for numpy.angle. Example:

```python import numpy as np import xarray as xr

ds = xr.Dataset(coords={'wd': ('wd', np.arange(0, 360, 30, dtype=float))})

Z = np.exp(1j * np.radians(ds.wd)) print(Z) print(f"Is Z an XArray? {isinstance(Z, xr.DataArray)}")

D = np.angle(ds.wd, deg=True) print(D) print(f"Is D an XArray? {isinstance(D, xr.DataArray)}") `` If this code is run, the result ofnumpy.angle(xarray.DataArray)is not a DataArray object, contrary to other numpy operations (for all versions of xarray I've used). Hence thexarray.ufuncs.angle` is a great option, if it was not for the current problem.

Environment:

No issues with xarray versions 0.16.2 and 0.17.0. This error happens from 0.18.0 onwards, up to 0.19.0 (recentmost).

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.19.0-18-amd64 machine: x86_64 processor: byteorder: little LC_ALL: en_US.utf8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 0.19.0 pandas: 1.2.3 numpy: 1.20.2 scipy: 1.5.3 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 58.2.0 pip: 21.3 conda: 4.10.3 pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5857/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2224036575 I_kwDOAMm_X86EkBrf 8905 Variable doesn't have an .expand_dims method TomNicholas 35968931 closed 0     4 2024-04-03T22:19:10Z 2024-04-28T19:54:08Z 2024-04-28T19:54:08Z MEMBER      

Is your feature request related to a problem?

DataArray and Dataset have an .expand_dims method, but it looks like Variable doesn't.

Describe the solution you'd like

Variable should also have this method, the only difference being that it wouldn't create any coordinates or indexes.

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8905/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
590630281 MDU6SXNzdWU1OTA2MzAyODE= 3921 issues discovered by the all-but-dask CI keewis 14808389 closed 0     4 2020-03-30T22:08:46Z 2024-04-25T14:48:15Z 2024-02-10T02:57:34Z MEMBER      

After adding the py38-all-but-dask CI in #3919, it discovered a few backend issues: - zarr: - [x] open_zarr with chunks="auto" always tries to chunk, even if dask is not available (fixed in #3919) - [x] ZarrArrayWrapper.__getitem__ incorrectly passes the indexer's tuple attribute to _arrayize_vectorized_indexer (this only happens if dask is not available) (fixed in #3919) - [x] slice indexers with negative steps get transformed incorrectly if dask is not available https://github.com/pydata/xarray/pull/8674 - rasterio: - ~calling pickle.dumps on a Dataset object returned by open_rasterio fails because a non-serializable lock was used (if dask is installed, a serializable lock is used instead)~

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3921/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2243685081 I_kwDOAMm_X86Fu-rZ 8945 netCDF4 indexing: `reindex_like` is very slow if dataset not loaded into memory brendan-m-murphy 11130776 closed 0     4 2024-04-15T13:26:08Z 2024-04-23T21:49:28Z 2024-04-23T15:33:36Z NONE      

What is your issue?

Reindexing a dataset without loading it into memory seems to be very slow (about 1000x slower than reindexing after loading into memory).

Here is a minimum working example: ``` times = 100 nlat = 200 nlon = 300

fp = xr.Dataset({"fp": (["time", "lat", "lon"], np.arange(times * nlat * nlon).reshape(times, nlat, nlon))}, coords={"time": pd.date_range(start="2019-01-01T02:00:00", periods=times, freq="1H"), "lat": np.arange(nlat), "lon": np.arange(nlon)})

flux = xr.Dataset({"flux": (["time", "lat", "lon"], np.arange(nlat * nlon).reshape(1, nlat, nlon))}, coords={"time": [pd.to_datetime("2019-01-01")], "lat": np.arange(nlat) + np.random.normal(0.0, 0.01, nlat), "lon": np.arange(nlon) + np.random.normal(0.0, 0.01, nlon)})

fp.to_netcdf("combine_datasets_tests/fp.nc") flux.to_netcdf("combine_datasets_tests/flux.nc")

fp1 = xr.open_dataset("combine_datasets_tests/fp.nc") flux1 = xr.open_dataset("combine_datasets_tests/flux.nc") ```

Then flux1 = flux1.reindex_like(fp1, method="ffill", tolerance=None) takes over a minute, while flux1 = flux1.load().reindex_like(fp1, method="ffill", tolerance=None) is almost instantaneous (timeit says 91ms, including opening the dataset... I'm not sure if caching is influencing this).

Profiling the "reindex without load" cell: ``` 804936 function calls (804622 primitive calls) in 93.285 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function) 1 92.211 92.211 93.191 93.191 {built-in method _operator.getitem} 1 0.289 0.289 0.980 0.980 utils.py:81(_StartCountStride) 6 0.239 0.040 0.613 0.102 shape_base.py:267(apply_along_axis) 72656 0.109 0.000 0.109 0.000 utils.py:429(<lambda>) 72656 0.085 0.000 0.136 0.000 utils.py:430(<lambda>) 72661 0.051 0.000 0.051 0.000 {built-in method numpy.arange} 145318 0.048 0.000 0.115 0.000 shape_base.py:370(<genexpr>) 2 0.045 0.023 0.046 0.023 indexing.py:1334(getitem) 6 0.044 0.007 0.044 0.007 numeric.py:136(ones) 145318 0.044 0.000 0.067 0.000 index_tricks.py:690(next) 14 0.033 0.002 0.033 0.002 {built-in method numpy.empty} 145333/145325 0.023 0.000 0.023 0.000 {built-in method builtins.next} 1 0.020 0.020 93.275 93.275 duck_array_ops.py:317(where) 21 0.018 0.001 0.018 0.001 {method 'astype' of 'numpy.ndarray' objects} 145330 0.013 0.000 0.013 0.000 {built-in method numpy.asanyarray} 1 0.002 0.002 0.002 0.002 {built-in method _functools.reduce} 1 0.002 0.002 93.279 93.279 variable.py:821(_getitem_with_mask) 18 0.001 0.000 0.001 0.000 {built-in method numpy.zeros} 1 0.000 0.000 0.000 0.000 file_manager.py:226(close) ```

The getitem call at the top is from xarray.backends.netCDF4_.py, line 114. Because of the jittered coordinates in flux, I'm assuming that the index passed to netCDF4 is not consecutive/strictly monotonic integers (0, 1, 2, 3, ...). In the past, this has caused issues: https://github.com/Unidata/netcdf4-python/issues/680.

In my venv, netCDF4 was installed from a wheel with the following versions: netcdf4-python version: 1.6.5 HDF5 lib version: 1.12.2 netcdf lib version: 4.9.3-development

This is with xarray version 2023.12.0, numpy 1.26, and pandas 1.5.3.

I will try to investigate more and hopefully simplify the example. (Can't quite justify spending more time on it at work because this is just to tag a version that was used in some experiments before we switch to zarr as a backend, so hopefully it won't be relevant at that point.)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8945/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1664193419 I_kwDOAMm_X85jMZOL 7748 diff('non existing dimension') does not raise exception LunarLanding 4441338 open 0     4 2023-04-12T09:29:58Z 2024-04-21T22:31:37Z   NONE      

What happened?

Calling xr.DataArray.diff with a non-existing dimension does not raise an exception.

What did you expect to happen?

An exception to be raised.

Minimal Complete Verifiable Example

Python import xarray as xr; import numpy as np; xr.DataArray(np.arange(10),dims=('a',)).diff('b')

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.10.0-21-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2023.3.0 pandas: 1.5.3 numpy: 1.23.5 scipy: 1.10.1 netCDF4: 1.6.2 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: 2.14.2 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2023.3.1 distributed: 2023.3.1 matplotlib: 3.7.1 cartopy: None seaborn: 0.12.2 numbagg: None fsspec: 2023.3.0 cupy: None pint: None sparse: 0.14.0 flox: 0.6.9 numpy_groupies: 0.9.20 setuptools: 67.6.0 pip: 23.0.1 conda: 23.1.0 pytest: 7.2.2 mypy: 1.1.1 IPython: 8.11.0 sphinx: 6.1.3
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7748/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2237228079 I_kwDOAMm_X86FWWQv 8927 Use a neutral format to have lossless interface with JSON, scipp, Astropy, pandas loco-philippe 92333742 open 0     4 2024-04-11T08:50:34Z 2024-04-12T14:25:35Z   NONE      

Is your feature request related to a problem?

Each tool has a specific structure for processing multidimensional data with the following consequences:

  • interfaces dedicated to each tool,
  • partially processed data,
  • no unified representation of data structures

Describe the solution you'd like

The proposed format (see jupyter notebook, github repository, PyPI package ) is based on the following principles:

  • neutral format available for tabular or multidimensional tools (e.g. Numpy, pandas, xarray, scipp, astropy),
  • taking into account a wide variety of data types as defined in NTV format,
  • high interoperability: reversible (lossless round-trip) interface with tabular or multidimensional tools,
  • reversible and compact JSON format,
  • Ease of sharing and exchanging multidimensional and tabular data,

Describe alternatives you've considered

No response

Additional context

https://github.com/numpy/numpy/issues/12481#issuecomment-2049179803 https://github.com/astropy/astropy/issues/16286 https://github.com/scipp/scipp/issues/3422

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8927/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1959816045 I_kwDOAMm_X8500Gtt 8368 to_netcdf: Unexpected drop of "units" attribute of attached "bounds" leonfoks 15173535 open 0     4 2023-10-24T18:15:05Z 2024-04-09T11:11:20Z   NONE      

What happened?

When writing a Dataset to netcdf, any DataArrays that are linked as bounds through another variables attrs['bounds'] entry, have their (specifically) 'units' attribute dropped inside the written netcdf file.

See example

What did you expect to happen?

Units attribute to be written to the netcdf file.

Minimal Complete Verifiable Example

```Python import numpy as np import xarray as xr

Create a new Dataset

ds = xr.Dataset()

Add the x variable, Specify 'x_bnds' as bounds, defined later.

ds['x'] = xr.DataArray(np.arange(10), dims='x', attrs={'units':'m', 'bounds':'x_bnds'})

Bounds require an extra dimension equal to number of vertices.

ds['nv'] = xr.DataArray(np.r_[0, 1], dims='nv')

Add the actual bounding values for variable x.

ds['x_bnds'] = xr.DataArray(np.squeeze(np.dstack([np.arange(10)-0.5, np.arange(10)+0.5])),
dims=['x', 'nv'], attrs={'test':4, 'units':'m', })

print('Units is attached to the bounds in the dataset before writing', 'units' in ds['x_bnds'].attrs)

Write to netcdf file

ds.to_netcdf('tmp.nc', format='netcdf4', engine='netcdf4')

Open the dataset and check x_bnds attrs. units is dropped.

new = xr.open_dataset('tmp.nc') print(new['x_bnds'].attrs)

Confirm that units were never written to the file.

!h5dump -d /x_bnds tmp.nc ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.5 (main, Sep 11 2023, 08:19:27) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 21.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2023.10.1 pandas: 2.1.1 numpy: 1.26.1 scipy: 1.11.3 netCDF4: 1.6.4 pydap: None h5netcdf: None h5py: 3.10.0 Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.8.0 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.0.0 pip: 23.3 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: 7.2.6
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8368/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2230680765 I_kwDOAMm_X86E9Xy9 8919 Using the xarray.Dataset.where() function takes up a lot of memory isLiYang 69391863 closed 0     4 2024-04-08T09:15:49Z 2024-04-09T02:45:09Z 2024-04-09T02:45:08Z NONE      

What is your issue?

My python script was killed because it took up too much memory. After checking, I found that the problem is the ds.where() function.

The original netcdf file opened from the hard disk takes up about 10 Mb of storage, but when I mask the data that doesn't match according to the latitude and longitude location, the variable ds takes up a dozen GB of memory. When I deleted this variable using del ds, the memory occupied by the script immediately returned to normal.

```

Open this netcdf file.

ds = xr.open_dataset(track)

If longitude range is [-180, 180], then convert to [0, 360].

if np.any(ds[var_lon] < 0): ds[var_lon] = ds[var_lon] % 360

Extract data by longitude and latitude.

ds = ds.where((ds[var_lon] >= region[0]) & (ds[var_lon] <= region[1]) & (ds[var_lat] >= region[2]) & (ds[var_lat] <= region[3]))

Select data by range and value of some variables.

for key, value in range_select.items(): ds = ds.where((ds[key] >= value[0]) & (ds[key] <= value[1])) for key, value in value_select.items(): ds = ds.where(ds[key].isin(value)) ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8919/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2228373305 I_kwDOAMm_X86E0kc5 8915 Weird behavior of DataSet.where(... , drop=True) johannespletzer 22961670 closed 0     4 2024-04-05T16:03:05Z 2024-04-08T09:32:48Z 2024-04-08T09:32:48Z NONE      

What happened?

I work with an aircraft emission dataset that is freely available online: emission dataset

During my calculations I eventually convert the DataSet to a DataFrame. My motivation is to avoid unnecessary rows in the DataFrame. Doing some calculations my code returned unexpected results. Eventually I could narrow it down to a DataSet.where(... , drop=True) argument I added along the way, which introduces differences in the data. Here are two examples:

Example 1: Along some dimensions data points vanished if drop=True

Example 2: For other dimensions (these?) data points appeared elsewhere if drop=True

What did you expect to happen?

I expect for my calculations to return the same results, regardless of whether drop=True is active or not.

Minimal Complete Verifiable Example

```Python !wget "https://zenodo.org/records/10818082/files/Emission_Inventory_H2O_Optimized_v0.1_MR3_Fleet_BRU-MYA_2075.nc"

import matplotlib.pyplot as plt import xarray as xr

nc_file = xr.open_dataset('Emission_Inventory_H2O_Optimized_v0.1_MR3_Fleet_BRU-MYA_2075.nc')

fig, axs = plt.subplots(1,2,figsize=(10,4))

nc_file.H2O.where(nc_file.H2O!=0, drop=True).sum(('lon','time')).plot.contour(x='lat',ax=axs[0]) axs[0].set_xlim(-50,90) axs[0].set_title('With drop=True')

nc_file.H2O.where(nc_file.H2O!=0, drop=False).sum(('lon','time')).plot.contour(x='lat',ax=axs[1]) axs[1].set_xlim(-50,90) axs[1].set_title('With drop=False')

plt.tight_layout() plt.show()

fig, axs = plt.subplots(1,2,figsize=(10,4))

nc_file.H2O.where(nc_file.H2O!=0, drop=True).sum(('lat','time')).plot.contour(x='lon',ax=axs[0]) axs[0].set_title('With drop=True')

nc_file.H2O.where(nc_file.H2O!=0, drop=False).sum(('lat','time')).plot.contour(x='lon',ax=axs[1]) axs[1].set_title('With drop=False')

plt.tight_layout() plt.show() ```

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 165 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('en_US', 'ISO8859-1') libhdf5: 1.14.0 libnetcdf: 4.9.2 xarray: 2022.11.0 pandas: 1.5.3 numpy: 1.23.5 scipy: 1.13.0 netCDF4: 1.6.5 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.5 dask: None distributed: None matplotlib: 3.7.0 cartopy: 0.21.1 seaborn: 0.12.2 numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.6.3 pip: 22.3.1 conda: None pytest: None IPython: 8.10.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8915/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2206243581 I_kwDOAMm_X86DgJr9 8876 Possible race condition when appending to an existing zarr rsemlal-murmuration 157591329 closed 0     4 2024-03-25T16:59:52Z 2024-04-03T15:23:14Z 2024-03-29T14:35:52Z NONE      

What happened?

When appending to an existing zarr along a dimension (to_zarr(..., mode='a', append_dim="x" ,..)), if the dask chunking of the dataset to append does not align with the chunking of the existing zarr, the resulting consolidated zarr store may have NaNs instead of the actual values it is supposed to have.

What did you expect to happen?

We would expected that zarr append to have the same behaviour as if we concatenate dataset in memory (using concat) and write the whole result on a new zarr store in one go

Minimal Complete Verifiable Example

```Python from distributed import Client, LocalCluster import xarray as xr import tempfile

ds1 = xr.Dataset({"a": ("x", [1., 1.])}, coords={'x': [1, 2]}).chunk({"x": 3}) ds2 = xr.Dataset({"a": ("x", [1., 1., 1., 1.])}, coords={'x': [3, 4, 5, 6]}).chunk({"x": 3})

with Client(LocalCluster(processes=False, n_workers=1, threads_per_worker=2)): # The issue happens only when: threads_per_worker > 1 for i in range(0, 100): with tempfile.TemporaryDirectory() as store: print(store) ds1.to_zarr(store, mode="w") # write first dataset ds2.to_zarr(store, mode="a", append_dim="x") # append first dataset

        rez = xr.open_zarr(store).compute() # open consolidated dataset
        nb_values = rez.a.count().item(0) # count non NaN values
        if nb_values != 6:
            print("found NaNs:")
            print(rez.to_dataframe())
            break

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Python /tmp/tmptg_pe6ox /tmp/tmpm7ncmuxd /tmp/tmpiqcgoiw2 /tmp/tmppma1ieo7 /tmp/tmpw5vi4cf0 /tmp/tmp1rmgwju0 /tmp/tmpm6tfswzi found NaNs: a x 1 1.0 2 1.0 3 1.0 4 1.0 5 1.0 6 NaN

Anything else we need to know?

The example code snippet provided here, reproduces the issue.

Since the issue occurs randomly, we loop in the example for a few times and stop when the issue occurs.

In the example, when ds1 is first written, since it only contains 2 values along the x dimension, the resulting .zarr store have the chunking: {'x': 2}, even though we called .chunk({"x": 3}).

Side note: This behaviour in itself is not problematic in this case, but the fact that the chunking is silently changed made this issue harder to spot.

However, when we try to append the second dataset ds2, that contains 4 values, the .chunk({"x": 3}) in the begining splits the dask array into 2 dask chunks, but in a way that does not align with zarr chunks.

Zarr chunks: + chunk1 : x: [1; 2] + chunk2 : x: [3; 4] + chunk3 : x: [5; 6]

Dask chunks for ds2: + chunk A: x: [3; 4; 5] + chunk B: x: [6]

Both dask chunks A and B, are supposed to write on zarr chunk3 And depending on who writes first, we can end up with NaN on x = 5 or x = 6 instead of actual values.

The issue obviously happens only when dask tasks are run in parallel. Using safe_chunks = True when calling to_zarr does not seem to help.

We couldn't figure out from the documentation how to detect this kind of issues, and how to prevent them from happening (maybe using a synchronizer?)

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 5.15.133.1-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2024.2.0 pandas: 2.2.1 numpy: 1.26.4 scipy: 1.12.0 netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.10.0 Nio: None zarr: 2.17.1 cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: 1.3.8 dask: 2024.3.1 distributed: 2024.3.1 matplotlib: 3.8.3 cartopy: None seaborn: 0.13.2 numbagg: 0.8.1 fsspec: 2024.3.1 cupy: None pint: None sparse: None flox: 0.9.5 numpy_groupies: 0.10.2 setuptools: 69.2.0 pip: 24.0 conda: None pytest: 8.1.1 mypy: None IPython: 8.22.2 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8876/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2211106929 I_kwDOAMm_X86DytBx 8882 to_zarr silently loses data when using append_dim, if chunks are different to zarr store harryC-space-intelligence 140395181 closed 0     4 2024-03-27T15:27:02Z 2024-03-29T14:35:51Z 2024-03-29T14:35:51Z NONE      

What happened?

When writing a chunked DataArray to an existing zarr store, appending along an existing dimension of the store, I have found that some data are not written if there are multiple array chunks to one zarr chunk.

I appreciate it is probably bad practice to have different chunksizes in my DataArray and zarr_store, but I think its a realistic scenario that needs to be caught.

This may be related to / the same underlying issue as #8371. Perhaps the checks mentioned in https://github.com/pydata/xarray/issues/8371#issuecomment-1814589157 are somehow getting bypassed? Using zarr's ThreadSynchronizer is the only way I have found to ensure that all the data gets written.

What did you expect to happen?

I expected that either

  • to_zarr would recognise the different chunk sizes, and re-chunk or wait for all the chunks to be written
  • or an error would be raised, given that the results result in loss of data in an unpredictable way

Minimal Complete Verifiable Example

```Python import xarray as xr import numpy as np from matplotlib import pyplot as plt

x_coords = np.arange(10) y_coords = np.arange(10) t_coords = np.array([np.datetime64('2020-01-01').astype('datetime64[ns]')]) data = np.ones((10,10))

for i in range(4): plt.subplot(1,4,i+1)

da = xr.DataArray(data.reshape((-1,10,10)),
                  dims = ['time','x','y'],
                  coords = {'x':x_coords, 'y':y_coords, 'time':t_coords},
                 ).chunk({'x':5, 'y':5,'time':1}).rename('foo')

da.to_zarr('foo.zarr', mode='w')

new_time = np.array([np.datetime64('2021-01-01').astype('datetime64[ns]')])

da2 = xr.DataArray(data.reshape((-1,10,10)),
                  dims = ['time','x','y'],
                  coords = {'x':x_coords, 'y':y_coords, 'time':new_time},
                 ).chunk({'x':1, 'y':1,'time':1}).rename('foo')

da2.to_zarr('foo.zarr',append_dim='time', mode='a')

plt.imshow(xr.open_zarr('foo.zarr').isel(time=-1).foo.values)

```

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

Output from the plots above:

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0] python-bits: 64 OS: Linux OS-release: 5.15.0-1041-azure machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2 xarray: 2024.2.0 pandas: 2.2.1 numpy: 1.26.4 scipy: 1.12.0 netCDF4: 1.6.5 pydap: installed h5netcdf: 1.3.0 h5py: 3.10.0 Nio: None zarr: 2.17.1 cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: 1.3.8 dask: 2024.3.1 distributed: 2024.3.1 matplotlib: 3.8.3 cartopy: 0.22.0 seaborn: 0.13.2 numbagg: None fsspec: 2024.3.1 cupy: None pint: 0.23 sparse: 0.15.1 flox: 0.9.5 numpy_groupies: 0.10.2 setuptools: 69.2.0 pip: 24.0 conda: 24.1.2 pytest: 8.1.1 mypy: None IPython: 8.22.2 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8882/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
935607748 MDU6SXNzdWU5MzU2MDc3NDg= 5563 Decoding non-utf-8 encoded strings with the h5netcdf engine kiksekage 11391714 closed 0     4 2021-07-02T09:49:58Z 2024-03-26T15:08:41Z 2024-03-26T15:08:41Z NONE      

What happened: Trying to load a netCDF file-like (io.BytesIO object) with attribute strings in non-utf-8 encoding with the h5netcdf engine leads to UnicodeDecodeError.

What you expected to happen: Loading the same file, albeit persisted to disk, with the netcdf4 engine works fine, however, since the netcdf4 engine doesnt support the file-like objects I ran into this issue.

Traceback: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/api.py", line 242, in load_dataset with open_dataset(filename_or_obj, **kwargs) as ds: File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/api.py", line 496, in open_dataset backend_ds = backend.open_dataset( File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py", line 384, in open_dataset ds = store_entrypoint.open_dataset( File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/store.py", line 22, in open_dataset vars, attrs = store.load() File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/common.py", line 126, in load attributes = FrozenDict(self.get_attrs()) File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/h5netcdf_.py", line 234, in get_attrs return FrozenDict(read_attributes(self.ds)) File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/h5netcdf.py", line 75, in read_attributes v = maybe_decode_bytes(v) File "/home/thns/.venv/cmems/lib/python3.8/site-packages/xarray/backends/h5netcdf.py", line 63, in maybe_decode_bytes return txt.decode("utf-8")

Minimal Complete Verifiable Example:

```python import xarray as xr import netCDF4

title = b'\xc3'

f = netCDF4.Dataset('test.nc', 'w') f.title = title f.close() xr.load_dataset("test.nc", engine="h5netcdf") ```

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.0 (default, Feb 25 2021, 22:10:10) [GCC 8.4.0] python-bits: 64 OS: Linux OS-release: 4.15.0-136-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.18.1 pandas: 1.2.4 numpy: 1.20.3 scipy: None netCDF4: 1.5.6 pydap: None h5netcdf: 0.11.0 h5py: 3.2.1 Nio: None zarr: None cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 57.0.0 pip: 21.1.3 conda: None pytest: 6.2.4 IPython: 7.25.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5563/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2117248281 I_kwDOAMm_X85-MqUZ 8704 Currently no way to create a Coordinates object without indexes for 1D variables TomNicholas 35968931 closed 0     4 2024-02-04T18:30:18Z 2024-03-26T13:50:16Z 2024-03-26T13:50:15Z MEMBER      

What happened?

The workaround described in https://github.com/pydata/xarray/pull/8107#discussion_r1311214263 does not seem to work on main, meaning that I think there is currently no way to create an xr.Coordinates object without 1D variables being coerced to indexes. This means there is no way to create a Dataset object without 1D variables becoming IndexVariables being coerced to indexes.

What did you expect to happen?

I expected to at least be able to use the workaround described in https://github.com/pydata/xarray/pull/8107#discussion_r1311214263, i.e.

python xr.Coordinates({'x': ('x', uarr)}, indexes={}) where uarr is an un-indexable array-like.

Minimal Complete Verifiable Example

```Python class UnindexableArrayAPI: ...

class UnindexableArray: """ Presents like an N-dimensional array but doesn't support changes of any kind, nor can it be coerced into a np.ndarray or pd.Index. """

_shape: tuple[int, ...]
_dtype: np.dtype

def __init__(self, shape: tuple[int, ...], dtype: np.dtype) -> None:
    self._shape = shape
    self._dtype = dtype
    self.__array_namespace__ = UnindexableArrayAPI

@property
def dtype(self) -> np.dtype:
    return self._dtype

@property
def shape(self) -> tuple[int, ...]:
    return self._shape

@property
def ndim(self) -> int:
    return len(self.shape)

@property
def size(self) -> int:
    return np.prod(self.shape)

@property
def T(self) -> Self:
    raise NotImplementedError()

def __repr__(self) -> str:
    return f"UnindexableArray(shape={self.shape}, dtype={self.dtype})"

def _repr_inline_(self, max_width):
    """
    Format to a single line with at most max_width characters. Used by xarray.
    """
    return self.__repr__()

def __getitem__(self, key, /) -> Self:
    """
    Only supports extremely limited indexing.

    I only added this method because xarray will apparently attempt to index into its lazy indexing classes even if the operation would be a no-op anyway.
    """
    from xarray.core.indexing import BasicIndexer

    if isinstance(key, BasicIndexer) and key.tuple == ((slice(None),) * self.ndim):
        # no-op
        return self
    else:
        raise NotImplementedError()

def __array__(self) -> np.ndarray:
    raise NotImplementedError("UnindexableArrays can't be converted into numpy arrays or pandas Index objects")

```

```python uarr = UnindexableArray(shape=(3,), dtype=np.dtype('int32'))

xr.Variable(data=uarr, dims=['x']) # works fine

xr.Coordinates({'x': ('x', uarr)}, indexes={}) # works in xarray v2023.08.0 but in versions after that it triggers the NotImplementedError in `__array__`:python


NotImplementedError Traceback (most recent call last) Cell In[59], line 1 ----> 1 xr.Coordinates({'x': ('x', uarr)}, indexes={})

File ~/Documents/Work/Code/xarray/xarray/core/coordinates.py:301, in Coordinates.init(self, coords, indexes) 299 variables = {} 300 for name, data in coords.items(): --> 301 var = as_variable(data, name=name) 302 if var.dims == (name,) and indexes is None: 303 index, index_vars = create_default_index_implicit(var, list(coords))

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:159, in as_variable(obj, name) 152 raise TypeError( 153 f"Variable {name!r}: unable to convert object into a variable without an " 154 f"explicit list of dimensions: {obj!r}" 155 ) 157 if name is not None and name in obj.dims and obj.ndim == 1: 158 # automatically convert the Variable into an Index --> 159 obj = obj.to_index_variable() 161 return obj

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:572, in Variable.to_index_variable(self) 570 def to_index_variable(self) -> IndexVariable: 571 """Return this variable as an xarray.IndexVariable""" --> 572 return IndexVariable( 573 self._dims, self._data, self._attrs, encoding=self._encoding, fastpath=True 574 )

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:2642, in IndexVariable.init(self, dims, data, attrs, encoding, fastpath) 2640 # Unlike in Variable, always eagerly load values into memory 2641 if not isinstance(self._data, PandasIndexingAdapter): -> 2642 self._data = PandasIndexingAdapter(self._data)

File ~/Documents/Work/Code/xarray/xarray/core/indexing.py:1481, in PandasIndexingAdapter.init(self, array, dtype) 1478 def init(self, array: pd.Index, dtype: DTypeLike = None): 1479 from xarray.core.indexes import safe_cast_to_index -> 1481 self.array = safe_cast_to_index(array) 1483 if dtype is None: 1484 self._dtype = get_valid_numpy_dtype(array)

File ~/Documents/Work/Code/xarray/xarray/core/indexes.py:469, in safe_cast_to_index(array) 459 emit_user_level_warning( 460 ( 461 "pandas.Index does not support the float16 dtype." (...) 465 category=DeprecationWarning, 466 ) 467 kwargs["dtype"] = "float64" --> 469 index = pd.Index(np.asarray(array), **kwargs) 471 return _maybe_cast_to_cftimeindex(index)

Cell In[55], line 63, in UnindexableArray.array(self) 62 def array(self) -> np.ndarray: ---> 63 raise NotImplementedError("UnindexableArrays can't be converted into numpy arrays or pandas Index objects")

NotImplementedError: UnindexableArrays can't be converted into numpy arrays or pandas Index objects ```

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

Context is #8699

Environment

Versions described above

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8704/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
957918751 MDU6SXNzdWU5NTc5MTg3NTE= 5664 Interpolation behaviour inconsistent with numpy? mathisc 7017525 open 0     4 2021-08-02T08:56:28Z 2024-03-12T01:15:46Z   NONE      

Hey all, When running dataset.interp(time=dataset.time) fills with np.nan if one of the neighbor is a np.nan even when interpolation is not actually needed.

Here is the sample code to reproduce the issue : ```python def test_crop_times_nan() : ds = xr.Dataset( data_vars = { "some_variable" : (['x', 'time'], np.array([[np.nan, 0, 1]])) }, coords = { "time" : np.array([0,1,2]) } ) result = ds.interp(time=ds.time)

    # result["some_variable"].value == [nan, nan, 1.0]
    # whereas [nan, 0, 1.0] is EXPECTED
    xr.testing.assert_allclose(ds, result)

Please note that numpy does not have the same behavior :python

import numpy as np np.interp([0,1,2], xp=[0,1,2], fp=[np.nan,0,1]) array([nan, 0., 1.]) ```

Is that an intended behaviour for xarray? If so, does this mean that I first have to check if an interpolation is needed instead of doing it no matter what (and use reindex instead of interp if it is not needed) ?
(this will be kind of tricky if interpolation is needed for certain values and some not...)

Thanks for your help ;)

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.5 (default, Jul 28 2020, 12:59:40) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.8.0-7642-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.18.2 pandas: 1.2.4 numpy: 1.19.4 scipy: 1.6.0 netCDF4: 1.5.6 pydap: None h5netcdf: 0.8.1 h5py: 3.1.0 Nio: None zarr: None cftime: 1.3.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.01.0 distributed: 2021.01.0 matplotlib: 3.4.2 cartopy: None seaborn: None numbagg: None pint: None setuptools: 57.4.0 pip: 20.2.4 conda: None pytest: None IPython: 7.19.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5664/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2140090923 I_kwDOAMm_X85_jzIr 8759 Passing datasets with different group hierarchy to open_mfdataset KareemShalabi 111437410 closed 0     4 2024-02-17T13:31:18Z 2024-03-03T18:43:09Z 2024-03-03T10:53:34Z NONE      

Is your feature request related to a problem?

When you want to open multiple datasets located at different nodes of group hierarchy in HDF file, you can't pass a list of group keys ( save_mfdataset offers 'groups' keyword; emphasis on the s). Add to that, the 'files' keyword argument does not accept 'datastore' as a valid input.

Describe the solution you'd like

No response

Describe alternatives you've considered

One, of course, can open_dataset each one in a loop and combine afterwards. One possible fix is to Modify the 'group' argument to accept a list the same length as paths list. Another could be changing "paths" keyword to accept datastore or h5py objects. Both are trivial in my opinion. Most of the code is already there in other functions (open_dataset, save_mfdataset).

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8759/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2141899767 I_kwDOAMm_X85_qsv3 8769 Errors started appearing after release v2024.02.0 navidcy 7112768 closed 0     4 2024-02-19T09:23:16Z 2024-02-22T04:54:06Z 2024-02-22T04:54:06Z NONE      

What happened?

I started seeing errors in my CI after latest xarray release. See, e.g.,

https://github.com/COSIMA/regional-mom6/actions/runs/7957078139/job/21719091616#step:7:226

After I added a compat for xarray to preclude the latest release the error went away. See:

https://github.com/COSIMA/regional-mom6/actions/runs/7957192738

What did you expect to happen?

No response

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8769/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2142982259 I_kwDOAMm_X85_u1Bz 8771 Unable to use Xarray to work on RCM Dataset with xsar and safe_rcm by umr-lops sparshgarg23 34626942 closed 0     4 2024-02-19T18:58:50Z 2024-02-20T05:29:33Z 2024-02-20T05:29:33Z NONE      

What happened?

UMR-LOPS has introduced XSAR a library to work with RCM dataset. when working with the following code import xsar import geoviews as gv import holoviews as hv import geoviews.feature as gf hv.extension('bokeh') path = xsar.get_test_file('RCM1_OK1050603_PK1050605_1_SC50MB_20200214_115905_HH_HV_Z010') meta = xsar.RcmMeta(name=path) meta.dt I am encountering the following error ``` ValueError Traceback (most recent call last) <ipython-input-5-3d49b63ff406> in <cell line: 2>() 1 #rs2meta = xsar.RadarSat2Meta(name=path) ----> 2 meta = xsar.RcmMeta(name=path)

14 frames /usr/local/lib/python3.10/dist-packages/xsar/utils.py in wrapper(args, kwargs) 93 startrss = process.memory_info().rss 94 starttime = time.time() ---> 95 result = f(args, **kwargs) 96 endtime = time.time() 97 if mem_monitor:

/usr/local/lib/python3.10/dist-packages/xsar/rcm_meta.py in init(self, name) 32 self.dt = api.open_rcm(name.split(':')[1]) 33 else: ---> 34 self.dt = api.open_rcm(name) 35 if not name.startswith('RCM_DS:'): 36 name = 'RCM_DS:%s:' % name

/usr/local/lib/python3.10/dist-packages/safe_rcm/api.py in open_rcm(url, backend_kwargs, manifest_ignores, **dataset_kwargs) 95 ) 96 ---> 97 tree = read_product(mapper, "metadata/product.xml") 98 99 calibration_root = "metadata/calibration"

/usr/local/lib/python3.10/dist-packages/safe_rcm/product/reader.py in read_product(mapper, product_path) 272 } 273 --> 274 converted = valmap( 275 lambda x: execute(**x)(decoded), 276 layout,

/usr/local/lib/python3.10/dist-packages/toolz/dicttoolz.py in valmap(func, d, factory) 83 """ 84 rv = factory() ---> 85 rv.update(zip(d.keys(), map(func, d.values()))) 86 return rv 87

/usr/local/lib/python3.10/dist-packages/safe_rcm/product/reader.py in <lambda>(x) 273 274 converted = valmap( --> 275 lambda x: execute(**x)(decoded), 276 layout, 277 )

/usr/local/lib/python3.10/dist-packages/toolz/functoolz.py in call(self, args, kwargs) 302 def call(self, args, kwargs): 303 try: --> 304 return self._partial(*args, kwargs) 305 except TypeError as exc: 306 if self._should_curry(args, kwargs, exc):

/usr/local/lib/python3.10/dist-packages/safe_rcm/product/reader.py in execute(mapping, f, path) 29 subset = query(path, mapping) 30 ---> 31 return compose_left(f, attach_path(path=path))(subset) 32 33

/usr/local/lib/python3.10/dist-packages/toolz/functoolz.py in call(self, args, kwargs) 485 486 def call(self, args, kwargs): --> 487 ret = self.first(*args, kwargs) 488 for f in self.funcs: 489 ret = f(ret)

/usr/local/lib/python3.10/dist-packages/toolz/functoolz.py in call(self, args, kwargs) 487 ret = self.first(args, **kwargs) 488 for f in self.funcs: --> 489 ret = f(ret) 490 return ret 491

/usr/local/lib/python3.10/dist-packages/safe_rcm/product/reader.py in <lambda>(obj) 126 ), 127 lambda obj: obj.set_index({"stacked": ["pole", "pulse"]}), --> 128 lambda obj: obj.unstack("stacked"), 129 ), 130 },

/usr/local/lib/python3.10/dist-packages/xarray/util/deprecation_helpers.py in inner(args, kwargs) 113 return func(args[:-n_extra_args], kwargs) 114 --> 115 return func(*args, kwargs) 116 117 return inner

/usr/local/lib/python3.10/dist-packages/xarray/core/dataset.py in unstack(self, dim, fill_value, sparse) 5576 ) 5577 else: -> 5578 result = result._unstack_once(d, stacked_indexes[d], fill_value, sparse) 5579 return result 5580

/usr/local/lib/python3.10/dist-packages/xarray/core/dataset.py in _unstack_once(self, dim, index_and_vars, fill_value, sparse) 5395 indexes = {k: v for k, v in self._indexes.items() if k != dim} 5396 -> 5397 new_indexes, clean_index = index.unstack() 5398 indexes.update(new_indexes) 5399

/usr/local/lib/python3.10/dist-packages/xarray/core/indexes.py in unstack(self) 1019 1020 if not clean_index.is_unique: -> 1021 raise ValueError( 1022 "Cannot unstack MultiIndex containing duplicates. Make sure entries " 1023 f"are unique, e.g., by calling .drop_duplicates('{self.dim}'), "

ValueError: Cannot unstack MultiIndex containing duplicates. Make sure entries are unique, e.g., by calling .drop_duplicates('stacked'), before unstacking. ``` As you can see from the last sections in the trace,the issue is with xarray/dataset.py when we unstack the dataframe. Any ideas why this is happening.The issue doesn't occur with radarsat 2 or any other dataset.So is this an xarray problem or should I raise the issue at umr-lops?

What did you expect to happen?

the error shouldn't be there,and I should be able to view the dataframe. as shown in below link https://cyclobs.ifremer.fr/static/sarwing_datarmor/xsar/examples/rcm.html

Minimal Complete Verifiable Example

Python import xsar import geoviews as gv import holoviews as hv import geoviews.feature as gf hv.extension('bokeh') path = xsar.get_test_file('RCM1_OK1050603_PK1050605_1_SC50MB_20200214_115905_HH_HV_Z010') meta = xsar.RcmMeta(name=path) meta.dt

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

commit: None python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] python-bits: 64 OS: Linux OS-release: 6.1.58+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: None xarray: 2023.7.0 pandas: 1.5.3 numpy: 1.25.2 scipy: 1.11.4 netCDF4: None pydap: None h5netcdf: 1.3.0 h5py: 3.9.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.8.1 distributed: 2023.8.1 matplotlib: 3.7.1 cartopy: None seaborn: 0.13.1 numbagg: None fsspec: 2023.6.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.7.2 pip: 23.1.2 conda: None pytest: 7.4.4 mypy: None IPython: 7.34.0 sphinx: 5.0.2 /usr/local/lib/python3.10/dist-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8771/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned xarray 13221727 issue
1912094632 I_kwDOAMm_X85x-D-o 8231 xr.concat concatenates along dimensions that it wasn't asked to TomNicholas 35968931 open 0     4 2023-09-25T18:50:29Z 2024-02-14T20:30:26Z   MEMBER      

What happened?

Here are two toy datasets designed to represent sections of a dataset that has variables living on a staggered grid. This type of dataset is common in fluid modelling (it's why xGCM exists).

```python import xarray as xr

ds1 = xr.Dataset( coords={ 'x_center': ('x_center', [1, 2, 3]), 'x_outer': ('x_outer', [0.5, 1.5, 2.5, 3.5]),
}, )

ds2 = xr.Dataset( coords={ 'x_center': ('x_center', [4, 5, 6]), 'x_outer': ('x_outer', [4.5, 5.5, 6.5]),
}, ) ```

Calling xr.concat on these with dim='x_center' happily concatenates them python xr.concat([ds1, ds2], dim='x_center') <xarray.Dataset> Dimensions: (x_outer: 7, x_center: 6) Coordinates: * x_outer (x_outer) float64 0.5 1.5 2.5 3.5 4.5 5.5 6.5 * x_center (x_center) int64 1 2 3 4 5 6 Data variables: *empty* but notice that the returned result has been concatenated along both x_center and x_outer.

What did you expect to happen?

I did not expect this to work. I definitely didn't expect the datasets to be concatenated along a dimension I didn't ask them to be concatenated along (i.e. x_outer).

What I expected to happen was that (as by default coords='different') both variables would be attempted to be concatenated along the x_center dimension, which would have succeeded for the x_center variable but failed for the x_outer variable. Indeed, if I name the variables differently so that they are no longer coordinate variables then that is what happens:

```python import xarray as xr

ds1 = xr.Dataset( data_vars={ 'a': ('x_center', [1, 2, 3]), 'b': ('x_outer', [0.5, 1.5, 2.5, 3.5]),
}, )

ds2 = xr.Dataset( data_vars={ 'a': ('x_center', [4, 5, 6]), 'b': ('x_outer', [4.5, 5.5, 6.5]),
}, ) python xr.concat([ds1, ds2], dim='x_center', data_vars='different') ValueError: cannot reindex or align along dimension 'x_outer' because of conflicting dimension sizes: {3, 4} ```

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

I was trying to create an example for which you would need the automatic combined concat/merge that happens within xr.combine_by_coords.

Environment

xarray 2023.8.0

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8231/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1390228572 I_kwDOAMm_X85S3TRc 7104 Duplicate values on unstack znichollscr 114576287 closed 0     4 2022-09-29T04:16:26Z 2024-02-13T09:48:37Z 2024-02-13T09:48:37Z NONE      

What happened?

I unstacked a dataset and got values I didn't expect. It turns out that, when unstacking, my dataset had multiple values for the same index. This is clearly a case of user error, but it silently passed.

What did you expect to happen?

A warning or error would be raised to say, "this isn't going to work".

Minimal Complete Verifiable Example

```Python import datetime as dt import xarray as xr

ds = xr.DataArray( [[1, 2, 3], [4, 5, 6]], dims=("lat", "time"), coords={"lat": [-60, 60], "time": [dt.datetime(2010, 1, d) for d in range(1, 4)]}, name="test", ).to_dataset()

ds = ( ds.assign_coords( { "month": ds["time"].dt.month, "year": ds["time"].dt.year, } ) .set_index(time=["month", "year"]) ) ds = ds.unstack("time")

the output only has 2 values, which isn't what I expected

ds["test"].data ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

It's not clear to me where the error is. It might just be that this particular order of operations leads to a case that isn't otherwise caught. Looking at intermediate output, I thought the error was in unstack but maybe it's more complex than that...

Environment

INSTALLED VERSIONS ------------------ commit: e678a1d7884a3c24dba22d41b2eef5d7fe5258e7 python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:14) [Clang 12.0.1 ] python-bits: 64 OS: Darwin OS-release: 21.5.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: ('en_AU', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 0.1.dev4312+ge678a1d.d20220928 pandas: 1.5.0 numpy: 1.22.4 scipy: 1.9.1 netCDF4: 1.6.1 pydap: installed h5netcdf: 1.0.2 h5py: 3.7.0 Nio: None zarr: 2.13.2 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: 3.2.2 rasterio: 1.3.1 cfgrib: 0.9.10.1 iris: 3.3.0 bottleneck: 1.3.5 dask: 2022.9.1 distributed: 2022.9.1 matplotlib: 3.6.0 cartopy: 0.21.0 seaborn: 0.12.0 numbagg: 0.2.1 fsspec: 2022.8.2 cupy: None pint: 0.19.2 sparse: 0.13.0 flox: 0.5.9 numpy_groupies: 0.9.19 setuptools: 65.4.0 pip: 22.2.2 conda: None pytest: 7.1.3 IPython: 8.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7104/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2126375172 I_kwDOAMm_X85-vekE 8726 PRs requiring approval & merging main? max-sixty 5635139 closed 0     4 2024-02-09T02:35:58Z 2024-02-09T18:23:52Z 2024-02-09T18:21:59Z MEMBER      

What is your issue?

Sorry I haven't been on the calls at all recently (unfortunately the schedule is difficult for me). Maybe this was discussed there? 

PRs now seem to require a separate approval prior to merging. Is there an upside to this? Is there any difference between those who can approve and those who can merge? Otherwise it just seems like more clicking.

PRs also now seem to require merging the latest main prior to merging? I get there's some theoretical value to this, because changes can semantically conflict with each other. But it's extremely rare that this actually happens (can we point to cases?), and it limits the immediacy & throughput of PRs. If the bad outcome does ever happen, we find out quickly when main tests fail and can revert.

(fwiw I wrote a few principles around this down a while ago here; those are much stronger than what I'm suggesting in this issue though)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8726/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2115049090 I_kwDOAMm_X85-ERaC 8694 Error while saving an altered dataset to NetCDF when loaded from a file tarik 12544636 open 0     4 2024-02-02T14:18:03Z 2024-02-07T13:38:40Z   NONE      

What happened?

When attempting to save an altered Xarray dataset to a NetCDF file using the to_netcdf method, an error occurs if the original dataset is loaded from a file. Specifically, this error does not occur when the dataset is created directly but only when it is loaded from a file.

What did you expect to happen?

The altered Xarray dataset is saved as a NetCDF file using the to_netcdf method.

Minimal Complete Verifiable Example

```Python import xarray as xr

ds = xr.Dataset( data_vars=dict( win_1=("attempt", [True, False, True, False, False, True]), win_2=("attempt", [False, True, False, True, False, False]), ), coords=dict( attempt=[1, 2, 3, 4, 5, 6], player_1=("attempt", ["paper", "paper", "scissors", "scissors", "paper", "paper"]), player_2=("attempt", ["rock", "scissors", "paper", "rock", "paper", "rock"]), ) ) ds.to_netcdf("dataset.nc")

ds_from_file = xr.load_dataset("dataset.nc")

ds_altered = ds_from_file.where(ds_from_file["player_1"] == "paper", drop=True) ds_altered.to_netcdf("dataset_altered.nc") ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Python Traceback (most recent call last): File "example.py", line 20, in <module> ds_altered.to_netcdf("dataset_altered.nc") File ".../python3.9/site-packages/xarray/core/dataset.py", line 2303, in to_netcdf return to_netcdf( # type: ignore # mypy cannot resolve the overloads:( File ".../python3.9/site-packages/xarray/backends/api.py", line 1315, in to_netcdf dump_to_store( File ".../python3.9/site-packages/xarray/backends/api.py", line 1362, in dump_to_store store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) File ".../python3.9/site-packages/xarray/backends/common.py", line 356, in store self.set_variables( File ".../python3.9/site-packages/xarray/backends/common.py", line 398, in set_variables writer.add(source, target) File ".../python3.9/site-packages/xarray/backends/common.py", line 243, in add target[...] = source File ".../python3.9/site-packages/xarray/backends/scipy_.py", line 78, in __setitem__ data[key] = value File ".../python3.9/site-packages/scipy/io/_netcdf.py", line 1019, in __setitem__ self.data[index] = data ValueError: could not broadcast input array from shape (4,5) into shape (4,8)

Anything else we need to know?

Findings:

The issue is related to the encoding information of the dataset becoming invalid after filtering data with the where method. The to_netcdf method takes the available encoding information instead of considering the actual shape of the data.

In the provided examples, the maximum length of strings stored in "player_1" and "player_2" is originally set to 8 characters. However, after filtering with the where method, the maximum length of the string becomes 5 in "player_1" and remains 8 in "player_2.". But the encoding information of the variables still shows a length of 8, particularly the attribute char_dim_name.

Workaround:

A workaround to resolve this issue is to call the drop_encoding method on the dataset before saving it with to_netcdf. This action ensures that the encoding information is not available, and the to_netcdf method is forced to take the actual shapes of the data, preventing the broadcasting error.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.14 (main, Aug 24 2023, 14:01:46) [GCC 11.4.0] python-bits: 64 OS: Linux OS-release: 6.3.1-060301-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2024.1.1 pandas: 2.2.0 numpy: 1.26.3 scipy: 1.12.0 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 69.0.3 pip: 23.3.2 conda: None pytest: None mypy: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8694/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
782440858 MDU6SXNzdWU3ODI0NDA4NTg= 4784 Opening a tiff with scale_factor/add_offset attrs then saving as zarr and opening causes a UFuncTypeError ohiat 53100696 closed 0     4 2021-01-08T22:45:21Z 2024-02-06T10:40:15Z 2024-02-06T10:40:14Z NONE      

What happened: When opening a geotiff that has scale_factor and add_offset metadata and then saving it as a zarr the scale_factor and add_offset attributes are loaded and then saved as strings. When the resulting zarr is opened xarray attempts to apply the scale_factor and add_offset attributes, but raises an exception because they are of type <U32. ``` /srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/coding/variables.py in _scale_offset_decoding(data, scale_factor, add_offset, dtype) 218 data = np.array(data, dtype=dtype, copy=True) 219 if scale_factor is not None: --> 220 data *= scale_factor 221 if add_offset is not None: 222 data += add_offset

UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('<U32') to dtype('float32') with casting rule 'same_kind' `` **What you expected to happen**: 1.scale_factorandadd_offsetare converted to floats and applied when the tiff is opened 2. When attempting to applyscale_factorandadd_offset` attributes, check their types and/or cast them to floats.

Minimal Complete Verifiable Example:

python import xarray as xr img = xr.open_rasterio('https://hlssa.blob.core.windows.net/hls/S30/HLS.S30.T10TET.2019001.v1.4_04.tif') img.to_dataset(name='img', promote_attrs=True).to_zarr('./test.zarr', mode='w') xr.open_zarr('./test.zarr').persist()

Anything else we need to know?:

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Dec 26 2020, 05:05:16) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-1034-azure machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.2 pandas: 1.2.0 numpy: 1.19.5 scipy: 1.6.0 netCDF4: 1.5.5.1 pydap: None h5netcdf: 0.8.1 h5py: 2.10.0 Nio: None zarr: 2.6.1 cftime: 1.3.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.8 cfgrib: None iris: None bottleneck: None dask: 2020.12.0 distributed: 2020.12.0 matplotlib: 3.3.3 cartopy: 0.18.0 seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20201009 pip: 20.3.3 conda: None pytest: 6.2.1 IPython: 7.19.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4784/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2112742578 I_kwDOAMm_X8597eSy 8693 reading netcdf with engine=scipy fails with a typeerror under certain conditions eivindjahren 32731672 open 0     4 2024-02-01T15:03:23Z 2024-02-05T09:35:51Z   CONTRIBUTOR      

What happened?

Saving and loading from netcdf with engine=scipy produces an unexpected valueerror on read. The file seems to be corrupted.

What did you expect to happen?

reading works just fine.

Minimal Complete Verifiable Example

```Python import numpy as np import xarray as xr ds = xr.Dataset( { "values": ( ["name", "time"], np.array([[]], dtype=np.float32).T, ) }, coords={"time": [1], "name": []}, ).expand_dims({"index": [0]})

ds.to_netcdf("file.nc", engine="scipy") _ = xr.open_dataset("file.nc", engine="scipy")

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

```Python KeyError Traceback (most recent call last) File .../python3.11/site-packages/xarray/backends/file_manag er.py:211, in CachingFileManager._acquire_with_cache_info(self, needs_lock) 210 try: --> 211 file = self._cache[self._key] 212 except KeyError:

File .../python3.11/site-packages/xarray/backends/lru_cache. py:56, in LRUCache.getitem(self, key) 55 with self._lock: ---> 56 value = self._cache[key] 57 self._cache.move_to_end(key)

KeyError: [<function _open_scipy_netcdf at 0x7fe96afa9120>, ('/home/eivind/Projects/ert/file.nc',), 'r', (('mmap', None), ('version', 2)), '264ec6b3-78b3-4766-bb41-7656d6a51962']

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last) Cell In[1], line 18 4 ds = ( 5 xr.Dataset( 6 { (...) 15 .expand_dims({"index": [0]}) 16 ) 17 ds.to_netcdf("file.nc", engine="scipy") ---> 18 _ = xr.open_dataset("file.nc", engine="scipy")

File .../python3.11/site-packages/xarray/backends/api.py:572 , in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, d ecode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked _array_type, from_array_kwargs, backend_kwargs, kwargs) 560 decoders = _resolve_decoders_kwargs( 561 decode_cf, 562 open_backend_dataset_parameters=backend.open_dataset_parameters, (...) 568 decode_coords=decode_coords, 569 ) 571 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None) --> 572 backend_ds = backend.open_dataset( 573 filename_or_obj, 574 drop_variables=drop_variables, 575 decoders, 576 kwargs, 577 ) 578 ds = _dataset_from_backend_dataset( 579 backend_ds, 580 filename_or_obj, (...) 590 kwargs, 591 ) 592 return ds

File .../python3.11/site-packages/xarray/backends/scipy_.py: 315, in ScipyBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, con cat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, mode, format, group, mm ap, lock) 313 store_entrypoint = StoreBackendEntrypoint() 314 with close_on_error(store): --> 315 ds = store_entrypoint.open_dataset( 316 store, 317 mask_and_scale=mask_and_scale, 318 decode_times=decode_times, 319 concat_characters=concat_characters, 320 decode_coords=decode_coords, 321 drop_variables=drop_variables, 322 use_cftime=use_cftime, 323 decode_timedelta=decode_timedelta, 324 ) 325 return ds

File .../python3.11/site-packages/xarray/backends/store.py:4 3, in StoreBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, conca t_characters, decode_coords, drop_variables, use_cftime, decode_timedelta) 29 def open_dataset( # type: ignore[override] # allow LSP violation, not supporting **kwargs 30 self, 31 filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore, (...) 39 decode_timedelta=None, 40 ) -> Dataset: 41 assert isinstance(filename_or_obj, AbstractDataStore) ---> 43 vars, attrs = filename_or_obj.load() 44 encoding = filename_or_obj.get_encoding() 46 vars, attrs, coord_names = conventions.decode_cf_variables( 47 vars, 48 attrs, (...) 55 decode_timedelta=decode_timedelta, 56 )

File .../python3.11/site-packages/xarray/backends/common.py: 210, in AbstractDataStore.load(self) 188 def load(self): 189 """ 190 This loads the variables and attributes simultaneously. 191 A centralized loading function makes it easier to create (...) 207 are requested, so care should be taken to make sure its fast. 208 """ 209 variables = FrozenDict( --> 210 (_decode_variable_name(k), v) for k, v in self.get_variables().items() 211 ) 212 attributes = FrozenDict(self.get_attrs()) 213 return variables, attributes

File .../python3.11/site-packages/xarray/backends/scipy_.py: 181, in ScipyDataStore.get_variables(self) 179 def get_variables(self): 180 return FrozenDict( --> 181 (k, self.open_store_variable(k, v)) for k, v in self.ds.variables.items() 182 )

File .../python3.11/site-packages/xarray/backends/scipy_.py: 170, in ScipyDataStore.ds(self) 168 @property 169 def ds(self): --> 170 return self._manager.acquire()

File .../python3.11/site-packages/xarray/backends/file_manag er.py:193, in CachingFileManager.acquire(self, needs_lock) 178 def acquire(self, needs_lock=True): 179 """Acquire a file object from the manager. 180 181 A new file is only opened if it has expired from the (...) 191 An open file object, as returned by opener(*args, **kwargs). 192 """ --> 193 file, _ = self._acquire_with_cache_info(needs_lock) 194 return file

File .../python3.11/site-packages/xarray/backends/file_manag er.py:217, in CachingFileManager._acquire_with_cache_info(self, needs_lock) 215 kwargs = kwargs.copy() 216 kwargs["mode"] = self._mode --> 217 file = self._opener(self._args, *kwargs) 218 if self._mode == "w": 219 # ensure file doesn't get overridden when opened again 220 self._mode = "a"

File .../python3.11/site-packages/xarray/backends/scipy_.py: 109, in _open_scipy_netcdf(filename, mode, mmap, version) 106 filename = io.BytesIO(filename) 108 try: --> 109 return scipy.io.netcdf_file(filename, mode=mode, mmap=mmap, version=version) 110 except TypeError as e: # netcdf3 message is obscure in this case 111 errmsg = e.args[0]

File .../python3.11/site-packages/scipy/io/_netcdf.py:278, i n netcdf_file.init(self, filename, mode, mmap, version, maskandscale) 275 self._attributes = {} 277 if mode in 'ra': --> 278 self._read()

File .../python3.11/site-packages/scipy/io/_netcdf.py:607, i n netcdf_file._read(self) 605 self._read_dim_array() 606 self._read_gatt_array() --> 607 self._read_var_array()

File .../python3.11/site-packages/scipy/io/netcdf.py:688, i n netcdf_file._read_var_array(self) 685 data = None 686 else: # not a record variable 687 # Calculate size to avoid problems with vsize (above) --> 688 a_size = reduce(mul, shape, 1) * size 689 if self.use_mmap: 690 data = self._mm_buf[begin:begin_+a_size].view(dtype=dtype_)

TypeError: unsupported operand type(s) for *: 'int' and 'NoneType' ```

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.4 (main, Dec 7 2023, 15:43:41) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.2.0-39-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development xarray: 2024.1.1 pandas: 2.1.1 numpy: 1.26.1 scipy: 1.11.3 netCDF4: 1.6.5 pydap: None h5netcdf: None h5py: 3.10.0 Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.8.0 cartopy: None seaborn: 0.13.1 numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 63.4.3 pip: 23.3.1 conda: None pytest: 7.4.4 mypy: 1.8.0 IPython: 8.17.2 sphinx: 7.2.6
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8693/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2111051033 I_kwDOAMm_X8591BUZ 8691 xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks abarciauskas-bgse 15016780 closed 0     4 2024-01-31T22:04:02Z 2024-01-31T22:56:17Z 2024-01-31T22:56:17Z NONE      

What happened?

When opening MUR SST netcdfs from S3, xarray.open_dataset(file, engine="h5netcdf", chunks={}) returns a single chunk (whereas the h5netcdf library returns a chunk shape of (1, 1023, 2047).

A notebook version of the code below includes the output: https://gist.github.com/abarciauskas-bgse/9366e04d2af09b79c9de466f6c1d3b90

What did you expect to happen?

I thought the chunks={} option would return the same chunks (1, 1023, 2047) exposed by the h5netcdf engine.

Minimal Complete Verifiable Example

```Python

!/usr/bin/env python

coding: utf-8

This notebook looks at how xarray and h5netcdf return different chunks.

import pandas as pd import h5netcdf import s3fs import xarray as xr

dates = [ d.to_pydatetime().strftime('%Y%m%d') for d in pd.date_range('2023-02-01', '2023-03-01', freq='D') ]

SHORT_NAME = 'MUR-JPL-L4-GLOB-v4.1' s3_fs = s3fs.S3FileSystem(anon=False) var = 'analysed_sst'

def make_filename(time): base_url = f's3://podaac-ops-cumulus-protected/{SHORT_NAME}/' # example file: "/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc" return f'{base_url}{time}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

s3_urls = [make_filename(d) for d in dates]

def print_chunk_shape(s3_url): try: # Open the dataset using xarray file = s3_fs.open(s3_url) dataset = xr.open_dataset(file, engine='h5netcdf', chunks={})

    # Print chunk shapes for each variable in the dataset
    print(f"\nChunk shapes for {s3_url}:")
    if dataset[var].chunks is not None:
        print(f"xarray open_dataset chunks for {var}: {dataset[var].chunks}")
    else:
        print(f"xarray open_dataset chunks for {var}: Not chunked")

    with h5netcdf.File(file, 'r') as file:
        dataset = file[var]

        # Check if the dataset is chunked
        if dataset.chunks:
            print(f"h5netcdf chunks for {var}:", dataset.chunks)
        else:
            print(f"h5netcdf dataset is not chunked.")

except Exception as e:
    print(f"Failed to process {s3_url}: {e}")

[print_chunk_shape(s3_url) for s3_url in s3_urls] ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.10.198-187.748.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.1 libnetcdf: 4.9.2 xarray: 2023.6.0 pandas: 2.0.3 numpy: 1.24.4 scipy: 1.11.1 netCDF4: 1.6.4 pydap: installed h5netcdf: 1.2.0 h5py: 3.9.0 Nio: None zarr: 2.15.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.6.1 distributed: 2023.6.1 matplotlib: 3.7.1 cartopy: 0.21.1 seaborn: 0.12.2 numbagg: None fsspec: 2023.6.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.7.2 numpy_groupies: 0.9.22 setuptools: 68.0.0 pip: 23.1.2 conda: None pytest: 7.4.0 mypy: None IPython: 8.14.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8691/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2104267494 I_kwDOAMm_X859bJLm 8677 Add rolling.rank() same as pandas Mirac-Le 39230130 open 0     4 2024-01-28T17:27:21Z 2024-01-29T19:50:20Z   NONE      

Is your feature request related to a problem?

Dear xarray maintainers,

I would like to express my heartfelt gratitude for the significant optimizations your xarray library has brought to my project. Xarray combines the speed of numpy with the highly customizable parameters of pandas. The extensive parameters in the rolling module have allowed me to achieve functionality similar to pandas more efficiently.

I am wondering if it would be possible to incorporate a ranking method for rolling windows, including the ability to specify parameters such as pct, similar to the pandas rolling.rank function. Your consideration of this feature would be greatly appreciated.

Once again, thank you for your contributions!

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8677/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1716228662 I_kwDOAMm_X85mS5I2 7848 Compatibility with the Array API standard TomNicholas 35968931 open 0     4 2023-05-18T20:34:43Z 2024-01-25T04:03:42Z   MEMBER      

What is your issue?

Meta-issue to track all the smaller issues around making xarray and the array API standard compatible with each other.

We've already had - #6804 - #7067 - #7847

and there will likely be many others.


I suspect this might require changes to the standard as well as to xarray - in particular see this list of common numpy functions which are not currently in the array API standard. Of these xarray currently uses (FYI @ralfgommers ):

  • np.clip
  • np.diff
  • np.pad
  • np.repeat
  • ~np.take~
  • ~np.tile~
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7848/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2079089277 I_kwDOAMm_X8577GJ9 8607 allow computing just a small number of variables keewis 14808389 open 0     4 2024-01-12T15:21:27Z 2024-01-12T20:20:29Z   MEMBER      

Is your feature request related to a problem?

I frequently find myself computing a handful of variables of a dataset (typically coordinates) and assigning them back to the dataset, and wishing we had a method / function that allowed that.

Describe the solution you'd like

I'd imagine something like python ds.compute(variables=variable_names) but I'm undecided on whether that's a good idea (it might make .compute more complex?)

Describe alternatives you've considered

So far I've been using something like python ds.assign_coords({k: lambda ds: ds[k].compute() for k in variable_names}) ds.pipe(lambda ds: ds.merge(ds[variable_names].compute())) but both are not easy to type / understand (though having .merge take a callable would make this much easier). Also, the first option computes variables separately, which may not be ideal?

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8607/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2073024461 I_kwDOAMm_X857j9fN 8602 `DataArray.mean()` and `Dataset.mean()` fail with `sparse==0.15.0` martinkim0 46072231 closed 0     4 2024-01-09T19:27:47Z 2024-01-10T14:44:57Z 2024-01-10T14:44:57Z NONE      

What happened?

The following script leads to an error: ``` import numpy as np import xarray as xr from sparse import GCXS

x = np.random.negative_binomial(1, 0.5, size=(100, 100)) array = xr.DataArray(GCXS.from_numpy(x)) array.mean() ```

```

AttributeError Traceback (most recent call last) Cell In[16], line 1 ----> 1 array.mean()

File ~/.../python3.11/site-packages/xarray/core/_aggregations.py:1663, in DataArrayAggregations.mean(self, dim, skipna, keep_attrs, kwargs) 1588 def mean( 1589 self, 1590 dim: Dims = None, (...) 1594 kwargs: Any, 1595 ) -> Self: 1596 """ 1597 Reduce this DataArray's data by applying mean along some dimension(s). 1598 (...) 1661 array(nan) 1662 """ -> 1663 return self.reduce( 1664 duck_array_ops.mean, 1665 dim=dim, 1666 skipna=skipna, 1667 keep_attrs=keep_attrs, 1668 **kwargs, 1669 )

File ~/.../python3.11/site-packages/xarray/core/dataarray.py:3776, in DataArray.reduce(self, func, dim, axis, keep_attrs, keepdims, kwargs) 3732 def reduce( 3733 self, 3734 func: Callable[..., Any], (...) 3740 kwargs: Any, 3741 ) -> Self: 3742 """Reduce this array by applying func along some dimension(s). 3743 3744 Parameters (...) 3773 summarized data and the indicated dimension(s) removed. 3774 """ -> 3776 var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs) 3777 return self._replace_maybe_drop_dims(var)

File ~/.../python3.11/site-packages/xarray/core/variable.py:1756, in Variable.reduce(self, func, dim, axis, keep_attrs, keepdims, kwargs) 1749 keep_attrs_ = ( 1750 _get_keep_attrs(default=False) if keep_attrs is None else keep_attrs 1751 ) 1753 # Noe that the call order for Variable.mean is 1754 # Variable.mean -> NamedArray.mean -> Variable.reduce 1755 # -> NamedArray.reduce -> 1756 result = super().reduce( 1757 func=func, dim=dim, axis=axis, keepdims=keepdims, kwargs 1758 ) 1760 # return Variable always to support IndexVariable 1761 return Variable( 1762 result.dims, result.data, attrs=result._attrs if keep_attrs else None 1763 )

File ~/.../python3.11/site-packages/xarray/namedarray/core.py:772, in NamedArray.reduce(self, func, dim, axis, keepdims, kwargs) 770 data = func(self.data, axis=axis, kwargs) 771 else: --> 772 data = func(self.data, **kwargs) 774 if getattr(data, "shape", ()) == self.shape: 775 dims = self.dims

File ~/.../python3.11/site-packages/xarray/core/duck_array_ops.py:637, in mean(array, axis, skipna, kwargs) 635 return _to_pytimedelta(mean_timedeltas, unit="us") + offset 636 else: --> 637 return _mean(array, axis=axis, skipna=skipna, kwargs)

File ~/.../python3.11/site-packages/xarray/core/duck_array_ops.py:399, in _create_nan_agg_method.<locals>.f(values, axis, skipna, **kwargs) 396 kwargs.pop("min_count", None) 398 xp = get_array_namespace(values) --> 399 func = getattr(xp, name) 401 try: 402 with warnings.catch_warnings():

AttributeError: module 'sparse' has no attribute 'mean' ```

What did you expect to happen?

Reproducible script runs without error with sparse==0.14.0.

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.2.0-34-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: None xarray: 2023.12.0 pandas: 1.5.3 numpy: 1.24.4 scipy: 1.11.4 netCDF4: None pydap: None h5netcdf: None h5py: 3.10.0 Nio: None zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: 2023.12.0 distributed: 2023.12.0 matplotlib: 3.8.2 cartopy: None seaborn: 0.12.2 numbagg: None fsspec: 2023.12.0 cupy: None pint: None sparse: 0.15.0 flox: None numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.18.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8602/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2041076267 I_kwDOAMm_X855qFor 8551 Make _obj_repr public BENR0 12115839 closed 0     4 2023-12-14T07:19:16Z 2023-12-21T16:00:52Z 2023-12-21T16:00:52Z NONE      

What is your issue?

We are using https://github.com/pydata/xarray/blob/2971994ef1dd67f44fe59e846c62b47e1e5b240b/xarray/core/formatting_html.py#L278

in the html representation of AreaDefinitions in https://github.com/pytroll/pyresample and don't like to import private functions. Would it be OK to make _obj_repr public?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8551/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
2027147099 I_kwDOAMm_X854089b 8523 tree-reduce the combine for `open_mfdataset(..., parallel=True, combine="nested")` dcherian 2448579 open 0     4 2023-12-05T21:24:51Z 2023-12-18T19:32:39Z   MEMBER      

Is your feature request related to a problem?

When parallel=True and a distributed client is active, Xarray reads every file in parallel, constructs a Dataset per file with indexed coordinates loaded, and then sends all of that back to the "head node" for the combine.

Instead we can tree-reduce the combine (example) by switching to dask.bag instead of dask.delayed and skip the overhead of shipping 1000s of copies of an indexed coordinate back to the head node.

  1. The downside is the dask graph is "worse" but perhaps that shouldn't stop us.
  2. I think this is only feasible for combine="nested"

cc @TomNicholas

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8523/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1223031600 I_kwDOAMm_X85I5fsw 6561 Excessive memory consumption by to_dataframe() sgdecker 8419421 closed 0     4 2022-05-02T15:33:33Z 2023-12-15T20:47:32Z 2023-12-15T20:47:32Z NONE      

What happened?

This is a reincarnation of #2534 with a reproduceable example.

A 51 MB netCDF file leads to to_dataframe() requesting 23 GB.

What did you expect to happen?

I expect to_dataframe() to require much less than 23 GB of memory for this operation.

Minimal Complete Verifiable Example

```Python import urllib.request import xarray as xr

url = 'http://people.envsci.rutgers.edu/decker/Surface_METAR_20220501_0000.nc' fname = 'metar.nc' urllib.request.urlretrieve(url, filename=fname) ncdata = xr.open_dataset(fname) df = ncdata.to_dataframe() ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Python Traceback (most recent call last): File "/chariton/decker/test/bug/xarraymem.py", line 8, in <module> df = ncdata.to_dataframe() File "/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/xarray/core/dataset.py", line 5399, in to_dataframe return self._to_dataframe(ordered_dims=ordered_dims) File "/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/xarray/core/dataset.py", line 5363, in _to_dataframe data = [ File "/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/xarray/core/dataset.py", line 5364, in <listcomp> self._variables[k].set_dims(ordered_dims).values.reshape(-1) numpy.core._exceptions._ArrayMemoryError: Unable to allocate 23.3 GiB for an array with shape (5021, 127626) and data type |S39

Anything else we need to know?

No response

Environment

/home/decker/local/miniconda3/envs/xarraybug/lib/python3.10/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit: None python: 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:39:04) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.62.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.22.3 scipy: None netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None setuptools: 62.1.0 pip: 22.0.4 conda: None pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6561/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
384002323 MDU6SXNzdWUzODQwMDIzMjM= 2570 np.clip() executes eagerly Hoeze 1200058 closed 0     4 2018-11-24T16:25:03Z 2023-12-03T05:29:17Z 2023-12-03T05:29:17Z NONE      

Example:

python x = xr.DataArray(np.random.uniform(size=[100, 100])).chunk(10) x

<xarray.DataArray (dim_0: 100, dim_1: 100)> dask.array<shape=(100, 100), dtype=float64, chunksize=(10, 10)> Dimensions without coordinates: dim_0, dim_1

python np.clip(x, 0, 0.5)

<xarray.DataArray (dim_0: 100, dim_1: 100)> array([[0.264276, 0.32227 , 0.336396, ..., 0.110182, 0.28255 , 0.399041], [0.5 , 0.030289, 0.5 , ..., 0.428923, 0.262249, 0.5 ], [0.5 , 0.5 , 0.280971, ..., 0.427334, 0.026649, 0.5 ], ..., [0.5 , 0.5 , 0.294943, ..., 0.053143, 0.5 , 0.488239], [0.5 , 0.341485, 0.5 , ..., 0.5 , 0.250441, 0.5 ], [0.5 , 0.156285, 0.179123, ..., 0.5 , 0.076242, 0.319699]]) Dimensions without coordinates: dim_0, dim_1

python x.clip(0, 0.5)

<xarray.DataArray (dim_0: 100, dim_1: 100)> dask.array<shape=(100, 100), dtype=float64, chunksize=(10, 10)> Dimensions without coordinates: dim_0, dim_1

Problem description

Using np.clip() directly calculates the result, while xr.DataArray.clip() does not.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2570/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned xarray 13221727 issue
1902108672 I_kwDOAMm_X85xX-AA 8207 Getting `NETCDF: HDF error` while writing a NetCDF file opened using `open_mfdataset` kasra-keshavarz 50383939 open 0     4 2023-09-19T02:44:02Z 2023-12-01T22:29:49Z   NONE      

What is your issue?

I am simply reading 366 small (~15MBs) NetCDF files to create one big NetCDF file at the end. Below is the relevant workflow:

```python-console In [1]: import os; import dask

In [2]: import xarray as xr

In [3]: from dask.distributed import Client, LocalCluster

In [4]: cluster = LocalCluster(n_workers=4, threads_per_worker=1) # 1 core to each worker

In [5]: client = Client(cluster)

In [6]: os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'

In [7]: ds = xr.open_mfdataset('./remapped/*.nc', chunks={'COMID': 1400}, parallel=True)

In [8]: ds.to_netcdf('./out2.nc')

```

And below, is the error I am getting:

Error message ```python-console In [8]: ds.to_netcdf('./out2.nc') /home/kasra545/virtual-envs/meshflow/lib/python3.10/site-packages/distributed/client.py:3149: UserWarning: Sending large graph of size 9.97 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures. warnings.warn( 2023-09-18 22:26:14,279 - distributed.worker - WARNING - Compute Failed Key: ('open_dataset-concatenate-concatenate-be7dd534c459e2f316d9149df2d9ec95', 178, 0) Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyIndexedArray(array=_ElementwiseFunctionArray(LazilyIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x2b863b0e94c0>, key=BasicIndexer((slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _apply_mask at 0x2b86218d4ee0>, encoded_fill_values={-9999.0}, decoded_fill_value=nan, dtype=dtype('float64')), dtype=dtype('float64')), key=BasicIndexer((slice(None, None, None), slice(None, None, None)))))), (slice(0, 24, None), slice(0, 1400, None))) kwargs: {} Exception: "RuntimeError('NetCDF: HDF error')" --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[8], line 1 ----> 1 ds.to_netcdf('./out2.nc') File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/dataset.py:2252, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf) 2249 encoding = {} 2250 from xarray.backends.api import to_netcdf -> 2252 return to_netcdf( # type: ignore # mypy cannot resolve the overloads:( 2253 self, 2254 path, 2255 mode=mode, 2256 format=format, 2257 group=group, 2258 engine=engine, 2259 encoding=encoding, 2260 unlimited_dims=unlimited_dims, 2261 compute=compute, 2262 multifile=False, 2263 invalid_netcdf=invalid_netcdf, 2264 ) File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/api.py:1255, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf) 1252 if multifile: 1253 return writer, store -> 1255 writes = writer.sync(compute=compute) 1257 if isinstance(target, BytesIO): 1258 store.sync() File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/common.py:256, in ArrayWriter.sync(self, compute, chunkmanager_store_kwargs) 253 if chunkmanager_store_kwargs is None: 254 chunkmanager_store_kwargs = {} --> 256 delayed_store = chunkmanager.store( 257 self.sources, 258 self.targets, 259 lock=self.lock, 260 compute=compute, 261 flush=True, 262 regions=self.regions, 263 **chunkmanager_store_kwargs, 264 ) 265 self.sources = [] 266 self.targets = [] File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/daskmanager.py:211, in DaskManager.store(self, sources, targets, **kwargs) 203 def store( 204 self, 205 sources: DaskArray | Sequence[DaskArray], 206 targets: Any, 207 **kwargs, 208 ): 209 from dask.array import store --> 211 return store( 212 sources=sources, 213 targets=targets, 214 **kwargs, 215 ) File ~/virtual-envs/meshflow/lib/python3.10/site-packages/dask/array/core.py:1236, in store(***failed resolving arguments***) 1234 elif compute: 1235 store_dsk = HighLevelGraph(layers, dependencies) -> 1236 compute_as_if_collection(Array, store_dsk, map_keys, **kwargs) 1237 return None 1239 else: File ~/virtual-envs/meshflow/lib/python3.10/site-packages/dask/base.py:369, in compute_as_if_collection(cls, dsk, keys, scheduler, get, **kwargs) 367 schedule = get_scheduler(scheduler=scheduler, cls=cls, get=get) 368 dsk2 = optimization_function(cls)(dsk, keys, **kwargs) --> 369 return schedule(dsk2, keys, **kwargs) File ~/virtual-envs/meshflow/lib/python3.10/site-packages/distributed/client.py:3267, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs) 3265 should_rejoin = False 3266 try: -> 3267 results = self.gather(packed, asynchronous=asynchronous, direct=direct) 3268 finally: 3269 for f in futures.values(): File ~/virtual-envs/meshflow/lib/python3.10/site-packages/distributed/client.py:2393, in Client.gather(self, futures, errors, direct, asynchronous) 2390 local_worker = None 2392 with shorten_traceback(): -> 2393 return self.sync( 2394 self._gather, 2395 futures, 2396 errors=errors, 2397 direct=direct, 2398 local_worker=local_worker, 2399 asynchronous=asynchronous, 2400 ) File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:484, in __array__() 483 def __array__(self, dtype: np.typing.DTypeLike = None) -> np.ndarray: --> 484 return np.asarray(self.get_duck_array(), dtype=dtype) File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:487, in get_duck_array() 486 def get_duck_array(self): --> 487 return self.array.get_duck_array() File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:664, in get_duck_array() 663 def get_duck_array(self): --> 664 return self.array.get_duck_array() File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:557, in get_duck_array() 552 # self.array[self.key] is now a numpy array when 553 # self.array is a BackendArray subclass 554 # and self.key is BasicIndexer((slice(None, None, None),)) 555 # so we need the explicit check for ExplicitlyIndexed 556 if isinstance(array, ExplicitlyIndexed): --> 557 array = array.get_duck_array() 558 return _wrap_numpy_scalars(array) File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/coding/variables.py:74, in get_duck_array() 73 def get_duck_array(self): ---> 74 return self.func(self.array.get_duck_array()) File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:551, in get_duck_array() 550 def get_duck_array(self): --> 551 array = self.array[self.key] 552 # self.array[self.key] is now a numpy array when 553 # self.array is a BackendArray subclass 554 # and self.key is BasicIndexer((slice(None, None, None),)) 555 # so we need the explicit check for ExplicitlyIndexed 556 if isinstance(array, ExplicitlyIndexed): File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:100, in __getitem__() 99 def __getitem__(self, key): --> 100 return indexing.explicit_indexing_adapter( 101 key, self.shape, indexing.IndexingSupport.OUTER, self._getitem 102 ) File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/core/indexing.py:858, in explicit_indexing_adapter() 836 """Support explicit indexing by delegating to a raw indexing method. 837 838 Outer and/or vectorized indexers are supported by indexing a second time (...) 855 Indexing result, in the form of a duck numpy-array. 856 """ 857 raw_key, numpy_indices = decompose_indexer(key, shape, indexing_support) --> 858 result = raw_indexing_method(raw_key.tuple) 859 if numpy_indices.tuple: 860 # index the loaded np.ndarray 861 result = NumpyIndexingAdapter(result)[numpy_indices] File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:112, in _getitem() 110 try: 111 with self.datastore.lock: --> 112 original_array = self.get_array(needs_lock=False) 113 array = getitem(original_array, key) 114 except IndexError: 115 # Catch IndexError in netCDF4 and return a more informative 116 # error message. This is most often called when an unsorted 117 # indexer is used before the data is loaded from disk. File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:91, in get_array() 90 def get_array(self, needs_lock=True): ---> 91 ds = self.datastore._acquire(needs_lock) 92 variable = ds.variables[self.variable_name] 93 variable.set_auto_maskandscale(False) File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/netCDF4_.py:403, in _acquire() 402 def _acquire(self, needs_lock=True): --> 403 with self._manager.acquire_context(needs_lock) as root: 404 ds = _nc4_require_group(root, self._group, self._mode) 405 return ds File /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.10.2/lib/python3.10/contextlib.py:135, in __enter__() 133 del self.args, self.kwds, self.func 134 try: --> 135 return next(self.gen) 136 except StopIteration: 137 raise RuntimeError("generator didn't yield") from None File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/file_manager.py:199, in acquire_context() 196 @contextlib.contextmanager 197 def acquire_context(self, needs_lock=True): 198 """Context manager for acquiring a file.""" --> 199 file, cached = self._acquire_with_cache_info(needs_lock) 200 try: 201 yield file File ~/virtual-envs/meshflow/lib/python3.10/site-packages/xarray/backends/file_manager.py:217, in _acquire_with_cache_info() 215 kwargs = kwargs.copy() 216 kwargs["mode"] = self._mode --> 217 file = self._opener(*self._args, **kwargs) 218 if self._mode == "w": 219 # ensure file doesn't get overridden when opened again 220 self._mode = "a" File src/netCDF4/_netCDF4.pyx:2487, in netCDF4._netCDF4.Dataset.__init__() File src/netCDF4/_netCDF4.pyx:1928, in netCDF4._netCDF4._get_vars() File src/netCDF4/_netCDF4.pyx:2029, in netCDF4._netCDF4._ensure_nc_success() RuntimeError: NetCDF: HDF error ```

The header of individual NetCDF ones are also in the following:

Individual NetCDF header ```console $ ncdump -h ab_models_remapped_1980-04-20-13-00-00.nc netcdf ab_models_remapped_1980-04-20-13-00-00 { dimensions: COMID = 14980 ; time = UNLIMITED ; // (24 currently) variables: int time(time) ; time:long_name = "time" ; time:units = "hours since 1980-04-20 12:00:00" ; time:calendar = "gregorian" ; time:standard_name = "time" ; time:axis = "T" ; double latitude(COMID) ; latitude:long_name = "latitude" ; latitude:units = "degrees_north" ; latitude:standard_name = "latitude" ; double longitude(COMID) ; longitude:long_name = "longitude" ; longitude:units = "degrees_east" ; longitude:standard_name = "longitude" ; double COMID(COMID) ; COMID:long_name = "shape ID" ; COMID:units = "1" ; double RDRS_v2.1_P_P0_SFC(time, COMID) ; RDRS_v2.1_P_P0_SFC:_FillValue = -9999. ; RDRS_v2.1_P_P0_SFC:long_name = "Forecast: Surface pressure" ; RDRS_v2.1_P_P0_SFC:units = "mb" ; double RDRS_v2.1_P_HU_1.5m(time, COMID) ; RDRS_v2.1_P_HU_1.5m:_FillValue = -9999. ; RDRS_v2.1_P_HU_1.5m:long_name = "Forecast: Specific humidity" ; RDRS_v2.1_P_HU_1.5m:units = "kg kg**-1" ; double RDRS_v2.1_P_TT_1.5m(time, COMID) ; RDRS_v2.1_P_TT_1.5m:_FillValue = -9999. ; RDRS_v2.1_P_TT_1.5m:long_name = "Forecast: Air temperature" ; RDRS_v2.1_P_TT_1.5m:units = "deg_C" ; double RDRS_v2.1_P_UVC_10m(time, COMID) ; RDRS_v2.1_P_UVC_10m:_FillValue = -9999. ; RDRS_v2.1_P_UVC_10m:long_name = "Forecast: Wind Modulus (derived using UU and VV)" ; RDRS_v2.1_P_UVC_10m:units = "kts" ; double RDRS_v2.1_A_PR0_SFC(time, COMID) ; RDRS_v2.1_A_PR0_SFC:_FillValue = -9999. ; RDRS_v2.1_A_PR0_SFC:long_name = "Analysis: Quantity of precipitation" ; RDRS_v2.1_A_PR0_SFC:units = "m" ; double RDRS_v2.1_P_FB_SFC(time, COMID) ; RDRS_v2.1_P_FB_SFC:_FillValue = -9999. ; RDRS_v2.1_P_FB_SFC:long_name = "Forecast: Downward solar flux" ; RDRS_v2.1_P_FB_SFC:units = "W m**-2" ; double RDRS_v2.1_P_FI_SFC(time, COMID) ; RDRS_v2.1_P_FI_SFC:_FillValue = -9999. ; RDRS_v2.1_P_FI_SFC:long_name = "Forecast: Surface incoming infrared flux" ; RDRS_v2.1_P_FI_SFC:units = "W m**-2" ; ```

I am running xarray and Dask on an HPC, so the "modules" I have loaded are the following: ```console module list

Currently Loaded Modules: 1) CCconfig 6) ucx/1.8.0 11) netcdf-mpi/4.9.0 (io) 16) freexl/1.0.5 (t) 21) scipy-stack/2023a (math) 2) gentoo/2020 (S) 7) libfabric/1.10.1 12) hdf5-mpi/1.12.1 (io) 17) geos/3.10.2 (geo) 22) libspatialindex/1.8.5 (phys) 3) gcccore/.9.3.0 (H) 8) openmpi/4.0.3 (m) 13) libffi/3.3 18) librttopo-proj9/1.1.0 23) ipykernel/2023a 4) imkl/2020.1.217 (math) 9) StdEnv/2020 (S) 14) python/3.10.2 (t) 19) proj/9.0.1 (geo) 24) sqlite/3.38.5 5) intel/2020.1.217 (t) 10) mii/1.1.2 15) mpi4py/3.1.3 (t) 20) libspatialite-proj901/5.0.1 ```

Any suggestion is greatly appreciated!

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8207/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
2019789753 I_kwDOAMm_X854Y4u5 8499 'drop_duplicates' behaves differently when using 1 vs many coordinates for an index jbweston 6654709 open 0     4 2023-12-01T00:36:42Z 2023-12-01T09:55:39Z   NONE      

What happened?

I am trying to drop_duplicates from a DataArray based on the values of some of the coordinates, starting from a DataArray with coordinates, but no indexes.

To accomplish this, I call 'DataArray.set_xindex' with the appropriate coordinate names, and then call 'drop_duplicates' on the resulting DataArray, like so:   ```python from xarray import DataArray import numpy as np

test_array = DataArray( np.random.rand(5), coords=dict(x=("sample", [1, 2, 1, 2, 1]), y=("sample", [-1] * 5)), dims="sample", )

output DataArray's 'sample' dimension has length 2, as expected

good = test_array.set_xindex(["x", "y"]).drop_duplicates("sample") assert len(good) == 2 ```

The above functions as expected; 'good' has had its duplicates dropped, and we are left with a DataArray of length 2.

However, the following does not function as I would expect:

```python

All the 'y's are '-1', so we expect the same duplicates as before to be dropped,

even if we don't include the 'y' values in the index.

bad = test_array.set_xindex("x").drop_duplicates("sample")

But this assert fails! 'drop_duplicates' does not drop anything

assert not bad.equals(test_array) ```

What did you expect to happen?

I expected drop_duplicates to drop the duplicates when I was using only a single coordinate for the index.

Minimal Complete Verifiable Example

```Python from xarray import DataArray import numpy as np

test_array = DataArray( range(5), coords=dict(x=("sample", [1, 2, 1, 2, 1]), y=("sample", [-1] * 5)), dims="sample", )

output DataArray's 'sample' dimension has length 2, as expected

good = test_array.set_xindex(["x", "y"]).drop_duplicates("sample")

And indeed there are only 2 elements left after dropping duplicates.

assert len(good) == 2

All the 'y's are '-1', so we expect the same duplicates as before to be dropped,

bad = test_array.drop_vars("y").set_xindex("x").drop_duplicates("sample")

But this assert fails! 'drop_duplicates' does not drop anything

assert not bad.equals(test_array.drop_vars("y")) ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:34:09) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.15.133.1-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.1 xarray: 2023.11.0 pandas: 2.1.0 numpy: 1.24.4 scipy: 1.11.2 netCDF4: 1.6.3 pydap: None h5netcdf: 1.2.0 h5py: 3.8.0 Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None iris: None bottleneck: None dask: 2023.9.1 distributed: 2023.9.1 matplotlib: 3.7.2 cartopy: None seaborn: 0.12.2 numbagg: None fsspec: 2023.9.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.1.2 pip: 23.2.1 conda: 23.7.3 pytest: 7.4.2 mypy: None IPython: 8.15.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8499/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1983891070 I_kwDOAMm_X852P8Z- 8427 Ambiguous behavior with coordinates when appending to Zarr store with append_dim rabernat 1197350 closed 0     4 2023-11-08T15:40:19Z 2023-12-01T03:58:56Z 2023-12-01T03:58:55Z MEMBER      

What happened?

There are two quite different scenarios covered by "append" with Zarr

  • Adding new variables to a dataset
  • Extending arrays along a dimensions (via append_dim)

This issue is about what should happen when using append_dim with variables that do not contain append_dim.

Here's the current behavior.

```python import xarray as xr import zarr

ds1 = xr.DataArray( np.array([1, 2, 3]).reshape(3, 1, 1), dims=('time', 'y', 'x'), coords={'x': [1], 'y': [2]}, name="foo" ).to_dataset()

ds2 = xr.DataArray( np.array([4, 5]).reshape(2, 1, 1), dims=('time', 'y', 'x'), coords={'x':[-1], 'y': [-2]}, name="foo" ).to_dataset()

how concat works: data are aligned

ds_concat = xr.concat([ds1, ds2], dim="time") assert ds_concat.dims == {"time": 5, "y": 2, "x": 2}

now do a Zarr append

store = zarr.storage.MemoryStore() ds1.to_zarr(store, consolidated=False)

we do not check that the coordinates are aligned--just that they have the same shape and dtype

ds2.to_zarr(store, append_dim="time", consolidated=False) ds_append = xr.open_zarr(store, consolidated=False)

coordinates data have been overwritten

assert ds_append.dims == {"time": 5, "y": 1, "x": 1}

...with the latest values

assert ds_append.x.data[0] == -1 ```

Currently, we always write all data variables in this scenario. That includes overwriting the coordinates every time we append. That makes appending more expensive than it needs to be. I don't think that is the behavior most users want or expect.

What did you expect to happen?

There are a couple of different options we could consider for how to handle this "extending" situation (with append_dim)

  1. [current behavior] Do not attempt to align coordinates a. [current behavior] Overwrite coordinates with new data b. Keep original coordinates c. Force the user to explicitly drop the coordinates, as we do for region operations.
  2. Attempt to align coordinates a. Fail if coordinates don't match b. Extend the arrays to replicate the behavior of concat

We currently do 1a. I propose to switch to 1b. I think it is closer to what users want, and it requires less I/O.

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.10.176-157.645.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.1 pandas: 2.1.2 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.5 pydap: installed h5netcdf: 1.2.0 h5py: 3.10.0 Nio: None zarr: 2.16.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.10.1 distributed: 2023.10.1 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: 0.13.0 numbagg: 0.6.0 fsspec: 2023.10.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.8.1 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.3.1 conda: None pytest: 7.4.3 mypy: None IPython: 8.16.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8427/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1044693438 I_kwDOAMm_X84-RMG- 5937 DataArray.dt.seconds returns incorrect value for negative `timedelta64[ns]` leifdenby 2405019 closed 0     4 2021-11-04T12:05:24Z 2023-11-10T00:39:17Z 2023-11-10T00:39:17Z CONTRIBUTOR      

What happened:

For a negative timedelta64[ns] of 42 nanoseconds DataArray.dt.seconds returned a non-zero value (the returned value was 86399). When I pass in a positive 42 nanosecond timedelta64[ns] with the the TimeDeltaAccessor correctly returns zero. I would have expected both assertions in the example below to have passed, but the second fails. This seems to be a general issue with negative timedelta64[ns].

bash <xarray.DataArray 'seconds' (dim_0: 1)> array([0]) Dimensions without coordinates: dim_0 <xarray.DataArray 'seconds' (dim_0: 1)> array([86399]) Dimensions without coordinates: dim_0 Traceback (most recent call last): File "bug_dt_seconds.py", line 15, in <module> assert da.dt.seconds == 0 AssertionError

What you expected to happen: bash <xarray.DataArray 'seconds' (dim_0: 1)> array([0]) Dimensions without coordinates: dim_0 <xarray.DataArray 'seconds' (dim_0: 1)> array([0]) Dimensions without coordinates: dim_0

Minimal Complete Verifiable Example:

```python

coding: utf-8

import xarray as xr import numpy as np

number of nanoseconds

value = 42

da = xr.DataArray([np.timedelta64(value, "ns")]) print(da.dt.seconds) assert da.dt.seconds == 0

da = xr.DataArray([np.timedelta64(-value, "ns")]) print(da.dt.seconds) assert da.dt.seconds == 0 ```

Anything else we need to know?:

I've narrowed this down to the call to pd.Series(values.ravel()) in xarray.core.accessor_dt._access_through_series:

python ipdb> pd.Series(values.ravel()) 0 -1 days +23:59:59.999999958 dtype: timedelta64[ns]

I think the issue arises because pandas turns the numpy timedelta64 into a "minus one day plus a time". This actually does have a number of "seconds" in it, but the "total_seconds" has the expected value:

python ipdb> pd.Series(values.ravel()).dt.total_seconds() 0 -4.200000e-08 dtype: float64

Which would correctly round to zero.

I don't think the issue is in pandas, although the output from pandas is counter-intuitive:

python ipdb> pd.Series(values.ravel()).dt.seconds 0 86399 dtype: int64

Maybe we should handle this as a special case by taking the absolute value before passing the values to pandas (and then applying the original sign again afterwards)?

Environment:

Output of <tt>xr.show_versions()</tt> ``` INSTALLED VERSIONS ------------------ commit: None python: 3.7.7 (default, May 6 2020, 04:59:01) [Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 19.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_GB.UTF-8 LANG: None LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.18.2 pandas: 1.3.4 numpy: 1.19.1 scipy: 1.5.0 netCDF4: 1.4.2 pydap: installed h5netcdf: None h5py: 2.9.0 Nio: None zarr: 2.10.1 cftime: 1.5.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.09.1 distributed: 2021.09.1 matplotlib: 3.2.2 cartopy: 0.18.0 seaborn: 0.10.1 numbagg: None fsspec: 2021.06.1 cupy: None pint: 0.18 sparse: None setuptools: 46.4.0.post20200518 pip: 21.1.2 conda: None pytest: 6.0.1 IPython: 7.16.1 sphinx: None ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5937/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1981799811 I_kwDOAMm_X852H92D 8423 Support remote string paths for `h5netcdf` engine jrbourbeau 11656932 open 0     4 2023-11-07T16:52:18Z 2023-11-09T07:24:45Z   CONTRIBUTOR      

Is your feature request related to a problem?

Currently the h5netcdf engine supports opening remote files, but only already open file-like objects (e.g. s3fs.open(...)), not string paths like s3://.... There are situations where I'd like to use string paths instead of open file-like objets

  • Opening files can sometimes be slow (xref https://github.com/fsspec/s3fs/issues/816)
  • When using parallel=True for opening lots of files, serializing open file-like objects back and forth from a remote cluster can be slow
  • Some systems (e.g. NASA Earthdata) only hand out credentials that are valid when run in the same region as the data. Being able to use parallel=True + storage_options would be convenient/performant in that case.

Describe the solution you'd like

It would be nice if I could do something like the following:

python ds = xr.open_mfdataset( files, # A bunch of files like `s3://bucket/file` engine="h5netcdf", ... parallel=True, storage_options={...}, # fsspec-compatible options )

and have my files opened prior to handing off to h5netcdf. storage_options is already supported for Zarr, so hopefully extending to h5netcdf feels natural.

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8423/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1975845455 I_kwDOAMm_X851xQJP 8410 Segmentation fault 139 (SIGSEGV) lucadix 39524075 closed 0     4 2023-11-03T10:14:03Z 2023-11-06T20:34:46Z 2023-11-06T20:34:45Z NONE      

What happened?

While opening a set of netCDF files in a for loop, using xr.open_dataset().load(), I get a segmentation error (nr. 139). Please see code example below: ``` for region in region_list: [some code to read data associated to each region...]

region_pred = xr.open_dataset(io.BytesIO(data)).load()

[other code working on region_pred...] The error is shown in Linux/Mac after running my Python code, whereas Windows seems to be masking it. I was able to catch that on Windows by launching my code as: python3 my_code.py && echo ok || echo KO ```

In this way, KO gets printed and the segmentation fault is now noticeable. I managed to fix the issue by using a second variable (called reg_pred) in addition to region_pred: ``` for region in region_list: [some code to read data associated to each region...]

region_pred = xr.open_dataset(io.BytesIO(data)) reg_pred = region_pred.load()

[other code working on reg_pred...] ```

What did you expect to happen?

I don't know if the issue I described is something that the developers made on purpose. Personally, I think it is an issue and that's why I am reporting it. If it is not an issue, I would like to get a clarification in order to understand what am I missing. Thank you in advance.

Minimal Complete Verifiable Example

```Python for region in region_list: with storage_client.open(region, "rb") as f: data = f.read() region_pred = xr.open_dataset(io.BytesIO(data)).load()

# some code working on region_pred to compute weather indices... ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 141 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: it_IT.UTF-8 LOCALE: ('Italian_Italy', '1252') libhdf5: 1.14.0 libnetcdf: 4.9.2 xarray: 2023.8.0 pandas: 2.1.0 numpy: 1.26.0 scipy: 1.11.2 netCDF4: 1.6.4 pydap: None h5netcdf: 1.2.0 h5py: 3.9.0 Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.10.0 distributed: 2023.10.0 matplotlib: 3.8.0 cartopy: None seaborn: None numbagg: None fsspec: 2023.9.1 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.2.2 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.15.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8410/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned xarray 13221727 issue
1977485456 I_kwDOAMm_X8513giQ 8413 Add a perception of a __xarray__ magic method swamidass 6273919 open 0     4 2023-11-04T19:55:14Z 2023-11-05T18:50:14Z   NONE      

Is your feature request related to a problem?

I am often moving data from external objects (of all sorts!) into xarray. This is a common use case

Much of this code would be greatly simplified if there was a way of giving non-xarray classes a way of declaring to xarray how these objects can be marshaled into

Describe the solution you'd like

So here is an initial proposal for comment. Much of this could be implemented in a third party library. But doing this in xarray itself would likely be best.

Magic Methods

It would be great to see these magic method signatures become integrated throughout the library:

___xarray__ -> xr.Dataset | xr.DataArray ___xarray_array__ -> xr.DatArray ___xarray_dataset__ -> xr.Dataset ___xarray_datatree__ -> xr.DataTree # when DataTree is finally integrated into xarray

Conversion Registry

And these extension functions to register converters:

def register_xarray_converter(class, name: str, func : Callable[[class, ...] | None) -> xr.Dataset | xr.DataArray]: ... def register_dataarray_converter(class, name: str, func : Callable[[class, ...] | None) -> xr.DataArray: ... def register_dataset_converter(class, name: str, func : Callable[[class, ...] | None) -> xr.Dataset: ... def register_datatree_converter(class, name: str, func : Callable[[class, ...], xr.DataArray] | None) -> DataTree # when DataTree is finally integrated into xarray ... Registering a converter if if cls implements a corresponding xarray_* method or another converter already registered for cls. Perhaps add an argument that specifies if the converter should or should not be added if their is a clash. Perhaps these functions return the replaced converter so it can be added back in if needed?

Ideally, also, "deregister" versions (.e.g deregister would also be available. So context managers that change marshaling behavior could easily be constructed.

User API

Along with the following new user API functions:

def as_xarray(x, *args, **kwargs) -> xr.Dataset | xr.DataArray: ... def as_dataarray(x,*args, **kwargs) -> xr.DataArray: ... def as_dataset(x,*args, **kwargs) -> xr.DataSet: ... def as_dataset(x,*args, **kwargs) -> xr.DataSet: # when DataTree is finally integrated into xarray ...

"as_xarray" returns (in order of precedence: - x unaltered if it is an xarray objects - registered_xarray_converter(x, args, kwargs) if it is callable and does not throw an exception - registered_dataarray_converter(x, args, kwargs) if it is callable and does not throw an exception - registered_dataarray_converter(x, *args, kwargs) if it is callable and does not throw an exception - x.xarray(args, kwargs), if it exits, is callable, and does not throw an exception - x.xarray_dataset(args, kwargs), if it exists, is callable, and does not throw an exception - x.xarray_dataarray(*args, kwargs), if it exists, is callable, and does not throw an exception - well known aliases of xarray_dataarray, such as x.to_xarray(args, *kwargs) (see pandas) - [DESIGN DECISION] convert and return tuple[dims, data, [attr, encoding] to DataArray? - [DESIGN DECISION] convert and return tuple encoding of DataSet? - [DESIGN DECISION] return DataArray wrapped duck-typed array in DataArray?

The rationale for putting the registered functions first is that this would enable

"as_dataarrray" would be slimilar, but it would only call x.xarray_dataarray and well known aliases.

"as_dataset" would be slimilar, but it would only call x.xarray_dataset, well known aliases, and perhaps falling back to calling x.xarray_dataarray and converting the return a dataset if it has a name attribute.

"as_datatree" would be slimilar, but it would only call x.xarray_datatree, and perhaps falling back to calling x.xarray_dataarray and wrapping it in a single node datatree. (Though of course at this point this method would probably be implemented by the DataTree package, not xarray)

The design decisions are flexible from my point of view, and might be decided in a way that makes the code base simplest or most usable. There is also a question of whether or not this method should default the backup methods. These decisions also can be deferred entirely by delegating to the converter registry.

Across the Xarray Library

Finally, across the xarray library, there may be places where passing input arguments through as_xarray, as_dataarray, or as_dataset would make a lot of sense. This could be the final thing to do, but cannot be handled by a third party library.

Doing this would give give another pathway for third party libraries to integrate with xarray, with a far easier way than the converter registry or explicit calls to as_* functions.

Describe alternatives you've considered

This can be done with a private library. But it seems to a lot of code that is pretty useful to other use cases.

Most of this (but not all) can accomplished in a 3rd party library, but it wouldn't allow the seamless sort of integration with (for example) xarray use of repr_html to integrate with pandas.

The existing backend hooks work great when we are marshaling from file-based sources. See, for example, tiffslide-xarray (https://github.com/swamidasslab/tiffslide-xarray). This approach is seemless for reading files, but cannot marshal objects. For example, this is possible:

x = xr.open_dataset("slide.tiff")

But this doesn't work.

t = tiffslide.TiffSlide("slide.tiff") x = xr.open_dataset(t) # won't work x = xr.DataArray(t) # won't work either

This is an important use case because there are cases where we want to create an xarray like this from objects that are never stored on the filesystem.

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8413/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
887711474 MDU6SXNzdWU4ODc3MTE0NzQ= 5290 Inconclusive error messages using to_zarr with regions niowniow 5802846 closed 0     4 2021-05-11T15:54:39Z 2023-11-05T06:28:39Z 2023-11-05T06:28:39Z CONTRIBUTOR      

What happened: The idea is to use a xarray dataset (stored as dummy zarr file), which is subsequently filled with the region argument, as explained in the documentation. Ideally, almost nothing is stored to disk upfront.

It seems the current implementation is only designed to either store coordinates for the whole dataset and write them to disk or to write without coordinates. I failed to understand this from the documentation and tried to create a dataset without coordinates and fill it with a dataset subset with coordinates. It gave some inconclusive errors depending on the actual code example (see below). ValueError: parameter 'value': expected array with shape (0,), got (10,) or ValueError: conflicting sizes for dimension 'x': length 10 on 'x' and length 30 on 'foo'

It might also be a bug and it should in fact be possible to add a dataset with coordinates to a dummy dataset without coordinates. Then there seems to be an issue regarding the handling of the variables during storing the region.

... or I might just have done it wrong... and I'm looking forward to suggestions.

What you expected to happen:

Either an error message telling me that that i should use coordinates during creation of the dummy dataset. Alternatively, if this is a bug and should be possible then it should just work.

Minimal Complete Verifiable Example:

```python import dask.array import xarray as xr import numpy as np

error = 1 # choose between 0 (no error), 1, 2, 3

dummies = dask.array.zeros(30, chunks=10)

chunks in coords are not taken into account while saving!?

coord_x = dask.array.zeros(30, chunks=10) # or coord_x = np.zeros((30,)) if error == 0: ds = xr.Dataset({"foo": ("x", dummies)}, coords={"x":coord_x}) else: ds = xr.Dataset({"foo": ("x", dummies)})

print(ds) path = "./tmp/test.zarr" ds.to_zarr(path, mode='w', compute=False, consolidated=True)

create a new dataset to be input into a region

ds = xr.Dataset({"foo": ('x', np.arange(10))},coords={"x":np.arange(10)})

if error == 1: ds.to_zarr(path, region={"x": slice(10, 20)}) # ValueError: parameter 'value': expected array with shape (0,), got (10,) elif error == 2: ds.to_zarr(path, region={"x": slice(0, 10)}) ds.to_zarr(path, region={"x": slice(10, 20)}) # ValueError: conflicting sizes for dimension 'x': length 10 on 'x' and length 30 on 'foo' elif error == 3: ds.to_zarr(path, region={"x": slice(0, 10)}) ds = xr.Dataset({"foo": ('x', np.arange(10))},coords={"x":np.arange(10)}) ds.to_zarr(path, region={"x": slice(10, 20)}) # ValueError: parameter 'value': expected array with shape (0,), got (10,) else: ds.to_zarr(path, region={"x": slice(10, 20)})

ds = xr.open_zarr(path) print('reopen',ds['x']) ```

Anything else we need to know?:

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.19.0-16-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: None libnetcdf: None xarray: 0.18.0 pandas: 1.2.3 numpy: 1.19.2 scipy: 1.6.2 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.8.1 cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.04.0 distributed: None matplotlib: 3.4.1 cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20210108 pip: 21.0.1 conda: None pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5290/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
377356113 MDU6SXNzdWUzNzczNTYxMTM= 2542 full_like, ones_like, zeros_like should retain subclasses gerritholl 500246 closed 0     4 2018-11-05T11:22:49Z 2023-11-05T06:27:31Z 2023-11-05T06:27:31Z CONTRIBUTOR      

Code Sample,

```python

Your code here

import numpy import xarray

class MyDataArray(xarray.DataArray): pass

da = MyDataArray(numpy.arange(5)) da2 = xarray.zeros_like(da) print(type(da), type(da2))

```

Problem description

I would expect that type(da2) is type(da), but this is not the case. The type of da is always <class 'xarray.core.dataarray.DataArray'>. Rather, the output of this script is:

<class '__main__.MyDataArray'> <class 'xarray.core.dataarray.DataArray'>

Expected Output

I would hope as an output:

<class '__main__.MyDataArray'> <class '__main__.MyDataArray'>

In principle changing this could break people's code, so if a change is implemented it should probably be through an optional keyword argument to the full_like/ones_like/zeros_like family.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 2.6.32-754.el6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 xarray: 0.10.7 pandas: 0.23.2 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.1 distributed: 1.22.0 matplotlib: 3.0.0 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 39.2.0 pip: 18.0 conda: None pytest: 3.2.2 IPython: 6.4.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2542/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned xarray 13221727 issue
1966675016 I_kwDOAMm_X851ORRI 8388 Type annotation compatibility with numpy ufuncs djhoese 1828519 closed 0     4 2023-10-28T17:25:11Z 2023-11-02T12:44:50Z 2023-11-02T12:44:50Z CONTRIBUTOR      

Is your feature request related to a problem?

I'd like mypy to understand that xarray DataArrays passed to numpy ufuncs have a return type of xarray DataArray.

```python import xarray as xr import numpy as np

def compute_relative_azimuth(sat_azi: xr.DataArray, sun_azi: xr.DataArray) -> xr.DataArray: abs_diff = np.absolute(sun_azi - sat_azi) ssadiff = np.minimum(abs_diff, 360 - abs_diff) return ssadiff

```

bash $ mypy ./xarray_mypy.py xarray_mypy.py:7: error: Incompatible return value type (got "ndarray[Any, dtype[Any]]", expected "DataArray") [return-value] Found 1 error in 1 file (checked 1 source file)

Describe the solution you'd like

I'm not sure if this is possible, if it is something xarray can fix, or something numpy needs to "fix". I'd like the above situation to "just work" without anything more than maybe some extra type-stub package.

Describe alternatives you've considered

Cast types or other type coercion or tell mypy to ignore the type issues for these numpy call.

Additional context

https://stackoverflow.com/questions/77369042/typing-when-passing-xarray-dataarray-objects-to-numpy-ufuncs

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8388/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1445905299 I_kwDOAMm_X85WLsOT 7282 groupby and mean on a MultiIndex level raises ValueError jjpr-mit 25231875 closed 0     4 2022-11-11T19:15:58Z 2023-10-30T09:18:54Z 2023-08-31T03:50:33Z NONE      

What happened?

After using set_index to create a MultiIndex, calling groupby on a MultiIndex level and then mean raises an error.

What did you expect to happen?

Apply mean to groups, no error.

Minimal Complete Verifiable Example

Python d = DataArray( data=[ [0, 1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12, 13], [14, 15, 16, 17, 18, 19, 20] ], coords={ "greek": ("a", ['alpha', 'beta', 'gamma']), "colors": ("a", ['red', 'green', 'blue']), "compass": ("b", ['north', 'south', 'east', 'west', 'northeast', 'southeast', 'southwest']), "integer": ("b", [0, 1, 2, 3, 4, 5, 6]), }, dims=("a", "b") ) d = d.set_index(a=['greek', 'colors'], b=['compass', 'integer']) g = d.groupby('greek') m = g.mean(...)

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.10/site-packages/xarray/core/_aggregations.py", line 5698, in mean return self.reduce( File "/usr/local/lib/python3.10/site-packages/xarray/core/groupby.py", line 1201, in reduce return self.map(reduce_array, shortcut=shortcut) File "/usr/local/lib/python3.10/site-packages/xarray/core/groupby.py", line 1104, in map return self._combine(applied, shortcut=shortcut) File "/usr/local/lib/python3.10/site-packages/xarray/core/groupby.py", line 1136, in _combine index, index_vars = create_default_index_implicit(coord) File "/usr/local/lib/python3.10/site-packages/xarray/core/indexes.py", line 1045, in create_default_index_implicit index = PandasMultiIndex(array, name) File "/usr/local/lib/python3.10/site-packages/xarray/core/indexes.py", line 615, in __init__ raise ValueError( ValueError: conflicting multi-index level name 'greek' with dimension 'greek'

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.7 (main, Sep 13 2022, 14:31:33) [GCC 10.2.1 20210110] python-bits: 64 OS: Linux OS-release: 5.15.49-linuxkit machine: x86_64 processor: byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2022.11.0 pandas: 1.5.1 numpy: 1.23.4 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 63.2.0 pip: 22.2.2 conda: None pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7282/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1953059418 I_kwDOAMm_X850aVJa 8345 `.stack` produces large chunks yt87 40218891 closed 0     4 2023-10-19T21:09:56Z 2023-10-26T21:20:05Z 2023-10-26T21:20:05Z NONE      

What happened?

Xarray stack does not chunk along the last coordinate, producing huge chunks, as described in #5754. Dask, seeing code like this: da2 = da.stack(new=("z", "t")).groupby("new").map(sum).unstack("new") produces warning and suggestion to use context manager: with dask.config.set(**{"array.slicing.split_large_chunks": True}): da2 = da.stack(new=("z", "t")).groupby("new").map(sum).unstack("new") This fails with message IndexError: tuple index out of range.

What did you expect to happen?

I expect this to work. #5754 is closed.

Minimal Complete Verifiable Example

```Python import dask.array import numpy as np

import xarray as xr

var = xr.Variable( ("t", "z", "u", "x", "y"), dask.array.random.random((1200, 4, 2, 1000, 100), chunks=(1, 1, -1, -1, -1)), ) da = xr.DataArray(var)

def sum(ds): return ds.sum(dim="u")

with dask.config.set(**{"array.slicing.split_large_chunks": True}): da2 = da.stack(new=("z", "t")).groupby("new").map(sum).unstack("new") da2 ```

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

```Python

IndexError Traceback (most recent call last) Cell In[21], line 5 2 return ds.sum(dim="u") 4 with dask.config.set(**{"array.slicing.split_large_chunks": True}): ----> 5 da2 = da.stack(new=("z", "t")).groupby("new").map(sum).unstack("new") 6 da2

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/dataarray.py:2855, in DataArray.unstack(self, dim, fill_value, sparse) 2795 def unstack( 2796 self, 2797 dim: Dims = None, 2798 fill_value: Any = dtypes.NA, 2799 sparse: bool = False, 2800 ) -> Self: 2801 """ 2802 Unstack existing dimensions corresponding to MultiIndexes into 2803 multiple new dimensions. (...) 2853 DataArray.stack 2854 """ -> 2855 ds = self._to_temp_dataset().unstack(dim, fill_value, sparse) 2856 return self._from_temp_dataset(ds)

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/dataset.py:5500, in Dataset.unstack(self, dim, fill_value, sparse) 5498 for d in dims: 5499 if needs_full_reindex: -> 5500 result = result._unstack_full_reindex( 5501 d, stacked_indexes[d], fill_value, sparse 5502 ) 5503 else: 5504 result = result._unstack_once(d, stacked_indexes[d], fill_value, sparse)

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/dataset.py:5395, in Dataset._unstack_full_reindex(self, dim, index_and_vars, fill_value, sparse) 5393 if name not in index_vars: 5394 if dim in var.dims: -> 5395 variables[name] = var.unstack({dim: new_dim_sizes}) 5396 else: 5397 variables[name] = var

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/variable.py:1930, in Variable.unstack(self, dimensions, **dimensions_kwargs) 1928 result = self 1929 for old_dim, dims in dimensions.items(): -> 1930 result = result._unstack_once_full(dims, old_dim) 1931 return result

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/xarray/core/variable.py:1820, in Variable._unstack_once_full(self, dims, old_dim) 1817 reordered = self.transpose(*dim_order) 1819 new_shape = reordered.shape[: len(other_dims)] + new_dim_sizes -> 1820 new_data = reordered.data.reshape(new_shape) 1821 new_dims = reordered.dims[: len(other_dims)] + new_dim_names 1823 return type(self)( 1824 new_dims, new_data, self._attrs, self._encoding, fastpath=True 1825 )

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/core.py:2219, in Array.reshape(self, merge_chunks, limit, *shape) 2217 if len(shape) == 1 and not isinstance(shape[0], Number): 2218 shape = shape[0] -> 2219 return reshape(self, shape, merge_chunks=merge_chunks, limit=limit)

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/reshape.py:285, in reshape(x, shape, merge_chunks, limit) 283 else: 284 chunk_plan.append("auto") --> 285 outchunks = normalize_chunks( 286 chunk_plan, 287 shape=shape, 288 limit=limit, 289 dtype=x.dtype, 290 previous_chunks=inchunks, 291 ) 293 x2 = x.rechunk(inchunks) 295 # Construct graph

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/core.py:3095, in normalize_chunks(chunks, shape, limit, dtype, previous_chunks) 3092 chunks = tuple("auto" if isinstance(c, str) and c != "auto" else c for c in chunks) 3094 if any(c == "auto" for c in chunks): -> 3095 chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks) 3097 if shape is not None: 3098 chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape))

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/core.py:3218, in auto_chunks(chunks, shape, limit, dtype, previous_chunks) 3212 largest_block = math.prod( 3213 cs if isinstance(cs, Number) else max(cs) for cs in chunks if cs != "auto" 3214 ) 3216 if previous_chunks: 3217 # Base ideal ratio on the median chunk size of the previous chunks -> 3218 result = {a: np.median(previous_chunks[a]) for a in autos} 3220 ideal_shape = [] 3221 for i, s in enumerate(shape):

File ~/mambaforge/envs/icec/lib/python3.11/site-packages/dask/array/core.py:3218, in <dictcomp>(.0) 3212 largest_block = math.prod( 3213 cs if isinstance(cs, Number) else max(cs) for cs in chunks if cs != "auto" 3214 ) 3216 if previous_chunks: 3217 # Base ideal ratio on the median chunk size of the previous chunks -> 3218 result = {a: np.median(previous_chunks[a]) for a in autos} 3220 ideal_shape = [] 3221 for i, s in enumerate(shape):

IndexError: tuple index out of range ```

Anything else we need to know?

The most recent traceback entry point to an issue in dask code.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.5-1-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.9.0 pandas: 2.1.1 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.4 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.7 dask: 2023.9.3 distributed: 2023.9.3 matplotlib: 3.8.0 cartopy: 0.22.0 seaborn: None numbagg: None fsspec: 2023.9.2 cupy: None pint: None sparse: 0.14.0 flox: 0.7.2 numpy_groupies: 0.10.2 setuptools: 68.2.2 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8345/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1923431725 I_kwDOAMm_X85ypT0t 8264 Improve error messages max-sixty 5635139 open 0     4 2023-10-03T06:42:57Z 2023-10-24T18:40:04Z   MEMBER      

Is your feature request related to a problem?

Coming back to xarray, and using it based on what I remember from a year ago or so, means I make lots of mistakes. I've also been using it outside of a repl, where error messages are more important, given I can't explore a dataset inline.

Some of the error messages could be much more helpful. Take one example:

xarray.core.merge.MergeError: conflicting values for variable 'date' on objects to be combined. You can skip this check by specifying compat='override'.

The second sentence is nice. But the first could be give us much more information: - Which variables conflict? I'm merging four objects, so would be so helpful to know which are causing the issue. - What is the conflict? Is one a superset and I can join=...? Are they off by 1 or are they completely different types? - Our testing.assert_equal produces pretty nice errors, as a comparison

Having these good is really useful, lets folks stay in the flow while they're working, and it signals that we're a well-built, refined library.

Describe the solution you'd like

I'm not sure the best way to surface the issues — error messages make for less legible contributions than features or bug fixes, and the primary audience for good error messages is often the opposite of those actively developing the library. They're also more difficult to manage as GH issues — there could be scores of marginal issues which would often be out of date.

One thing we do in PRQL is have a file that snapshots error messages test_bad_error_messages.rs, which can then be a nice contribution to change those from bad to good. I'm not sure whether that would work here (python doesn't seem to have a great snapshotter, pytest-regtest is the best I've found; I wrote pytest-accept but requires doctests).

Any other ideas?

Describe alternatives you've considered

No response

Additional context

A couple of specific error-message issues: - https://github.com/pydata/xarray/issues/2078 - https://github.com/pydata/xarray/issues/5290

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8264/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
529644880 MDU6SXNzdWU1Mjk2NDQ4ODA= 3580 xr.DataArray.values fails with latest versions of netcdf4 kpegion 16332933 closed 0     4 2019-11-28T01:26:07Z 2023-10-18T17:01:17Z 2023-10-18T17:01:17Z NONE      

MCVE Code Sample

```python import xarray as xr xr.show_versions()

url = 'http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/NCEP-CFSv2/.HINDCAST/.MONTHLY/.sst/dods' fullda = xr.open_dataset(url, decode_times=False,chunks={'S': 'auto', 'L': 'auto', 'M':'auto','X':'auto','Y':'auto'}) print(fullda) print(fullda['sst'][:10,0,0,0,0].values)

```

Expected Output

python <xarray.Dataset> Dimensions: (L: 10, M: 24, S: 348, X: 360, Y: 181) Coordinates: * X (X) float32 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0 * L (L) float32 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 * S (S) float32 264.0 265.0 266.0 267.0 ... 608.0 609.0 610.0 611.0 * M (M) float32 1.0 2.0 3.0 4.0 5.0 6.0 ... 20.0 21.0 22.0 23.0 24.0 * Y (Y) float32 -90.0 -89.0 -88.0 -87.0 -86.0 ... 87.0 88.0 89.0 90.0 Data variables: sst (S, L, M, Y, X) float32 dask.array<chunksize=(29, 10, 24, 51, 45), meta=np.ndarray> Attributes: Conventions: IRIDL [-25.652588 -35.577393 -48.702896 -51.3853 -50.687195 -50.341995 -50.407593 -54.955994 -52.052994 -47.31279 ]

Problem Description

This should return the array’s data as a numpy.ndarray according to the documentation and as shown above. I tested this with various versions of netcdf4 and I get the error below for netcdf4 versions 1.5.1, 1.5.1.2, 1.5.3 (latest version). If I use netcdf4 version 1.5.1, I get the expected output as above.

``` python <xarray.Dataset> Dimensions: (L: 10, M: 24, S: 348, X: 360, Y: 181) Coordinates: * X (X) float32 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0 * L (L) float32 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 * S (S) float32 264.0 265.0 266.0 267.0 ... 608.0 609.0 610.0 611.0 * M (M) float32 1.0 2.0 3.0 4.0 5.0 6.0 ... 20.0 21.0 22.0 23.0 24.0 * Y (Y) float32 -90.0 -89.0 -88.0 -87.0 -86.0 ... 87.0 88.0 89.0 90.0 Data variables: sst (S, L, M, Y, X) float32 dask.array<chunksize=(29, 10, 24, 51, 45), meta=np.ndarray> Attributes: Conventions: IRIDL Traceback (most recent call last): File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/backends/netCDF4_.py", line 84, in _getitem array = getitem(original_array, key) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/backends/common.py", line 54, in robust_getitem return array[key] File "netCDF4/_netCDF4.pyx", line 4408, in netCDF4._netCDF4.Variable.getitem File "netCDF4/_netCDF4.pyx", line 5350, in netCDF4._netCDF4.Variable._get IndexError: index exceeds dimension bounds

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "testpython.py", line 7, in <module> print(fullda['sst'][:10,0,0,0,0].values) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/core/dataarray.py", line 567, in values return self.variable.values File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/core/variable.py", line 448, in values return as_array_or_item(self._data) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/core/variable.py", line 254, in _as_array_or_item data = np.asarray(data) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/dask/array/core.py", line 1314, in __array__ x = self.compute() File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/dask/base.py", line 165, in compute (result,) = compute(self, traverse=False, kwargs) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/dask/base.py", line 436, in compute results = schedule(dsk, keys, kwargs) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/dask/threaded.py", line 81, in get *kwargs File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/dask/local.py", line 486, in get_async raise_exception(exc, tb) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/dask/local.py", line 316, in reraise raise exc File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/dask/local.py", line 222, in execute_task result = _execute_task(task, data) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task return func(args2) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/dask/array/core.py", line 106, in getter c = np.asarray(c) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/core/indexing.py", line 481, in array return np.asarray(self.array, dtype=dtype) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/core/indexing.py", line 643, in array return np.asarray(self.array, dtype=dtype) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/core/indexing.py", line 547, in array return np.asarray(array[self.key], dtype=None) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/backends/netCDF4.py", line 72, in getitem key, self.shape, indexing.IndexingSupport.OUTER, self.getitem File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/core/indexing.py", line 827, in explicit_indexing_adapter result = raw_indexing_method(raw_key.tuple) File "/homes/kpegion/.conda/envs/testenv3-dev/lib/python3.6/site-packages/xarray/backends/netCDF4.py", line 94, in _getitem raise IndexError(msg) IndexError: The indexing operation you are attempting to perform is not valid on netCDF4.Variable object. Try loading your data into memory first by calling .load(). ```

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.7 | packaged by conda-forge | (default, Nov 6 2019, 16:19:42) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1062.4.3.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.1 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.3 scipy: None netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.8.1 distributed: 2.8.1 matplotlib: None cartopy: None seaborn: None numbagg: None setuptools: 42.0.1.post20191125 pip: 19.3.1 conda: None pytest: None IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3580/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1924497392 I_kwDOAMm_X85ytX_w 8269 open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0' mps01060 6819509 closed 0     4 2023-10-03T16:19:54Z 2023-10-18T16:50:20Z 2023-10-18T16:50:20Z NONE      

What is your issue?

When moving from xarray version '2023.8.0' to '2023.9.0' the behavior of importing a zarr changed for me (code to create example zarr at end of this post). When importing a variable with units "days accumulated", the values are scaled differently between the two versions. The latest version seems to automatically treat this as as time-like array (I think the -9.223372e+18 seen are NaT-like?).

Open the zarr: python import xarray as xr ds = xr.open_dataset('debug.zarr', engine='zarr', chunks={})

Print as a pandas-like table for each version of xarray for readability: python ds.to_dataframe()

Version '2023.8.0': |time|dapr (dtype=float32)|mdpr (dtype=float32)| |---|---|---| |2000-01-01|NaN|NaN| |2000-01-02|NaN|NaN| |2000-01-03|2.0|1.5|

Version '2023.9.0': |time|dapr (dtype=float64)|mdpr (dtype=float32)| |---|---|---| |2000-01-01|-9.223372e+18|NaN| |2000-01-02|-9.223372e+18|NaN| |2000-01-03|2.000000e+00|1.5|

I can manually disable this by using the "use_cf=False", "mask_and_scale=False", and then manually scale this variable, though that is not ideal. The "decode_timedelta" doesn't seem to have an effect on this data, either.

I understand the "days" keyword is in my units, however the full unit is "days accumulated". Has the behavior of xarray changed to find keywords such as "days" occurring anywhere in the units (eg. as a substring)? Do you have any other suggestions? Thank you for the help.

Code to create the debug.zarr for the tables above:

```python import numpy as np import pandas as pd import xarray as xr import zarr

Create some multiday precipitation data (similar to https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily)

mdpr is the amount of a multiday total (inches)

dapr is the number of days each multiday total occurred over (days accumulated).

In this example, 1.50 inches of rain fell over 2 days (2 observation periods), ending on 2000-01-03

I use float32 to represent these, but pack these as int16 values in the zarr.

mdpr = np.array([np.NaN, np.NaN, 1.50], dtype=np.float32) dapr = np.array([np.NaN, np.NaN , 2.0], dtype=np.float32) time = pd.date_range('2000-01-01', periods=3)

Create a dataset from these values

ds = xr.Dataset( data_vars=dict( mdpr=(['time'], mdpr), dapr=(['time'], dapr), ), coords=dict( time=time, ), attrs=dict(description='multiday precipitation data'), )

Specify encoding to pack these float32 values as int16

encoding = { 'mdpr' : { 'chunks' : (3,), 'compressor': zarr.Blosc(cname='zstd', clevel=3, shuffle=1), 'filters': None, 'missing_value': -32768, '_FillValue': -32768, 'scale_factor': 0.01, 'add_offset': 0.0, 'dtype': np.int16, }, 'dapr' : { 'chunks' : (3,), 'compressor': zarr.Blosc(cname='zstd', clevel=3, shuffle=1), 'filters': None, 'missing_value': -32768, '_FillValue': -32768, 'scale_factor': 1.0, 'add_offset': 0.0, 'dtype': np.int16, }, }

Create attributes. The "units" for the dapr variable seems to be the issue "days" in the

"days accumulated"

ds.mdpr.attrs['units'] = 'inches' ds.mdpr.attrs['description'] = 'multiday precip amount'

ds.dapr.attrs['units'] = 'days accumulated' ds.dapr.attrs['description'] = 'number of days included in the multiday precipitation'

Save to zarr

ds.to_zarr('debug.zarr', mode='w', encoding=encoding) ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8269/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1384226112 I_kwDOAMm_X85SgZ1A 7075 Convert xarray dataset to pandas dataframe is much slower in newest xarray version rilllydi 20794996 closed 0     4 2022-09-23T19:36:28Z 2023-10-14T20:37:40Z 2023-10-14T20:37:40Z NONE      

What is your issue?

Converting an xarray dataset to pandas dataframe has become much slower in the newest xarray version.

I want to read in very large netcdf files, extract a slice, and convert the slice to a pandas dataframe. For an input size of 2GB, the xarray version 0.21.0 takes 3 seconds versus the xarray version 2022.6.0 takes 44 seconds. See table below for more tests with increasing size of xarray dataset.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/rilllydi/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/rilllydi/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> Number of NetCDF Input Files in Xarray Dataset (~1GB per file): | 2 | 5 | 10 | 15 | 20 | 30 | 40 -- | -- | -- | -- | -- | -- | -- | -- Older Xarray Version 0.21.0 | 0:03 | 0:02 | 0:04 | 0:06 | 0:09 | 0:13 | 0:17 Newer Xarray Version 2022.6.0 | 0:44 | 1:30 | 2:46 | 4:01 | 5:23 | 7:56 | 10:29 </body> </html>

Here is my code: ```

Read in a list of netcdf files and combine into a single dataset.

with xr.open_mfdataset(infile_list, combine='by_coords') as ds:

    # Extract the data for a single location (the nearest grid point) using the provided coordinates (lat/lon).
    ds_slice = ds.sel(lon=-84.725, lat=42.3583, method='nearest')

    # Convert xarray dataset to a pandas dataframe.
    # This is now the slow part since the xarray library was updated.
    df = ds_slice.to_dataframe()

```

The netcdf files I am reading in are about 1 GB each, containing daily weather data for the entire CONUS. There is 1 file per year, so if I read in 2 files, the dimensions are (lon: 1386, lat: 585, day: 731, crs: 1) with coordinates of lon, lat, day, and crs. They include 8 float data variables.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7075/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned xarray 13221727 issue
1943355490 I_kwDOAMm_X85z1UBi 8308 Different plotting reaults compared to matplotlib zxdawn 30388627 closed 0     4 2023-10-14T15:54:32Z 2023-10-14T20:02:16Z 2023-10-14T20:02:16Z NONE      

What happened?

I got different results when I tried to plot 2D data test.npy.zip using matplotlib and xarray.

matplotlib

xarray

What did you expect to happen?

Same plot.

Minimal Complete Verifiable Example

```Python import numpy as np import xarray as xr import matplotlib.pyplot as plt

test = np.load('test.npy')

plt.imshow(test, vmin=0, vmax=200) plt.colorbar()

xr.DataArray(test).plot.imshow(vmin=0, vmax=200) ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:41:52) [Clang 15.0.7 ] python-bits: 64 OS: Darwin OS-release: 22.3.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: None libnetcdf: None xarray: 2023.9.0 pandas: 2.1.1 numpy: 1.26.0 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.8.0 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.2.2 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8308/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1821467933 I_kwDOAMm_X85skWUd 8021 Specify chunks in bytes mrocklin 306380 open 0     4 2023-07-26T02:29:43Z 2023-10-06T10:09:33Z   MEMBER      

Is your feature request related to a problem?

I'm playing around with xarray performance and would like a way to easily tweak chunk sizes. I'm able to do this by backing out what xarray chooses in an open_zarr call and then provide the right chunks= argument. I'll admit though that I wouldn't mind giving Xarray a value like "1 GiB" though and having it use that when determining "auto" chunk sizes.

Dask array does this in two ways. We can provide a value in chunks as like the following:

python x = da.random.random(..., chunks="1 GiB")

We also refer to a value in Dask config

```python In [1]: import dask

In [2]: dask.config.get("array.chunk-size") Out[2]: '128MiB' ```

This is not very important (I'm unblocked) but I thought I'd mention it in case someone is looking for some fun work 🙂

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8021/reactions",
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1169750048 I_kwDOAMm_X85FuPgg 6360 Multidimensional `interpolate_na()` iuryt 5797727 open 0     4 2022-03-15T14:27:46Z 2023-09-28T11:51:20Z   NONE      

Is your feature request related to a problem?

I think that having a way to run a multidimensional interpolation for filling missing values would be awesome.

The code snippet below create a data and show the problem I am having now. If the data has some orientation, we couldn't simply interpolate dimensions separately.

```python import xarray as xr import numpy as np

n = 30 x = xr.DataArray(np.linspace(0,2np.pi,n),dims=['x']) y = xr.DataArray(np.linspace(0,2np.pi,n),dims=['y']) z = (np.sin(x)*xr.ones_like(y))

mask = xr.DataArray(np.random.randint(0,1+1,(n,n)).astype('bool'),dims=['x','y'])

kw = dict(add_colorbar=False)

fig,ax = plt.subplots(1,3,figsize=(11,3)) z.plot(ax=ax[0],kw) z.where(mask).plot(ax=ax[1],kw) z.where(mask).interpolate_na('x').plot(ax=ax[2],**kw) ```

I tried to use advanced interpolation for that, but it doesn't look like the best solution.

```python zs = z.where(mask).stack(k=['x','y']) zs = zs.where(np.isnan(zs),drop=True) xi,yi = zs.k.x.drop('k'),zs.k.y.drop('k') zi = z.interp(x=xi,y=yi)

fig,ax = plt.subplots() z.where(mask).plot(ax=ax,kw) ax.scatter(xi,yi,c=zi,kw,linewidth=1,edgecolor='k') ``` returns

Describe the solution you'd like

Simply z.interpolate_na(['x','y'])

Describe alternatives you've considered

I could extract the data to numpy and interpolate using scipy.interpolate.griddata, but this is not the way xarray should work.

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6360/reactions",
    "total_count": 11,
    "+1": 9,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 2
}
    xarray 13221727 issue
1905824568 I_kwDOAMm_X85xmJM4 8221 Frequent doc build timeout / OOM max-sixty 5635139 open 0     4 2023-09-20T23:02:37Z 2023-09-21T03:50:07Z   MEMBER      

What is your issue?

I'm frequently seeing Command killed due to timeout or excessive memory consumption in the doc build.

It's after 1552 seconds, so it not being a round number means it might be the memory?

It follows writing output... [ 90%] generated/xarray.core.rolling.DatasetRolling.max, which I wouldn't have thought as a particularly memory-intensive part of the build?

Here's an example: https://readthedocs.org/projects/xray/builds/21983708/

Any thoughts for what might be going on?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8221/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1326238990 I_kwDOAMm_X85PDM0O 6870 `rolling_exp` loses coords max-sixty 5635139 closed 0     4 2022-08-02T18:27:44Z 2023-09-19T01:13:23Z 2023-09-19T01:13:23Z MEMBER      

What happened?

We lose the time coord here — Dimensions without coordinates: time:

```python ds = xr.tutorial.load_dataset("air_temperature") ds.rolling_exp(time=5).mean()

<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 Dimensions without coordinates: time Data variables: air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.4 296.1 295.7 ```

(I realize I wrote this, I didn't think this used to happen, but either it always did or I didn't write good enough tests... mea culpa)

What did you expect to happen?

We keep the time coords, like we do for normal rolling:

python In [2]: ds.rolling(time=5).mean() Out[2]: <xarray.Dataset> Dimensions: (lat: 25, lon: 53, time: 2920) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

Minimal Complete Verifiable Example

Python (as above)

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.13 (main, May 24 2022, 21:13:51) [Clang 13.1.6 (clang-1316.0.21.2)] python-bits: 64 OS: Darwin OS-release: 21.6.0 machine: arm64 processor: arm byteorder: little LC_ALL: en_US.UTF-8 LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.21.6 scipy: 1.8.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.12.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.12.0 distributed: 2021.12.0 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: 0.2.1 fsspec: 2021.11.1 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 62.3.2 pip: 22.1.2 conda: None pytest: 7.1.2 IPython: 8.4.0 sphinx: 4.3.2
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6870/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
598991028 MDU6SXNzdWU1OTg5OTEwMjg= 3967 Support static type analysis eric-czech 6130352 closed 0     4 2020-04-13T16:34:43Z 2023-09-17T19:43:32Z 2023-09-17T19:43:31Z NONE      

As a related discussion to https://github.com/pydata/xarray/issues/3959, I wanted to see what possibilities exist for a user or API developer building on Xarray to enforce Dataset/DataArray structure through static analysis.

In my specific scenario, I would like to model several different types of data in my domain as Dataset objects, but I'd like to be able enforce that names and dtypes associated with both data variables and coordinates meet certain constraints.

@keewis mentioned an example of this in https://github.com/pydata/xarray/issues/3959#issuecomment-612076605 where it might be possible to use something like a TypedDict to constrain variable/coord names and array dtypes, but this won't work with TypedDict as it's currently implemented. Another possibility could be generics, and I took a stab at that in https://github.com/pydata/xarray/issues/3959#issuecomment-612513722 (though this would certainly be more intrusive).

An example of where this would be useful is in adding extensions through accessors:

```python @xr.register_dataset_accessor('ext') def ExtAccessor: def init(self, ds) self.data = ds

def is_zero(self):
    return self.ds['data'] == 0

ds = xr.Dataset(dict(DATA=xr.DataArray([0.0])))

I'd like to catch that "data" was misspelled as "DATA" and that

this particular method shouldn't be run against floats prior to runtime

ds.ext.is_zero() ```

I probably care more about this as someone looking to build an API on top of Xarray, but I imagine typical users would find a solution to this problem beneficial too.

There is a related conversation on doing something like this for Pandas DataFrames at https://github.com/python/typing/issues/28#issuecomment-351284520, so that might be helpful context for possibilities with TypeDict.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3967/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned xarray 13221727 issue
561921094 MDU6SXNzdWU1NjE5MjEwOTQ= 3762 xarray groupby/map fails to parallelize bjcosta 6491058 closed 1     4 2020-02-07T23:20:59Z 2023-09-15T15:52:42Z 2023-09-15T15:52:41Z NONE      

MCVE Code Sample

```python import sys import math import logging import dask import xarray import numpy

logger = logging.getLogger('main')

if name == 'main': logging.basicConfig( stream=sys.stdout, format='%(asctime)s %(levelname)-8s %(message)s', level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S')

logger.info('Starting dask client')
client = dask.distributed.Client()

SIZE = 100000
SONAR_BINS = 2000
time = range(0, SIZE)
upper_limit = numpy.random.randint(0, 10, (SIZE))
lower_limit = numpy.random.randint(20, 30, (SIZE))
sonar_data = numpy.random.randint(0, 255, (SIZE, SONAR_BINS))

channel = xarray.Dataset({
        'upper_limit': (['time'], upper_limit, {'units': 'depth meters'}),
        'lower_limit': (['time'],  lower_limit, {'units': 'depth meters'}),
        'data': (['time', 'depth_bin'], sonar_data, {'units': 'amplitude'}),
    },
    coords={
        'depth_bin': (['depth_bin'], range(0,SONAR_BINS)),
        'time': (['time'], time)
    })

logger.info('get overall min/max radar range we want to normalize to called the adjusted range')
adjusted_min, adjusted_max = channel.upper_limit.min().values.item(), channel.lower_limit.max().values.item()
adjusted_min = math.floor(adjusted_min)
adjusted_max = math.ceil(adjusted_max)
logger.info('adjusted_min: %s, adjusted_max: %s', adjusted_min, adjusted_max)

bin_count = len(channel.depth_bin)
logger.info('bin_count: %s', bin_count)

adjusted_depth_per_bin = (adjusted_max - adjusted_min) / bin_count
logger.info('adjusted_depth_per_bin: %s', adjusted_depth_per_bin)

adjusted_bin_depths = [adjusted_min + (j * adjusted_depth_per_bin) for j in range(0, bin_count)]
logger.info('adjusted_bin_depths[0]: %s ... [-1]: %s', adjusted_bin_depths[0], adjusted_bin_depths[-1])

def Interp(ds):
    # Ideally instead of using interp we will use some kind of downsampling and shift
    # this doesnt exist in xarray though and interp is good enough for the moment

    # I just added this to debug
    t = ds.time.values.item()
    if (t % 100) == 0:
        total = len(channel.time)
        perc = 100.0 * t / total
        logger.info('%s : %s of %s', perc, t, total)

    unadjusted_depth_amplitudes = ds.data
    unadjusted_min = ds.upper_limit.values.item()
    unadjusted_max = ds.lower_limit.values.item()
    unadjusted_depth_per_bin = (unadjusted_max - unadjusted_min) / bin_count

    index_mapping = [((adjusted_min + (bin * adjusted_depth_per_bin)) - unadjusted_min) / unadjusted_depth_per_bin for bin in range(0, bin_count)]
    adjusted_depth_amplitudes = unadjusted_depth_amplitudes.interp(coords={'depth_bin':index_mapping}, method='linear', assume_sorted=True)
    adjusted_depth_amplitudes = adjusted_depth_amplitudes.rename({'depth_bin':'depth'}).assign_coords({'depth':adjusted_bin_depths})

    #logger.info('%s, \n\tunadjusted_depth_amplitudes.values:%s\n\tunadjusted_min:%s\n\tunadjusted_max:%s\n\tunadjusted_depth_per_bin:%s\n\tindex_mapping:%s\n\tadjusted_depth_amplitudes:%s\n\tadjusted_depth_amplitudes.values:%s\n\n', ds, unadjusted_depth_amplitudes.values, unadjusted_min, unadjusted_max, unadjusted_depth_per_bin, index_mapping, adjusted_depth_amplitudes, adjusted_depth_amplitudes.values)
    return adjusted_depth_amplitudes

# Lets split into chunks so could be performed in parallel
# This doesnt work to parallelize and only slows it down a lot
#logger.info('chunk')
#channel = channel.chunk({'time':100})

logger.info('groupby')
g = channel.groupby('time')

logger.info('do interp')
normalized_depth_data = g.map(Interp)

logger.info('done')

```

Expected Output

I am fairly new to xarray but feel this example could have been executed a bit better than xarray currenty does. Each map call of the above custom function should be possible to be parallelized from what I can tell. I imagined that in the backend, xarray would have chunked it and run in parallel on dask. However I find it is VERY slow even for single threaded case but also that it doesn't seem to parallelize.

It takes roughly 5msec per map call in my hardware when I don't include the chunk and 70msec with the chunk call you can find in the code.

Problem Description

The single threaded performance is super slow, but also it fails to parallelize the computations across the cores on my machine.

If you are after more background to what I am trying to do, I also asked a SO question about how to re-organize the code to improve performance. I felt the current behavior though is a performance bug (assuming I didn't do something completely wrong in the code).

https://stackoverflow.com/questions/60103317/can-the-performance-of-using-xarray-groupby-map-be-improved

Output of xr.show_versions()

# Paste the output here xr.show_versions() here xarray.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None libhdf5: 1.10.4 libnetcdf: 4.6.1 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.3 scipy: 1.3.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.2 cfgrib: None iris: None bottleneck: None dask: 2.9.1 distributed: 2.9.1 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: None numbagg: None setuptools: 44.0.0.post20200102 pip: 19.3.1 conda: None pytest: None IPython: 7.11.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3762/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1473152374 I_kwDOAMm_X85XzoV2 7348 Using entry_points to register dataset and dataarray accessors? nbren12 1386642 open 0     4 2022-12-02T16:48:42Z 2023-09-14T19:53:46Z   CONTRIBUTOR      

Is your feature request related to a problem?

External libraries often use the dataset/dataarray accessor pattern (e.g. metpy). These accessors are not available until importing the external package where the registration occurs. This means scripts using these accessors must include an often-unused import that linters will complain about e.g.

``` import metpy # linter complains here

some data

ds: xr.Dataset = ...

ds.metpy.... ```

Describe the solution you'd like

Use importlib entrypoints to register these as entrypoints so that registration is automatically handled. This is currently enabled for the array backend, but not for accessors (e.g. metpy's setup.cfg).

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7348/reactions",
    "total_count": 2,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
    xarray 13221727 issue
1098241812 I_kwDOAMm_X85BddcU 6149 [Bug]: `numpy` `DeprecationWarning` with `DType` and `xr.testing.assert_all_close()` + Dask tomvothecoder 25624127 closed 0     4 2022-01-10T18:34:27Z 2023-09-13T20:06:59Z 2023-09-13T20:06:58Z CONTRIBUTOR      

What happened?

A numpy DeprecationWarning regarding DType is being outputted when using xr.testing.assert_all_close() to compare two chunked Datasets. This does warning does not appear with two non-chunked datasets.

What did you expect to happen?

The warning should not appear.

Minimal Complete Verifiable Example

```python class TestTemporalAvg: class TestTimeseries: @pytest.fixture(autouse=True) def setup(self): self.ds: xr.Dataset = generate_dataset(cf_compliant=True, has_bounds=True)

    # No warning with this test
    def test_weighted_annual_avg(self):
        ds = self.ds.copy()

        result = ds.temporal.temporal_avg("timeseries", "year", data_var="ts")
        expected = ds.copy()
        expected["ts"] = xr.DataArray(
            name="ts",
            data=np.ones((2, 4, 4)),
            coords={
                "lat": self.ds.lat,
                "lon": self.ds.lon,
                "year": pd.MultiIndex.from_tuples(
                    [(2000,), (2001,)],
                ),
            },
            dims=["year", "lat", "lon"],
            attrs={
                "operation": "temporal_avg",
                "mode": "timeseries",
                "freq": "year",
                "groupby": "year",
                "weighted": "True",
                "centered_time": "True",
            },
        )

        # For some reason, there is a floating point difference between both
        # for ts so we have to use floating point comparison
        xr.testing.assert_allclose(result, expected)
        assert result.ts.attrs == expected.ts.attrs

    # Warning with this test
    @requires_dask
    def test_weighted_annual_avg_with_chunking(self):
        ds = self.ds.copy().chunk({"time": 2})

        result = ds.temporal.temporal_avg("timeseries", "year", data_var="ts")
        expected = ds.copy()
        expected["ts"] = xr.DataArray(
            name="ts",
            data=np.ones((2, 4, 4)),
            coords={
                "lat": ds.lat,
                "lon": ds.lon,
                "year": pd.MultiIndex.from_tuples(
                    [(2000,), (2001,)],
                ),
            },
            dims=["year", "lat", "lon"],
            attrs={
                "operation": "temporal_avg",
                "mode": "timeseries",
                "freq": "year",
                "groupby": "year",
                "weighted": "True",
                "centered_time": "True",
            },
        )

        # For some reason, there is a floating point difference between both
        # for ts so we have to use floating point comparison
        xr.testing.assert_allclose(result, expected)
        assert result.ts.attrs == expected.ts.attrs

```

Relevant log output

python DeprecationWarning: The `dtype` and `signature` arguments to ufuncs only select the general DType and not details such as the byte order or time unit (with rare exceptions see release notes). To avoid this warning please use the scalar types `np.float64`, or string notation. In rare cases where the time unit was preserved, either cast the inputs or provide an output array. In the future NumPy may transition to allow providing `dtype=` to denote the outputs `dtype` as well. (Deprecated NumPy 1.21) return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.45.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1

xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.4 scipy: None netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.11.2 distributed: 2021.11.2 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2021.11.1 cupy: None pint: None sparse: None setuptools: 59.6.0 pip: 21.3.1 conda: None pytest: 6.2.5 IPython: 7.30.1 sphinx: 4.3.1

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6149/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  not_planned xarray 13221727 issue
1075765204 I_kwDOAMm_X85AHt_U 6055 Unexpected type conversion in variables with _FillValue jp-dark 24235303 closed 0     4 2021-12-09T16:26:54Z 2023-09-13T12:40:14Z 2023-09-13T12:40:13Z CONTRIBUTOR      

What happened: When opening a dataset with an int16 variable with the _FillValue attribute, the variable is converted from type int16 to float32. This was originally reported to the TileDB-CF-Py Git repo that contains a TileDB backend for xarray. See TileDB-CF-Py issue #117.

What you expected to happen: I would expect the type to remain the same when applying the _FillValue.

Minimal Complete Verifiable Example:

Original example from TileDB-CF-Py issue #117 using the TileDB backend. ```python import tiledb import xarray as xr import numpy as np

index = tiledb.Dim(name='index', domain=(0, 3)) domain = tiledb.Domain(index) var = tiledb.Attr(name='var', dtype=np.int16) schema = tiledb.ArraySchema(domain=domain, attrs=[var], sparse=False) tiledb.Array.create('dense_array0', schema)

with tiledb.open('dense_array0', 'w') as A: A[:] = np.array([5, 6, 7, 8], dtype=np.int16)

ds = xr.open_dataset('dense_array0', engine='tiledb') ds['var'].dtype ```

NetCDF example with the same behavior: ```python import netCDF4 import xarray as xr import numpy as np

filename = 'temp_file.nc' with netCDF4.Dataset(filename, mode="w") as group: group.createDimension("index", 4) var = group.createVariable("var", np.int16, ("index",), fill_value=-1) var[:] = np.array([5, 6, 7, 8], dtype=np.int16) dataset = xr.open_dataset(filename) dataset["var"].dtype ```

Anything else we need to know?: * I was able to verify the type conversion from int16 to float32 occurs in the conventions.decode_cf_variables call in the open_dataset method of StoreBackendEntrypoint. * I was able to verify the conversion does not happen if mask_and_scale=False. * Note that TileDB is automatically setting a fill value for all dense numerical arrays, and so we are always setting the _FillValue attribute for variables from the TileDB backend.

Environment: I was able to reproduce this with both xarray 0.19.0 and 0.20.1

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6055/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
  completed xarray 13221727 issue
514672231 MDU6SXNzdWU1MTQ2NzIyMzE= 3466 RuntimeError: NetCDF: DAP failure b-kode 47066389 closed 1     4 2019-10-30T13:32:34Z 2023-09-12T16:00:57Z 2023-09-12T16:00:57Z NONE      

Hi all,

I am interested in extracting specific point and variable information from the GEOS-FC product, accessible via OpenDap.

Loading the data seems to work fine, and I can do some processing to my specific needs. Ideally I would like to convert this selection to a dataframe, or if needed store as an intermediate file from which I can read again.

Yet when doing so, I get the following error: RuntimeError: NetCDF: DAP failure

I am not sure what is causing this? Perhaps I chunck the data in the wrong (inefficient) way? Or there is an error with the GEOS netcdf files? Or ...

Below a working code snippet.

``` python import xarray as xr idir_geos = 'https://opendap.nccs.nasa.gov/dods/gmao/geos-cf/assim/chm_tavg_1hr_g1440x721_v1'

def preprocess(ds): ''' Rename variables and select the relevant ones. Remove lev''' ds = ds.rename({'pm25_rh35_gcc': 'PM2.5','no': 'NO','no2': 'NO2','o3': 'O3','so2': 'SO2','co': 'CO'}) ds = ds[['PM2.5','NO','NO2','O3','SO2','CO']] ds = ds.squeeze('lev') return ds

ds = xr.open_mfdataset([idir_geos],preprocess=preprocess,combine='by_coords')

lat = 51.25 lon = 4.25 pol = 'O3' ds_sel = ds.sel(lat=lat,lon=lon,method='nearest')[pol]

df_sel = ds_sel.to_dataframe().drop(['lat','lon'],axis=1)

ds_sel.to_netcdf('test.nc') # Runtime error

```

Traceback error:

Traceback (most recent call last): File "/home/demuzmp4/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3291, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-2-fccd11da2246>", line 57, in <module> df_sel = ds_sel.to_dataframe().drop(['lat','lon'],axis=1) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/dataset.py", line 4285, in to_dataframe return self.to_dataframe(self.dims) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/dataset.py", line 4273, in _to_dataframe for k in columns File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/dataset.py", line 4273, in <listcomp> for k in columns File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/variable.py", line 437, in values return _as_array_or_item(self._data) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/variable.py", line 250, in _as_array_or_item data = np.asarray(data) File "/home/demuzmp4/.local/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/usr/lib/python3/dist-packages/dask/array/core.py", line 1138, in __array__ x = self.compute() File "/usr/lib/python3/dist-packages/dask/base.py", line 135, in compute (result,) = compute(self, traverse=False, kwargs) File "/usr/lib/python3/dist-packages/dask/base.py", line 333, in compute results = get(dsk, keys, kwargs) File "/usr/lib/python3/dist-packages/dask/threaded.py", line 75, in get pack_exception=pack_exception, *kwargs) File "/usr/lib/python3/dist-packages/dask/local.py", line 521, in get_async raise_exception(exc, tb) File "/usr/lib/python3/dist-packages/dask/compatibility.py", line 60, in reraise raise exc File "/usr/lib/python3/dist-packages/dask/local.py", line 290, in execute_task result = _execute_task(task, data) File "/usr/lib/python3/dist-packages/dask/local.py", line 271, in _execute_task return func(args2) File "/usr/lib/python3/dist-packages/dask/array/core.py", line 72, in getter c = np.asarray(c) File "/home/demuzmp4/.local/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/indexing.py", line 490, in array return np.asarray(self.array, dtype=dtype) File "/home/demuzmp4/.local/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/indexing.py", line 652, in array return np.asarray(self.array, dtype=dtype) File "/home/demuzmp4/.local/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/indexing.py", line 556, in array return np.asarray(array[self.key], dtype=None) File "/home/demuzmp4/.local/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/coding/variables.py", line 73, in array return self.func(self.array) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/coding/variables.py", line 142, in _apply_mask data = np.asarray(data, dtype=dtype) File "/home/demuzmp4/.local/lib/python3.6/site-packages/numpy/core/_asarray.py", line 85, in asarray return array(a, dtype, copy=False, order=order) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/indexing.py", line 556, in array return np.asarray(array[self.key], dtype=None) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/backends/netCDF4.py", line 72, in getitem key, self.shape, indexing.IndexingSupport.OUTER, self.getitem File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/core/indexing.py", line 836, in explicit_indexing_adapter result = raw_indexing_method(raw_key.tuple) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/backends/netCDF4.py", line 84, in _getitem array = getitem(original_array, key) File "/home/demuzmp4/.local/lib/python3.6/site-packages/xarray/backends/common.py", line 54, in robust_getitem return array[key] File "netCDF4/_netCDF4.pyx", line 4408, in netCDF4._netCDF4.Variable.getitem File "netCDF4/_netCDF4.pyx", line 5352, in netCDF4._netCDF4.Variable._get File "netCDF4/_netCDF4.pyx", line 1887, in netCDF4._netCDF4._ensure_nc_success RuntimeError: NetCDF: DAP failure

More info on my xarray installation:

commit: None python: 3.6.9 (default, Jul 3 2019, 07:38:46) [GCC 8.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-66-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_GB.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.3 xarray: 0.14.0 pandas: 0.25.2 numpy: 1.17.3 scipy: 1.3.1 netCDF4: 1.5.3 pydap: installed h5netcdf: None h5py: 2.9.0 Nio: None zarr: 2.3.2 cftime: 1.0.4 nc_time_axis: None PseudoNetCDF: None rasterio: 1.0.28 cfgrib: None iris: None bottleneck: 1.2.1 dask: 0.16.0 distributed: None matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: 0.9.0 numbagg: None setuptools: 41.4.0 pip: 9.0.1 conda: None pytest: 5.2.1 IPython: 7.3.0 sphinx: 1.8.4

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3466/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1339921253 I_kwDOAMm_X85P3ZNl 6919 Parallel read with MPI mengaldo 8100801 closed 0     4 2022-08-16T07:19:14Z 2023-09-12T15:16:32Z 2023-09-12T15:16:31Z NONE      

Is your feature request related to a problem?

Is it possible to somehow extend xarray to use MPI I/O?

Describe the solution you'd like

We would need to know the offset from where the actual data starts within the file. Is there a way of retrieving that? Disclaimer: I am not an expert of NetCDF format - so, apologies if the question is trivial!

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6919/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1861335844 I_kwDOAMm_X85u8bsk 8096 Errors when saving PyObject coordinates krokosik 38408316 closed 0     4 2023-08-22T12:14:53Z 2023-09-06T11:44:41Z 2023-09-06T11:44:41Z CONTRIBUTOR      

What happened?

Hi, I'm trying to create a DataArray with coordinates that are tuples and potentionally even more dimensional objects. The way I did it is to create an empty numpy array with dtype=object and then insert my tuples inside. This doesn't throw an error when creating a DataArray (as opposed to using a 2D ndarray or a list of lists). However, when trying to save it to zarr or netcdf. I get an error saying ValueError: setting an array element with a sequence

What did you expect to happen?

I want to be able to save and load such coordinates without errors. Maybe there is a cleaner way to do it than the object dtype ndarray?

Minimal Complete Verifiable Example

Python n = 5 x = np.empty(n, dtype=object) for i in range(n): x[i] = (i, i) xr.DataArray(np.arange(n), dims=("x"), coords={"x": x}).to_zarr("test")

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python File c:\Users\Wiktor\AppData\Local\pypoetry\Cache\virtualenvs\spin1-JGuolXDk-py3.11\Lib\site-packages\xarray\core\dataarray.py:4014, in DataArray.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf) 4010 else: 4011 # No problems with the name - so we're fine! 4012 dataset = self.to_dataset() -> 4014 return to_netcdf( # type: ignore # mypy cannot resolve the overloads:( 4015 dataset, 4016 path, 4017 mode=mode, 4018 format=format, 4019 group=group, 4020 engine=engine, 4021 encoding=encoding, 4022 unlimited_dims=unlimited_dims, ... 101 result = np.empty(data.shape, dtype) --> 102 result[...] = data 103 return result

ValueError: setting an array element with a sequence. ```

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.3 (tags/v3.11.3:f3909b8, Apr 4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 183 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('Polish_Poland', '1250') libhdf5: None libnetcdf: None xarray: 2023.8.0 pandas: 2.0.3 numpy: 1.25.2 scipy: 1.11.2 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.16.0 cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.7.2 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.0.0 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.14.0 sphinx: 7.1.2
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8096/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1870484988 I_kwDOAMm_X85vfVX8 8120 `open_mfdataset` exits while sending a "Segmentation fault" error kasra-keshavarz 50383939 closed 0     4 2023-08-28T20:51:23Z 2023-09-01T15:43:08Z 2023-09-01T15:43:08Z NONE      

What is your issue?

I try to open about ~10 files, each 5MB as a test case, using xarray's open_mfdataset method with the parallel=True option, however, it throws a "Segmentation fault" error as the following:

```python $ ipython Python 3.10.2 (main, Feb 4 2022, 19:10:35) [GCC 9.3.0] Type 'copyright', 'credits' or 'license' for more information IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import xarray as xr

In [2]: ds = xr.open_mfdataset('./ab_models_198001*.nc', chunks={'time':10})

In [3]: ds Out[3]: <xarray.Dataset> Dimensions: (time: 744, rlat: 140, rlon: 105) Coordinates: * time (time) datetime64[ns] 1980-01-01T13:00:00 ... 1980-0... lon (rlat, rlon) float32 dask.array<chunksize=(140, 105), meta=np.ndarray> lat (rlat, rlon) float32 dask.array<chunksize=(140, 105), meta=np.ndarray> * rlon (rlon) float64 342.1 342.2 342.2 ... 351.2 351.3 351.4 * rlat (rlat) float64 -7.83 -7.74 -7.65 ... 4.5 4.59 4.68 Data variables: rotated_pole (time) int32 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 RDRS_v2.1_P_UVC_10m (time, rlat, rlon) float32 dask.array<chunksize=(10, 140, 105), meta=np.ndarray> RDRS_v2.1_P_FI_SFC (time, rlat, rlon) float32 dask.array<chunksize=(10, 140, 105), meta=np.ndarray> RDRS_v2.1_P_FB_SFC (time, rlat, rlon) float32 dask.array<chunksize=(10, 140, 105), meta=np.ndarray> RDRS_v2.1_A_PR0_SFC (time, rlat, rlon) float32 dask.array<chunksize=(10, 140, 105), meta=np.ndarray> RDRS_v2.1_P_P0_SFC (time, rlat, rlon) float32 dask.array<chunksize=(10, 140, 105), meta=np.ndarray> RDRS_v2.1_P_TT_1.5m (time, rlat, rlon) float32 dask.array<chunksize=(10, 140, 105), meta=np.ndarray> RDRS_v2.1_P_HU_1.5m (time, rlat, rlon) float32 dask.array<chunksize=(10, 140, 105), meta=np.ndarray> Attributes: CDI: Climate Data Interface version 2.0.4 (https://mpimet.mpg.de... Conventions: CF-1.6 product: RDRS_v2.1 Remarks: Variable names are following the convention <Product>_<Type... License: These data are provided by the Canadian Surface Prediction ... history: Mon Aug 28 13:44:02 2023: cdo -z zip -s -L -sellonlatbox,-1... NCO: netCDF Operators version 5.0.6 (Homepage = http://nco.sf.ne... CDO: Climate Data Operators version 2.0.4 (https://mpimet.mpg.de...

In [4]: type(ds) Out[4]: xarray.core.dataset.Dataset

In [5]: ds = xr.open_mfdataset('./ab_models_198001*.nc', chunks={'time':10}, parallel=True) [gra-login3:25527:0:6913] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8) [gra-login3:25527] *** Process received signal *** [gra-login3:25527] Signal: Segmentation fault (11) [gra-login3:25527] Signal code: (128) [gra-login3:25527] Failing at address: (nil) Segmentation fault

```

Here is the version of xarray:

```python In [5]: xr.show_versions() /home/user/virtual-envs/scienv/lib/python3.10/site-packages/_distutils_hack/init.py:36: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit: None python: 3.10.2 (main, Feb 4 2022, 19:10:35) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.88.1.el7.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: ('en_CA', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.9.0

xarray: 2023.7.0 pandas: 1.4.0 numpy: 1.21.2 scipy: 1.8.0 netCDF4: 1.6.4 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.8.0 distributed: 2023.8.0 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.6.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 60.2.0 pip: 23.2.1 conda: None pytest: 7.4.0 mypy: None IPython: 8.10.0 sphinx: None ```

I'm working on an HPC, so if a list "modules" I have loaded helps, here it is: ```console $ module list

Currently Loaded Modules: 1) CCconfig 5) gcccore/.9.3.0 (H) 9) libfabric/1.10.1 13) ipykernel/2023a 17) sqlite/3.38.5 21) postgresql/12.4 (t) 25) gdal/3.5.1 (geo) 29) udunits/2.2.28 (t) 33) cdo/2.2.1 (geo) 2) gentoo/2020 (S) 6) imkl/2020.1.217 (math) 10) openmpi/4.0.3 (m) 14) scipy-stack/2023a (math) 18) jasper/2.0.16 (vis) 22) freexl/1.0.5 (t) 26) geos/3.10.2 (geo) 30) libaec/1.0.6 34) mpi4py/3.1.3 (t) 3) StdEnv/2020 (S) 7) gcc/9.3.0 (t) 11) libffi/3.3 15) hdf5/1.10.6 (io) 19) libgeotiff-proj901/1.7.1 23) librttopo-proj9/1.1.0 27) proj/9.0.1 (geo) 31) eccodes/2.25.0 (geo) 35) netcdf-fortran/4.5.2 (io) 4) mii/1.1.2 8) ucx/1.8.0 12) python/3.10.2 (t) 16) netcdf/4.7.4 (io) 20) cfitsio/4.1.0 (vis) 24) libspatialite-proj901/5.0.1 28) expat/2.4.1 (t) 32) yaxt/0.9.0 (t) 36) libspatialindex/1.8.5 (phys)

Where: S: Module is Sticky, requires --force to unload or purge m: MPI implementations / Implémentations MPI math: Mathematical libraries / Bibliothèques mathématiques io: Input/output software / Logiciel d'écriture/lecture t: Tools for development / Outils de développement vis: Visualisation software / Logiciels de visualisation geo: Geography libraries/apps / Logiciels de géographie phys: Physics libraries/apps / Logiciels de physique H: Hidden Module ```

Thanks.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8120/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1611701140 I_kwDOAMm_X85gEJuU 7588 xr.merge with compat="minimal" returns corrupted Dataset and causes __len__ to return wrong and possibly negative values. Metamess 2466330 closed 0     4 2023-03-06T15:47:40Z 2023-08-30T09:14:19Z 2023-08-30T07:57:37Z CONTRIBUTOR      

What happened?

When merging multiple datasets with the compat="minimal" option, coordinates whose variables are dropped due to incompatibility are still saved in the dataset's _coord_names. I believe the cause for this to originate in line 752 of merge_core, where the coordinate names are based on the datasets in coerced, which is not impacted by the dropping of (coordinate) variables/indexes in the merge_collected function.

This is directly related to the bug described in issue 7405. As seen there, one result is that dropped coordinate still evaluates as being contained in the resulting dataset's coords. The effects of this bug are more widespread, which this issue attempts to dive into.

At least one other (perhaps more severe) result of this bug is connected to the fact that the __len__ function of a DataVariable is implemented as follows: return len(self._dataset._variables) - len(self._dataset._coord_names)

If a coordinate was dropped as a result of the merge, it is no longer part of the _variables, but still listed in the _coord_names, and as such the result of len() will be off by 1 for each such coordinate. This also means that the result of len() can become negative, which causes python to raise ValueError: __len__() should return >= 0.

One instance where this causes immediate errors is when trying to print the resulting dataset. As part of the __repr__ of a Dataset, a boolean evaluation of the DataVariable is performed (if mapping: in xarray/core/formatting.py in _mapping_repr), calling __len__ to check the truth value and triggering the ValueError.

While this is undoubtedly only one of many places where the incorrect __len__ causes issues, it is a rather pressing one as it even stops one from inspecting the Dataset in the most common way (printing it). The ValueError it produces is also very hard to trace back to the actual cause, likely completely throwing users off from fixing their code.

What did you expect to happen?

To get a Dataset with the correct _coord_names property, and in no circumstance whatsoever to get a Dataset which reports a negative length

Minimal Complete Verifiable Example

```Python import xarray as xr ds1 = xr.Dataset(coords={"foo": [1, 2, 3], "bar": 4}) ds2 = xr.Dataset(coords={"foo": [1, 2, 3], "bar": 5})

res = xr.merge([ds1, ds2], compat="minimal") # If the result is not captured in res, this will cause a ValueError as the interpreter attempts to print the result

res.coords

Coordinates:

* foo (foo) int64 1 2 3

res._coord_names

{'foo', 'bar'}

"bar" in res.coords # As shown in issue #7405. Note "bar" is not printed in res.coords, revealing an interesting disconnect in behaviors of different functions targeting a dataset's coordinates

True

res

ValueError: len() should return >= 0

```

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python

import xarray as xr ds1 = xr.Dataset(coords={"foo": [1, 2, 3], "bar": 4}) ds2 = xr.Dataset(coords={"foo": [1, 2, 3], "bar": 5}) res = xr.merge([ds1, ds2], compat="minimal") res.coords Coordinates: * foo (foo) int64 1 2 3 res._coord_names {'bar', 'foo'} "bar" in res.coords True res Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/redacted/.venv/lib/python3.10/site-packages/xarray/core/dataset.py", line 2116, in repr return formatting.dataset_repr(self) File "/usr/lib/python3.10/reprlib.py", line 21, in wrapper result = user_function(self) File "/home/redacted/.venv/lib/python3.10/site-packages/xarray/core/formatting.py", line 673, in dataset_repr summary.append(data_vars_repr(ds.data_vars, col_width=col_width, max_rows=max_rows)) File "/home/redacted/.lvenv/lib/python3.10/site-packages/xarray/core/formatting.py", line 357, in _mapping_repr if mapping: ValueError: len() should return >= 0 ```

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.10.16.3-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2023.2.0 pandas: 1.5.1 numpy: 1.24.2 scipy: 1.10.0 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.13.6 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.10.3 iris: None bottleneck: 1.3.6 dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2023.1.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 59.6.0 pip: 23.0.1 conda: None pytest: 7.2.1 mypy: 1.0.1 IPython: 7.34.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7588/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1858062203 I_kwDOAMm_X85uv8d7 8090 DataArrayResampleAggregations break with _flox_reduce where source DataArray has a discontinuous time dimension ollie-bell 56110893 open 0     4 2023-08-20T09:48:42Z 2023-08-24T04:20:32Z   NONE      

What happened?

When resampling a DataArray with a discontinuity in the time dimension the resample object contains placeholder groups for the missing times in between the present times.

This seems to cause flox reductions to break (any, count and all) as it complains about a fill_value of None. See example provided below.

What did you expect to happen?

The result should be computed successfully in the same way that it is without using flox.

Minimal Complete Verifiable Example

```Python import xarray as xr import numpy as np

dates = (("1980-12-01", "1990-11-30"), ("2000-12-01", "2010-11-30")) times = [xr.cftime_range(*d, freq="D", calendar="360_day") for d in dates]

da = xr.concat( [xr.DataArray(np.random.rand(len(t)), coords={"time": t}, dims="time") for t in times], dim="time" )

da = da.chunk(time=360)

with xr.set_options(use_flox=True): # FAILS - discontinuous time dimension before resample (da > 0.5).resample(time="AS-DEC").any(dim="time")

with xr.set_options(use_flox=True): # SUCCEEDS - continuous time dimension before resample

(da.sel(time=slice(*dates[0])) > 0.5).resample(time="AS-DEC").any(dim="time")

with xr.set_options(use_flox=True): # SUCCEEDS - compute chunks before resample

(da > 0.5).compute().resample(time="AS-DEC").any(dim="time")

with xr.set_options(use_flox=False): # SUCCEEDS - don't use flox

(da > 0.5).resample(time="AS-DEC").any(dim="time")

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python

ValueError Traceback (most recent call last) Cell In[60], line 1 ----> 1 (da > 0.5).resample(time="AS-DEC").any(dim="time")

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/xarray/core/_aggregations.py:7029, in DataArrayResampleAggregations.any(self, dim, keep_attrs, kwargs) 6960 """ 6961 Reduce this DataArray's data by applying any along some dimension(s). 6962 (...) 7022 * time (time) datetime64[ns] 2001-01-31 2001-04-30 2001-07-31 7023 """ 7024 if ( 7025 flox_available 7026 and OPTIONS["use_flox"] 7027 and contains_only_chunked_or_numpy(self._obj) 7028 ): -> 7029 return self._flox_reduce( 7030 func="any", 7031 dim=dim, 7032 # fill_value=fill_value, 7033 keep_attrs=keep_attrs, 7034 kwargs, 7035 ) 7036 else: 7037 return self.reduce( 7038 duck_array_ops.array_any, 7039 dim=dim, 7040 keep_attrs=keep_attrs, 7041 **kwargs, 7042 )

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/xarray/core/resample.py:57, in Resample._flox_reduce(self, dim, keep_attrs, kwargs) 51 def _flox_reduce( 52 self, 53 dim: Dims, 54 keep_attrs: bool | None = None, 55 kwargs, 56 ) -> T_Xarray: ---> 57 result = super()._flox_reduce(dim=dim, keep_attrs=keep_attrs, **kwargs) 58 result = result.rename({RESAMPLE_DIM: self._group_dim}) 59 return result

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/xarray/core/groupby.py:1018, in GroupBy._flox_reduce(self, dim, keep_attrs, kwargs) 1015 kwargs.setdefault("min_count", 1) 1017 output_index = grouper.full_index -> 1018 result = xarray_reduce( 1019 obj.drop_vars(non_numeric.keys()), 1020 self._codes, 1021 dim=parsed_dim, 1022 # pass RangeIndex as a hint to flox that by is already factorized 1023 expected_groups=(pd.RangeIndex(len(output_index)),), 1024 isbin=False, 1025 keep_attrs=keep_attrs, 1026 kwargs, 1027 ) 1029 # we did end up reducing over dimension(s) that are 1030 # in the grouped variable 1031 group_dims = grouper.group.dims

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/flox/xarray.py:408, in xarray_reduce(obj, func, expected_groups, isbin, sort, dim, fill_value, dtype, method, engine, keep_attrs, skipna, min_count, reindex, by, finalize_kwargs) 406 output_core_dims = [d for d in input_core_dims[0] if d not in dim_tuple] 407 output_core_dims.extend(group_names) --> 408 actual = xr.apply_ufunc( 409 wrapper, 410 ds_broad.drop_vars(tuple(missing_dim)).transpose(..., grouper_dims), 411 *by_da, 412 input_core_dims=input_core_dims, 413 # for xarray's test_groupby_duplicate_coordinate_labels 414 exclude_dims=set(dim_tuple), 415 output_core_dims=[output_core_dims], 416 dask="allowed", 417 dask_gufunc_kwargs=dict( 418 output_sizes=group_sizes, output_dtypes=[dtype] if dtype is not None else None 419 ), 420 keep_attrs=keep_attrs, 421 kwargs={ 422 "func": func, 423 "axis": axis, 424 "sort": sort, 425 "fill_value": fill_value, 426 "method": method, 427 "min_count": min_count, 428 "skipna": skipna, 429 "engine": engine, 430 "reindex": reindex, 431 "expected_groups": tuple(expected_groups), 432 "isbin": isbins, 433 "finalize_kwargs": finalize_kwargs, 434 "dtype": dtype, 435 "core_dims": input_core_dims, 436 }, 437 ) 439 # restore non-dim coord variables without the core dimension 440 # TODO: shouldn't apply_ufunc handle this? 441 for var in set(ds_broad._coord_names) - set(ds_broad._indexes) - set(ds_broad.dims):

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/xarray/core/computation.py:1185, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, args) 1183 # feed datasets apply_variable_ufunc through apply_dataset_vfunc 1184 elif any(is_dict_like(a) for a in args): -> 1185 return apply_dataset_vfunc( 1186 variables_vfunc, 1187 args, 1188 signature=signature, 1189 join=join, 1190 exclude_dims=exclude_dims, 1191 dataset_join=dataset_join, 1192 fill_value=dataset_fill_value, 1193 keep_attrs=keep_attrs, 1194 ) 1195 # feed DataArray apply_variable_ufunc through apply_dataarray_vfunc 1196 elif any(isinstance(a, DataArray) for a in args):

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/xarray/core/computation.py:469, in apply_dataset_vfunc(func, signature, join, dataset_join, fill_value, exclude_dims, keep_attrs, args) 464 list_of_coords, list_of_indexes = build_output_coords_and_indexes( 465 args, signature, exclude_dims, combine_attrs=keep_attrs 466 ) 467 args = tuple(getattr(arg, "data_vars", arg) for arg in args) --> 469 result_vars = apply_dict_of_variables_vfunc( 470 func, args, signature=signature, join=dataset_join, fill_value=fill_value 471 ) 473 out: Dataset | tuple[Dataset, ...] 474 if signature.num_outputs > 1:

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/xarray/core/computation.py:411, in apply_dict_of_variables_vfunc(func, signature, join, fill_value, args) 409 result_vars = {} 410 for name, variable_args in zip(names, grouped_by_name): --> 411 result_vars[name] = func(variable_args) 413 if signature.num_outputs > 1: 414 return _unpack_dict_tuples(result_vars, signature.num_outputs)

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/xarray/core/computation.py:761, in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, vectorize, keep_attrs, dask_gufunc_kwargs, args) 756 if vectorize: 757 func = _vectorize( 758 func, signature, output_dtypes=output_dtypes, exclude_dims=exclude_dims 759 ) --> 761 result_data = func(input_data) 763 if signature.num_outputs == 1: 764 result_data = (result_data,)

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/flox/xarray.py:379, in xarray_reduce.<locals>.wrapper(array, func, skipna, core_dims, by, kwargs) 376 offset = min(array) 377 array = datetime_to_numeric(array, offset, datetime_unit="us") --> 379 result, groups = groupby_reduce(array, by, func=func, *kwargs) 381 # Output of count has an int dtype. 382 if requires_numeric and func != "count":

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/flox/core.py:2011, in groupby_reduce(array, func, expected_groups, sort, isbin, axis, fill_value, dtype, min_count, method, engine, reindex, finalize_kwargs, *by) 2005 groups = (groups[0][sorted_idx],) 2007 if factorize_early: 2008 # nan group labels are factorized to -1, and preserved 2009 # now we get rid of them by reindexing 2010 # This also handles bins with no data -> 2011 result = reindex_( 2012 result, from_=groups[0], to=expected_groups, fill_value=fill_value 2013 ).reshape(result.shape[:-1] + grp_shape) 2014 groups = final_groups 2016 if is_bool_array and (_is_minmax_reduction(func) or _is_first_last_reduction(func)):

File ~/miniconda3/envs/forge310/lib/python3.10/site-packages/flox/core.py:428, in reindex_(array, from_, to, fill_value, axis, promote) 426 if any(idx == -1): 427 if fill_value is None: --> 428 raise ValueError("Filling is required. fill_value cannot be None.") 429 indexer[axis] = idx == -1 430 # This allows us to match xarray's type promotion rules

ValueError: Filling is required. fill_value cannot be None. ```

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:41:52) [Clang 15.0.7 ] python-bits: 64 OS: Darwin OS-release: 22.5.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: 1.14.1 libnetcdf: 4.9.2 xarray: 2023.7.0 pandas: 1.5.3 numpy: 1.24.4 scipy: 1.11.1 netCDF4: 1.6.4 pydap: installed h5netcdf: 1.2.0 h5py: 3.9.0 Nio: None zarr: 2.16.0 cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None iris: 3.6.1 bottleneck: 1.3.7 dask: 2023.8.1 distributed: 2023.8.1 matplotlib: 3.7.2 cartopy: 0.22.0 seaborn: 0.12.2 numbagg: 0.2.2 fsspec: 2023.6.0 cupy: None pint: 0.22 sparse: 0.14.0 flox: 0.7.2 numpy_groupies: 0.9.22 setuptools: 68.1.2 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.14.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8090/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1325665237 I_kwDOAMm_X85PBAvV 6866 Confusing terminologies and some errors in the official documentation v-liuwei 49091585 closed 0     4 2022-08-02T10:48:07Z 2023-08-23T14:20:23Z 2023-08-23T14:20:23Z NONE      

What happened?

To note, I'm using the stable version(2022.6.0).

First, I'm confused that both dimension coordinate/non-dimension coordinate and index coordinate/non-index coordinate appear in the documentation(search to see), but they seem to be the same thing.

Second, I found that there are some errors in the documentation:

  • It says that "The index associated with dimension name x can be retrieved by arr.indexes[x]. By construction, len(arr.dims) == len(arr.indexes)", which is inconsistent with actual behavior. See example code below: ```python In [0]: import xarray as xr, numpy as np In [1]: arr = xr.DataArray(np.zeros((2, 3)), dims=['x', 'y'], coords={'x': ['a', 'b']}) In [2]: assert len(arr.dims) == len(arr.indexes), f"{len(arr.dims)=}, {len(arr.indexes)=}"

AssertionError Traceback (most recent call last) <ipython-input-202-f217d18e6979> in <module> ----> 1 assert len(arr.dims) == len(arr.indexes), f"{len(arr.dims)=}, {len(arr.indexes)=}"

AssertionError: len(arr.dims)=2, len(arr.indexes)=1 In [3]: arr.indexes Out[3]: Indexes: x: Index(['a', 'b'], dtype='object', name='x') It seems that `arr.indexes` only returns indexes of dimensions that have coordinates. However, it's possible to get the index of dimension `y` through `get_index()`:python In [4]: arr.get_index('y') Out[4]: RangeIndex(start=0, stop=3, step=1, name='y') ```

  • It says that: (see link)

For convenience multi-index levels are directly accessible as “virtual” or “derived” coordinates (marked by - when printing a dataset or data array): ```python In [77]: mda["band"] Out[77]: <xarray.DataArray 'band' (spec: 4)> array(['R', 'R', 'V', 'V'], dtype=object) Coordinates: * spec (spec) object MultiIndex * band (spec) object 'R' 'R' 'V' 'V' * wn (spec) float64 0.1 0.2 0.7 0.9

In [78]: mda.wn Out[78]: <xarray.DataArray 'wn' (spec: 4)> array([0.1, 0.2, 0.7, 0.9]) Coordinates: * spec (spec) object MultiIndex * band (spec) object 'R' 'R' 'V' 'V' * wn (spec) float64 0.1 0.2 0.7 0.9 `` As you can see, even in the given example code offered by the offical, all the "virtual" coordinates are marked as*instead of-`, which is a little bit confusing when handling multi-index coordinates in my experience.

May I have missed something? Thanks in advance for the reply.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Sep 28 2021, 16:10:42) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.10.102.1-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.23.1 scipy: 1.3.3 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.1.2 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 45.2.0 pip: 22.2.1 conda: None pytest: None IPython: 7.13.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6866/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
979316661 MDU6SXNzdWU5NzkzMTY2NjE= 5738 Flexible indexes: how to handle possible dimension vs. coordinate name conflicts? benbovy 4160723 closed 0     4 2021-08-25T15:31:39Z 2023-08-23T13:28:41Z 2023-08-23T13:28:40Z MEMBER      

Another thing that I've noticed while working on #5692.

Currently it is not possible to have a Dataset with a same name used for both a dimension and a multi-index level. I guess the reason is to prevent some errors like unmatched dimension sizes when eventually the multi-index is dropped with renamed dimension(s) according to the level names (e.g., with sel or unstack). See #2299.

I'm wondering how we should handle this in the context of flexible / custom indexes:

A. Keep this current behavior as a special case for (pandas) multi-indexes. This would avoid breaking changes but how to support custom indexes that could eventually be used like pandas multi-indexes in sel or stack?

B. Introduce some tag in xarray.Index so that we can identify a multi-coordinate index that behaves like a hierarchical index (i.e., levels may be dropped into a single index/coordinate with dimension renaming)

C. Do not allow any dimension name matching the name of a coordinate attached to a multi-coordinate index. This seems silly?

D. Eventually revert #2353 and let users taking care of potential conflicts.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5738/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
448082431 MDU6SXNzdWU0NDgwODI0MzE= 2986 How to add a custom indexer. fbriol 397386 closed 0     4 2019-05-24T09:56:25Z 2023-08-23T12:24:21Z 2023-08-23T12:24:20Z CONTRIBUTOR      

Hello,

I have written a set of indexers for 1D, 2D and 3D geodetic and Cartesian data (up to 5 dimensions for Cartesian data).

I used the Boost/C++ library to write the multidimensional data search algorithm. This tree (R*Tree) is impressive for its performance. It can be built in a few seconds with several million points and made requests for a few seconds with several million points.

```python import numpy as np

Install it with conda, if you want, only for python3.7: conda install pyindex -c fbriol

import pyindex.core as core

lon = np.random.uniform(-180.0, 180.0, 20484096) lat = np.random.uniform(-90.0, 90.0, 20484096)

You can not set an altitude if it is not necessary.

alt = np.random.uniform(-10000, 100000, 2048*4096)

WGS system used

system = core.geodetic.System()

RTree

tree = core.geodetic.RTree(system) %timeit tree.packing(np.asarray((lon, lat, alt)).T)

3.84 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

coordinates = np.asarray(( np.random.uniform(-180.0, 180.0, 10000), np.random.uniform(-90.0, 90.0, 10000), np.random.uniform(-10000, 100000, 10000))).T %timeit tree.query(coordinates)

18 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

``` I'm trying to use these indexes with Xarray, but I didn't quite understand how to interface with xarray.

Is there anyone who could explain to me how to write my own indexer to test these indexers with xarray? Thank you in advance.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2986/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1603957501 I_kwDOAMm_X85fmnL9 7573 Add optional min versions to conda-forge recipe (`run_constrained`) dcherian 2448579 closed 0     4 2023-02-28T23:12:15Z 2023-08-21T16:12:34Z 2023-08-21T16:12:21Z MEMBER      

Is your feature request related to a problem?

I opened this PR to add minimum versions for our optional dependencies: https://github.com/conda-forge/xarray-feedstock/pull/84/files to prevent issues like #7467

I think we'd need a policy to choose which ones to list. Here's the current list: run_constrained: - bottleneck >=1.3 - cartopy >=0.20 - cftime >=1.5 - dask-core >=2022.1 - distributed >=2022.1 - flox >=0.5 - h5netcdf >=0.13 - h5py >=3.6 - hdf5 >=1.12 - iris >=3.1 - matplotlib-base >=3.5 - nc-time-axis >=1.4 - netcdf4 >=1.5.7 - numba >=0.55 - pint >=0.18 - scipy >=1.7 - seaborn >=0.11 - sparse >=0.13 - toolz >=0.11 - zarr >=2.10

Some examples to think about: 1. iris seems like a bad one to force. It seems like people might use Iris and Xarray independently and Xarray shouldn't force a minimum version. 2. For backends, I arbitrarily kept netcdf4, h5netcdf and zarr. 3. It seems like we should keep array types: so dask, sparse, pint.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7573/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1845132891 I_kwDOAMm_X85t-n5b 8062 Dataset.chunk() does not overwrite encoding["chunks"] Metamess 2466330 open 0     4 2023-08-10T12:54:12Z 2023-08-14T18:23:36Z   CONTRIBUTOR      

What happened?

When using the chunk function to change the chunk sizes of a Dataset (or DataArray, which uses the Dataset implementation of chunk), the chunk sizes of the Dask arrays are changed, but the "chunks" entry of the encoding attributes are not changed accordingly. This causes the raising of a NotImplementedError when attempting to write the Dataset to a zarr (and presumably other formats as well).

Looking at the implementation of chunk, every variable is rechunked using the _maybe_chunk function, which actually has the parameter overwrite_encoded_chunks to control just this behavior. However, it is an optional parameter which defaults to False, and the call in chunk does not provide a value for this parameter, nor does it offer the caller to influence it (by having an overwrite_encoded_chunks parameter itself, for example).

I do not know why this default value was chosen as False, or what could break if it was changed to True, but looking at the documentation, it seems the opposite of the intended effect. From the documentation of to_zarr:

Zarr chunks are determined in the following way: From the chunks attribute in each variable’s encoding (can be set via Dataset.chunk).

Which is exactly what it doesn't.

What did you expect to happen?

I would expect the "chunks" entry of the encoding attribute to be changed to reflect the new chunking scheme.

Minimal Complete Verifiable Example

```Python import xarray as xr import numpy as np

Create a test Dataset with dimension x and y, each of size 100, and a chunksize of 50

ds_original = xr.Dataset({"my_var": (["x", "y"], np.random.randn(100, 100))})

Since 'chunk' does not work, manually set encoding

ds_original .my_var.encoding["chunks"] = (50, 50)

To best showcase the real-life example, write it to file and read it back again.

The same could be achieved by just calling .chunk() with chunksizes of 25, but this feels more 'complete'

filepath = "~/chunk_test.zarr" ds_original.to_zarr(filepath) ds = xr.open_zarr(filepath)

Check the chunksizes and "chunks" encoding

print(ds.my_var.chunks)

>>> ((50, 50), (50, 50))

print(ds.my_var.encoding["chunks"])

>>> (50, 50)

Rechunk the Dataset

ds = ds.chunk({"x": 25, "y": 25})

The chunksizes have changed

print(ds.my_var.chunks)

>>> ((25, 25, 25, 25), (25, 25, 25, 25))

But the encoding value remains the same

print(ds.my_var.encoding["chunks"])

>>> (50, 50)

Attempting to write this back to zarr raises an error

ds.to_zarr("~/chunk_test_rechunked.zarr")

NotImplementedError: Specified zarr chunks encoding['chunks']=(50, 50) for variable named 'my_var' would overlap multiple dask chunks ((25, 25, 25, 25), (25, 25, 25, 25)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using chunk(), deleting or modifying encoding['chunks'], or specify safe_chunks=False.

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.10.16.3-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.7 libnetcdf: 4.8.1 xarray: 2023.7.0 pandas: 1.5.3 numpy: 1.24.2 scipy: 1.10.0 netCDF4: 1.5.8 pydap: None h5netcdf: 0.12.0 h5py: 3.6.0 Nio: None zarr: 2.14.1 cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.6 dask: 2022.01.0+dfsg distributed: 2022.01.0+ds.1 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.1.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 59.6.0 pip: 23.2.1 conda: None pytest: 7.2.2 mypy: 1.1.1 IPython: 7.31.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8062/reactions",
    "total_count": 2,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
    xarray 13221727 issue
1845508562 I_kwDOAMm_X85uADnS 8065 .mfdataset fail to open a kerchunked zarr file from an object-store bucket pl-marasco 22492773 closed 0     4 2023-08-10T16:22:05Z 2023-08-14T14:18:17Z 2023-08-14T14:13:58Z NONE      

What happened?

Trying to open a kerchunk .json through the open_mfdata a ValueError is raised.

What did you expect to happen?

should be open a Dataset as described here below: <xarray.Dataset> Dimensions: (lat: 15680, lon: 40320, time: 36) Coordinates: * lat (lat) float64 80.0 79.99 79.98 79.97 ... -59.97 -59.98 -59.99 * lon (lon) float64 -180.0 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 * time (time) float64 nan 1.0 2.0 3.0 4.0 5.0 ... 31.0 32.0 33.0 34.0 35.0 Data variables: crs object ... max (time, lat, lon) float32 dask.array<chunksize=(1, 1207, 3102), meta=np.ndarray> mean (time, lat, lon) float32 dask.array<chunksize=(1, 1207, 3102), meta=np.ndarray> median (time, lat, lon) float32 dask.array<chunksize=(1, 1207, 3102), meta=np.ndarray> min (time, lat, lon) float32 dask.array<chunksize=(1, 1207, 3102), meta=np.ndarray> nobs (time, lat, lon) float32 dask.array<chunksize=(1, 1207, 3102), meta=np.ndarray> stdev (time, lat, lon) float32 dask.array<chunksize=(1, 1207, 3102), meta=np.ndarray> Attributes: (12/19) Conventions: CF-1.6 archive_facility: VITO copyright: Copernicus Service information 2021 history: 2021-03-01 - Processing line NDVI LTS identifier: urn:cgls:global:ndvi_stats_all:NDVI-LTS_1999-2019-0... institution: VITO NV ... ... references: https://land.copernicus.eu/global/products/ndvi sensor: VEGETATION-1, VEGETATION-2, VEGETATION source: Derived from EO satellite imagery time_coverage_end: 2019-12-31T23:59:59Z time_coverage_start: 1999-01-01T00:00:00Z title: Normalized Difference Vegetation Index: Long Term S...

Minimal Complete Verifiable Example

python import xarray as xr catalogue="https://object-store.cloud.muni.cz/swift/v1/foss4g-catalogue/c_gls_NDVI-LTS_1999-2019.json" LTS = xr.open_mfdataset( "reference://", engine="zarr", backend_kwargs={ "storage_options": { "fo":catalogue }, "consolidated": False } )

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Python `ValueError: Cannot specify both fs and storage_options`

Anything else we need to know?

Seems to be related to zarr's version: if tested with <= 2.12 it works but with the latest versions > 2.12 it doesn't.

Environment

xarray version 2023.7.0 zarr >2.12
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8065/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1817880272 I_kwDOAMm_X85sWqbQ 8013 np.cumproduct deprecated quantsnus 25102059 closed 0     4 2023-07-24T08:11:01Z 2023-07-31T16:46:00Z 2023-07-31T16:46:00Z CONTRIBUTOR      

What is your issue?

Since numpy version 1.25.0 np.cumproduct is deprecated in favor of np.cumprod.

The coordinates to_index() method still uses it https://github.com/pydata/xarray/blob/971be103d6376d6572d1f12d32526f12f07ae2c7/xarray/core/coordinates.py#L144 which results in an unecessary DeprecationWarning.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8013/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1789989152 I_kwDOAMm_X85qsREg 7962 Better chunk manager error dcherian 2448579 closed 0     4 2023-07-05T17:27:25Z 2023-07-24T22:26:14Z 2023-07-24T22:26:13Z MEMBER      

What happened?

I just ran in to this error in an environment without dask. TypeError: Could not find a Chunk Manager which recognises type <class 'dask.array.core.Array'>

I think we could easily recommend the user to install a package that provides dask by looking at type(array).__name__. This would make the message a lot friendlier

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7962/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1752520008 I_kwDOAMm_X85odVVI 7907 `plot.scatter(hue_style="discrete")` does nothing mgunyho 20118130 closed 0     4 2023-06-12T11:21:33Z 2023-07-13T23:17:49Z 2023-07-13T23:17:49Z CONTRIBUTOR      

What happened?

I was trying to do a scatterplot of my data with one dimension determining the color. The dimension has only a few values so I used hue_style="discrete" to have a different color for each value. However, the resulting scatterplot has a continuous colorbar, which is the same as when I pass hue_style="continuous":

What did you expect to happen?

The colorbar should have discrete colors. I was also expecting the colors to be from the default matplotlib color palette, C0, C1, etc, when there's less than 10 items, like this:

Although the examples in the documentation show the discrete case also using viridis.

What I was really expecting is a plot like one would get by passing add_colorbar=False, add_legend=True:

But that may be a bit too automagical.

Minimal Complete Verifiable Example

```Python import matplotlib.pyplot as plt import numpy as np import xarray as xr

x = xr.DataArray( np.random.default_rng().random((10, 3)), coords=[ ("idx", np.linspace(0, 1, 10)), ("color", [1, 2, 3]), ] ) y = x + np.random.default_rng().random(x.shape)

ds = xr.Dataset({ "x": x, "y": y, })

the output is the same regardless of hue_style="discrete" or "continuous" or just leaving it out

ds.plot.scatter(x="x", y="y", hue="color", hue_style="discrete", ax=plt.figure().gca()) ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

This is the code for the "expected" plot: ```python from matplotlib.colors import ListedColormap

ds.plot.scatter( x="x", y="y", hue="color", hue_style="discrete", ax=plt.figure().gca(),

# these lines added in addition to the MVCE
cmap=ListedColormap(["C0", "C1", "C2"]),
vmin=0.5, vmax=3.5,
cbar_kwargs=dict(ticks=ds.color.data),

) ```

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.14.0-1059-oem machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2023.1.0 pandas: 1.4.3 numpy: 1.23.0 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.5.3 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 44.0.0 pip: 20.0.2 conda: None pytest: None mypy: None IPython: 8.12.2 sphinx: None

I also tried this on main at 3459e6fa, the behavior is the same.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7907/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1775657305 I_kwDOAMm_X85p1mFZ 7945 engine='cfgrib' no longer an option in xr.open_dataset() but works anyway parsellsx 74011857 closed 0     4 2023-06-26T21:32:01Z 2023-06-27T00:06:27Z 2023-06-26T21:37:05Z NONE      

What is your issue?

Looking at the documentation for xr.open_dataset(), the "engine" argument to that function is listed as accepting one of 7 different engines (or None), but the "cfgrib" engine is not among them. Looking at older versions of the documentation, I see that "cfgrib" was delisted starting with v2023.04.0 (it's still present in v2023.03.0).

In what I think is a related issue, this tutorial on reading in ERA5 GRIB files with the "engine='cfgrib'" option on xr.load_dataset() gives a ValueError in documentation versions starting with v2023.04.0 and going through v2023.05.0 and 'stable' due to the unrecognized engine 'cfgrib', although it seems to have been fixed for v2023.06.0 and 'latest'.

Given both of the above, I was surprised to find that using xr.open_dataset() on a GRIB file with engine='cfgrib' does work for me using xarray v2023.05.0. To me it seems that the documentation for xr.open_dataset() should be edited to include the 'cfgrib' option again, but I'd like to get an opinion from someone more familiar with xarray.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7945/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1718143526 I_kwDOAMm_X85maMom 7854 Freezing Issue When Accessing Precipitation Values with xarray yanivgolds 118670091 closed 0     4 2023-05-20T11:30:54Z 2023-06-26T15:33:19Z 2023-06-26T15:33:19Z NONE      

What is your issue?

I am encountering a freezing issue in my project that utilizes xarray when trying to access precipitation values for a specific longitude-latitude position over a time period. This issue occurs on the slurm system but is not reproduced on my Jupyter Notebook setup. As a result, whenever I attempt to run the project, the job freezes. I would greatly appreciate your assistance in determining the cause of this problem.

Below is a figure showing the result from Jupyer Notebook (this works):

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7854/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1691902604 I_kwDOAMm_X85k2GKM 7805 [FR] add support for rss and rss button to xarray blog danieltomasz 7980381 closed 0     4 2023-05-02T07:15:12Z 2023-06-21T21:10:32Z 2023-06-21T21:10:32Z NONE      

Is your feature request related to a problem?

A easy way to subscribe to news from xarray blog

Describe the solution you'd like

A support for publishing news and button to subscribe to rss from blog (along twitter icon etcera)

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7805/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1760733017 I_kwDOAMm_X85o8qdZ 7924 Migrate from nbsphinx to myst, myst-nb dcherian 2448579 open 0     4 2023-06-16T14:17:41Z 2023-06-20T22:07:42Z   MEMBER      

Is your feature request related to a problem?

I think we should switch to MyST markdown for our docs. I've been using MyST markdown and MyST-NB in docs in other projects and it works quite well.

Advantages: 1. We get HTML reprs in the docs (example) which is a big improvement. (#6620) 2. I think many find markdown a lot easier to write than RST

There's a tool to migrate RST to MyST (RTD's migration guide).

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7924/reactions",
    "total_count": 5,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1722614979 I_kwDOAMm_X85mrQTD 7870 Name collision with Pulsar Timing package 'PINT' vhaasteren 3092444 closed 0     4 2023-05-23T18:54:18Z 2023-05-26T16:19:37Z 2023-05-26T16:19:37Z CONTRIBUTOR      

What is your issue?

In the astrophysics community of pulsar timers, there is an analysis package called PINT. PINT is widely used in that community. As you can see on their github, they have been aware of the name collision and on pip/conda the package is available as pint-pulsar. This has not been a problem so far, because most if not all astrophysicists use the great astropy to keep track of units where necessary.

However, Bayesian modeling through PyMC is becoming more and more popular, meaning that arviz and xarray are now getting installed alongside pint-pulsar, giving obvious issues.

A very simple workaround would be to change line 37 in https://github.com/pydata/xarray/blob/main/xarray/core/pycompat.py to something like:

except (ImportError, AttributeError):

This means that pint-pulsar would get imported through mod), and the AttributeError gets caught, and all should be well. It fits the design of duck-typing, since the package doesn't Quack like pint should. Would xarray be willing to accommodate the pulsar timing community this way? As you are all aware, changing the name of a package that is integral in projects with many dependencies is kind of painful.

EDIT: fixed typo

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7870/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1160309381 I_kwDOAMm_X85FKOqF 6335 ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4']. morestart 35556811 closed 0     4 2022-03-05T10:26:49Z 2023-05-12T14:09:52Z 2022-03-05T10:28:29Z NONE      

What is your issue?

ValueError: did not find a match in any of xarray's currently installed IO backends ['netcdf4']. Consider explicitly selecting one of the installed engines via the engine parameter, or installing additional IO dependencies, see:

but i installed nedCDF4 use pip install netCDF4

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6335/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1517575123 I_kwDOAMm_X85adFvT 7409 Implement `DataArray.to_dask_dataframe()` gcaria 44147817 closed 0     4 2023-01-03T15:44:11Z 2023-04-28T15:09:31Z 2023-04-28T15:09:31Z CONTRIBUTOR      

Is your feature request related to a problem?

It'd be nice to pass from a chunked DataArray to a dask object directly

Describe the solution you'd like

I think something along these lines should work (although a less convoluted way might exist):

```python import dask.dataframe as dkd import xarray as xr

def to_dask(da: xr.DataArray) -> Union[dkd.Series, dkd.DataFrame]:

if da.data.ndim > 2:
    raise ValueError(f"Can only convert 1D and 2D DataArrays, found {da.data.ndim} dimensions")

indexes = [da.get_index(dim) for dim in da.dims]
darr_index = dka.from_array(indexes[0], chunks=da.data.chunks[0])
columns = [da.name] if da.data.ndim == 1 else indexes[1]
ddf = dkd.from_dask_array(da.data, columns=columns)
ddf[indexes[0].name] = darr_index
return ddf.set_index(indexes[0].name).squeeze()

```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7409/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1652227927 I_kwDOAMm_X85iev9X 7713 `Variable/IndexVariable` do not accept a tuple for data. zoj613 44142765 closed 0     4 2023-04-03T14:50:58Z 2023-04-28T14:26:37Z 2023-04-28T14:26:37Z NONE      

What happened?

It appears that Variable and IndexVariable do not accept a tuple for the data parameter even though the docstring suggests it should be able to accept array_like objects (tuple falls under this type of object, right?).

What did you expect to happen?

Successful instantiation of a Variable/IndexVariable object, but instead a ValueError exception is raised.

Minimal Complete Verifiable Example

```Python import xarray as xr

xr.Variable(data=(2, 3, 45), dims="day") ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Python ValueError: dimensions ('day',) must have the same length as the number of data dimensions, ndim=0

Anything else we need to know?

This error seems to be triggered by the self._parse_dimensions(dims) call inside the Variable class. This problem does not happen if I use a list. But I find it strange that the array_like data specifically needs to be a certain type of object for the call to work. Maybe if it has to be a list then the docstring should reflect that.

Environment

``` INSTALLED VERSIONS ------------------ commit: None python: 3.8.16 | packaged by conda-forge | (default, Feb 1 2023, 16:01:55) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 6.1.21-1-lts machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2023.1.0 pandas: 1.5.3 numpy: 1.23.5 scipy: 1.10.1 netCDF4: 1.6.2 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: 2.14.2 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2023.3.2 distributed: 2023.3.2 matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: 2023.3.0 cupy: None pint: None sparse: 0.14.0 flox: None numpy_groupies: None setuptools: 67.6.1 pip: 23.0.1 conda: None pytest: 7.2.2 mypy: 1.1.1 IPython: 8.12.0 sphinx: None ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7713/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
575939446 MDU6SXNzdWU1NzU5Mzk0NDY= 3830 Documentation request: add examples for carrying out "ncecat" in xarray lukelbd 19657652 open 0     4 2020-03-05T01:58:17Z 2023-04-13T20:06:20Z   NONE      

In climate science, a very common task involves concatenating NetCDF files with identical variables, dimensions, and coordinates along a brand new "ensemble member" or "record" dimension. With the NetCDF Operators, this is accomplished using ncecat.

MCVE Code Sample

Currently, it seems the correct way to do this in xarray is with xarray.combine_nested as follows:

python import xarray as xr files = ['member1.nc', 'member2.nc', ...] ds = xr.open_mfdataset( files, combine='nested', concat_dim='record', )

Problem Description

While this works, there does not seem to be any mention of this use case in the combine_nested or open_mfdataset docs... and using combine='nested' to concatenate along a brand new dimension feels quite unintuitive to me.

It would be nice to have examples in combine_nested and/or open_mfdataset with this special usage or mention the possibility of creating brand new dimensions with concat_dim. For example:

python In [1]: import xarray as xr ...: datasets = [ ...: xr.Dataset({'temp': (('x', 'y'), np.random.rand(10, 20))}) ...: for i in range(3) ...: ] ...: xr.combine_nested(datasets, concat_dim='record') Out[1]: <xarray.Dataset> Dimensions: (record: 3, x: 10, y: 20) Dimensions without coordinates: record, x, y Data variables: temp (record, x, y) float64 0.32 0.4897 0.2659 ... 0.3485 0.0251 0.399

Output of xr.show_versions()

n/a

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3830/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1659786592 I_kwDOAMm_X85i7lVg 7742 About save char into netcdf ChristmasZCY 61818189 closed 0     4 2023-04-09T07:49:50Z 2023-04-11T06:36:27Z 2023-04-11T06:36:27Z NONE      

What is your issue?

When I want to save char into netcdf, it will produce a new dimension. However I read this netcdf file with xarray, it can't find anything with this dimension. ∫

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7742/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1419825696 I_kwDOAMm_X85UoNIg 7199 Deprecate cfgrib backend headtr1ck 43316012 closed 0     4 2022-10-23T15:09:14Z 2023-03-29T15:19:53Z 2023-03-29T15:19:53Z COLLABORATOR      

What is your issue?

Since cfgrib 0.9.9 (04/2021) it comes with its own xarray backend plugin (looks mainly like a copy of our internal version). We should deprecate our internal plugin.

The deprecation is complicated since we usually bind the minimum version to a minor step, but cfgrib seems to be on 0.9 since 4 years already. Maybye an exception like for netCDF4?

Anyway, if we decide to leave it as it is for now, this ticket is just a reminder to remove it someday :)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7199/reactions",
    "total_count": 4,
    "+1": 4,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1620573171 I_kwDOAMm_X85gl_vz 7617 The documentation contains some non-descriptive link texts. remigathoni 51911758 closed 0     4 2023-03-13T00:34:09Z 2023-03-27T21:37:21Z 2023-03-27T21:37:20Z CONTRIBUTOR      

What is your issue?

I've been going through the docs and noticed some links could be more descriptive.

Here are a few examples with options on how we could rewrite them: - See the user guide for more. -> Check out the indexing section in the user guide for a detailed explanation. - For more, see the Xarray documentation. -> See the documentation on automatic alignment to learn more. - This tutorial notebook also covers alignment and broadcasting (highly recommended)-> You can also check out this tutorial notebook on alignment and broadcasting (highly recommended). - For more see the user guide, the gallery, and the tutorial material. -> For more information, check out the following resources: * The plotting documentation in the user guide. * The visualization gallery. * The plotting and visualization tutorial materials.

With more specific link texts, you get a clearer idea of what to expect when you click on the link which improves the reading experience. It also makes the links more accessible.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7617/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
928381010 MDU6SXNzdWU5MjgzODEwMTA= 5515 NetCDF: Attempting netcdf-4 operation on netcdf-3 file mickaellalande 20254164 open 0     4 2021-06-23T15:23:55Z 2023-03-27T21:07:32Z   CONTRIBUTOR      

I'm trying to open MODIS .hdf files, but I get the error : NetCDF: Attempting netcdf-4 operation on netcdf-3 file. Does anyone knows how to open that files? (https://nsidc.org/data/MOD10C1)

```python import xarray as xr xr.open_dataset('MOD10C1.A2000055.061.2020037182124.hdf')

RuntimeError: NetCDF: Attempting netcdf-4 operation on netcdf-3 file ```

I already opened hdf files from another product without any issue... (https://nsidc.org/data/MOD10CM)

Here are two examples, with one that works and the other one that causes the issue: MODIS.zip

Thanks in advance for your help!

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.8.5 | packaged by conda-forge | (default, Jul 24 2020, 01:25:15) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.19.0-16-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.0 pandas: 1.1.0 numpy: 1.19.1 scipy: 1.5.2 netCDF4: 1.5.4 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.4.0 cftime: 1.2.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.1.5 cfgrib: 0.9.8.5 iris: None bottleneck: None dask: 2.21.0 distributed: 2.21.0 matplotlib: 3.2.0 cartopy: 0.17.0 seaborn: None numbagg: None pint: None setuptools: 49.2.0.post20200712 pip: 20.2 conda: None pytest: 6.0.0 IPython: 7.16.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5515/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1338173609 I_kwDOAMm_X85Pwuip 6914 plt.imshow() vs xarray_dataset.plot.imshow() not rendering correctly | Potential Bug melioristic 32569566 closed 0     4 2022-08-14T08:40:56Z 2023-03-22T20:46:23Z 2023-03-22T20:46:23Z NONE      

What is your issue?

I have 2d data which I want to visualise. The visuals look completely different if I use plt.imshow() vs xarray_dataset.plot.imshow() There are mainly two issues - First, the array is flipped. (I think this is manageable but inconsistent) - Secondly, the plots don't look correct. This can be best illustrated by the figures themselves.

For example this is the xarray code I am using.

day_data.plot.imshow(cmap= "Blues", vmin =1, vmax = 100) plt.show()

And this is the image that I get.

Secondly, when I use the matplotlib to plot the values. plt.imshow(day_data.values, vmin = 1, vmax = 100, cmap = 'Blues') plt.show() I get this plot.

Since it is a discharge data I would expect to see the second plot. Can someone tell me what is the issue here?

P.S.

This is how day_data looks like.

xarray.DataArray'dis06'y: 950x: 1000 array([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], dtype=float32) Coordinates: time () datetime64[ns] 2019-10-24T06:00:00 step () timedelta64[ns] 06:00:00 surface () float64 0.0 latitude (y, x) float64 ... longitude (y, x) float64 ... valid_time () datetime64[ns] 2019-10-24T12:00:00 Attributes: GRIB_paramId : 240023 GRIB_dataType : sfo GRIB_numberOfPoints : 950000 GRIB_typeOfLevel : surface GRIB_stepUnits : 1 GRIB_stepType : avg GRIB_gridType : lambert_azimuthal_equal_area GRIB_NV : 0 GRIB_cfName : unknown GRIB_cfVarName : dis06 GRIB_gridDefinitionDescription : Lambert azimuthal equal area projection GRIB_missingValue : 9999 GRIB_name : Mean discharge in the last 6 hours GRIB_shortName : dis06 GRIB_units : m**3 s**-1 long_name : Mean discharge in the last 6 hours units : m**3 s**-1 standard_name : unknown

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6914/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1499473190 I_kwDOAMm_X85ZYCUm 7385 Unexpected NaNs in broadcast dopplershift 221526 open 0     4 2022-12-16T02:42:44Z 2023-03-14T20:43:00Z   CONTRIBUTOR      

What happened?

When running the broadcast in the sample code, I end up with nan in the output when there are not any in the original source array. While I know the construction is really odd (this came from user-submitted code), I'm shocked that it resulted in nans the resulting broadcasted data and honestly assumed MetPy's code was doing something dumb for quite awhile. I would have expected (regardless of the nature of the coordinates) that the result for broad_a be [[1, 2], [1, 2]].

What did you expect to happen?

No response

Minimal Complete Verifiable Example

```Python levs = np.array([100000, 85000]) a = xr.Dataset({'a': (('lev',), [1, 2])}, coords={'lev': levs}).to_array() b = xr.Dataset({'b': (('lev',), [3, 4])}, coords={'lev': levs}).to_array()

broad_a, broad_b = xr.broadcast(a, b) print(broad_a) ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Python <xarray.DataArray (variable: 2, lev: 2)> array([[ 1., 2.], [nan, nan]]) Coordinates: * lev (lev) int64 100000 85000 * variable (variable) object 'a' 'b'

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:31:57) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 21.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.12.0 pandas: 1.5.2 numpy: 1.23.5 scipy: 1.9.3 netCDF4: 1.6.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.13.3 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.10.3 iris: None bottleneck: 1.3.5 dask: 2022.6.1 distributed: 2022.6.1 matplotlib: 3.6.2 cartopy: 0.21.0 seaborn: None numbagg: None fsspec: 2022.11.0 cupy: None pint: 0.20.1 sparse: None flox: None numpy_groupies: None setuptools: 65.5.1 pip: 22.3.1 conda: None pytest: 7.2.0 mypy: 0.991 IPython: 8.7.0 sphinx: 5.3.0
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7385/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
706507153 MDU6SXNzdWU3MDY1MDcxNTM= 4449 Did copy(deep=True) break with 0.16.1? blaylockbk 6249613 closed 0     4 2020-09-22T15:59:41Z 2023-03-12T21:08:42Z 2023-03-12T21:08:42Z NONE      

What happened: I have a script that downloads a file, reads and copies it to memory with ds.copy(deep=True), and then removes the downloaded file from disk. In 0.16.1, I get an error "No such file or directory" when I try to read the data from the deep-copied Dataset as if the Dataset was not actually copied into memory.

What you expected to happen: In 0.16.0 and earlier, the variable data is available (ds.varName.data) after it is copied into memory even after the original file was removed. But this doesn't work anymore in 0.16.1.

Minimal Complete Verifiable Example:

```python import xarray as xr import os import urllib.request

Get sample NetCDF file

url = 'https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc' FILE = 'tos_O1_2001-2002.nc' urllib.request.urlretrieve(url, FILE)

Open the NetCDF file

ds1 = xr.open_dataset(FILE)

Make a copy of the Dataset

ds2 = ds1.copy(deep=True)

and close the original

ds1.close()

remove the NetCDF file

os.remove(FILE)

Read the copied dataset

ds2 ```

Anything else we need to know?: Output for xarray v0.16.0

Output for xarray v0.16.1 FileNotFoundError: [Errno 2] No such file or directory: ...tos_O1_2001-2002.nc'

Environment:

Output of <tt>xr.show_versions()</tt> for xarray 0.16.0 INSTALLED VERSIONS ------------------ commit: None python: 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 17:19:16) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: English_United States.1252 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.0 pandas: 1.1.2 numpy: 1.19.1 scipy: 1.5.0 netCDF4: 1.5.4 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.8.4 iris: None bottleneck: None dask: None distributed: None matplotlib: 3.3.2 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16 setuptools: 49.6.0.post20200917 pip: 20.2.3 conda: None pytest: None IPython: 7.18.1 sphinx: None
Output of <tt>xr.show_versions()</tt> for xarray 0.16.1 INSTALLED VERSIONS ------------------ commit: None python: 3.8.5 | packaged by conda-forge | (default, Sep 16 2020, 17:19:16) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: English_United States.1252 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.1 pandas: 1.1.2 numpy: 1.19.1 scipy: 1.5.0 netCDF4: 1.5.4 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.8.4 iris: None bottleneck: None dask: None distributed: None matplotlib: 3.3.2 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16 setuptools: 49.6.0.post20200917 pip: 20.2.3 conda: None pytest: None IPython: 7.18.1 sphinx: Nonexarray: 0.16.0 pandas: 1.1.2 numpy: 1.19.1 scipy: 1.5.0 netCDF4: 1.5.4 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.8.4 iris: None bottleneck: None dask: None distributed: None matplotlib: 3.3.2 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16 setuptools: 49.6.0.post20200917 pip: 20.2.3 conda: None pytest: None IPython: 7.18.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4449/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1598266728 I_kwDOAMm_X85fQ51o 7556 broken documentation link arfriedman 76110149 closed 0     4 2023-02-24T09:37:57Z 2023-03-12T18:02:59Z 2023-03-12T18:02:59Z CONTRIBUTOR      

What is your issue?

Hi,

I found this broken link at the bottom of the Datetime Indexing subsection in the User Guide.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7556/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1468838643 I_kwDOAMm_X85XjLLz 7336 Instability when calculating standard deviation ShihengDuan 26401994 closed 0     4 2022-11-29T23:33:55Z 2023-03-10T20:32:51Z 2023-03-10T20:32:50Z NONE      

What happened?

I noticed that for some large values (not really that large) and lots of samples, the data.std() yields different values than np.std(data). This seems to be related to the magnitude. See attached code here: nino34_tas_picontrol_detrend = nino34_tas_picontrol-298 std_dev = nino34_tas_picontrol_detrend.std() print(std_dev.data) std_dev = nino34_tas_picontrol.std() print(std_dev.data) nino34_tas_picontrol_detrend = nino34_tas_picontrol-10 std_dev = nino34_tas_picontrol_detrend.std() print(std_dev.data) and the results are: 1.4448999166488647 24.911161422729492 20.054718017578125

So I guess this is related to the magnitude, but not sure. Anyone has similar issue?

What did you expect to happen?

Adding or subtracting a constant should not change the standard deviation. See screenshot here about what the data look like:

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.71.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.6.0 pandas: 1.4.4 numpy: 1.22.3 scipy: 1.8.1 netCDF4: 1.6.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.2 nc_time_axis: 1.4.1 PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.9.0 distributed: 2022.9.0 matplotlib: 3.5.2 cartopy: 0.21.0 seaborn: None numbagg: None fsspec: 2022.10.0 cupy: None pint: None sparse: 0.13.0 flox: None numpy_groupies: None setuptools: 65.5.0 pip: 22.2.2 conda: None pytest: None IPython: 8.6.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7336/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1588461863 I_kwDOAMm_X85ergEn 7539 Concat doesn't concatenate dimension coordinates along new dims TomNicholas 35968931 open 0     4 2023-02-16T22:32:33Z 2023-02-21T19:07:48Z   MEMBER      

What is your issue?

xr.concat doesn't concatenate dimension coordinates along new dimensions, which leads to pretty unintuitive behavior.

Take this example (motivated by https://github.com/pydata/xarray/discussions/7532#discussioncomment-4988792) python segments = [] for i in range(2): time = np.sort(np.random.random(4)) da = xr.DataArray( np.random.randn(4,2), dims=["time", "cols"], coords=dict(time=('time', time), cols=["col1", "col2"]), ) segments.append(da) python In [86]: segments Out[86]: [<xarray.DataArray (time: 4, cols: 2)> array([[-0.61199576, -0.9012078 ], [-0.54187577, 1.30509994], [-3.53720471, 0.97607797], [ 0.2593455 , 0.95920031]]) Coordinates: * time (time) float64 0.1048 0.168 0.869 0.9432 * cols (cols) <U4 'col1' 'col2', <xarray.DataArray (time: 4, cols: 2)> array([[ 0.90266408, -0.54294821], [-1.09087103, -0.17484417], [-0.21679558, -0.57377412], [ 0.07570151, 0.27433728]]) Coordinates: * time (time) float64 0.03627 0.09754 0.2434 0.592 * cols (cols) <U4 'col1' 'col2'] ```python In [85]: xr.concat(segments, dim='new') Out[85]: <xarray.DataArray (new: 2, time: 8, cols: 2)> array([[[ nan, nan], [ nan, nan], [-0.61199576, -0.9012078 ], [-0.54187577, 1.30509994], [ nan, nan], [ nan, nan], [-3.53720471, 0.97607797], [ 0.2593455 , 0.95920031]],

   [[ 0.90266408, -0.54294821],
    [-1.09087103, -0.17484417],
    [        nan,         nan],
    [        nan,         nan],
    [-0.21679558, -0.57377412],
    [ 0.07570151,  0.27433728],
    [        nan,         nan],
    [        nan,         nan]]])

Coordinates: * time (time) float64 0.03627 0.09754 0.1048 0.168 ... 0.592 0.869 0.9432 * cols (cols) <U4 'col1' 'col2' Dimensions without coordinates: new ```

I would have expected to get a result of size {new: 2, time: 4, cols: 2}. That would be intuitive, because the default is coords='different', and that would be the result of concatenating each time coordinate (which have different values) and just propagating the cols coordinate (as they have the same values).

Instead what happened is that xr.concat treats the dimension coordinates as indexes to align, and defaults to an outer join. This auto-alignment behaviour has been discussed at length before, I'm just trying to point out another place in which its problematic.

This is kind of briefly mentioned in the concat docstring under coords='all': “all”: All coordinate variables will be concatenated, except those corresponding to other dimensions. but it's not even mentioned under coords='different'

I don't really know what I would prefer to happen with the coordinates. I guess to have created a time coordinate of size {new: 2, time: 4, cols: 2}, but then I don't know what that implies for the underlying index. @benbovy do you have any thoughts?

At the very least we should make this a lot clearer in the docs.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7539/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1470583016 I_kwDOAMm_X85Xp1Do 7340 xr.corr produces incorrect output for complex arrays mattragoza 7647340 closed 0     4 2022-12-01T03:00:09Z 2023-02-14T16:38:29Z 2023-02-14T16:38:29Z NONE      

What happened?

I create a DataArray full of complex numbers, and I compute the correlation of the DataArray with itself.

What did you expect to happen?

The absolute value of the correlation coefficient should be equal to 1, up to numerical precision. However, this is not the case. The returned correlation coefficient is around 0.26 and change depending on the number of values in the array.

Minimal Complete Verifiable Example

```Python import xarray as xr

array = xr.DataArray([ -4.21904583e-03-1.53714478e-03j, -4.24663044e-03-1.12832926e-03j, -4.26968892e-03-4.87451439e-04j, -6.99917538e-03+3.07376860e-04j, 0.00000000e+00+0.00000000e+00j, -2.42585590e-02+1.42052459e-02j, -5.53404148e-03+4.60188062e-03j, -4.68829482e-03+4.90179019e-03j, -7.02331258e-03+8.75908673e-03j, -1.31233383e-01+1.86572484e-01j, -4.05137401e-03+6.59972035e-03j, -4.20701822e-03+7.29813816e-03j, -3.56487231e-03+6.51759430e-03j, -3.68077200e-03+7.04388575e-03j, -8.16459981e-02+1.70084145e-01j, -5.11737898e-03+1.98164995e-02j, 6.72772914e-04-7.28110367e-05j, 2.13957504e-03-1.82525995e-03j, 1.60369835e-03-1.54029189e-03j, 8.77788719e-02-8.45568854e-02j, 1.04277417e-01-9.38854749e-02j, 7.58465696e-03-6.07906563e-03j, 8.00776452e-03-5.70470615e-03j, 8.36166252e-03-5.14978313e-03j, 0.00000000e+00+0.00000000e+00j, 0.00000000e+00+0.00000000e+00j, 0.00000000e+00+0.00000000e+00j, 7.26422461e-03+4.40382166e-04j, 4.01364547e-03+1.09269127e-03j, -1.99069471e-01-1.20355081e-01j, 1.56511579e-01+2.59839758e-01j, 9.14046953e-04+5.42262898e-03j, -8.37800782e-04+5.67555708e-03j, -3.36561822e-03+7.50108018e-03j, -4.22682090e-03+5.36279242e-03j, 5.95438564e-02-3.48209841e-02j, -6.77184281e-03+2.10711488e-03j, -4.84293269e-03+3.78698499e-04j, -5.13547723e-03-6.86765713e-04j, 4.48392070e-01+1.54568226e-01j, -3.17412047e-01-2.35431216e-01j, -2.95731737e-03-3.39078899e-03j, -1.95111443e-03-3.77545168e-03j, -2.82719903e-04-1.61393513e-03j, 7.20241467e-04-1.73515565e-03j, -1.96675563e-01-4.42259734e-02j, 0.00000000e+00+0.00000000e+00j, 4.84813452e-03+7.60742077e-03j, 6.31707602e-03+1.51808252e-02j, 2.99277774e-03+1.18667410e-02j, 5.64640060e-04+1.58372118e-02j, -1.74137347e-03+1.70383706e-02j, -5.91398408e-03+2.30008930e-02j, -7.12027831e-03+1.87732435e-02j, 9.30919156e-02-1.65255887e-01j, -2.09716130e-01+2.30490479e-01j, -1.80115101e-02+1.37248240e-02j, -1.85851718e-02+9.23420957e-03j, -1.88459965e-02+5.12854226e-03j, 1.09175874e+00-9.17875627e-02j, -1.63766142e-02-5.32431671e-03j, -1.24749963e-02-9.63714407e-03j, -7.58657222e-03-1.27728267e-02j, -1.99052439e-03-1.35879033e-02j, -5.70595470e-01+2.27742231e+00j, 1.24516564e-02-1.21867738e-02j, 1.82174257e-02-8.67884733e-03j, 2.27204879e-02-3.77097224e-03j, 2.66143091e-02+2.68683768e-03j, 1.06983372e+00+3.19301893e-01j, -6.86033738e-01-4.72910865e-01j, 3.00291320e-02+3.10297521e-02j, 2.22880055e-02+3.45332319e-02j, 1.61724440e-02+4.04122368e-02j, 9.78881043e-03+4.96053678e-02j, -6.51085120e-03+5.27227722e-02j, -1.76752380e-02+5.26095806e-02j, -3.81856382e-02+6.41735764e-02j, 0.00000000e+00+0.00000000e+00j, -4.32481463e-02+3.88706950e-02j ]) r = np.abs(xr.corr(array, array).item()) assert np.isclose(r, 1.0), r ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python The exact output I get for the self-contained example below is:

AssertionError Traceback (most recent call last) Cell In [44], line 46 3 array = xr.DataArray([ 4 -4.21904583e-03-1.53714478e-03j, -4.24663044e-03-1.12832926e-03j, 5 -4.26968892e-03-4.87451439e-04j, -6.99917538e-03+3.07376860e-04j, (...) 43 0.00000000e+00+0.00000000e+00j, -4.32481463e-02+3.88706950e-02j 44 ]) 45 r = np.abs(xr.corr(array, array).item()) ---> 46 assert np.isclose(r, 1.0), r

AssertionError: 0.2664911388214005

Anything else we need to know?

Python 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]

Xarray version is '2022.9.0'

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 4.18.0-193.28.1.el8_2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.9.0 pandas: 1.5.0 numpy: 1.23.3 scipy: 1.9.1 netCDF4: 1.6.0 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.11.0 distributed: None matplotlib: 3.6.2 cartopy: None seaborn: 0.12.1 numbagg: None fsspec: 2022.11.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.4.1 pip: 22.2.2 conda: None pytest: None IPython: 8.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7340/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Next page

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 540.322ms · About: xarray-datasette