id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
1933712083,I_kwDOAMm_X85zQhrT,8289,segfault with a particular netcdf4 file,90008,open,0,,,11,2023-10-09T20:07:17Z,2024-05-03T16:54:18Z,,CONTRIBUTOR,,,,"### What happened?
The following code yields a segfault on my machine (and many other machines with a similar environment)
```
import xarray
filename = 'tiny.nc.txt'
engine = ""netcdf4""
dataset = xarray.open_dataset(filename, engine=engine)
i = 0
for i in range(60):
xarray.open_dataset(filename, engine=engine)
```
[tiny.nc.txt](https://github.com/pydata/xarray/files/12850060/tiny.nc.txt)
[mrc.nc.txt](https://github.com/pydata/xarray/files/12850061/mrc.nc.txt)
### What did you expect to happen?
Not to segfault.
### Minimal Complete Verifiable Example
1. Generate some netcdf4 with my application.
2. Trim the netcdf4 file down (load it, and drop all the vars I can while still reproducing this bug)
3. Try to read it.
```Python
import xarray
from tqdm import tqdm
filename = 'mrc.nc.txt'
engine = ""h5netcdf""
dataset = xarray.open_dataset(filename, engine=engine)
for i in tqdm(range(60), desc=f""filename={filename}, enine={engine}""):
xarray.open_dataset(filename, engine=engine)
engine = ""netcdf4""
dataset = xarray.open_dataset(filename, engine=engine)
for i in tqdm(range(60), desc=f""filename={filename}, enine={engine}""):
xarray.open_dataset(filename, engine=engine)
filename = 'tiny.nc.txt'
engine = ""h5netcdf""
dataset = xarray.open_dataset(filename, engine=engine)
for i in tqdm(range(60), desc=f""filename={filename}, enine={engine}""):
xarray.open_dataset(filename, engine=engine)
engine = ""netcdf4""
dataset = xarray.open_dataset(filename, engine=engine)
for i in tqdm(range(60), desc=f""filename={filename}, enine={engine}""):
xarray.open_dataset(filename, engine=engine)
```
hand crafting the file from start to finish seems to not segfault:
```
import xarray
import numpy as np
engine = 'netcdf4'
dataset = xarray.Dataset()
coords = {}
coords['image_x'] = np.arange(1, dtype='int')
dataset = dataset.assign_coords(coords)
dataset['image'] = xarray.DataArray(
np.zeros((1,), dtype='uint8'),
dims=('image_x',)
)
# %%
dataset.to_netcdf('mrc.nc.txt')
# %%
dataset = xarray.open_dataset('mrc.nc.txt', engine=engine)
for i in range(10):
xarray.open_dataset('mrc.nc.txt', engine=engine)
```
### MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
### Relevant log output
```Python
i=0 passes
i=1 mostly segfaults, but sometimes it can take more than 1 iteration
```
### Anything else we need to know?
At first I thought it was deep in hdf5, but I am less convinced now
xref: https://github.com/HDFGroup/hdf5/issues/3649
### Environment
```
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.12 | packaged by Ramona Optics | (main, Jun 27 2023, 02:59:09) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 6.5.1-060501-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.2
xarray: 2023.9.1.dev25+g46643bb1.d20231009
pandas: 2.1.1
numpy: 1.24.4
scipy: 1.11.3
netCDF4: 1.6.4
pydap: None
h5netcdf: 1.2.0
h5py: 3.9.0
Nio: None
zarr: 2.16.1
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.3.0
distributed: 2023.3.0
matplotlib: 3.8.0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.9.2
cupy: None
pint: 0.22
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.2.1
conda: 23.7.4
pytest: 7.4.2
mypy: None
IPython: 8.16.1
sphinx: 7.2.6
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8289/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
2128501296,I_kwDOAMm_X85-3low,8733,A basic default ChunkManager for arrays that report their own chunks,90008,open,0,,,21,2024-02-10T14:36:55Z,2024-03-10T17:26:13Z,,CONTRIBUTOR,,,,"### Is your feature request related to a problem?
I'm creating duckarrays for various file backed datastructures for mine that are naturally ""chunked"". i.e. different parts of the array may appear in completely different files.
Using these ""chunks"" and the ""strides"" algorithms can better decide on how to iterate in a convenient manner.
For example, an MP4 file's chunks may be defined as being delimited by I frames, while images stored in a TIFF may be delimited by a page.
So for me, chunks are not so useful for parallel computing, but more for computing locally and choosing the appropriate way to iterate through a large arrays (TB of uncompressed data).
### Describe the solution you'd like
I think a default Chunk manager could simply implement `compute` as `np.asarray` as a default instance, and be a catchall to all other instances.
Advanced users could then go in an reimplement their own chunkmanager, but I was unable to use my duckarrays that incldued a `chunk` property because they weren't associated with any chunk manager.
Something as simple as:
```patch
diff --git a/xarray/core/parallelcompat.py b/xarray/core/parallelcompat.py
index c009ef48..bf500abb 100644
--- a/xarray/core/parallelcompat.py
+++ b/xarray/core/parallelcompat.py
@@ -681,3 +681,26 @@ class ChunkManagerEntrypoint(ABC, Generic[T_ChunkedArray]):
cubed.store
""""""
raise NotImplementedError()
+
+
+class DefaultChunkManager(ChunkMangerEntrypoint):
+ def __init__(self) -> None:
+ self.array_cls = None
+
+ def is_chunked_array(self, data: Any) -> bool:
+ return is_duck_array(data) and hasattr(data, ""chunks"")
+
+ def chunks(self, data: T_ChunkedArray) -> T_NormalizedChunks:
+ return data.chunks
+
+ def compute(self, *data: T_ChunkedArray | Any, **kwargs) -> tuple[np.ndarray, ...]:
+ raise tuple(np.asarray(d) for d in data)
+
+ def normalize_chunks(self, *args, **kwargs):
+ raise NotImplementedError()
+
+ def from_array(self, *args, **kwargs):
+ raise NotImplementedError()
+
+ def apply_gufunc(self, *args, **kwargs):
+ raise NotImplementedError()
```
### Describe alternatives you've considered
I created my own chunk manager, with my own chunk manager entry point.
Kinda tedious...
### Additional context
It seems that this is related to: https://github.com/pydata/xarray/pull/7019
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8733/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1429172192,I_kwDOAMm_X85VL2_g,7239,include/exclude lists in Dataset.expand_dims,90008,closed,0,,,6,2022-10-31T03:01:52Z,2023-11-05T06:29:06Z,2023-11-05T06:29:06Z,CONTRIBUTOR,,,,"### Is your feature request related to a problem?
I would like to be able to expand the dimensions of a dataset, but most of the time, I only want to expand the datasets of a few key variables.
It would be nice if there were some kind of filter mechanism.
### Describe the solution you'd like
```python
import xarray as xr
dataset = xr.Dataset(data_vars={'foo': 1, 'bar': 2})
dataset.expand_dims(""zar"", include_variables=[""foo""])
# Only foo is expanded, bar is left alone.
```
### Describe alternatives you've considered
Writing my own function. I'll probably do this.
Subclassing. Too confusing and easy to ""diverge"" from you all when you do decide to implment this.
### Additional context
For large datasets, you likely just want some key parameters expanded, and not all parameters expanded.
xarray version: 2022.10.0","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7239/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
1152047670,I_kwDOAMm_X85Eqto2,6309,Read/Write performance optimizations for netcdf files,90008,open,0,,,5,2022-02-26T17:40:40Z,2023-09-13T08:27:47Z,,CONTRIBUTOR,,,,"### What happened?
I'm not too sure this is a bug report, but I figured I would share some of the investigation I've done on the topic of writing large datasets to netcdf.
For clarity, the usecase I'm considering is writing large in-memory array to persistant storage on Linux.
* Array size 4-100GB
* File Format Netcdf
* Dask: No.
* Hardware: Some very modern SSD that can write more than 1GB/s (Sabrent Rocket 4 Plus for example).
* Operating System: Linux
* Xarray version: 0.21.1 (or around there)
The symptoms are two fold:
1. The write speed is slow. About 1GB/s, much less than the 2-3 GB/s you can get with other means.
2. The Linux disk cache just keeps filling up.
Its quite hard to get good performance from systems, so I""m going to put a few more constraints on the type of data we are are writing:
1. The underlying numpy array must be alight to the linux Page boundary of 4096 bytes.
2. The underlying numpy array must have been pre-faulted and not swapped. (Do not use `np.zeros`, it doesn't fault the memory)
I feel like these two options are rather easy to get to as I'll show in my example.
### What did you expect to happen?
I want to be able to write at 3.2GB/s with my shiny new SSD.
I want to leave my RAM unused when I'm archiving to disk.
### Minimal Complete Verifiable Example
```Python
import numpy as np
import xarray as xr
def empty_aligned(shape, dtype=np.float64, align=4096):
if not isinstance(shape, tuple):
shape = (shape,)
dtype = np.dtype(dtype)
size = dtype.itemsize
# Compute the final size of the array
for s in shape:
size *= s
a = np.empty(size + (align - 1), dtype=np.uint8)
data_align = a.ctypes.data % align
offset = 0 if data_align == 0 else (align - data_align)
arr = a[offset:offset + size].view(dtype)
# Don't use reshape since reshape might copy the data.
# This is the suggested way to assign a new shape with guarantee
# That the data won't be copied.
arr.shape = shape
return arr
dataset = xr.DataArray(
empty_aligned((4, 1024, 1024, 1024), dtype='uint8'),
name='mydata').to_dataset()
# Fault and write data to this dataset
dataset['mydata'].data[...] = 1
%time dataset.to_netcdf(""test"", engine='h5netcdf')
%time dataset.to_netcdf(""test"", engine='netcdf4')
```
### Relevant log output
Both output about 3.5s equivalent to just about 1GB/s.
To get to about 3 ish GB/s (taking about 1.27s to write a 4GB array). One needs to do a few things:
1. You must align the underlying data to disk.
* h5netcdf (h5py) backend https://github.com/h5py/h5py/pull/2040
* netcdf4: https://github.com/Unidata/netcdf-c/pull/2206
2. You must use a driver that bypasses the operating system cache
* https://github.com/h5py/h5py/pull/2041
For the h5netcdf backend you would have to add the following kwargs to h5netcdf constructor
```
kwargs = {
""invalid_netcdf"": invalid_netcdf,
""phony_dims"": phony_dims,
""decode_vlen_strings"": decode_vlen_strings,
'alignment_threshold': alignment_threshold,
'alignment_interval': alignment_interval,
}
```
### Anything else we need to know?
The main challenge is that while writing aligned data this way is REALLY fast, writing small chunks and unaligned data becomes REALLY slow.
Personally, I think that someone might be able to write a new HDF5 driver that does better optimization, I feel like this can help people loading large datasets which seems to be a large part of the community of xarray users.
### Environment
```
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.9 (main, Dec 29 2021, 07:47:36)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.13.0-30-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 0.21.1
pandas: 1.4.0
numpy: 1.22.2
scipy: 1.8.0
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.13.1
h5py: 3.6.0.post1
Nio: None
zarr: None
cftime: 1.5.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.01.1
distributed: None
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.01.0
cupy: None
pint: None
sparse: None
setuptools: 60.8.1
pip: 22.0.3
conda: None
pytest: None
IPython: 8.0.1
sphinx: None
```
h5py includes some additions of mine that allow you to use the DIRECT driver and I am using a version of HDF5 that is built with the DIRECT driver.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6309/reactions"", ""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,,13221727,issue
1773296009,I_kwDOAMm_X85pslmJ,7940,decide on how to handle `empty_like`,90008,open,0,,,8,2023-06-25T13:48:46Z,2023-07-05T16:36:35Z,,CONTRIBUTOR,,,,"### Is your feature request related to a problem?
calling `np.empty_like` seems to be instantiating the whole array.
```python
from xarray.tests import InaccessibleArray
import xarray as xr
import numpy as np
array = InaccessibleArray(np.zeros((3, 3), dtype=""uint8""))
da = xr.DataArray(array, dims=[""x"", ""y""])
np.empty_like(da)
```
```python
Traceback (most recent call last):
File ""/home/mark/t.py"", line 8, in
np.empty_like(da)
File ""/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/common.py"", line 165, in __array__
return np.asarray(self.values, dtype=dtype)
File ""/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/dataarray.py"", line 732, in values
return self.variable.values
File ""/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/variable.py"", line 614, in values
return _as_array_or_item(self._data)
File ""/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/variable.py"", line 314, in _as_array_or_item
data = np.asarray(data)
File ""/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/tests/__init__.py"", line 151, in __array__
raise UnexpectedDataAccess(""Tried accessing data"")
xarray.tests.UnexpectedDataAccess: Tried accessing data
```
### Describe the solution you'd like
I'm not too sure. This is why I raised this as a ""feature"" and not a bug.
On one hand, it is pretty hard to ""get"" the underlying class.
Is it a:
* numpy array
* a lazy thing that looks like a numpy array?
* a dask array when it is dask?
I think that there are also some nuances between:
1. Loading an nc file from a file (where things might be handled by dask even though you don't want them to be)
2. Creating your xarray from in memory.
### Describe alternatives you've considered
for now, i'm trying to avoid `empty_like` or `zeros_like`.
In general, we haven't seen much benefit from dask and cuda still needs careful memory management.
### Additional context
_No response_","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7940/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1428549868,I_kwDOAMm_X85VJfDs,7237,The new NON_NANOSECOND_WARNING is not very nice to end users,90008,closed,0,,,5,2022-10-30T01:56:59Z,2023-05-09T12:52:54Z,2022-11-04T20:13:20Z,CONTRIBUTOR,,,,"### What is your issue?
The new nanosecond warning doesn't really point anybody to where they should change their code.
Nor does it really tell them how to fix it.
```
import xarray as xr
import numpy as np
xr.DataArray(np.zeros(1, dtype='datetime64[us]'))
```
yields
```
xarray/core/variable.py:194: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values.
```
https://github.com/pydata/xarray/blob/f32d354e295c05fb5c5ece7862f77f19d82d5894/xarray/core/variable.py#L194
I think at the very least, the stacklevel should be specified when calling the `warn` function.
It isn't really pretty, but I've been passing a parameter when I expect to pass up a warning to the end user:
eg. https://github.com/vispy/vispy/pull/2405
However, others have not liked that approach.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7237/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
1306457778,I_kwDOAMm_X85N3vay,6791,get_data or get_varibale method,90008,closed,0,,,3,2022-07-15T20:24:31Z,2023-04-29T03:40:01Z,2023-04-29T03:40:01Z,CONTRIBUTOR,,,,"### Is your feature request related to a problem?
I often store a few scalars or arrays in xarray containers.
However, when I want to optionally address their data the code I have to run
```python
import xarray as xr
dataset = xr.Dataset()
my_variable = dataset.get('my_variable', None)
if my_variable is not None:
my_variable = my_variable.data
else:
my_variable = np.asarray(1.0) # the default value I actually want
```
### Describe the solution you'd like
```python
import xarray as xr
dataset = xr.Dataset()
my_variable = dataset.get_data('my_variable', np.asarray(1.0))
```
### Describe alternatives you've considered
_No response_
### Additional context
Thank you!","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6791/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
1675299031,I_kwDOAMm_X85j2wjX,7770,Provide a public API for adding new backends,90008,closed,0,,,3,2023-04-19T17:06:24Z,2023-04-20T00:15:23Z,2023-04-20T00:15:23Z,CONTRIBUTOR,,,,"### Is your feature request related to a problem?
I understand that this is a double edge sword. but we were relying on `BACKEND_ENTRYPOINTS` being a dictionary to a class and that broke in
https://github.com/pydata/xarray/pull/7523
### Describe the solution you'd like
Some agreed upon way that we could create a new backend. This would allow users to provide more custom parameters to file creation attributes and other options that are currently not exposed via xarray.
I've used this to overwrite some parameters like netcdf global variables.
I've also used this to add `alignment_threshold` and `alignment_interval` to h5netcdf.
I did it through a custom backend because it felt like a contentious feature at the time. (I really do think it helps performance).
### Describe alternatives you've considered
A deprecation cycle in the future???
Maybe this could have been acheived with the definition of `RELOADABLE_BACKEND_ENTRYPOINTS` and leaving the `BACKEND_ENTRYPOINTS` unchanged in signature.
### Additional context
We used this to define the alignment within a file. netcdf4 exposed this as a global variable so we have to somewhat hack around it just before creation time.
I mean, you can probably say:
""Doing this is too complicated, we don't want to give any guarantees on this front.""
I would agree with you.....","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7770/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
1432388736,I_kwDOAMm_X85VYISA,7245,coordinates not removed for variable encoding during reset_coords,90008,open,0,,,5,2022-11-02T02:46:56Z,2023-01-15T16:23:15Z,,CONTRIBUTOR,,,,"### What happened?
When calling `reset_coords` on a dataset that is loaded from disk, the coordinates are not removed from the encoding of the variable.
This means, that at save time they will be resaved as coordinates... annoying. (and erroneous)
### What did you expect to happen?
_No response_
### Minimal Complete Verifiable Example
```Python
import xarray as xr
dataset = xr.Dataset(
data_vars={'images': (('y', 'x'), np.zeros((10, 2)))},
coords={'zar': 1}
)
dataset.to_netcdf('foo.nc', mode='w')
# %%
foo_loaded = xr.open_dataset('foo.nc')
foo_loaded_reset = foo_loaded.reset_coords()
# %%
assert 'zar' in foo_loaded.coords
assert 'zar' not in foo_loaded_reset.coords
assert 'zar' in foo_loaded_reset.data_vars
foo_loaded_reset.to_netcdf('bar.nc', mode='w')
# %% Now load the dataset
bar_loaded = xr.open_dataset('bar.nc')
assert 'zar' not in bar_loaded.coords, 'zar is erroneously a coordinate'
# %%
# This is the problem
assert 'zar' not in foo_loaded_reset.images.encoding['coordinates'].split(' '), ""zar should not be in here""
```
### MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
### Relevant log output
_No response_
### Anything else we need to know?
```
for _, variable in obj._variables.items():
coords_in_encoding = set(variable.encoding.get('coordinates', ' ').split(' '))
variable.encoding['coordinates'] = ' '.join(coords_in_encoding - set(names))
```
suggested fix in `dataset.py, reset_coords`
https://github.com/pydata/xarray/blob/513ee34f16cc8f9250a72952e33bf9b4c95d33d1/xarray/core/dataset.py#L1734
### Environment
```
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.13 | packaged by Ramona Optics | (main, Aug 31 2022, 22:30:30)
[GCC 10.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-50-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1
xarray: 2022.10.0
pandas: 1.5.1
numpy: 1.23.4
scipy: 1.9.3
netCDF4: 1.6.1
pydap: None
h5netcdf: 1.0.2
h5py: 3.7.0
Nio: None
zarr: 2.13.3
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.10.0
distributed: 2022.10.0
matplotlib: 3.6.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.10.0
cupy: None
pint: 0.20.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.5.0
pip: 22.3
conda: 22.9.0
pytest: 7.2.0
IPython: 7.33.0
sphinx: 5.3.0
/home/mark/mambaforge/envs/mcam_dev/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn(""Setuptools is replacing distutils."")
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7245/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1524642393,I_kwDOAMm_X85a4DJZ,7428,Avoid instantiating the data in prepare_variable,90008,open,0,,,0,2023-01-08T19:18:49Z,2023-01-09T06:25:52Z,,CONTRIBUTOR,,,,"### Is your feature request related to a problem?
I'm trying to extend the features of xarray for a new backend I'm developing internally. The main use case that we are trying to open a multi 100's of GB dataset, slice out a smaller dataset (10s of GB) and write it.
However, when we try to use functions like `prepare_variable`, the way they are currently written, they implicitely instantiate the whole data, (potentially 10s of GB) which incurs a huge ""time cost"" at a surprising (to me) point in the code.
https://github.com/pydata/xarray/blob/6e77f5e8942206b3e0ab08c3621ade1499d8235b/xarray/backends/h5netcdf_.py#L338
### Describe the solution you'd like
Would it be possible to just remove the second return value from `prepare_variable`? It isn't particuarly ""useful"" and easy to obtain from the inputs to the function.
### Describe alternatives you've considered
I'm proably going to create a new method, with a not so well chosen name like `prepare_variable_no_data` that does the above, but only for my backend. My code path that needs this only uses our custom backend.
### Additional context
I think this would be useful, in general for other users that need more out of memory computation. I've found that you really have to ""buy into"" dask, all the way to the end, if you want to see any benefits. As such, if somebody used a dask array, this would create a serial choke point in:
https://github.com/pydata/xarray/blob/6e77f5e8942206b3e0ab08c3621ade1499d8235b/xarray/backends/common.py#L308","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7428/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1423948375,I_kwDOAMm_X85U37pX,7224,Insertion speed of new dataset elements,90008,open,0,,,3,2022-10-26T12:34:51Z,2022-10-29T22:39:39Z,,CONTRIBUTOR,,,,"### What is your issue?
In https://github.com/pydata/xarray/pull/7221 I showed that a major contributor the slowdown in inserting a new element was the cost associated with an internal only debugging assert statement.
The benchmarks results 7221 and 7222 are pretty useful to look at.
Thank you for encouraging the creation of a ""benchmark"" so that we can monitor the performance of element insertion.
Unfortunately, that was the only ""free"" lunch I got.
A few other minor improvements can be obtained with:
https://github.com/pydata/xarray/pull/7222
However, it seems to me that the fundamental reason this is ""slow"" is because element insertion is not so much ""insertion"" as it is:
* Dataset Merge
* Dataset Replacement of the internal methods.
This is really solidified in the https://github.com/pydata/xarray/blob/main/xarray/core/dataset.py#L4918
In my benchmarks, I found that in the limit of large datasets, list comprehensions of 1000 elements or more were often used to ""search"" for variables that were ""indexed"" https://github.com/pydata/xarray/blob/ca57e5cd984e626487636628b1d34dca85cc2e7c/xarray/core/merge.py#L267
I think a few speedsups can be obtained by avoiding these kinds of ""searches"" and list comprehensions. However, I think that the dataset would have to provide this kind of information to the `merge_core` routine, instead of the `merge_core` routine recreating it all the time.
Ultimately, I think you trade off ""memory footprint"" (due to the potential increase of datastructures you keep around) of a dataset, and ""speed"".
Anyway, I just wanted to share where I got.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7224/reactions"", ""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1098915891,I_kwDOAMm_X85BgCAz,6153,[FEATURE]: to_netcdf and additional keyword arguments,90008,open,0,,,2,2022-01-11T09:39:35Z,2022-01-20T06:54:25Z,,CONTRIBUTOR,,,,"### Is your feature request related to a problem?
I briefly tried to see if any issue was brought up but couldn't.
I'm hoping to be able to pass additional keyword arguments to the engine when using `to_netcdf`. https://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html
However, it doesn't seem to easy to do so.
Similar to how `open_dataset` has an additional `**kwargs` parameter, would it be reasonable to add a similar parameter, maybe `engine_kwargs` to the `to_netcdf` to allow users to pass additional parameters to the engine?
### Describe the solution you'd like
```python
import xarray as xr
import numpy as np
dataset = xr.DataArray(
data=np.zeros(3),
name=""hello""
).to_dataset()
dataset.to_netcdf(""my_file.nc"", engine=""h5netcdf"", engine_kwargs={""decode_vlen_strings=True""})
```
### Describe alternatives you've considered
One could forward the additional keyword arguments with `**kwargs`. I just feel like this makes things less ""explicit"".
### Additional context
h5netcdf emits a warning that is hard to disable without passing a keyword argument to the constructor.
https://github.com/h5netcdf/h5netcdf/issues/132
Also, for performance reasons, it might be very good to tune things like the storage data alignment.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6153/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
347962055,MDU6SXNzdWUzNDc5NjIwNTU=,2347,Serialization of just coordinates,90008,closed,0,,,6,2018-08-06T15:03:29Z,2022-01-09T04:28:49Z,2022-01-09T04:28:49Z,CONTRIBUTOR,,,,"In the search for the perfect data storage mechanism, I find myself needing to store some of the images I am generating the metadata seperately. It is really useful for me to serialize just the coordinates of my DataArray.
My serialization method of choice is json since it allows me to read the metadata with just a text editor. For that, having the coordinates as a self contained dictionary is really important.
Currently, I convert just the coordinates to a [dataset](http://xarray.pydata.org/en/stable/data-structures.html#coordinates-methods), and serialize that. The code looks something like this:
```python
import xarray as xr
import numpy as np
# Setup an array with coordinates
n = np.zeros(3)
coords={'x': np.arange(3)}
m = xr.DataArray(n, dims=['x'], coords=coords)
coords_dataset_dict = m.coords.to_dataset().to_dict()
coords_dict = coords_dataset_dict['coords']
# Read/Write dictionary to JSON file
# This works, but I'm essentially creating an emtpy dataset for it
coords_set = xr.Dataset.from_dict(coords_dataset_dict)
coords2 = coords_set.coords # so many `coords` :D
m2 = xr.DataArray(np.zeros(shape=m.shape), dims=m.dims, coords=coords2)
```
Would encapsulating this functionality in the `Coordinates` class be accepted as a PR?
It would add 2 functions that would look like:
```python
def to_dict(self):
# offload the heavy lifting to the Dataset class
return self.to_dataset().to_dict()['coords']
def from_dict(self, d):
# Offload the heavy lifting again to the Dataset class
d_dataset = {'dims': [], 'attrs': [], 'coords': d}
return Dataset.from_dict(d_dataset).coords
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2347/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
689390592,MDU6SXNzdWU2ODkzOTA1OTI=,4394,Is it possible to append_dim to netcdf stores,90008,closed,0,,,2,2020-08-31T18:02:46Z,2020-08-31T22:11:10Z,2020-08-31T22:11:09Z,CONTRIBUTOR,,,,"
**Is your feature request related to a problem? Please describe.**
Feature request: It seems that it should be possible to append to netcdf4 stores along the unlimited dimensions. Is there an example of this?
**Describe the solution you'd like**
I would like the following code to be valid:
```python
from xarray.tests.test_dataset import create_append_test_data
ds, ds_to_append, ds_with_new_var = create_append_test_data()
filename = 'test_dataset.nc'
# Choose any one of
# engine : {'netcdf4', 'scipy', 'h5netcdf'}
engine = 'netcdf4'
ds.to_netcdf(filename, mode='w', unlimited_dims=['time'], engine=engine)
ds_to_append.to_netcdf(filename, mode='a', unlimited_dims=['time'], engine=engine)
```
**Describe alternatives you've considered**
I guess you could use zarr, but the fact that it creates multiple files is a problem.
**Additional context**
xarray version: 0.16.0
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4394/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
573577844,MDU6SXNzdWU1NzM1Nzc4NDQ=,3815,Opening from zarr.ZipStore fails to read (store???) unicode characters,90008,open,0,,,20,2020-03-01T16:49:25Z,2020-03-26T04:22:29Z,,CONTRIBUTOR,,,,"See upstream: https://github.com/zarr-developers/zarr-python/issues/551
It seems that using a ZipStore creates 1 byte objects for Unicode string attributes.
For example, saving the same Dataset with a DirectoryStore and a Zip Store creates an attribute for a unicode array with 20 bytes in size in the first, and 1 byte in size in the second.
In fact, ubuntu file roller isn't even allowing me to extract the files.
I have a feeling it is due to the note in the zarr documentation
> Note that Zip files do not provide any way to remove or replace existing entries.
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore
#### MCVE Code Sample
ZipStore
```python
import xarray as xr
import zarr
x = xr.Dataset()
x['hello'] = 'world'
x
with zarr.ZipStore('test_store.zip', mode='w') as store:
x.to_zarr(store)
with zarr.ZipStore('test_store.zip', mode='r') as store:
x_read = xr.open_zarr(store).compute()
```
Issued error
```python
---------------------------------------------------------------------------
BadZipFile Traceback (most recent call last)
in
7 x.to_zarr(store)
8 with zarr.ZipStore('test_store.zip', mode='r') as store:
----> 9 x_read = xr.open_zarr(store).compute()
~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/dataset.py in compute(self, **kwargs)
803 """"""
804 new = self.copy(deep=False)
--> 805 return new.load(**kwargs)
806
807 def _persist_inplace(self, **kwargs) -> ""Dataset"":
~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/dataset.py in load(self, **kwargs)
655 for k, v in self.variables.items():
656 if k not in lazy_data:
--> 657 v.load()
658
659 return self
~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/variable.py in load(self, **kwargs)
370 self._data = as_compatible_data(self._data.compute(**kwargs))
371 elif not hasattr(self._data, ""__array_function__""):
--> 372 self._data = np.asarray(self._data)
373 return self
374
~/miniconda3/envs/dev/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """"""
---> 85 return array(a, dtype, copy=False, order=order)
86
87
~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/indexing.py in __array__(self, dtype)
545 def __array__(self, dtype=None):
546 array = as_indexable(self.array)
--> 547 return np.asarray(array[self.key], dtype=None)
548
549 def transpose(self, order):
~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/backends/zarr.py in __getitem__(self, key)
46 array = self.get_array()
47 if isinstance(key, indexing.BasicIndexer):
---> 48 return array[key.tuple]
49 elif isinstance(key, indexing.VectorizedIndexer):
50 return array.vindex[
~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection)
570
571 fields, selection = pop_fields(selection)
--> 572 return self.get_basic_selection(selection, fields=fields)
573
574 def get_basic_selection(self, selection=Ellipsis, out=None, fields=None):
~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields)
693 if self._shape == ():
694 return self._get_basic_selection_zd(selection=selection, out=out,
--> 695 fields=fields)
696 else:
697 return self._get_basic_selection_nd(selection=selection, out=out,
~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_zd(self, selection, out, fields)
709 # obtain encoded data for chunk
710 ckey = self._chunk_key((0,))
--> 711 cdata = self.chunk_store[ckey]
712
713 except KeyError:
~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/storage.py in __getitem__(self, key)
1249 with self.mutex:
1250 with self.zf.open(key) as f: # will raise KeyError
-> 1251 return f.read()
1252
1253 def __setitem__(self, key, value):
~/miniconda3/envs/dev/lib/python3.7/zipfile.py in read(self, n)
914 self._offset = 0
915 while not self._eof:
--> 916 buf += self._read1(self.MAX_N)
917 return buf
918
~/miniconda3/envs/dev/lib/python3.7/zipfile.py in _read1(self, n)
1018 if self._left <= 0:
1019 self._eof = True
-> 1020 self._update_crc(data)
1021 return data
1022
~/miniconda3/envs/dev/lib/python3.7/zipfile.py in _update_crc(self, newdata)
946 # Check the CRC if we're at the end of the file
947 if self._eof and self._running_crc != self._expected_crc:
--> 948 raise BadZipFile(""Bad CRC-32 for file %r"" % self.name)
949
950 def read1(self, n):
BadZipFile: Bad CRC-32 for file 'hello/0'
0
2
Untitled10.ipynb
```
Working Directory Store example
```python
import xarray as xr
import zarr
x = xr.Dataset()
x['hello'] = 'world'
x
store = zarr.DirectoryStore('test_store2.zarr')
x.to_zarr(store)
x_read = xr.open_zarr(store)
x_read.compute()
assert x_read.hello == x.hello
```
#### Expected Output
The string metadata should work.
#### Output of ``xr.show_versions()``
```
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.14.1
pandas: 1.0.0
numpy: 1.17.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.10.1
distributed: 2.10.0
matplotlib: 3.1.3
cartopy: None
seaborn: None
numbagg: None
setuptools: 45.1.0.post20200119
pip: 20.0.2
conda: None
pytest: 5.3.1
IPython: 7.12.0
sphinx: 2.3.1
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3815/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
335608017,MDU6SXNzdWUzMzU2MDgwMTc=,2251,netcdf roundtrip fails to preserve the shape of numpy arrays in attributes,90008,closed,0,,,5,2018-06-25T23:52:07Z,2018-08-29T16:06:29Z,2018-08-29T16:06:28Z,CONTRIBUTOR,,,,"#### Code Sample
```python
import numpy as np
import xarray as xr
a = xr.DataArray(np.zeros((3, 3)), dims=('y', 'x'))
a.attrs['my_array'] = np.arange(6, dtype='uint8').reshape(2, 3)
a.to_netcdf('a.nc')
b = xr.open_dataarray('a.nc')
b.load()
assert np.all(b == a)
print('all arrays equal')
assert b.dtype == a.dtype
print('dtypes equal')
print(a.my_array.shape)
print(b.my_array.shape)
assert a.my_array.shape == b.my_array.shape
```
#### Problem description
I have some metadata that is in the form of numpy arrays.
I would think that it should round trip with netcdf.
#### Expected Output
equal shapes inside the metadata
#### Output of ``xr.show_versions()``
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.16.15-300.fc28.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
xarray: 0.10.7
pandas: 0.23.0
numpy: 1.14.4
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.6.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.5
distributed: 1.21.8
matplotlib: 2.2.2
cartopy: None
seaborn: None
setuptools: 39.2.0
pip: 9.0.3
conda: None
pytest: 3.6.1
IPython: 6.4.0
sphinx: 1.7.5
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2251/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
347558405,MDU6SXNzdWUzNDc1NTg0MDU=,2340,expand_dims erases named dim in the array's coordinates,90008,closed,0,,,5,2018-08-03T23:00:07Z,2018-08-05T01:15:49Z,2018-08-04T03:39:49Z,CONTRIBUTOR,,,,"#### Code Sample, a copy-pastable example if possible
```python
# %%
import xarray as xa
import numpy as np
n = np.zeros((3, 2))
data = xa.DataArray(n, dims=['y', 'x'], coords={'y':range(3), 'x':range(2)})
data = data.assign_coords(z=xa.DataArray(np.arange(6).reshape((3, 2)),
dims=['y', 'x']))
print('Original Data')
print('=============')
print(data)
# %%
my_slice = data[0, 1]
print(""Sliced data"")
print(""==========="")
print(""z coordinate remembers it's own x value"")
print(f'x = {my_slice.z.x}')
# %%
expanded_slice = data[0, 1].expand_dims('x')
print(""expanded slice"")
print(""=============="")
print(""forgot that 'z' had 'x' coordinates"")
print(""but remembered it had a 'y' coordinate"")
print(f""z = {expanded_slice.z}"")
print(expanded_slice.z.x)
```
Output:
```
Original Data
=============
array([[0., 0.],
[0., 0.],
[0., 0.]])
Coordinates:
* y (y) int32 0 1 2
* x (x) int32 0 1
z (y, x) int32 0 1 2 3 4 5
Sliced data
===========
z coordinate remembers it's own x value
x =
array(1)
Coordinates:
y int32 0
x int32 1
z int32 1
expanded slice
==============
forgot that 'z' had 'x' coordinates
but remembered it had a 'y' coordinate
z =
array(1)
Coordinates:
y int32 0
z int32 1
AttributeError: 'DataArray' object has no attribute 'x'
```
#### Problem description
The coordinate used to have an explicit dimension.
When we expanded the dimension, that information should not be erased.
Note that information about other coordinates are maintained.
#### The challenge
The coordinates probably have fewer dimensions than the original data. I'm not sure about xarray's model, but a few challenges come to mind:
1. is the relative order of dimensions maintained between data in the same dataset/dataarray?
2. Can coordinates have MORE dimensions than the array itself?
The answer to these two questions might make or break If not, then this becomes a very difficult problem to solve since we don't know where to insert this new dimension in the coordinate array.
#### Output of ``xr.show_versions()``
xa.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None
xarray: 0.10.7
pandas: 0.23.1
numpy: 1.14.3
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.6.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.18.1
distributed: 1.22.0
matplotlib: 2.2.2
cartopy: None
seaborn: None
setuptools: 39.2.0
pip: 9.0.3
conda: None
pytest: 3.7.1
IPython: 6.4.0
sphinx: 1.7.5
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2340/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue