github: issues: 10 rows where state = "open" and user = 90008 sorted by updated

10 rows where state = "open" and user = 90008 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	author_association	draft	pull_request	body	reactions	repo	type
1933712083	I_kwDOAMm_X85zQhrT	8289	segfault with a particular netcdf4 file	hmaarrfk 90008	open	11	2023-10-09T20:07:17Z	2024-05-03T16:54:18Z	CONTRIBUTOR			What happened? The following code yields a segfault on my machine (and many other machines with a similar environment) ``` import xarray filename = 'tiny.nc.txt' engine = "netcdf4" dataset = xarray.open_dataset(filename, engine=engine) i = 0 for i in range(60): xarray.open_dataset(filename, engine=engine) ``` tiny.nc.txt mrc.nc.txt What did you expect to happen? Not to segfault. Minimal Complete Verifiable Example Generate some netcdf4 with my application. Trim the netcdf4 file down (load it, and drop all the vars I can while still reproducing this bug) Try to read it. ```Python import xarray from tqdm import tqdm filename = 'mrc.nc.txt' engine = "h5netcdf" dataset = xarray.open_dataset(filename, engine=engine) for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"): xarray.open_dataset(filename, engine=engine) engine = "netcdf4" dataset = xarray.open_dataset(filename, engine=engine) for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"): xarray.open_dataset(filename, engine=engine) filename = 'tiny.nc.txt' engine = "h5netcdf" dataset = xarray.open_dataset(filename, engine=engine) for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"): xarray.open_dataset(filename, engine=engine) engine = "netcdf4" dataset = xarray.open_dataset(filename, engine=engine) for i in tqdm(range(60), desc=f"filename={filename}, enine={engine}"): xarray.open_dataset(filename, engine=engine) ``` hand crafting the file from start to finish seems to not segfault: ``` import xarray import numpy as np engine = 'netcdf4' dataset = xarray.Dataset() coords = {} coords['image_x'] = np.arange(1, dtype='int') dataset = dataset.assign_coords(coords) dataset['image'] = xarray.DataArray( np.zeros((1,), dtype='uint8'), dims=('image_x',) ) %% dataset.to_netcdf('mrc.nc.txt') %% dataset = xarray.open_dataset('mrc.nc.txt', engine=engine) for i in range(10): xarray.open_dataset('mrc.nc.txt', engine=engine) ``` MVCE confirmation [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. [X] Complete example — the example is self-contained, including all data and the text of any traceback. [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result. [X] New issue — a search of GitHub Issues suggests this is not a duplicate. [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies. Relevant log output `Python i=0 passes i=1 mostly segfaults, but sometimes it can take more than 1 iteration` Anything else we need to know? At first I thought it was deep in hdf5, but I am less convinced now xref: https://github.com/HDFGroup/hdf5/issues/3649 Environment ``` INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 \| packaged by Ramona Optics \| (main, Jun 27 2023, 02:59:09) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.1-060501-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.9.1.dev25+g46643bb1.d20231009 pandas: 2.1.1 numpy: 1.24.4 scipy: 1.11.3 netCDF4: 1.6.4 pydap: None h5netcdf: 1.2.0 h5py: 3.9.0 Nio: None zarr: 2.16.1 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.3.0 distributed: 2023.3.0 matplotlib: 3.8.0 cartopy: None seaborn: None numbagg: None fsspec: 2023.9.2 cupy: None pint: 0.22 sparse: None flox: None numpy_groupies: None setuptools: 68.2.2 pip: 23.2.1 conda: 23.7.4 pytest: 7.4.2 mypy: None IPython: 8.16.1 sphinx: 7.2.6 ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8289/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
2128501296	I_kwDOAMm_X85-3low	8733	A basic default ChunkManager for arrays that report their own chunks	hmaarrfk 90008	open	21	2024-02-10T14:36:55Z	2024-03-10T17:26:13Z	CONTRIBUTOR			Is your feature request related to a problem? I'm creating duckarrays for various file backed datastructures for mine that are naturally "chunked". i.e. different parts of the array may appear in completely different files. Using these "chunks" and the "strides" algorithms can better decide on how to iterate in a convenient manner. For example, an MP4 file's chunks may be defined as being delimited by I frames, while images stored in a TIFF may be delimited by a page. So for me, chunks are not so useful for parallel computing, but more for computing locally and choosing the appropriate way to iterate through a large arrays (TB of uncompressed data). Describe the solution you'd like I think a default Chunk manager could simply implement `compute` as `np.asarray` as a default instance, and be a catchall to all other instances. Advanced users could then go in an reimplement their own chunkmanager, but I was unable to use my duckarrays that incldued a `chunk` property because they weren't associated with any chunk manager. Something as simple as: ```patch diff --git a/xarray/core/parallelcompat.py b/xarray/core/parallelcompat.py index c009ef48..bf500abb 100644 --- a/xarray/core/parallelcompat.py +++ b/xarray/core/parallelcompat.py @@ -681,3 +681,26 @@ class ChunkManagerEntrypoint(ABC, Generic[T_ChunkedArray]): cubed.store """ raise NotImplementedError() + + +class DefaultChunkManager(ChunkMangerEntrypoint): + def init(self) -> None: + self.array_cls = None + + def is_chunked_array(self, data: Any) -> bool: + return is_duck_array(data) and hasattr(data, "chunks") + + def chunks(self, data: T_ChunkedArray) -> T_NormalizedChunks: + return data.chunks + + def compute(self, data: T_ChunkedArray \| Any, kwargs) -> tuple[np.ndarray, ...]: + raise tuple(np.asarray(d) for d in data) + + def normalize_chunks(self, args, *kwargs): + raise NotImplementedError() + + def from_array(self, args,** kwargs): + raise NotImplementedError() + + def apply_gufunc(self, args, *kwargs): + raise NotImplementedError() ``` Describe alternatives you've considered I created my own chunk manager, with my own chunk manager entry point. Kinda tedious... Additional context It seems that this is related to: https://github.com/pydata/xarray/pull/7019	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8733/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
2131364916	PR_kwDOAMm_X85ms5QB	8739	Add a test for usability of duck arrays with chunks property	hmaarrfk 90008	open	1	2024-02-13T02:46:47Z	2024-02-13T03:35:24Z	CONTRIBUTOR	0	pydata/xarray/pulls/8739	xref: https://github.com/pydata/xarray/issues/8733 ```python xarray/tests/test_variable.py F ================================================ FAILURES ================================================ ____________________________ TestAsCompatibleData.test_duck_array_with_chunks ____________________________ self = <xarray.tests.test_variable.TestAsCompatibleData object at 0x7f3d1b122e60> def test_duck_array_with_chunks(self): # Non indexable type class CustomArray(NDArrayMixin, indexing.ExplicitlyIndexed): def __init__(self, array): self.array = array @property def chunks(self): return self.shape def __array_function__(self, args, kwargs): return NotImplemented def __array_ufunc__(self, args, kwargs): return NotImplemented array = CustomArray(np.arange(3)) assert is_chunked_array(array) var = Variable(dims=("x"), data=array) > var.load() /home/mark/git/xarray/xarray/tests/test_variable.py:2745: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /home/mark/git/xarray/xarray/core/variable.py:936: in load self._data = to_duck_array(self._data, kwargs) /home/mark/git/xarray/xarray/namedarray/pycompat.py:129: in to_duck_array chunkmanager = get_chunked_array_type(data) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (CustomArray(array=array([0, 1, 2])),), chunked_arrays = [CustomArray(array=array([0, 1, 2]))] chunked_array_types = {<class 'xarray.tests.test_variable.TestAsCompatibleData.test_duck_array_with_chunks.<locals>.CustomArray'>} chunkmanagers = {'dask': <xarray.namedarray.daskmanager.DaskManager object at 0x7f3d1b1568f0>} def get_chunked_array_type(*args: Any) -> ChunkManagerEntrypoint[Any]: """ Detects which parallel backend should be used for given set of arrays. Also checks that all arrays are of same chunking type (i.e. not a mix of cubed and dask). """ # TODO this list is probably redundant with something inside xarray.apply_ufunc ALLOWED_NON_CHUNKED_TYPES = {int, float, np.ndarray} chunked_arrays = [ a for a in args if is_chunked_array(a) and type(a) not in ALLOWED_NON_CHUNKED_TYPES ] # Asserts all arrays are the same type (or numpy etc.) chunked_array_types = {type(a) for a in chunked_arrays} if len(chunked_array_types) > 1: raise TypeError( f"Mixing chunked array types is not supported, but received multiple types: {chunked_array_types}" ) elif len(chunked_array_types) == 0: raise TypeError("Expected a chunked array but none were found") # iterate over defined chunk managers, seeing if each recognises this array type chunked_arr = chunked_arrays[0] chunkmanagers = list_chunkmanagers() selected = [ chunkmanager for chunkmanager in chunkmanagers.values() if chunkmanager.is_chunked_array(chunked_arr) ] if not selected: > raise TypeError( f"Could not find a Chunk Manager which recognises type {type(chunked_arr)}" E TypeError: Could not find a Chunk Manager which recognises type <class 'xarray.tests.test_variable.TestAsCompatibleData.test_duck_array_with_chunks.<locals>.CustomArray'> /home/mark/git/xarray/xarray/namedarray/parallelcompat.py:158: TypeError ============================================ warnings summary ============================================ xarray/testing/assertions.py:9 /home/mark/git/xarray/xarray/testing/assertions.py:9: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ======================================== short test summary info ========================================= FAILED xarray/tests/test_variable.py::TestAsCompatibleData::test_duck_array_with_chunks - TypeError: Could not find a Chunk Manager which recognises type <class 'xarray.tests.test_variable.Te... ====================================== 1 failed, 1 warning in 0.77s ====================================== (dev) ✘-1 ~/git/xarray [add_test_for_duck_array\|✔] ``` </details> [ ] Closes #xxxx [ ] Tests added [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` [ ] New functions/methods are listed in `api.rst`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8739/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	pull
1152047670	I_kwDOAMm_X85Eqto2	6309	Read/Write performance optimizations for netcdf files	hmaarrfk 90008	open	5	2022-02-26T17:40:40Z	2023-09-13T08:27:47Z	CONTRIBUTOR			What happened? I'm not too sure this is a bug report, but I figured I would share some of the investigation I've done on the topic of writing large datasets to netcdf. For clarity, the usecase I'm considering is writing large in-memory array to persistant storage on Linux. Array size 4-100GB File Format Netcdf Dask: No. Hardware: Some very modern SSD that can write more than 1GB/s (Sabrent Rocket 4 Plus for example). Operating System: Linux Xarray version: 0.21.1 (or around there) The symptoms are two fold: 1. The write speed is slow. About 1GB/s, much less than the 2-3 GB/s you can get with other means. 2. The Linux disk cache just keeps filling up. Its quite hard to get good performance from systems, so I"m going to put a few more constraints on the type of data we are are writing: 1. The underlying numpy array must be alight to the linux Page boundary of 4096 bytes. 2. The underlying numpy array must have been pre-faulted and not swapped. (Do not use `np.zeros`, it doesn't fault the memory) I feel like these two options are rather easy to get to as I'll show in my example. What did you expect to happen? I want to be able to write at 3.2GB/s with my shiny new SSD. I want to leave my RAM unused when I'm archiving to disk. Minimal Complete Verifiable Example ```Python import numpy as np import xarray as xr def empty_aligned(shape, dtype=np.float64, align=4096): if not isinstance(shape, tuple): shape = (shape,) `dtype = np.dtype(dtype) size = dtype.itemsize # Compute the final size of the array for s in shape: size *= s a = np.empty(size + (align - 1), dtype=np.uint8) data_align = a.ctypes.data % align offset = 0 if data_align == 0 else (align - data_align) arr = a[offset:offset + size].view(dtype) # Don't use reshape since reshape might copy the data. # This is the suggested way to assign a new shape with guarantee # That the data won't be copied. arr.shape = shape return arr` dataset = xr.DataArray( empty_aligned((4, 1024, 1024, 1024), dtype='uint8'), name='mydata').to_dataset() Fault and write data to this dataset dataset['mydata'].data[...] = 1 %time dataset.to_netcdf("test", engine='h5netcdf') %time dataset.to_netcdf("test", engine='netcdf4') ``` Relevant log output Both output about 3.5s equivalent to just about 1GB/s. To get to about 3 ish GB/s (taking about 1.27s to write a 4GB array). One needs to do a few things: You must align the underlying data to disk. h5netcdf (h5py) backend https://github.com/h5py/h5py/pull/2040 netcdf4: https://github.com/Unidata/netcdf-c/pull/2206 You must use a driver that bypasses the operating system cache https://github.com/h5py/h5py/pull/2041 For the h5netcdf backend you would have to add the following kwargs to h5netcdf constructor `kwargs = { "invalid_netcdf": invalid_netcdf, "phony_dims": phony_dims, "decode_vlen_strings": decode_vlen_strings, 'alignment_threshold': alignment_threshold, 'alignment_interval': alignment_interval, }` Anything else we need to know? The main challenge is that while writing aligned data this way is REALLY fast, writing small chunks and unaligned data becomes REALLY slow. Personally, I think that someone might be able to write a new HDF5 driver that does better optimization, I feel like this can help people loading large datasets which seems to be a large part of the community of xarray users. Environment ``` INSTALLED VERSIONS commit: None python: 3.9.9 (main, Dec 29 2021, 07:47:36) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.13.0-30-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 0.21.1 pandas: 1.4.0 numpy: 1.22.2 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: 0.13.1 h5py: 3.6.0.post1 Nio: None zarr: None cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.01.1 distributed: None matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.01.0 cupy: None pint: None sparse: None setuptools: 60.8.1 pip: 22.0.3 conda: None pytest: None IPython: 8.0.1 sphinx: None ``` h5py includes some additions of mine that allow you to use the DIRECT driver and I am using a version of HDF5 that is built with the DIRECT driver.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6309/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 1 }	xarray 13221727	issue
1773296009	I_kwDOAMm_X85pslmJ	7940	decide on how to handle `empty_like`	hmaarrfk 90008	open	8	2023-06-25T13:48:46Z	2023-07-05T16:36:35Z	CONTRIBUTOR			Is your feature request related to a problem? calling `np.empty_like` seems to be instantiating the whole array. ```python from xarray.tests import InaccessibleArray import xarray as xr import numpy as np array = InaccessibleArray(np.zeros((3, 3), dtype="uint8")) da = xr.DataArray(array, dims=["x", "y"]) np.empty_like(da) ``` python Traceback (most recent call last): File "/home/mark/t.py", line 8, in <module> np.empty_like(da) File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/common.py", line 165, in __array__ return np.asarray(self.values, dtype=dtype) File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/dataarray.py", line 732, in values return self.variable.values File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/variable.py", line 614, in values return _as_array_or_item(self._data) File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/core/variable.py", line 314, in _as_array_or_item data = np.asarray(data) File "/home/mark/mambaforge/envs/dev/lib/python3.9/site-packages/xarray/tests/__init__.py", line 151, in __array__ raise UnexpectedDataAccess("Tried accessing data") xarray.tests.UnexpectedDataAccess: Tried accessing data Describe the solution you'd like I'm not too sure. This is why I raised this as a "feature" and not a bug. On one hand, it is pretty hard to "get" the underlying class. Is it a: numpy array a lazy thing that looks like a numpy array? a dask array when it is dask? I think that there are also some nuances between: Loading an nc file from a file (where things might be handled by dask even though you don't want them to be) Creating your xarray from in memory. Describe alternatives you've considered for now, i'm trying to avoid `empty_like` or `zeros_like`. In general, we haven't seen much benefit from dask and cuda still needs careful memory management. Additional context No response	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7940/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
1432388736	I_kwDOAMm_X85VYISA	7245	coordinates not removed for variable encoding during reset_coords	hmaarrfk 90008	open	5	2022-11-02T02:46:56Z	2023-01-15T16:23:15Z	CONTRIBUTOR			What happened? When calling `reset_coords` on a dataset that is loaded from disk, the coordinates are not removed from the encoding of the variable. This means, that at save time they will be resaved as coordinates... annoying. (and erroneous) What did you expect to happen? No response Minimal Complete Verifiable Example ```Python import xarray as xr dataset = xr.Dataset( data_vars={'images': (('y', 'x'), np.zeros((10, 2)))}, coords={'zar': 1} ) dataset.to_netcdf('foo.nc', mode='w') %% foo_loaded = xr.open_dataset('foo.nc') foo_loaded_reset = foo_loaded.reset_coords() %% assert 'zar' in foo_loaded.coords assert 'zar' not in foo_loaded_reset.coords assert 'zar' in foo_loaded_reset.data_vars foo_loaded_reset.to_netcdf('bar.nc', mode='w') %% Now load the dataset bar_loaded = xr.open_dataset('bar.nc') assert 'zar' not in bar_loaded.coords, 'zar is erroneously a coordinate' %% This is the problem assert 'zar' not in foo_loaded_reset.images.encoding['coordinates'].split(' '), "zar should not be in here" ``` MVCE confirmation [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. [X] Complete example — the example is self-contained, including all data and the text of any traceback. [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result. [X] New issue — a search of GitHub Issues suggests this is not a duplicate. Relevant log output No response Anything else we need to know? `for _, variable in obj._variables.items(): coords_in_encoding = set(variable.encoding.get('coordinates', ' ').split(' ')) variable.encoding['coordinates'] = ' '.join(coords_in_encoding - set(names))` suggested fix in `dataset.py, reset_coords` https://github.com/pydata/xarray/blob/513ee34f16cc8f9250a72952e33bf9b4c95d33d1/xarray/core/dataset.py#L1734 Environment ``` INSTALLED VERSIONS ------------------ commit: None python: 3.9.13 \| packaged by Ramona Optics \| (main, Aug 31 2022, 22:30:30) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 5.15.0-50-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.10.0 pandas: 1.5.1 numpy: 1.23.4 scipy: 1.9.3 netCDF4: 1.6.1 pydap: None h5netcdf: 1.0.2 h5py: 3.7.0 Nio: None zarr: 2.13.3 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.10.0 distributed: 2022.10.0 matplotlib: 3.6.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.10.0 cupy: None pint: 0.20.1 sparse: None flox: None numpy_groupies: None setuptools: 65.5.0 pip: 22.3 conda: 22.9.0 pytest: 7.2.0 IPython: 7.33.0 sphinx: 5.3.0 /home/mark/mambaforge/envs/mcam_dev/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7245/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
1524642393	I_kwDOAMm_X85a4DJZ	7428	Avoid instantiating the data in prepare_variable	hmaarrfk 90008	open	0	2023-01-08T19:18:49Z	2023-01-09T06:25:52Z	CONTRIBUTOR			Is your feature request related to a problem? I'm trying to extend the features of xarray for a new backend I'm developing internally. The main use case that we are trying to open a multi 100's of GB dataset, slice out a smaller dataset (10s of GB) and write it. However, when we try to use functions like `prepare_variable`, the way they are currently written, they implicitely instantiate the whole data, (potentially 10s of GB) which incurs a huge "time cost" at a surprising (to me) point in the code. https://github.com/pydata/xarray/blob/6e77f5e8942206b3e0ab08c3621ade1499d8235b/xarray/backends/h5netcdf_.py#L338 Describe the solution you'd like Would it be possible to just remove the second return value from `prepare_variable`? It isn't particuarly "useful" and easy to obtain from the inputs to the function. Describe alternatives you've considered I'm proably going to create a new method, with a not so well chosen name like `prepare_variable_no_data` that does the above, but only for my backend. My code path that needs this only uses our custom backend. Additional context I think this would be useful, in general for other users that need more out of memory computation. I've found that you really have to "buy into" dask, all the way to the end, if you want to see any benefits. As such, if somebody used a dask array, this would create a serial choke point in: https://github.com/pydata/xarray/blob/6e77f5e8942206b3e0ab08c3621ade1499d8235b/xarray/backends/common.py#L308	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7428/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
1423948375	I_kwDOAMm_X85U37pX	7224	Insertion speed of new dataset elements	hmaarrfk 90008	open	3	2022-10-26T12:34:51Z	2022-10-29T22:39:39Z	CONTRIBUTOR			What is your issue? In https://github.com/pydata/xarray/pull/7221 I showed that a major contributor the slowdown in inserting a new element was the cost associated with an internal only debugging assert statement. The benchmarks results 7221 and 7222 are pretty useful to look at. Thank you for encouraging the creation of a "benchmark" so that we can monitor the performance of element insertion. Unfortunately, that was the only "free" lunch I got. A few other minor improvements can be obtained with: https://github.com/pydata/xarray/pull/7222 However, it seems to me that the fundamental reason this is "slow" is because element insertion is not so much "insertion" as it is: * Dataset Merge * Dataset Replacement of the internal methods. This is really solidified in the https://github.com/pydata/xarray/blob/main/xarray/core/dataset.py#L4918 In my benchmarks, I found that in the limit of large datasets, list comprehensions of 1000 elements or more were often used to "search" for variables that were "indexed" https://github.com/pydata/xarray/blob/ca57e5cd984e626487636628b1d34dca85cc2e7c/xarray/core/merge.py#L267 I think a few speedsups can be obtained by avoiding these kinds of "searches" and list comprehensions. However, I think that the dataset would have to provide this kind of information to the `merge_core` routine, instead of the `merge_core` routine recreating it all the time. Ultimately, I think you trade off "memory footprint" (due to the potential increase of datastructures you keep around) of a dataset, and "speed". Anyway, I just wanted to share where I got.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7224/reactions", "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
1098915891	I_kwDOAMm_X85BgCAz	6153	[FEATURE]: to_netcdf and additional keyword arguments	hmaarrfk 90008	open	2	2022-01-11T09:39:35Z	2022-01-20T06:54:25Z	CONTRIBUTOR			Is your feature request related to a problem? I briefly tried to see if any issue was brought up but couldn't. I'm hoping to be able to pass additional keyword arguments to the engine when using `to_netcdf`. https://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html However, it doesn't seem to easy to do so. Similar to how `open_dataset` has an additional `kwargs` parameter, would it be reasonable to add a similar parameter, maybe `engine_kwargs` to the `to_netcdf` to allow users to pass additional parameters to the engine? Describe the solution you'd like ```python import xarray as xr import numpy as np dataset = xr.DataArray( data=np.zeros(3), name="hello" ).to_dataset() dataset.to_netcdf("my_file.nc", engine="h5netcdf", engine_kwargs={"decode_vlen_strings=True"}) ``` Describe alternatives you've considered One could forward the additional keyword arguments with `kwargs`. I just feel like this makes things less "explicit". Additional context h5netcdf emits a warning that is hard to disable without passing a keyword argument to the constructor. https://github.com/h5netcdf/h5netcdf/issues/132 Also, for performance reasons, it might be very good to tune things like the storage data alignment.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6153/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
573577844	MDU6SXNzdWU1NzM1Nzc4NDQ=	3815	Opening from zarr.ZipStore fails to read (store???) unicode characters	hmaarrfk 90008	open	20	2020-03-01T16:49:25Z	2020-03-26T04:22:29Z	CONTRIBUTOR			See upstream: https://github.com/zarr-developers/zarr-python/issues/551 It seems that using a ZipStore creates 1 byte objects for Unicode string attributes. For example, saving the same Dataset with a DirectoryStore and a Zip Store creates an attribute for a unicode array with 20 bytes in size in the first, and 1 byte in size in the second. In fact, ubuntu file roller isn't even allowing me to extract the files. I have a feeling it is due to the note in the zarr documentation Note that Zip files do not provide any way to remove or replace existing entries. https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore MCVE Code Sample ZipStore `python import xarray as xr import zarr x = xr.Dataset() x['hello'] = 'world' x with zarr.ZipStore('test_store.zip', mode='w') as store: x.to_zarr(store) with zarr.ZipStore('test_store.zip', mode='r') as store: x_read = xr.open_zarr(store).compute()` Issued error ```python --------------------------------------------------------------------------- BadZipFile Traceback (most recent call last) <ipython-input-1-2a92a6db56ab> in <module> 7 x.to_zarr(store) 8 with zarr.ZipStore('test_store.zip', mode='r') as store: ----> 9 x_read = xr.open_zarr(store).compute() ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/dataset.py in compute(self, kwargs) 803 """ 804 new = self.copy(deep=False) --> 805 return new.load(kwargs) 806 807 def _persist_inplace(self, kwargs) -> "Dataset": ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/dataset.py in load(self, kwargs) 655 for k, v in self.variables.items(): 656 if k not in lazy_data: --> 657 v.load() 658 659 return self ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/variable.py in load(self, kwargs) 370 self._data = as_compatible_data(self._data.compute(kwargs)) 371 elif not hasattr(self._data, "__array_function__"): --> 372 self._data = np.asarray(self._data) 373 return self 374 ~/miniconda3/envs/dev/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order) 83 84 """ ---> 85 return array(a, dtype, copy=False, order=order) 86 87 ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/indexing.py in __array__(self, dtype) 545 def __array__(self, dtype=None): 546 array = as_indexable(self.array) --> 547 return np.asarray(array[self.key], dtype=None) 548 549 def transpose(self, order): ~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/backends/zarr.py in __getitem__(self, key) 46 array = self.get_array() 47 if isinstance(key, indexing.BasicIndexer): ---> 48 return array[key.tuple] 49 elif isinstance(key, indexing.VectorizedIndexer): 50 return array.vindex[ ~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection) 570 571 fields, selection = pop_fields(selection) --> 572 return self.get_basic_selection(selection, fields=fields) 573 574 def get_basic_selection(self, selection=Ellipsis, out=None, fields=None): ~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields) 693 if self._shape == (): 694 return self._get_basic_selection_zd(selection=selection, out=out, --> 695 fields=fields) 696 else: 697 return self._get_basic_selection_nd(selection=selection, out=out, ~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_zd(self, selection, out, fields) 709 # obtain encoded data for chunk 710 ckey = self._chunk_key((0,)) --> 711 cdata = self.chunk_store[ckey] 712 713 except KeyError: ~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/storage.py in __getitem__(self, key) 1249 with self.mutex: 1250 with self.zf.open(key) as f: # will raise KeyError -> 1251 return f.read() 1252 1253 def __setitem__(self, key, value): ~/miniconda3/envs/dev/lib/python3.7/zipfile.py in read(self, n) 914 self._offset = 0 915 while not self._eof: --> 916 buf += self._read1(self.MAX_N) 917 return buf 918 ~/miniconda3/envs/dev/lib/python3.7/zipfile.py in _read1(self, n) 1018 if self._left <= 0: 1019 self._eof = True -> 1020 self._update_crc(data) 1021 return data 1022 ~/miniconda3/envs/dev/lib/python3.7/zipfile.py in _update_crc(self, newdata) 946 # Check the CRC if we're at the end of the file 947 if self._eof and self._running_crc != self._expected_crc: --> 948 raise BadZipFile("Bad CRC-32 for file %r" % self.name) 949 950 def read1(self, n): BadZipFile: Bad CRC-32 for file 'hello/0' 0 2 Untitled10.ipynb ``` Working Directory Store example `python import xarray as xr import zarr x = xr.Dataset() x['hello'] = 'world' x store = zarr.DirectoryStore('test_store2.zarr') x.to_zarr(store) x_read = xr.open_zarr(store) x_read.compute() assert x_read.hello == x.hello` Expected Output The string metadata should work. Output of `xr.show_versions()` ``` INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 \| packaged by conda-forge \| (default, Jan 7 2020, 22:33:48) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 5.3.0-40-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8 libhdf5: None libnetcdf: None xarray: 0.14.1 pandas: 1.0.0 numpy: 1.17.5 scipy: 1.4.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.4.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.10.1 distributed: 2.10.0 matplotlib: 3.1.3 cartopy: None seaborn: None numbagg: None setuptools: 45.1.0.post20200119 pip: 20.0.2 conda: None pytest: 5.3.1 IPython: 7.12.0 sphinx: 2.3.1 ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3815/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

10 rows where state = "open" and user = 90008 sorted by updated_at descending

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

%%

%%

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

Fault and write data to this dataset

Relevant log output

Anything else we need to know?

Environment

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

%%

%%

%% Now load the dataset

%%

This is the problem

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

What is your issue?

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

MCVE Code Sample

Expected Output

Output of xr.show_versions()

Advanced export

Output of `xr.show_versions()`