id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1167883842,I_kwDOAMm_X85FnH5C,6352,to_netcdf from subsetted Dataset with strings loaded from char array netCDF can sometimes fail,868027,open,0,,,0,2022-03-14T04:52:38Z,2022-04-09T16:59:52Z,,CONTRIBUTOR,,,,"### What happened? Not quite sure what to actually title this, so feel free to edit it. I have some netcdf files modeled after the Argo _prof file format ([CF Discrete sampling geometry incomplete multidimensional array representation](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#_incomplete_multidimensional_array_representation)). While working on splitting these into individual profiles, I would occasionally get exceptions thrown complaining about broadcasting. I eventually narrowed this down to some string variables we maintain for historic purposes. Depending on the row split apart, the string data in each cell could be shorter which would result in a stringN having some different N (e.g. string4 = 3 in the CDL). If while serializing, a different string variable is being encoded that actually has length 4, it would reuse the now incorrect string4 dim name. The above situation seems to only occur when a netCDF file is read back into xarray and the `char_dim_name` encoding key is set. ### What did you expect to happen? Successful serialization to netCDF. ### Minimal Complete Verifiable Example ```Python # setup import numpy as np import xarray as xr one_two = xr.DataArray(np.array([""a"", ""aa""], dtype=""object""), dims=[""dim0""]) two_two = xr.DataArray(np.array([""aa"", ""aa""], dtype=""object""), dims=[""dim0""]) ds = xr.Dataset({""var0"": one_two, ""var1"": two_two}) ds.var0.encoding[""dtype""] = ""S1"" ds.var1.encoding[""dtype""] = ""S1"" # need to write out and read back in ds.to_netcdf(""test.nc"") # only selecting the shorter string will fail ds1 = xr.load_dataset(""test.nc"") ds1[{""dim0"": 1}].to_netcdf(""ok.nc"") ds1[{""dim0"": 0}].to_netcdf(""error.nc"") # will work if the char dim name is removed from encoding of the now shorter arr ds1 = xr.load_dataset(""test.nc"") del ds1.var0.encoding[""char_dim_name""] ds1[{""dim0"": 0}].to_netcdf(""will_work.nc"") ``` ### Relevant log output ```Python --------------------------------------------------------------------------- IndexError Traceback (most recent call last) /var/folders/y1/63dlf4614h5d2cgr5g1t_5lh0000gn/T/ipykernel_64155/447008818.py in 2 ds1 = xr.load_dataset(""test.nc"") 3 ds1[{""dim0"": 1}].to_netcdf(""ok.nc"") ----> 4 ds1[{""dim0"": 0}].to_netcdf(""error.nc"") ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf) 1899 from ..backends.api import to_netcdf 1900 -> 1901 return to_netcdf( 1902 self, 1903 path, ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf) 1070 # TODO: allow this work (setting up the file for writing array data) 1071 # to be parallelized with dask -> 1072 dump_to_store( 1073 dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims 1074 ) ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims) 1117 variables, attrs = encoder(variables, attrs) 1118 -> 1119 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) 1120 1121 ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims) 263 self.set_attributes(attributes) 264 self.set_dimensions(variables, unlimited_dims=unlimited_dims) --> 265 self.set_variables( 266 variables, check_encoding_set, writer, unlimited_dims=unlimited_dims 267 ) ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/common.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims) 305 ) 306 --> 307 writer.add(source, target) 308 309 def set_dimensions(self, variables, unlimited_dims=None): ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/common.py in add(self, source, target, region) 154 target[region] = source 155 else: --> 156 target[...] = source 157 158 def sync(self, compute=True): ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in __setitem__(self, key, value) 70 with self.datastore.lock: 71 data = self.get_array(needs_lock=False) ---> 72 data[key] = value 73 if self.datastore.autoclose: 74 self.datastore.close(needs_lock=False) src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__setitem__() src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._put() IndexError: size of data array does not conform to slice ``` ### Anything else we need to know? I've been unable to recreate the specific error I'm getting in a minimal example. However, removing the `char_dim_name` encoding key does solve this. When digging in the xarray issues, these looked maybe relevant: #2219 #2895
Actual traceback I get with my data ```python --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/y1/63dlf4614h5d2cgr5g1t_5lh0000gn/T/ipykernel_64155/3328648456.py in ----> 1 ds[{""N_PROF"": 0}].to_netcdf(""test.nc"") ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf) 1899 from ..backends.api import to_netcdf 1900 -> 1901 return to_netcdf( 1902 self, 1903 path, ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf) 1070 # TODO: allow this work (setting up the file for writing array data) 1071 # to be parallelized with dask -> 1072 dump_to_store( 1073 dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims 1074 ) ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims) 1117 variables, attrs = encoder(variables, attrs) 1118 -> 1119 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) 1120 1121 ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims) 263 self.set_attributes(attributes) 264 self.set_dimensions(variables, unlimited_dims=unlimited_dims) --> 265 self.set_variables( 266 variables, check_encoding_set, writer, unlimited_dims=unlimited_dims 267 ) ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/common.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims) 305 ) 306 --> 307 writer.add(source, target) 308 309 def set_dimensions(self, variables, unlimited_dims=None): ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/common.py in add(self, source, target, region) 154 target[region] = source 155 else: --> 156 target[...] = source 157 158 def sync(self, compute=True): ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in __setitem__(self, key, value) 70 with self.datastore.lock: 71 data = self.get_array(needs_lock=False) ---> 72 data[key] = value 73 if self.datastore.autoclose: 74 self.datastore.close(needs_lock=False) src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__setitem__() ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/netCDF4/utils.py in _StartCountStride(elem, shape, dimensions, grp, datashape, put, use_get_vars) 354 fullslice = False 355 if fullslice and datashape and put and not hasunlim: --> 356 datashape = broadcasted_shape(shape, datashape) 357 358 # pad datashape with zeros for dimensions not being sliced (issue #906) ~/.dotfiles/pyenv/versions/3.9.9/envs/jupyter/lib/python3.9/site-packages/netCDF4/utils.py in broadcasted_shape(shp1, shp2) 962 a = as_strided(x, shape=shp1, strides=[0] * len(shp1)) 963 b = as_strided(x, shape=shp2, strides=[0] * len(shp2)) --> 964 return np.broadcast(a, b).shape ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (5,) and arg 1 with shape (6,). ```
### Environment INSTALLED VERSIONS ------------------ commit: None python: 3.9.9 (main, Jan 5 2022, 11:21:18) [Clang 13.0.0 (clang-1300.0.29.30)] python-bits: 64 OS: Darwin OS-release: 21.3.0 machine: arm64 processor: arm byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.13.0 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.3.5 numpy: 1.22.0 scipy: None netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: 0.18 sparse: None setuptools: 58.1.0 pip: 21.2.4 conda: None pytest: 6.2.5 IPython: 7.31.0 sphinx: 4.4.0","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6352/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 344631360,MDU6SXNzdWUzNDQ2MzEzNjA=,2315,Behavior of filter_by_attrs() does not match netCDF4.Dataset.get_variables_by_attributes,868027,closed,0,,,8,2018-07-25T22:35:14Z,2018-08-28T01:21:20Z,2018-08-28T01:21:20Z,CONTRIBUTOR,,,,"When using the `filter_by_attrs()` method of a Dataset it was returning more matches than I expected. The returned match set appears to be the logical OR of all the key=value pairs passed into the method. This does not match the behavior of the `netCDF4.Dataset.get_variables_by_attributes` which returns the logical AND of all the key=value pairs passed into it. I couldn't tell from the documentation if this was the intended behavior. #### Minimal Example ```python import xarray as xr example_dataset = xr.Dataset({ ""var1"": xr.DataArray([], attrs={""standard_name"": ""example1"", ""priority"": 0}), ""var2"": xr.DataArray([], attrs={""standard_name"": ""example2""}) }) example_dataset.filter_by_attrs(standard_name=""example2"", priority=0) ``` #### Example Output ```python Dimensions: (dim_0: 0) Dimensions without coordinates: dim_0 Data variables: var1 (dim_0) float64 var2 (dim_0) float64 ``` #### Expected Output ```python Dimensions: () Data variables: *empty* ``` Alternatively, chaining calls to filter_by_attrs will result in the expected behavior: ```python import xarray as xr example_dataset = xr.Dataset({ ""var1"": xr.DataArray([], attrs={""standard_name"": ""example1"", ""priority"": 0}), ""var2"": xr.DataArray([], attrs={""standard_name"": ""example2""}) }) example_dataset.filter_by_attrs(standard_name=""example2"").filter_by_attrs(priority=0) Dimensions: () Data variables: *empty* ``` #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.3 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: None Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.2 distributed: 1.22.0 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: None setuptools: 39.0.1 pip: 10.0.1 conda: None pytest: 3.6.3 IPython: 6.4.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2315/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue