home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

5 rows where user = 1492047 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: state_reason, created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 4
  • pull 1

state 2

  • closed 3
  • open 2

repo 1

  • xarray 5
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
2250654663 I_kwDOAMm_X86GJkPH 8957 netCDF encoding and decoding issues. Thomas-Z 1492047 open 0     6 2024-04-18T13:06:49Z 2024-04-19T13:12:04Z   CONTRIBUTOR      

What happened?

Reading or writing netCDF variables containing scale_factor and/or fill_value might raise the following error: python UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('int64') with casting rule 'same_kind' This problem might be related to the following changes: #7654.

What did you expect to happen?

I'm expecting it to work like it did before xarray 2024.03.0!

Minimal Complete Verifiable Example

```Python

Example 1, decoding problem.

import netCDF4 as nc import numpy as np import xarray as xr

with nc.Dataset("test1.nc", mode="w") as ncds: ncds.createDimension(dimname="d") ncx = ncds.createVariable( varname="x", datatype=np.int64, dimensions=("d",), fill_value=-1, )

ncx.scale_factor = 1e-3
ncx.units = "seconds"

ncx[:] = np.array([0.001, 0.002, 0.003])

This will raise the error

xr.load_dataset("test1.nc")

Example 2, encoding problem.

import netCDF4 as nc import numpy as np import xarray as xr

with nc.Dataset("test2.nc", mode="w") as ncds: ncds.createDimension(dimname="d") ncx = ncds.createVariable(varname="x", datatype=np.int8, dimensions=("d",))

ncx.scale_factor = 1000

ncx[:] = np.array([1000, 2000, 3000])

Reading it does work

data = xr.load_dataset("test2.nc")

Writing read data does not work

data.to_netcdf("text2x.nc") ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

```Python

Example 1 error


UFuncTypeError Traceback (most recent call last) Cell In[38], line 1 ----> 1 xr.load_dataset("test2.nc")

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/backends/api.py:280, in load_dataset(filename_or_obj, kwargs) 277 raise TypeError("cache has no effect in this context") 279 with open_dataset(filename_or_obj, kwargs) as ds: --> 280 return ds.load()

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/core/dataset.py:855, in Dataset.load(self, **kwargs) 853 for k, v in self.variables.items(): 854 if k not in lazy_data: --> 855 v.load() 857 return self

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/core/variable.py:961, in Variable.load(self, kwargs) 944 def load(self, kwargs): 945 """Manually trigger loading of this variable's data from disk or a 946 remote source into memory and return this variable. 947 (...) 959 dask.array.compute 960 """ --> 961 self._data = to_duck_array(self._data, **kwargs) 962 return self

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/namedarray/pycompat.py:134, in to_duck_array(data, **kwargs) 131 return loaded_data 133 if isinstance(data, ExplicitlyIndexed): --> 134 return data.get_duck_array() # type: ignore[no-untyped-call, no-any-return] 135 elif is_duck_array(data): 136 return data

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/core/indexing.py:809, in MemoryCachedArray.get_duck_array(self) 808 def get_duck_array(self): --> 809 self._ensure_cached() 810 return self.array.get_duck_array()

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/core/indexing.py:803, in MemoryCachedArray._ensure_cached(self) 802 def _ensure_cached(self): --> 803 self.array = as_indexable(self.array.get_duck_array())

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/core/indexing.py:760, in CopyOnWriteArray.get_duck_array(self) 759 def get_duck_array(self): --> 760 return self.array.get_duck_array()

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/core/indexing.py:630, in LazilyIndexedArray.get_duck_array(self) 625 # self.array[self.key] is now a numpy array when 626 # self.array is a BackendArray subclass 627 # and self.key is BasicIndexer((slice(None, None, None),)) 628 # so we need the explicit check for ExplicitlyIndexed 629 if isinstance(array, ExplicitlyIndexed): --> 630 array = array.get_duck_array() 631 return _wrap_numpy_scalars(array)

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/coding/variables.py:81, in _ElementwiseFunctionArray.get_duck_array(self) 80 def get_duck_array(self): ---> 81 return self.func(self.array.get_duck_array())

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/coding/variables.py:81, in _ElementwiseFunctionArray.get_duck_array(self) 80 def get_duck_array(self): ---> 81 return self.func(self.array.get_duck_array())

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/coding/variables.py:399, in _scale_offset_decoding(data, scale_factor, add_offset, dtype) 397 data = data.astype(dtype=dtype, copy=True) 398 if scale_factor is not None: --> 399 data *= scale_factor 400 if add_offset is not None: 401 data += add_offset

UFuncTypeError: Cannot cast ufunc 'multiply' output from dtype('float64') to dtype('int64') with casting rule 'same_kind'

Example 2 error


UFuncTypeError Traceback (most recent call last) Cell In[42], line 1 ----> 1 data.to_netcdf("text1x.nc")

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/core/dataset.py:2298, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf) 2295 encoding = {} 2296 from xarray.backends.api import to_netcdf -> 2298 return to_netcdf( # type: ignore # mypy cannot resolve the overloads:( 2299 self, 2300 path, 2301 mode=mode, 2302 format=format, 2303 group=group, 2304 engine=engine, 2305 encoding=encoding, 2306 unlimited_dims=unlimited_dims, 2307 compute=compute, 2308 multifile=False, 2309 invalid_netcdf=invalid_netcdf, 2310 )

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/backends/api.py:1339, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf) 1334 # TODO: figure out how to refactor this logic (here and in save_mfdataset) 1335 # to avoid this mess of conditionals 1336 try: 1337 # TODO: allow this work (setting up the file for writing array data) 1338 # to be parallelized with dask -> 1339 dump_to_store( 1340 dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims 1341 ) 1342 if autoclose: 1343 store.close()

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/backends/api.py:1386, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims) 1383 if encoder: 1384 variables, attrs = encoder(variables, attrs) -> 1386 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/backends/common.py:393, in AbstractWritableDataStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims) 390 if writer is None: 391 writer = ArrayWriter() --> 393 variables, attributes = self.encode(variables, attributes) 395 self.set_attributes(attributes) 396 self.set_dimensions(variables, unlimited_dims=unlimited_dims)

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/backends/common.py:482, in WritableCFDataStore.encode(self, variables, attributes) 479 def encode(self, variables, attributes): 480 # All NetCDF files get CF encoded by default, without this attempting 481 # to write times, for example, would fail. --> 482 variables, attributes = cf_encoder(variables, attributes) 483 variables = {k: self.encode_variable(v) for k, v in variables.items()} 484 attributes = {k: self.encode_attribute(v) for k, v in attributes.items()}

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/conventions.py:795, in cf_encoder(variables, attributes) 792 # add encoding for time bounds variables if present. 793 _update_bounds_encoding(variables) --> 795 new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()} 797 # Remove attrs from bounds variables (issue #2921) 798 for var in new_vars.values():

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/conventions.py:196, in encode_cf_variable(var, needs_copy, name) 183 ensure_not_multiindex(var, name=name) 185 for coder in [ 186 times.CFDatetimeCoder(), 187 times.CFTimedeltaCoder(), (...) 194 variables.BooleanCoder(), 195 ]: --> 196 var = coder.encode(var, name=name) 198 # TODO(kmuehlbauer): check if ensure_dtype_not_object can be moved to backends: 199 var = ensure_dtype_not_object(var, name=name)

File .../conda/envs/ong312_local/lib/python3.12/site-packages/xarray/coding/variables.py:476, in CFScaleOffsetCoder.encode(self, variable, name) 474 data -= pop_to(encoding, attrs, "add_offset", name=name) 475 if "scale_factor" in encoding: --> 476 data /= pop_to(encoding, attrs, "scale_factor", name=name) 478 return Variable(dims, data, attrs, encoding, fastpath=True)

UFuncTypeError: Cannot cast ufunc 'divide' output from dtype('float64') to dtype('int64') with casting rule 'same_kind' ```

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 5.15.0-92-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: ('fr_FR', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2 xarray: 2024.3.0 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.0 netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 Nio: None zarr: 2.17.2 cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: None dask: 2024.4.1 distributed: 2024.4.1 matplotlib: 3.8.4 cartopy: 0.23.0 seaborn: None numbagg: None fsspec: 2024.3.1 cupy: None pint: 0.23 sparse: None flox: None numpy_groupies: None setuptools: 69.5.1 pip: 24.0 conda: 24.3.0 pytest: 8.1.1 mypy: 1.9.0 IPython: 8.22.2 sphinx: 7.3.5
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8957/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1575938277 I_kwDOAMm_X85d7ujl 7516 Dataset.where performances regression. Thomas-Z 1492047 open 0     10 2023-02-08T11:19:34Z 2023-05-03T12:58:14Z   CONTRIBUTOR      

What happened?

Hello,

I'm using the Dataset.where function to select data based on some fields values and it takes way to much time! The dask dashboard seems to show some tasks repeating themselves many times.

The provided example uses a 1D array for which the selection could be done with Dataset.sel but with our real usecase we make selections on 2D variables.

This problem seems to have appeared with the 2022.6.0 xarray release, the 2022.3.0 is working as expected.

What did you expect to happen?

Using the 2022.3 release, this selection takes 1.37 seconds. Using the 2022.6.0 up to the 2023.2.0 (the one from yesterday), this selection takes 8.47 seconds.

This example is a very simple and small one, with real data and use case we simply cannot use this function anymore.

Minimal Complete Verifiable Example

```Python import dask.array as da import distributed as dist import xarray as xr

client = dist.Client()

Using small chunks emphasis the problem

ds = xr.Dataset( {"field": xr.DataArray(data=da.empty(shape=10000, chunks=10), dims=("x"))} ) sel = ds["field"] > 0

ds.where(sel, drop=True) ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

Problematic version

INSTALLED VERSIONS ------------------ commit: None python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.15.0-58-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: ('fr_FR', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2023.2.0 pandas: 1.5.3 numpy: 1.23.5 scipy: 1.8.1 netCDF4: 1.6.2 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: 2.13.6 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.4 cfgrib: 0.9.10.3 iris: None bottleneck: None dask: 2023.1.1 distributed: 2023.1.1 matplotlib: 3.6.3 cartopy: 0.21.1 seaborn: None numbagg: None fsspec: 2023.1.0 cupy: None pint: 0.20.1 sparse: None flox: None numpy_groupies: None setuptools: 67.1.0 pip: 23.0 conda: 22.11.1 pytest: 7.2.1 mypy: None IPython: 8.7.0 sphinx: 5.3.0

Working version

INSTALLED VERSIONS ------------------ commit: None python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.15.0-58-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: ('fr_FR', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.5.3 numpy: 1.23.5 scipy: 1.8.1 netCDF4: 1.6.2 pydap: None h5netcdf: 1.1.0 h5py: 3.8.0 Nio: None zarr: 2.13.6 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.4 cfgrib: 0.9.10.3 iris: None bottleneck: None dask: 2023.1.1 distributed: 2023.1.1 matplotlib: 3.6.3 cartopy: 0.21.1 seaborn: None numbagg: None fsspec: 2023.1.0 cupy: None pint: 0.20.1 sparse: None setuptools: 67.1.0 pip: 23.0 conda: 22.11.1 pytest: 7.2.1 IPython: 8.7.0 sphinx: 5.3.0
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7516/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  reopened xarray 13221727 issue
372848074 MDU6SXNzdWUzNzI4NDgwNzQ= 2501 open_mfdataset usage and limitations. Thomas-Z 1492047 closed 0     22 2018-10-23T07:31:42Z 2021-01-27T18:06:16Z 2021-01-27T18:06:16Z CONTRIBUTOR      

I'm trying to understand and use the open_mfdataset function to open a huge amount of files. I thought this function would be quit similar to dask.dataframe.from_delayed and allow to "load" and work on an amount of data only limited by the number of Dask workers (or "unlimited" considering it could be "lazily loaded").

But my tests showed something quit different. It seems xarray requires the index to be copied back to the Dask client in order to "auto_combine" data.

Doing some tests on a small portion of my data I have something like this.

Each file has these dimensions: time: ~2871, xx_ind: 40, yy_ind: 128. The concatenation of these files is made on the time dimension and my understanding is that only the time is loaded and brought back to the client (other dimensions are constant).

Parallel tests are made with 200 dask workers.

```python =================== Loading 1002 files ===================

xr.open_mfdataset('1002.nc')

peak memory: 1660.59 MiB, increment: 1536.25 MiB Wall time: 1min 29s

xr.open_mfdataset('1002.nc', parallel=True)

peak memory: 1745.14 MiB, increment: 1602.43 MiB Wall time: 53 s

=================== Loading 5010 files ===================

xr.open_mfdataset('5010.nc')

peak memory: 7419.99 MiB, increment: 7315.36 MiB Wall time: 8min 33s

xr.open_mfdataset('5010.nc', parallel=True)

peak memory: 8249.75 MiB, increment: 8112.07 MiB Wall time: 4min 48s ```

As you can see, the amount of memory used for this operation is significant and I won't be able to do this on much more files. When using the parallel option, the loading of files take a few seconds (judging from what the Dask dashboard is showing) and I'm guessing the rest of the time is for the "auto_combine".

So I'm wondering if I'm doing something wrong, if there other way to load data or if I cannot use xarray directly for this quantity of data and have to use Dask directly.

Thanks in advance.

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-34-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 xarray: 0.10.9+32.g9f4474d.dirty pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: 0.6.2 h5py: 2.8.0 Nio: None zarr: 2.2.0 cftime: 1.0.1 PseudonetCDF: None rasterio: None iris: None bottleneck: None cyordereddict: None dask: 0.19.4 distributed: 1.23.3 matplotlib: 3.0.0 cartopy: None seaborn: None setuptools: 40.4.3 pip: 18.1 conda: None pytest: 3.9.1 IPython: 7.0.1 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2501/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
676696822 MDExOlB1bGxSZXF1ZXN0NDY1OTYxOTIw 4333 Support explicitly setting a dimension order with to_dataframe() Thomas-Z 1492047 closed 0     11 2020-08-11T08:46:45Z 2020-08-19T20:37:38Z 2020-08-14T18:28:26Z CONTRIBUTOR   0 pydata/xarray/pulls/4333
  • [x] Closes #4331
  • [x] Tests added
  • [x] Passes isort . && black . && mypy . && flake8
  • [x] User visible changes (including notable bug fixes) are documented in whats-new.rst
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4333/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
347895055 MDU6SXNzdWUzNDc4OTUwNTU= 2346 Dataset/DataArray to_dataframe() dimensions order mismatch. Thomas-Z 1492047 closed 0     4 2018-08-06T12:03:00Z 2020-08-10T17:45:43Z 2020-08-08T07:10:28Z CONTRIBUTOR      

Code Sample

```python import xarray as xr import numpy as np

data = xr.DataArray(np.random.randn(2, 3), coords={'x': ['a', 'b']}, dims=('y', 'x')) ds = xr.Dataset({'foo': data})

Applied on the Dataset

ds.to_dataframe()

foo

x y

a 0 0.348519

1 -0.322634

2 -0.683181

b 0 0.197501

1 0.504810

2 -1.871626

Applied to the DataArray

ds['foo'].to_dataframe()

foo

y x

0 a 0.348519

b 0.197501

1 a -0.322634

b 0.504810

2 a -0.683181

b -1.871626

```

Problem description

The to_dataframe method applied to a DataArray will respect the dimensions order whereas the same method applied to a Dataset will use an alphabetically sorted order.

In both situation to_dataframe calls _to_dataframe() with an argument. The DataArray uses an OrderedDict but the Dataset uses self.dims (which is a SortedKeyDict) as argument.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-23-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: 2.8.0 Nio: None zarr: 2.2.0 bottleneck: None cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 40.0.0 pip: 18.0 conda: None pytest: 3.7.1 IPython: 6.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2346/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 20.035ms · About: xarray-datasette