home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

5 rows where user = 33886395 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)

state 2

  • closed 4
  • open 1

type 1

  • issue 5

repo 1

  • xarray 5
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1962680526 I_kwDOAMm_X850_CDO 8377 Slow performance with groupby using a custom DataArray grouper alessioarena 33886395 closed 0     6 2023-10-26T04:28:00Z 2024-02-15T22:44:18Z 2024-02-15T22:44:18Z NONE      

What is your issue?

I have a code that calculates a per-pixel nearest neighbor match between two datasets, to then perform a groupby + aggregation. The calculation I perform is generally lazy using dask.

I recently noticed a slow performance of groupby in this way, with lazy calculations taking in excess of 10 minutes for an index of approximately 4000 by 4000.

I did a bit of digging around and noticed that the slow line is this: ```Python Timer unit: 1e-09 s

Total time: 0.263679 s File: /env/lib/python3.10/site-packages/xarray/core/duck_array_ops.py Function: array_equiv at line 260

Line # Hits Time Per Hit % Time Line Contents

260 def array_equiv(arr1, arr2): 261 """Like np.array_equal, but also allows values to be NaN in both arrays""" 262 22140 96490101.0 4358.2 36.6 arr1 = asarray(arr1) 263 22140 34155953.0 1542.7 13.0 arr2 = asarray(arr2) 264 22140 119855572.0 5413.5 45.5 lazy_equiv = lazy_array_equiv(arr1, arr2) 265 22140 7390478.0 333.8 2.8 if lazy_equiv is None: 266 with warnings.catch_warnings(): 267 warnings.filterwarnings("ignore", "In the future, 'NAT == x'") 268 flag_array = (arr1 == arr2) | (isnull(arr1) & isnull(arr2)) 269 return bool(flag_array.all()) 270 else: 271 22140 5787053.0 261.4 2.2 return lazy_equiv

Total time: 242.247 s File: /env/lib/python3.10/site-packages/xarray/core/indexing.py Function: getitem at line 1419

Line # Hits Time Per Hit % Time Line Contents

1419 def getitem(self, key): 1420 22140 26764337.0 1208.9 0.0 if not isinstance(key, VectorizedIndexer): 1421 # if possible, short-circuit when keys are effectively slice(None) 1422 # This preserves dask name and passes lazy array equivalence checks 1423 # (see duck_array_ops.lazy_array_equiv) 1424 22140 10513930.0 474.9 0.0 rewritten_indexer = False 1425 22140 4602305.0 207.9 0.0 new_indexer = [] 1426 66420 61804870.0 930.5 0.0 for idim, k in enumerate(key.tuple): 1427 88560 78516641.0 886.6 0.0 if isinstance(k, Iterable) and ( 1428 22140 151748667.0 6854.1 0.1 not is_duck_dask_array(k) 1429 22140 2e+11 1e+07 93.6 and duck_array_ops.array_equiv(k, np.arange(self.array.shape[idim])) 1430 ): 1431 new_indexer.append(slice(None)) 1432 rewritten_indexer = True 1433 else: 1434 44280 40322984.0 910.6 0.0 new_indexer.append(k) 1435 22140 4847251.0 218.9 0.0 if rewritten_indexer: 1436 key = type(key)(tuple(new_indexer)) 1437
1438 22140 24251221.0 1095.4 0.0 if isinstance(key, BasicIndexer): 1439 return self.array[key.tuple] 1440 22140 9613954.0 434.2 0.0 elif isinstance(key, VectorizedIndexer): 1441 return self.array.vindex[key.tuple] 1442 else: 1443 22140 8618414.0 389.3 0.0 assert isinstance(key, OuterIndexer) 1444 22140 26601491.0 1201.5 0.0 key = key.tuple 1445 22140 6010672.0 271.5 0.0 try: 1446 22140 2e+10 678487.7 6.2 return self.array[key] 1447 except NotImplementedError: 1448 # manual orthogonal indexing. 1449 # TODO: port this upstream into dask in a saner way. 1450 value = self.array 1451 for axis, subkey in reversed(list(enumerate(key))): 1452 value = value[(slice(None),) * axis + (subkey,)] 1453 return value ```

The test duck_array_ops.array_equiv(k, np.arange(self.array.shape[idim])) is repeated multiple times, and despite that being decently fast it amounts to a lot of time that could be potentially minimized by introducing a prior test of equal length, like

python if isinstance(k, Iterable) and ( not is_duck_dask_array(k) and len(k) == self.array.shape[idim] and duck_array_ops.array_equiv(k, np.arange(self.array.shape[idim])) ):

This would work better because, despite that test being performed by array_equiv, currently the array to test against is always created using np.arange, that being ultimately the bottleneck ```Python 74992059 function calls (73375414 primitive calls) in 298.934 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function) 22140 225.296 0.010 225.296 0.010 {built-in method numpy.arange} 177123 3.192 0.000 3.670 0.000 inspect.py:2920(init) 110702/110701 2.180 0.000 2.180 0.000 {built-in method numpy.asarray} 11690863/11668723 2.036 0.000 5.043 0.000 {built-in method builtins.isinstance} 287827 1.876 0.000 3.768 0.000 utils.py:25(meta_from_array) 132843 1.872 0.000 7.649 0.000 inspect.py:2280(_signature_from_function) 974166 1.485 0.000 2.558 0.000 inspect.py:2637(init) ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8377/reactions",
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 1
}
  completed xarray 13221727 issue
1397532790 I_kwDOAMm_X85TTKh2 7132 Saving a DataArray of datetime objects as zarr is not a lazy operation despite compute=False alessioarena 33886395 closed 0     2 2022-10-05T09:50:34Z 2024-01-29T19:12:32Z 2024-01-29T19:12:32Z NONE      

What happened?

Trying to save a lazy xr.DataArray of datetime objects as a zarr forces a dask.compute operation and retrieves the data to the local notebook. This is generally not a problem for indices of datetime objects as that is already locally store and generally small in size.

However, if the whole underlying array is a datetime object, that can be a serious problem. In my case it simply crashed the scheduler upon attempting to retrieve the data persisted on workers.

I managed to isolate the problem on this call stack. The issue is in the encode_cf_datetime function

What did you expect to happen?

Storing the data in zarr format to be performed directly by dask workers bypassing the scheduler/Client if compute=True, and complete lazy operation if compute=False

Minimal Complete Verifiable Example

```Python import numpy as np import xarray as xr import dask.array as da test = xr.DataArray( data = da.full((20000, 20000), np.datetime64('2005-02-25T03:30', 'ns')), coords = {'x': range(20000), 'y': range(20000)} ).to_dataset(name='test')

print(test.test.dtype)

dtype('<M8[ns]')

test.to_zarr('test.zarr', compute=False)

this will take a while and trigger the computation of the array. No data will be actually saved though

```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python File /env/lib/python3.8/site-packages/xarray/core/dataset.py:2036, in Dataset.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options) 2033 if encoding is None: 2034 encoding = {} -> 2036 return to_zarr( 2037 self, 2038 store=store, 2039 chunk_store=chunk_store, 2040 storage_options=storage_options, 2041 mode=mode, 2042 synchronizer=synchronizer, 2043 group=group, 2044 encoding=encoding, 2045 compute=compute, 2046 consolidated=consolidated, 2047 append_dim=append_dim, 2048 region=region, 2049 safe_chunks=safe_chunks, 2050 )

File /env/lib/python3.8/site-packages/xarray/backends/api.py:1431, in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options) 1429 writer = ArrayWriter() 1430 # TODO: figure out how to properly handle unlimited_dims -> 1431 dump_to_store(dataset, zstore, writer, encoding=encoding) 1432 writes = writer.sync(compute=compute) 1434 if compute:

File /env/lib/python3.8/site-packages/xarray/backends/api.py:1119, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims) 1116 if encoder: 1117 variables, attrs = encoder(variables, attrs) -> 1119 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

File /env/lib/python3.8/site-packages/xarray/backends/zarr.py:500, in ZarrStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims) 498 new_variables = set(variables) - existing_variable_names 499 variables_without_encoding = {vn: variables[vn] for vn in new_variables} --> 500 variables_encoded, attributes = self.encode( 501 variables_without_encoding, attributes 502 ) 504 if existing_variable_names: 505 # Decode variables directly, without going via xarray.Dataset to 506 # avoid needing to load index variables into memory. 507 # TODO: consider making loading indexes lazy again? 508 existing_vars, _, _ = conventions.decode_cf_variables( 509 self.get_variables(), self.get_attrs() 510 )

File /env/lib/python3.8/site-packages/xarray/backends/common.py:200, in AbstractWritableDataStore.encode(self, variables, attributes) 183 def encode(self, variables, attributes): 184 """ 185 Encode the variables and attributes in this store 186 (...) 198 199 """ --> 200 variables = {k: self.encode_variable(v) for k, v in variables.items()} 201 attributes = {k: self.encode_attribute(v) for k, v in attributes.items()} 202 return variables, attributes

File /env/lib/python3.8/site-packages/xarray/backends/common.py:200, in <dictcomp>(.0) 183 def encode(self, variables, attributes): 184 """ 185 Encode the variables and attributes in this store 186 (...) 198 199 """ --> 200 variables = {k: self.encode_variable(v) for k, v in variables.items()} 201 attributes = {k: self.encode_attribute(v) for k, v in attributes.items()} 202 return variables, attributes

File /env/lib/python3.8/site-packages/xarray/backends/zarr.py:459, in ZarrStore.encode_variable(self, variable) 458 def encode_variable(self, variable): --> 459 variable = encode_zarr_variable(variable) 460 return variable

File /env/lib/python3.8/site-packages/xarray/backends/zarr.py:258, in encode_zarr_variable(var, needs_copy, name) 237 def encode_zarr_variable(var, needs_copy=True, name=None): 238 """ 239 Converts an Variable into an Variable which follows some 240 of the CF conventions: (...) 255 A variable which has been encoded as described above. 256 """ --> 258 var = conventions.encode_cf_variable(var, name=name) 260 # zarr allows unicode, but not variable-length strings, so it's both 261 # simpler and more compact to always encode as UTF-8 explicitly. 262 # TODO: allow toggling this explicitly via dtype in encoding. 263 coder = coding.strings.EncodedStringCoder(allows_unicode=True)

File /env/lib/python3.8/site-packages/xarray/conventions.py:273, in encode_cf_variable(var, needs_copy, name) 264 ensure_not_multiindex(var, name=name) 266 for coder in [ 267 times.CFDatetimeCoder(), 268 times.CFTimedeltaCoder(), (...) 271 variables.UnsignedIntegerCoder(), 272 ]: --> 273 var = coder.encode(var, name=name) 275 # TODO(shoyer): convert all of these to use coders, too: 276 var = maybe_encode_nonstring_dtype(var, name=name)

File /env/lib/python3.8/site-packages/xarray/coding/times.py:659, in CFDatetimeCoder.encode(self, variable, name) 655 dims, data, attrs, encoding = unpack_for_encoding(variable) 656 if np.issubdtype(data.dtype, np.datetime64) or contains_cftime_datetimes( 657 variable 658 ): --> 659 (data, units, calendar) = encode_cf_datetime( 660 data, encoding.pop("units", None), encoding.pop("calendar", None) 661 ) 662 safe_setitem(attrs, "units", units, name=name) 663 safe_setitem(attrs, "calendar", calendar, name=name)

File /env/lib/python3.8/site-packages/xarray/coding/times.py:592, in encode_cf_datetime(dates, units, calendar) 582 def encode_cf_datetime(dates, units=None, calendar=None): 583 """Given an array of datetime objects, returns the tuple (num, units, 584 calendar) suitable for a CF compliant time variable. 585 (...) 590 cftime.date2num 591 """ --> 592 dates = np.asarray(dates) 594 if units is None: 595 units = infer_datetime_units(dates) ```

Anything else we need to know?

Our system uses dask_gateway in a AWS infrastructure (S3 for storage)

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.209-116.367.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.4 libnetcdf: 4.7.3 xarray: 2022.3.0 pandas: 1.5.0 numpy: 1.22.4 scipy: 1.9.1 netCDF4: 1.6.1 pydap: installed h5netcdf: 1.0.2 h5py: 3.7.0 Nio: None zarr: 2.13.2 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.2 cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.9.2 distributed: 2022.9.2 matplotlib: 3.6.0 cartopy: 0.20.2 seaborn: 0.12.0 numbagg: None fsspec: 2022.8.2 cupy: None pint: None sparse: 0.13.0 setuptools: 65.4.1 pip: 22.2.2 conda: None pytest: 7.1.3 IPython: 8.5.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/7132/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
1307523148 I_kwDOAMm_X85N7zhM 6803 Passing a distributed.Future to the kwargs of apply_ufunc should resolve the future alessioarena 33886395 closed 0     10 2022-07-18T07:31:28Z 2024-01-09T18:21:15Z 2023-12-19T05:40:20Z NONE      

What is your issue?

I am trying to scatter an large array and pass it as keyword argument to a function applied using apply_ufunc but that is currently not working. The same function works if providing the actual array, but if providing the Future linked to the scatter data the task fails.

Here is a minimal example to reproduce this issue

```python import dask.array as da import xarray as xr import numpy as np

data = xr.DataArray(data=da.random.random((15, 15, 20)), coords={'x': range(15), 'y': range(15), 'z': range(20)}, dims=('x', 'y', 'z'))

test = np.full((20,), 30) test_future = client.scatter(test, broadcast=True)

def _copy_test(d, test=None): return test

new_data_actual = xr.apply_ufunc( _copy_test, data, input_core_dims=[['z']], output_core_dims=[['new_z']], vectorize=True, dask='parallelized', output_dtypes="float64", kwargs={'test':test}, dask_gufunc_kwargs = {'output_sizes':{'new_z':20}} )

new_data_future = xr.apply_ufunc( _copy_test, data, input_core_dims=[['z']], output_core_dims=[['new_z']], vectorize=True, dask='parallelized', output_dtypes="float64", kwargs={'test':test_future}, dask_gufunc_kwargs = {'output_sizes':{'new_z':20}} )

data[0, 0].compute()

[0.3034994 , 0.08172002, 0.34731092, ...]

new_data_actual[0, 0].compute()

[30.0, 30.0, 30.0, ...]

new_data_future[0,0].compute()

KilledWorker

```

I tried different versions of this, going from explicitly calling test.result() to change the way the Future was passed, but nothing worked. I also tried to raise exceptions within the function and various way to print information, but that also did not work. This last issue makes me think that if passing a Future I actually don't get to the scope of that function

Am I trying to do something completely silly? or is this an unexpected behavior?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6803/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
754789691 MDU6SXNzdWU3NTQ3ODk2OTE= 4637 Support for monotonically decreasing indices in interpolate_na alessioarena 33886395 open 0     3 2020-12-01T22:58:23Z 2023-03-31T18:10:24Z   NONE      

Currently interpolate_na requires all indices to be monotonically increasing. If thinking about geographical dataset, indices are generally always monotonic, however in the South hemisphere the latitude is monotonically decreasing.

The current workaround is to flip the image before and after the interpolation, however should not be a lot of effort to support all monotonic indices directly within interpolate_na.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4637/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
584865241 MDU6SXNzdWU1ODQ4NjUyNDE= 3872 Operations resulting in np.timedelta64 are not properly coerced alessioarena 33886395 closed 0     2 2020-03-20T06:14:30Z 2020-03-23T20:55:54Z 2020-03-23T20:55:53Z NONE      

It seems that operations that are resulting in timedelta64 (for example datetime64 arithmetic) are not properly coerced. In fact, the result of that operation is a xarray object that has a dt accessor being of type xarray.core.accessor_dt.DatetimeAccessor instead of the expected xarray.core.accessor_dt.TimedeltaAccessor

This follows the numpy documentation describing datetime arithmentic resulting in timedelta objects (http://lagrange.univ-lyon1.fr/docs/numpy/1.11.0/reference/arrays.datetime.html#datetime-and-timedelta-arithmetic)

MCVE Code Sample

```python

this is a DataArray of type np.datetime64

da = xr.DataArray( data= pd.date_range('2020-01-01', '2020-01-30', freq='D') )

this simple arithmetic will result in np.timedelta64

delta = (da - np.datetime64('2020-01-01'))

type(delta.data[0])

> numpy.timedelta64

type(delta.dt)

> xarray.core.accessor_dt.DatetimeAccessor

```

Expected Output

```python type(delta.dt)

> xarray.core.accessor_dt.TimedeltaAccessor

```

Problem Description

Having a .data of type timedelta64 would benefit from having a .dt accessor of type TimedeltaAccessor. This would allow to represent such timedelta64 using the relevant time units like days

Versions

Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.10 | packaged by conda-forge | (default, Mar 5 2020, 10:05:08) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.12.14-95.32-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.14.1 pandas: 0.24.1 numpy: 1.18.1 scipy: 1.4.1 netCDF4: 1.5.1.2 pydap: None h5netcdf: 0.8.0 h5py: 2.9.0 Nio: None zarr: 2.2.0 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2.12.0 distributed: 2.12.0 matplotlib: 3.0.3 cartopy: None seaborn: 0.10.0 numbagg: None setuptools: 45.2.0.post20200209 pip: 20.0.2 conda: None pytest: None IPython: 7.13.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3872/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 27.213ms · About: xarray-datasette