id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1962680526,I_kwDOAMm_X850_CDO,8377,Slow performance with groupby using a custom DataArray grouper,33886395,closed,0,,,6,2023-10-26T04:28:00Z,2024-02-15T22:44:18Z,2024-02-15T22:44:18Z,NONE,,,,"### What is your issue? I have a code that calculates a per-pixel nearest neighbor match between two datasets, to then perform a groupby + aggregation. The calculation I perform is generally lazy using dask. I recently noticed a slow performance of groupby in this way, with lazy calculations taking in excess of 10 minutes for an index of approximately 4000 by 4000. I did a bit of digging around and noticed that the slow line is [this](https://github.com/pydata/xarray/blob/main/xarray/core/indexing.py#L1429): ```Python Timer unit: 1e-09 s Total time: 0.263679 s File: /env/lib/python3.10/site-packages/xarray/core/duck_array_ops.py Function: array_equiv at line 260 Line # Hits Time Per Hit % Time Line Contents ============================================================== 260 def array_equiv(arr1, arr2): 261 """"""Like np.array_equal, but also allows values to be NaN in both arrays"""""" 262 22140 96490101.0 4358.2 36.6 arr1 = asarray(arr1) 263 22140 34155953.0 1542.7 13.0 arr2 = asarray(arr2) 264 22140 119855572.0 5413.5 45.5 lazy_equiv = lazy_array_equiv(arr1, arr2) 265 22140 7390478.0 333.8 2.8 if lazy_equiv is None: 266 with warnings.catch_warnings(): 267 warnings.filterwarnings(""ignore"", ""In the future, 'NAT == x'"") 268 flag_array = (arr1 == arr2) | (isnull(arr1) & isnull(arr2)) 269 return bool(flag_array.all()) 270 else: 271 22140 5787053.0 261.4 2.2 return lazy_equiv Total time: 242.247 s File: /env/lib/python3.10/site-packages/xarray/core/indexing.py Function: __getitem__ at line 1419 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1419 def __getitem__(self, key): 1420 22140 26764337.0 1208.9 0.0 if not isinstance(key, VectorizedIndexer): 1421 # if possible, short-circuit when keys are effectively slice(None) 1422 # This preserves dask name and passes lazy array equivalence checks 1423 # (see duck_array_ops.lazy_array_equiv) 1424 22140 10513930.0 474.9 0.0 rewritten_indexer = False 1425 22140 4602305.0 207.9 0.0 new_indexer = [] 1426 66420 61804870.0 930.5 0.0 for idim, k in enumerate(key.tuple): 1427 88560 78516641.0 886.6 0.0 if isinstance(k, Iterable) and ( 1428 22140 151748667.0 6854.1 0.1 not is_duck_dask_array(k) 1429 22140 2e+11 1e+07 93.6 and duck_array_ops.array_equiv(k, np.arange(self.array.shape[idim])) 1430 ): 1431 new_indexer.append(slice(None)) 1432 rewritten_indexer = True 1433 else: 1434 44280 40322984.0 910.6 0.0 new_indexer.append(k) 1435 22140 4847251.0 218.9 0.0 if rewritten_indexer: 1436 key = type(key)(tuple(new_indexer)) 1437 1438 22140 24251221.0 1095.4 0.0 if isinstance(key, BasicIndexer): 1439 return self.array[key.tuple] 1440 22140 9613954.0 434.2 0.0 elif isinstance(key, VectorizedIndexer): 1441 return self.array.vindex[key.tuple] 1442 else: 1443 22140 8618414.0 389.3 0.0 assert isinstance(key, OuterIndexer) 1444 22140 26601491.0 1201.5 0.0 key = key.tuple 1445 22140 6010672.0 271.5 0.0 try: 1446 22140 2e+10 678487.7 6.2 return self.array[key] 1447 except NotImplementedError: 1448 # manual orthogonal indexing. 1449 # TODO: port this upstream into dask in a saner way. 1450 value = self.array 1451 for axis, subkey in reversed(list(enumerate(key))): 1452 value = value[(slice(None),) * axis + (subkey,)] 1453 return value ``` The test `duck_array_ops.array_equiv(k, np.arange(self.array.shape[idim]))` is repeated multiple times, and despite that being decently fast it amounts to a lot of time that could be potentially minimized by introducing a prior test of equal length, like ```python if isinstance(k, Iterable) and ( not is_duck_dask_array(k) and len(k) == self.array.shape[idim] and duck_array_ops.array_equiv(k, np.arange(self.array.shape[idim])) ): ``` This would work better because, despite that test being performed [by array_equiv](https://github.com/pydata/xarray/blob/main/xarray/core/duck_array_ops.py#L233), currently the array to test against is always created using `np.arange`, that being ultimately the bottleneck ```Python 74992059 function calls (73375414 primitive calls) in 298.934 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 22140 225.296 0.010 225.296 0.010 {built-in method numpy.arange} 177123 3.192 0.000 3.670 0.000 inspect.py:2920(__init__) 110702/110701 2.180 0.000 2.180 0.000 {built-in method numpy.asarray} 11690863/11668723 2.036 0.000 5.043 0.000 {built-in method builtins.isinstance} 287827 1.876 0.000 3.768 0.000 utils.py:25(meta_from_array) 132843 1.872 0.000 7.649 0.000 inspect.py:2280(_signature_from_function) 974166 1.485 0.000 2.558 0.000 inspect.py:2637(__init__) ```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8377/reactions"", ""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,completed,13221727,issue 1397532790,I_kwDOAMm_X85TTKh2,7132,Saving a DataArray of datetime objects as zarr is not a lazy operation despite compute=False,33886395,closed,0,,,2,2022-10-05T09:50:34Z,2024-01-29T19:12:32Z,2024-01-29T19:12:32Z,NONE,,,,"### What happened? Trying to save a lazy xr.DataArray of datetime objects as a zarr forces a dask.compute operation and retrieves the data to the local notebook. This is generally not a problem for indices of datetime objects as that is already locally store and generally small in size. However, if the whole underlying array is a datetime object, that can be a serious problem. In my case it simply crashed the scheduler upon attempting to retrieve the data persisted on workers. I managed to isolate the problem on this call stack. The issue is in the `encode_cf_datetime` function ### What did you expect to happen? Storing the data in zarr format to be performed directly by dask workers bypassing the scheduler/Client if `compute=True`, and complete lazy operation if `compute=False` ### Minimal Complete Verifiable Example ```Python import numpy as np import xarray as xr import dask.array as da test = xr.DataArray( data = da.full((20000, 20000), np.datetime64('2005-02-25T03:30', 'ns')), coords = {'x': range(20000), 'y': range(20000)} ).to_dataset(name='test') print(test.test.dtype) # dtype(' 2036 return to_zarr( 2037 self, 2038 store=store, 2039 chunk_store=chunk_store, 2040 storage_options=storage_options, 2041 mode=mode, 2042 synchronizer=synchronizer, 2043 group=group, 2044 encoding=encoding, 2045 compute=compute, 2046 consolidated=consolidated, 2047 append_dim=append_dim, 2048 region=region, 2049 safe_chunks=safe_chunks, 2050 ) File /env/lib/python3.8/site-packages/xarray/backends/api.py:1431, in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options) 1429 writer = ArrayWriter() 1430 # TODO: figure out how to properly handle unlimited_dims -> 1431 dump_to_store(dataset, zstore, writer, encoding=encoding) 1432 writes = writer.sync(compute=compute) 1434 if compute: File /env/lib/python3.8/site-packages/xarray/backends/api.py:1119, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims) 1116 if encoder: 1117 variables, attrs = encoder(variables, attrs) -> 1119 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims) File /env/lib/python3.8/site-packages/xarray/backends/zarr.py:500, in ZarrStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims) 498 new_variables = set(variables) - existing_variable_names 499 variables_without_encoding = {vn: variables[vn] for vn in new_variables} --> 500 variables_encoded, attributes = self.encode( 501 variables_without_encoding, attributes 502 ) 504 if existing_variable_names: 505 # Decode variables directly, without going via xarray.Dataset to 506 # avoid needing to load index variables into memory. 507 # TODO: consider making loading indexes lazy again? 508 existing_vars, _, _ = conventions.decode_cf_variables( 509 self.get_variables(), self.get_attrs() 510 ) File /env/lib/python3.8/site-packages/xarray/backends/common.py:200, in AbstractWritableDataStore.encode(self, variables, attributes) 183 def encode(self, variables, attributes): 184 """""" 185 Encode the variables and attributes in this store 186 (...) 198 199 """""" --> 200 variables = {k: self.encode_variable(v) for k, v in variables.items()} 201 attributes = {k: self.encode_attribute(v) for k, v in attributes.items()} 202 return variables, attributes File /env/lib/python3.8/site-packages/xarray/backends/common.py:200, in (.0) 183 def encode(self, variables, attributes): 184 """""" 185 Encode the variables and attributes in this store 186 (...) 198 199 """""" --> 200 variables = {k: self.encode_variable(v) for k, v in variables.items()} 201 attributes = {k: self.encode_attribute(v) for k, v in attributes.items()} 202 return variables, attributes File /env/lib/python3.8/site-packages/xarray/backends/zarr.py:459, in ZarrStore.encode_variable(self, variable) 458 def encode_variable(self, variable): --> 459 variable = encode_zarr_variable(variable) 460 return variable File /env/lib/python3.8/site-packages/xarray/backends/zarr.py:258, in encode_zarr_variable(var, needs_copy, name) 237 def encode_zarr_variable(var, needs_copy=True, name=None): 238 """""" 239 Converts an Variable into an Variable which follows some 240 of the CF conventions: (...) 255 A variable which has been encoded as described above. 256 """""" --> 258 var = conventions.encode_cf_variable(var, name=name) 260 # zarr allows unicode, but not variable-length strings, so it's both 261 # simpler and more compact to always encode as UTF-8 explicitly. 262 # TODO: allow toggling this explicitly via dtype in encoding. 263 coder = coding.strings.EncodedStringCoder(allows_unicode=True) File /env/lib/python3.8/site-packages/xarray/conventions.py:273, in encode_cf_variable(var, needs_copy, name) 264 ensure_not_multiindex(var, name=name) 266 for coder in [ 267 times.CFDatetimeCoder(), 268 times.CFTimedeltaCoder(), (...) 271 variables.UnsignedIntegerCoder(), 272 ]: --> 273 var = coder.encode(var, name=name) 275 # TODO(shoyer): convert all of these to use coders, too: 276 var = maybe_encode_nonstring_dtype(var, name=name) File /env/lib/python3.8/site-packages/xarray/coding/times.py:659, in CFDatetimeCoder.encode(self, variable, name) 655 dims, data, attrs, encoding = unpack_for_encoding(variable) 656 if np.issubdtype(data.dtype, np.datetime64) or contains_cftime_datetimes( 657 variable 658 ): --> 659 (data, units, calendar) = encode_cf_datetime( 660 data, encoding.pop(""units"", None), encoding.pop(""calendar"", None) 661 ) 662 safe_setitem(attrs, ""units"", units, name=name) 663 safe_setitem(attrs, ""calendar"", calendar, name=name) File /env/lib/python3.8/site-packages/xarray/coding/times.py:592, in encode_cf_datetime(dates, units, calendar) 582 def encode_cf_datetime(dates, units=None, calendar=None): 583 """"""Given an array of datetime objects, returns the tuple `(num, units, 584 calendar)` suitable for a CF compliant time variable. 585 (...) 590 cftime.date2num 591 """""" --> 592 dates = np.asarray(dates) 594 if units is None: 595 units = infer_datetime_units(dates) ``` ### Anything else we need to know? Our system uses dask_gateway in a AWS infrastructure (S3 for storage) ### Environment
INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.209-116.367.amzn2.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.4 libnetcdf: 4.7.3 xarray: 2022.3.0 pandas: 1.5.0 numpy: 1.22.4 scipy: 1.9.1 netCDF4: 1.6.1 pydap: installed h5netcdf: 1.0.2 h5py: 3.7.0 Nio: None zarr: 2.13.2 cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.2 cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.9.2 distributed: 2022.9.2 matplotlib: 3.6.0 cartopy: 0.20.2 seaborn: 0.12.0 numbagg: None fsspec: 2022.8.2 cupy: None pint: None sparse: 0.13.0 setuptools: 65.4.1 pip: 22.2.2 conda: None pytest: 7.1.3 IPython: 8.5.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7132/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 1307523148,I_kwDOAMm_X85N7zhM,6803,Passing a distributed.Future to the kwargs of apply_ufunc should resolve the future,33886395,closed,0,,,10,2022-07-18T07:31:28Z,2024-01-09T18:21:15Z,2023-12-19T05:40:20Z,NONE,,,,"### What is your issue? I am trying to scatter an large array and pass it as keyword argument to a function applied using `apply_ufunc` but that is currently not working. The same function works if providing the actual array, but if providing the Future linked to the scatter data the task fails. Here is a minimal example to reproduce this issue ```python import dask.array as da import xarray as xr import numpy as np data = xr.DataArray(data=da.random.random((15, 15, 20)), coords={'x': range(15), 'y': range(15), 'z': range(20)}, dims=('x', 'y', 'z')) test = np.full((20,), 30) test_future = client.scatter(test, broadcast=True) def _copy_test(d, test=None): return test new_data_actual = xr.apply_ufunc( _copy_test, data, input_core_dims=[['z']], output_core_dims=[['new_z']], vectorize=True, dask='parallelized', output_dtypes=""float64"", kwargs={'test':test}, dask_gufunc_kwargs = {'output_sizes':{'new_z':20}} ) new_data_future = xr.apply_ufunc( _copy_test, data, input_core_dims=[['z']], output_core_dims=[['new_z']], vectorize=True, dask='parallelized', output_dtypes=""float64"", kwargs={'test':test_future}, dask_gufunc_kwargs = {'output_sizes':{'new_z':20}} ) data[0, 0].compute() #[0.3034994 , 0.08172002, 0.34731092, ...] new_data_actual[0, 0].compute() #[30.0, 30.0, 30.0, ...] new_data_future[0,0].compute() #KilledWorker ``` I tried different versions of this, going from explicitly calling `test.result()` to change the way the Future was passed, but nothing worked. I also tried to raise exceptions within the function and various way to print information, but that also did not work. This last issue makes me think that if passing a Future I actually don't get to the scope of that function Am I trying to do something completely silly? or is this an unexpected behavior? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6803/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 584865241,MDU6SXNzdWU1ODQ4NjUyNDE=,3872,Operations resulting in np.timedelta64 are not properly coerced,33886395,closed,0,,,2,2020-03-20T06:14:30Z,2020-03-23T20:55:54Z,2020-03-23T20:55:53Z,NONE,,,,"It seems that operations that are resulting in `timedelta64` (for example `datetime64` arithmetic) are not properly coerced. In fact, the result of that operation is a xarray object that has a dt accessor being of type `xarray.core.accessor_dt.DatetimeAccessor` instead of the expected `xarray.core.accessor_dt.TimedeltaAccessor` This follows the numpy documentation describing datetime arithmentic resulting in timedelta objects (http://lagrange.univ-lyon1.fr/docs/numpy/1.11.0/reference/arrays.datetime.html#datetime-and-timedelta-arithmetic) #### MCVE Code Sample ```python # this is a DataArray of type np.datetime64 da = xr.DataArray( data= pd.date_range('2020-01-01', '2020-01-30', freq='D') ) # this simple arithmetic will result in np.timedelta64 delta = (da - np.datetime64('2020-01-01')) type(delta.data[0]) # > numpy.timedelta64 type(delta.dt) # > xarray.core.accessor_dt.DatetimeAccessor ``` #### Expected Output ```python type(delta.dt) # > xarray.core.accessor_dt.TimedeltaAccessor ``` #### Problem Description Having a `.data` of type `timedelta64` would benefit from having a `.dt` accessor of type `TimedeltaAccessor`. This would allow to represent such `timedelta64` using the relevant time units like `days` #### Versions
Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.10 | packaged by conda-forge | (default, Mar 5 2020, 10:05:08) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.12.14-95.32-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.14.1 pandas: 0.24.1 numpy: 1.18.1 scipy: 1.4.1 netCDF4: 1.5.1.2 pydap: None h5netcdf: 0.8.0 h5py: 2.9.0 Nio: None zarr: 2.2.0 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2.12.0 distributed: 2.12.0 matplotlib: 3.0.3 cartopy: None seaborn: 0.10.0 numbagg: None setuptools: 45.2.0.post20200209 pip: 20.0.2 conda: None pytest: None IPython: 7.13.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3872/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue