id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 309686915,MDU6SXNzdWUzMDk2ODY5MTU=,2027,square-bracket slice a Dataset with a DataArray,6213168,open,0,,,4,2018-03-29T09:39:57Z,2022-04-18T03:51:25Z,,MEMBER,,,,"Given this: ``` ds = xarray.Dataset( data_vars={ 'vote': ('pupil', [5, 7, 8]), 'age': ('pupil', [15, 14, 16]) }, coords={ 'pupil': ['Alice', 'Bob', 'Charlie'] }) Dimensions: (pupil: 3) Coordinates: * pupil (pupil) = 6] array([14, 16]) Coordinates: * pupil (pupil) = 6] KeyError: False ``` ``ds.vote >= 6`` is a DataArray with dims=('pupil', ) and dtype=bool, so I can't think of any ambiguity in what I want to achieve? Workaround: ``` ds.sel(pupil=ds.vote >= 6) Dimensions: (pupil: 2) Coordinates: * pupil (pupil) > import xarray >> a = xarray.DataArray([1, 2, 3], dims=['x'], coords={'y': 10}) >> b = xarray.DataArray([4, 5, 6], dims=['x']) >> a + b array([5, 7, 9]) Coordinates: y int64 10 ``` But this doesn't? ``` >> xarray.concat([a, b], dim='x') KeyError: 'y' ``` It doesn't seem coherent to me...","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1151/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 305757822,MDU6SXNzdWUzMDU3NTc4MjI=,1995,apply_ufunc support for chunks on input_core_dims,6213168,open,0,,,13,2018-03-15T23:50:22Z,2021-05-17T18:59:18Z,,MEMBER,,,,"I am trying to optimize the following function: c = (a * b).sum('x', skipna=False) where a and b are xarray.DataArray's, both with dimension x and both with dask backend. I successfully obtained a 5.5x speedup with the following: @numba.guvectorize(['void(float64[:], float64[:], float64[:])'], '(n),(n)->()', nopython=True, cache=True) def mulsum(a, b, res): acc = 0 for i in range(a.size): acc += a[i] * b[i] res.flat[0] = acc c = xarray.apply_ufunc( mulsum, a, b, input_core_dims=[['x'], ['x']], dask='parallelized', output_dtypes=[float]) The problem is that this introduces a (quite problematic, in my case) constraint that a and b can't be chunked on dimension x - which is theoretically avoidable as long as the kernel function doesn't need interaction between x[i] and x[j] (e.g. it can't work for an interpolator, which would require to rely on dask ghosting). # Proposal Add a parameter to apply_ufunc, ``reduce_func=None``. reduce_func is a function which takes as input two parameters a, b that are the output of func. apply_ufunc will invoke it whenever there's chunking on an input_core_dim. e.g. my use case above would simply become: c = xarray.apply_ufunc( mulsum, a, b, input_core_dims=[['x'], ['x']], dask='parallelized', output_dtypes=[float], reduce_func=operator.sum) So if I have 2 chunks in a and b on dimension x, apply_ufunc will internally do c1 = mulsum(a1, b1) c2 = mulsum(a2, b2) c = operator.sum(c1, c2) Note that reduce_func will be invoked exclusively in presence of dask='parallelized' and when there's chunking on one or more of the input_core_dims. If reduce_func is left to None, apply_ufunc will keep crashing like it does now. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1995/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 417356439,MDU6SXNzdWU0MTczNTY0Mzk=,2801,NaN-sized chunks,6213168,open,0,,,2,2019-03-05T15:30:14Z,2021-04-24T02:41:34Z,,MEMBER,,,,"It would be nice to have support for NaN-sized dask chunks, e.g. ``x[x > 2]``. There are two problems: 1. ``x[x > 2]`` silently resolves the dask graph. It definitely shouldn't. There needs to be some discussion on what needs to happen to indices on the NaN-sized dimension; I can think of 3 options: - silently drop any index that would become undefined - drop any index that would become undefined and issue a warning - hard crash if there is any index that would become undefined - redesign IndexVariable so that it can contain dask data (probably much more complicated than the 3 above). The above design decision is anyway for when there _is_ an index; dims without indices should just work. 2. This crashes: ```>>> a = xarray.DataArray([1, 2, 3, 4]).chunk(2) >>> xarray.DataArray(a.data[a.data > 2]).compute() ValueError: replacement data must match the Variable's shape ``` I didn't investigate but I suspect it should be trivial to fix. I'm not sure why there is a check at all? Any such health check should be in dask only IMHO.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2801/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 311573817,MDU6SXNzdWUzMTE1NzM4MTc=,2039,open_mfdataset: skip loading for indexes and coordinates from all but the first file,6213168,open,0,,,1,2018-04-05T11:32:02Z,2021-01-27T17:49:21Z,,MEMBER,,,,"This is a follow-up from #1521. When invoking open_mfdataset, very frequently the user knows in advance that all of his coords that aren't on the concat_dim are already aligned, and may be willing to blindly trust such assumption in exchange of a huge performance boost. My production data: 200x NetCDF files on a not very performant NFS file system, concatenated on the ""scenario"" dimension: ``` xarray.open_mfdataset('cube.*.nc', engine='h5netcdf', concat_dim='scenario') Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: * attribute (attribute) object 'THEO/Value' currency (instr_id) object 'ZAR' 'EUR' 'EUR' 'EUR' 'EUR' 'EUR' 'GBP' ... * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 type (instr_id) object 'American' 'Bond Future' 'Bond Future' ... * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Data variables: FX (fx_id, timestep, scenario) float64 dask.array instruments (instr_id, attribute, timestep, scenario) float64 dask.array CPU times: user 19.6 s, sys: 981 ms, total: 20.6 s Wall time: 24.4 s ``` If I skip loading and comparing the non-index coords from all 200 files: ``` xarray.open_mfdataset('cube.*.nc'), engine='h5netcdf', concat_dim='scenario', coords='all') Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: * attribute (attribute) object 'THEO/Value' * fx_id (fx_id) object 'GBP' 'USD' 'EUR' 'JPY' 'ARS' 'AUD' 'BRL' ... * instr_id (instr_id) object 'S01626556_ZAE000204921' '537805_1275' ... * timestep (timestep) datetime64[ns] 2016-12-31 currency (scenario, instr_id) object dask.array * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... type (scenario, instr_id) object dask.array Data variables: FX (fx_id, timestep, scenario) float64 dask.array instruments (instr_id, attribute, timestep, scenario) float64 dask.array CPU times: user 12.7 s, sys: 305 ms, total: 13 s Wall time: 14.8 s ``` If I skip loading and comparing also the index coords from all 200 files: ``` cube = xarray.open_mfdataset(sh.resolve_env(f'{dynamic}/mtf/{cubename}/nc/cube.*.nc'), engine='h5netcdf', concat_dim='scenario', drop_variables=['attribute', 'fx_id', 'instr_id', 'timestep', 'currency', 'type']) Dimensions: (attribute: 1, fx_id: 40, instr_id: 10765, scenario: 500001, timestep: 1) Coordinates: * scenario (scenario) object 'Base Scenario' 'SSMC_1' 'SSMC_2' ... Dimensions without coordinates: attribute, fx_id, instr_id, timestep Data variables: FX (fx_id, timestep, scenario) float64 dask.array instruments (instr_id, attribute, timestep, scenario) float64 dask.array CPU times: user 7.31 s, sys: 61 ms, total: 7.37 s Wall time: 9.05 s ``` # Proposed design Add a new optional parameter to open_mfdataset, ``assume_aligned=None``. It can be valued to a list of variable names or ""all"", and requires ``concat_dim`` to be explicitly set. It causes open_mfdataset to use the first occurrence of every variable and blindly skip loading the subsequent ones. ## Algorithm 1. Perform the first invocation to the underlying open_dataset like it happens now 2. if assume_aligned is not None: for each new NetCDF file, figure out which variables need to be aligned & compared (as opposed to concatenated), and add them to a drop_variables list. 3. if assume_aligned != ""all"": drop_variables &= assume_aligned 3. Pass the increasingly long drop_variables list to the underlying open_dataset","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2039/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 272004812,MDU6SXNzdWUyNzIwMDQ4MTI=,1699,apply_ufunc(dask='parallelized') output_dtypes for datasets,6213168,open,0,,,8,2017-11-07T22:18:23Z,2020-04-06T15:31:17Z,,MEMBER,,,,"When a Dataset has variables with different dtypes, there's no way to tell apply_ufunc that the same function applied to different variables will produce different dtypes: ``` ds1 = xarray.Dataset(data_vars={'a': ('x', [1, 2]), 'b': ('x', [3.0, 4.5])}).chunk() ds2 = xarray.apply_ufunc(lambda x: x + 1, ds1, dask='parallelized', output_dtypes=[float]) ds2 Dimensions: (x: 2) Dimensions without coordinates: x Data variables: a (x) float64 dask.array b (x) float64 dask.array ds2.compute() Dimensions: (x: 2) Dimensions without coordinates: x Data variables: a (x) int64 2 3 b (x) float64 4.0 5.5 ``` ### Proposed solution When the output is a dataset, apply_ufunc could accept either ``output_dtypes=[t]`` (if all output variables will have the same dtype) or ``output_dtypes=[{var1: t1, var2: t2, ...}]``. In the example above, it would be ``output_dtypes=[{'a': int, 'b': float}]``.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1699/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 503983776,MDU6SXNzdWU1MDM5ODM3NzY=,3382,Improve indexing performance benchmarks,6213168,open,0,,,0,2019-10-08T11:20:39Z,2019-11-14T15:52:33Z,,MEMBER,,,,"As discussed in #3375 - FYI @jhamman ``asv_bench/benchmarks/indexing.py`` is currently missing some key use cases: - All tests in the above module use arrays with 2~6 million points. While this is important to spot any case where the numpy underlying functions start being unnecessarily called more than once, it also means any performance improvement or degradation in any of the pure-Python code will be completely drowned out. All tests should be run twice, once with the current ``nx = 3000; ny = 2000; nt = 1000`` and again with ``nx = 15; ny = 10; nt = 5``. - DataArray slicing (sel, isel, and square brackets) - Slicing when there are no IndexVariables (verify that we're not creating dummy variables, doing a full scan on them, and then discarding them) - other? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3382/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 272002705,MDU6SXNzdWUyNzIwMDI3MDU=,1698,apply_ufunc(dask='parallelized') to infer output_dtypes,6213168,open,0,,,3,2017-11-07T22:11:11Z,2019-10-22T08:33:38Z,,MEMBER,,,,"If one doesn't provide the ``dtype`` parameter to ``dask.map_blocks()``, it automatically infers it by running the kernel on trivial dummy data. It should be straightforward to make ``xarray.apply_ufunc(dask='parallelized')`` use the same functionality if the ``output_dtypes`` parameter is omitted.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1698/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 485708282,MDU6SXNzdWU0ODU3MDgyODI=,3268,Stateful user-defined accessors,6213168,open,0,,,15,2019-08-27T09:54:28Z,2019-10-08T11:13:25Z,,MEMBER,,,,"If anybody decorates a stateful class with ``@register_dataarray_accessor`` or ``@register_dataset_accessor``, the instance will lose its state on any method that invokes ``_to_temp_dataset``, as well as on a shallow copy. ```python In [1]: @xarray.register_dataarray_accessor('foo') ...: class Foo: ...: def __init__(self, obj): ...: self.obj = obj ...: self.x = 1 ...: ...: In [2]: a = xarray.DataArray() In [3]: a.foo.x Out[3]: 1 In [4]: a.foo.x = 2 In [5]: a.foo.x Out[5]: 2 In [6]: a.roll().foo.x Out[6]: 1 In [7]: a.copy(deep=False).foo.x Out[7]: 1 ``` While in the case of ``_to_temp_dataset`` it could be possible to spend (substantial) effort to retain the state, on the case of copy() it's impossible without modifying the accessor duck API, as one would need to tamper with the accessor instance in place and modify the pointer back to the DataArray/Dataset. This issue is so glaring that it makes me strongly suspect that nobody saves any state in accessor classes. This kind of use would also be problematic in practical terms, as the accessor object would have a hard time realising when its own state is no longer coherent with the referenced DataArray/Dataset. This design also carries the problem that it introduces a circular reference in the DataArray/Dataset. This means that, after someone invokes an accessor method on his DataArray/Dataset, then the whole object - _including the numpy buffers!_ - won't be instantly collected when it's dereferenced by the user, and it will have to instead wait for the next ``gc`` pass. This could cause huge increases in RAM usage overnight in a user application, which would be very hard to logically link to a change that just added a custom method. Finally, with https://github.com/pydata/xarray/pull/3250/, this statefulness forces us to increase the RAM usage of all datasets and dataarrays by an extra slot, for all users, even if this feature is quite niche. **Proposed solution** Get rid of accessor caching altogether, and just recreate the accessor object from scratch every time it is invoked. In the documentation, clarify that the ``__init__`` method should not perform anything computationally intensive.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3268/reactions"", ""total_count"": 4, ""+1"": 4, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 311578894,MDU6SXNzdWUzMTE1Nzg4OTQ=,2040,to_netcdf() to automatically switch to fixed-length strings for compressed variables ,6213168,open,0,,,2,2018-04-05T11:50:16Z,2019-01-13T01:42:03Z,,MEMBER,,,,"When you have fixed-length numpy arrays of unicode characters (