id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 309686915,MDU6SXNzdWUzMDk2ODY5MTU=,2027,square-bracket slice a Dataset with a DataArray,6213168,open,0,,,4,2018-03-29T09:39:57Z,2022-04-18T03:51:25Z,,MEMBER,,,,"Given this: ``` ds = xarray.Dataset( data_vars={ 'vote': ('pupil', [5, 7, 8]), 'age': ('pupil', [15, 14, 16]) }, coords={ 'pupil': ['Alice', 'Bob', 'Charlie'] }) Dimensions: (pupil: 3) Coordinates: * pupil (pupil) = 6] array([14, 16]) Coordinates: * pupil (pupil) = 6] KeyError: False ``` ``ds.vote >= 6`` is a DataArray with dims=('pupil', ) and dtype=bool, so I can't think of any ambiguity in what I want to achieve? Workaround: ``` ds.sel(pupil=ds.vote >= 6) Dimensions: (pupil: 2) Coordinates: * pupil (pupil) With the current version of dask, there is no automatic alignment of chunks when performing operations between dask arrays with different chunk sizes. If your computation involves multiple dask arrays with different chunks, you may need to explicitly rechunk each array to ensure compatibility. While chunk auto-alignment could be done within the dask library, that would be limited to arrays with the same dimensionality and same dims order. For example it would not be possible to have a dask library call to align the chunks on xarrays with the following dims: - (time, latitude, longitude) - (time) - (longitude, latitude) even if it makes perfect sense in xarray. I think xarray.align() should take care of it automatically. A safe algorithm would be to always scale down the chunksize when in conflict. This would prevent having chunks larger than expected, and should minimise (in a greedy way) the number of operations. It's also a good idea on dask.distributed, where merging two chunks could cause one of them to travel on the network - which is very expensive. e.g. to reconcile chunksizes a: (5, 10, 6) b: (5, 7, 9) the algorithm would rechunk both arrays to (5, 7, 3, 6). Finally, when served with a numpy-based array and a dask-based array, align() should convert the numpy array to dask. The critical use case that would benefit from this behaviour is when align() is invoked inside a broadcast() between a tiny constant you just loaded from csv/pandas/pure python list/whatever - e.g. dims=(time, ) shape=(100, ) - and a huge dask-backed array e.g. dims=(time, scenario) shape=(100, 2\*\*30) chunks=(25, 2\*\*20). ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/979/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 339611449,MDU6SXNzdWUzMzk2MTE0NDk=,2273,to_netcdf uses deprecated and unnecessary dask call,6213168,closed,0,,,4,2018-07-09T21:20:20Z,2018-07-31T20:03:41Z,2018-07-31T19:42:20Z,MEMBER,,,,"``` >>> ds = xarray.Dataset({'x': 1}) >>> ds.to_netcdf('foo.nc') dask/utils.py:1010: UserWarning: Deprecated, see dask.base.get_scheduler instead ``` Stack trace: ``` > xarray/backends/common.py(44)get_scheduler() 43 from dask.utils import effective_get ---> 44 actual_get = effective_get(get, collection) ``` There are two separate problems here: - dask recently changed API from ``get(get=callable)`` to ``get(scheduler=str)``. Should we - just increase the minimum version of dask (I doubt anybody will complain) - go through the hoops of dynamically invoking a different API depending on the dask version :sweat: - silence the warning now, and then increase the minimum version of dask the day that dask removes the old API entirely (risky)? - xarray is calling dask even when it's unnecessary, as none of the variables in the example Dataset had a dask backend. I don't think there are any CI suites for NetCDF without dask. I'm also wondering if they would bring any actual added value, as dask is small, has no exotic dependencies, and is pure Python; so I doubt anybody will have problems installing it whatever his setup is. @shoyer opinion? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2273/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 172290413,MDU6SXNzdWUxNzIyOTA0MTM=,978,broadcast() broken on dask backend,6213168,closed,0,,,4,2016-08-20T20:56:33Z,2016-12-09T20:28:42Z,2016-12-09T20:28:42Z,MEMBER,,,,"``` python >>> a = xarray.DataArray([1,2]).chunk(1) >>> a dask.array Coordinates: * dim_0 (dim_0) int64 0 1 >>> xarray.broadcast(a) ( array([1, 2]) Coordinates: * dim_0 (dim_0) int64 0 1,) ``` The problem is actually somewhere in the constructor of DataArray. In alignment.py:362, we have `return DataArray(data, ...)` where data is a Variable with dask backend. The returned DataArray object has a numpy backend. As a workaround, changing that line to `return DataArray(data.data, ...)` (thus passing a dask array) fixes the problem. After that however there's a new issue: whenever broadcast adds a dimension to an array, it creates it in a single chunk, as opposed to copying the chunking of the other arrays. This can easily call a host to go out of memory, and makes it harder to work with the arrays afterwards because chunks won't match. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/978/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 170744285,MDExOlB1bGxSZXF1ZXN0ODEwMjkzMDc=,963,Align broadcast,6213168,closed,0,,,4,2016-08-11T20:55:29Z,2016-08-14T23:25:02Z,2016-08-14T23:24:15Z,MEMBER,,0,pydata/xarray/pulls/963,"- Removed partial_align() - Added exclude and indexes optional parameters to align() public API - Added exclude optional parameter to broadcast() public API - Added various unit tests to check that align() and broadcast() do not perform needless data copies - broadcast() to automatically align inputs Note: there is a failed unit test, TestDataset.test_broadcast_nocopy, which shows broadcast on dataset doing a data copy whereas it shouldn't. Could you look into it? ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/963/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull