html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/3484#issuecomment-549627590,https://api.github.com/repos/pydata/xarray/issues/3484,549627590,MDEyOklzc3VlQ29tbWVudDU0OTYyNzU5MA==,10554254,2019-11-05T01:50:29Z,2020-02-12T02:51:51Z,NONE,"After reading through the issue tracker and PRs, it looks like sparse arrays can safely be wrapped with xarray, thanks to the work done in [PR#3117](https://github.com/pydata/xarray/pull/3117), but built-in functions are still under development (e.g. [PR#3542](https://github.com/pydata/xarray/pull/3542)). As a user, here is what I am seeing when test driving sparse:
Sparse gives me a smaller in-memory array
```python
In [1]: import xarray as xr, sparse, sys, numpy as np, dask.array as da
In [2]: x = np.random.random((100, 100, 100))
In [3]: x[x < 0.9] = np.nan
In [4]: s = sparse.COO.from_numpy(x, fill_value=np.nan)
In [5]: sys.getsizeof(s)
Out[5]: 3189592
In [6]: sys.getsizeof(x)
Out[6]: 8000128
```
Which I can wrap with dask and xarray
```python
In [7]: x = da.from_array(x)
In [8]: s = da.from_array(s)
In [9]: ds_dense = xr.DataArray(x).to_dataset(name='data_variable')
In [10]: ds_sparse = xr.DataArray(s).to_dataset(name='data_variable')
In [11]: ds_dense
Out[11]:
Dimensions: (dim_0: 100, dim_1: 100, dim_2: 100)
Dimensions without coordinates: dim_0, dim_1, dim_2
Data variables:
data_variable (dim_0, dim_1, dim_2) float64 dask.array
In [12]: ds_sparse
Out[12]:
Dimensions: (dim_0: 100, dim_1: 100, dim_2: 100)
Dimensions without coordinates: dim_0, dim_1, dim_2
Data variables:
data_variable (dim_0, dim_1, dim_2) float64 dask.array
```
However, computation on a sparse array takes longer than running compute on a dense array (which I think is expected...?)
```python
In [13]: %%time
...: ds_sparse.mean().compute()
CPU times: user 487 ms, sys: 22.9 ms, total: 510 ms
Wall time: 518 ms
Out[13]:
Dimensions: ()
Data variables:
data_variable float64 0.9501
In [14]: %%time
...: ds_dense.mean().compute()
CPU times: user 10.9 ms, sys: 3.91 ms, total: 14.8 ms
Wall time: 13.8 ms
Out[14]:
Dimensions: ()
Data variables:
data_variable float64 0.9501
```
And writing to netcdf, to take advantage of the smaller data size, doesn't work out of the box (yet)
```python
In [15]: ds_sparse.to_netcdf('ds_sparse.nc')
Out[15]: ...
RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.
```
Additional discussion happening at #3213
@dcherian @shoyer Am I missing any built-in methods that are working and ready for public release? Happy to send in a PR, if any of what is provided here should go into a basic example for the docs.
At this stage, I am not using sparse arrays for my own research just yet, but when I get to that anticipated phase I can dig in more on this and hopefully send in some useful PRs for improved documentation and fixes/features.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,517338735
https://github.com/pydata/xarray/issues/3315#issuecomment-584975960,https://api.github.com/repos/pydata/xarray/issues/3315,584975960,MDEyOklzc3VlQ29tbWVudDU4NDk3NTk2MA==,10554254,2020-02-12T01:46:00Z,2020-02-12T01:46:00Z,NONE,"Few observations after looking at the default flags for `concat`:
```python
xr.concat(
objs,
dim,
data_vars='all',
coords='different',
compat='equals',
positions=None,
fill_value=,
join='outer',
)
```
The description of `compat='equals'` indicates combining DataArrays with different names should fail: `'equals': all values and dimensions must be the same.` (though I am not entirely sure what is meant by `values`... I assume this perhaps generically means `keys`?)
Another option is `compat='identical'` which is described as: `'identical': all values, dimensions and attributes must be the same.` Using this flag will cause the operation to fail, as one would expect from the description...
```python
objs = [xr.DataArray([0],
dims='x',
name='a'),
xr.DataArray([1],
dims='x',
name='b')]
xr.concat(objs, dim='x', compat='identical')
```
```python
ValueError: array names not identical
```
... and is the case for `concat` on Datasets, as previously shown by @TomNicholas
```
objs = [xr.Dataset({'a': ('x', [0])}),
xr.Dataset({'b': ('x', [0])})]
xr.concat(objs, dim='x')
```
```python
ValueError: 'a' is not present in all datasets.
```
However, `'identical': all values, dimensions and **attributes** must be the same.` doesn't quite seem to be the case for DataArrays, as
```python
objs = [xr.DataArray([0],
dims='x',
name='a',
attrs={'foo':1}),
xr.DataArray([1],
dims='x',
name='a',
attrs={'bar':2})]
xr.concat(objs, dim='x', compat='identical')
```
succeeds with
```python
array([0, 1])
Dimensions without coordinates: x
Attributes:
foo: 1
```
but again fails on Datasets, as one would expect from the description.
```python
ds1 = xr.Dataset({'a': ('x', [0])})
ds1.attrs['foo'] = 'example attribute'
ds2 = xr.Dataset({'a': ('x', [1])})
ds2.attrs['bar'] = 'example attribute'
objs = [ds1,ds2]
xr.concat(objs, dim='x',compat='identical')
```
```python
ValueError: Dataset global attributes not equal.
```
Also had a look at `compat='override'`, which will override an `attrs` inconsistency but not a naming one when applied to Datasets. Works as expected on DataArrays. It is described as `'override': skip comparing and pick variable from first dataset`.
Potential resolutions:
1. `'identical'` should raise an error when attributes are not the same for DataArrays
2. `'equals'` should raise an error when DataArray names are not identical (unless one is None, which works with Datasets and seems fine to be replaced)
3. `'override'` should override naming inconsistencies when combining DataSets.
Final thought: perhaps promoting to Dataset when all requirements are met for a DataArray to be considered as such, might simplify keeping operations and checks consistent?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,494906646
https://github.com/pydata/xarray/issues/3445#issuecomment-551359502,https://api.github.com/repos/pydata/xarray/issues/3445,551359502,MDEyOklzc3VlQ29tbWVudDU1MTM1OTUwMg==,10554254,2019-11-08T02:41:13Z,2019-11-08T02:41:13Z,NONE,"@El-minadero from the [sparse API](https://sparse.pydata.org/en/latest/generated/sparse.html) page I'm seeing two methods for combining data:
```python
import sparse
import numpy as np
A = sparse.COO.from_numpy(np.array([[1, 2], [3, 4]]))
B = sparse.COO.from_numpy(np.array([[5, 9], [6, 8]]))
sparse.stack([A,B]).todense()
Out[1]:
array([[[1, 2],
[3, 4]],
[[5, 9],
[6, 8]]])
sparse.concatenate([A,B]).todense()
Out[2]:
array([[1, 2],
[3, 4],
[5, 9],
[6, 8]])
```
Since this is an issue with `sparse` and merging data doesn't seem to be supported at this time, you might consider closing this issue out here and raising it over at [sparse](https://github.com/pydata/sparse/issues).
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,512205079
https://github.com/pydata/xarray/issues/3445#issuecomment-550516745,https://api.github.com/repos/pydata/xarray/issues/3445,550516745,MDEyOklzc3VlQ29tbWVudDU1MDUxNjc0NQ==,10554254,2019-11-06T21:51:31Z,2019-11-06T21:51:31Z,NONE,"Note that `dataset1 = xr.concat([data_array1,data_array2],dim='source')` or `dim='receiver'` seem to work, however, concat also fails if `time` is specified as the dimension.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,512205079
https://github.com/pydata/xarray/pull/3312#issuecomment-532754800,https://api.github.com/repos/pydata/xarray/issues/3312,532754800,MDEyOklzc3VlQ29tbWVudDUzMjc1NDgwMA==,10554254,2019-09-18T16:08:09Z,2019-09-18T16:08:09Z,NONE,Opened https://github.com/pydata/xarray/issues/3315 regarding combine_nested() failing when being passed nested DataSets.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,494210818
https://github.com/pydata/xarray/pull/3312#issuecomment-532419859,https://api.github.com/repos/pydata/xarray/issues/3312,532419859,MDEyOklzc3VlQ29tbWVudDUzMjQxOTg1OQ==,10554254,2019-09-17T22:03:23Z,2019-09-17T23:51:13Z,NONE,"`pytest -q xarray/tests/test_combine.py` is telling me that
```
def test_concat_name_symmetry(self):
""""""Inspired by the discussion on GH issue #2777""""""
da1 = DataArray(name=""a"", data=[[0]], dims=[""x"", ""y""])
da2 = DataArray(name=""b"", data=[[1]], dims=[""x"", ""y""])
da3 = DataArray(name=""a"", data=[[2]], dims=[""x"", ""y""])
da4 = DataArray(name=""b"", data=[[3]], dims=[""x"", ""y""])
x_first = combine_nested([[da1, da2], [da3, da4]], concat_dim=[""x"", ""y""])
```
fails with:
```
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
in
3 da3 = xr.DataArray(name=""a"", data=[[2]], dims=[""x"", ""y""])
4 da4 = xr.DataArray(name=""b"", data=[[3]], dims=[""x"", ""y""])
----> 5 xr.combine_nested([[da1, da2], [da3, da4]], concat_dim=[""x"", ""y""])
~/repos/contribute/xarray/xarray/core/combine.py in combine_nested(objects, concat_dim, compat, data_vars, coords, fill_value, join)
468 ids=False,
469 fill_value=fill_value,
--> 470 join=join,
471 )
472
~/repos/contribute/xarray/xarray/core/combine.py in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join)
305 coords=coords,
306 fill_value=fill_value,
--> 307 join=join,
308 )
309 return combined
~/repos/contribute/xarray/xarray/core/combine.py in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join)
196 compat=compat,
197 fill_value=fill_value,
--> 198 join=join,
199 )
200 (combined_ds,) = combined_ids.values()
~/repos/contribute/xarray/xarray/core/combine.py in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join)
218 datasets = combined_ids.values()
219 new_combined_ids[new_id] = _combine_1d(
--> 220 datasets, dim, compat, data_vars, coords, fill_value, join
221 )
222 return new_combined_ids
~/repos/contribute/xarray/xarray/core/combine.py in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join)
246 compat=compat,
247 fill_value=fill_value,
--> 248 join=join,
249 )
250 except ValueError as err:
~/repos/contribute/xarray/xarray/core/concat.py in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join)
131 ""objects, got %s"" % type(first_obj)
132 )
--> 133 return f(objs, dim, data_vars, coords, compat, positions, fill_value, join)
134
135
~/repos/contribute/xarray/xarray/core/concat.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join)
363 for k in datasets[0].variables:
364 if k in concat_over:
--> 365 vars = ensure_common_dims([ds.variables[k] for ds in datasets])
366 combined = concat_vars(vars, dim, positions)
367 assert isinstance(combined, Variable)
~/repos/contribute/xarray/xarray/core/concat.py in (.0)
363 for k in datasets[0].variables:
364 if k in concat_over:
--> 365 vars = ensure_common_dims([ds.variables[k] for ds in datasets])
366 combined = concat_vars(vars, dim, positions)
367 assert isinstance(combined, Variable)
~/repos/contribute/xarray/xarray/core/utils.py in __getitem__(self, key)
383
384 def __getitem__(self, key: K) -> V:
--> 385 return self.mapping[key]
386
387 def __iter__(self) -> Iterator[K]:
KeyError: 'a'
```
It looks like the existing combine_nested() routine actually wants a DataArray and fails if passed a DataSet.
The following should work with current master.
```
da1 = xr.DataArray(name=""a"", data=[[0]], dims=[""x"", ""y""])
da2 = xr.DataArray(name=""b"", data=[[1]], dims=[""x"", ""y""])
da3 = xr.DataArray(name=""a"", data=[[2]], dims=[""x"", ""y""])
da4 = xr.DataArray(name=""b"", data=[[3]], dims=[""x"", ""y""])
xr.combine_nested([[da1, da2], [da3, da4]], concat_dim=[""x"", ""y""])
```
While converting to DataSet will cause the same error expressed by the test.
```
ds1 = da1.to_dataset()
ds2 = da2.to_dataset()
ds3 = da3.to_dataset()
ds4 = da4.to_dataset()
xr.combine_nested([[ds1, ds2], [ds3, ds4]], concat_dim=[""x"", ""y""])
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,494210818
https://github.com/pydata/xarray/issues/3248#issuecomment-531511177,https://api.github.com/repos/pydata/xarray/issues/3248,531511177,MDEyOklzc3VlQ29tbWVudDUzMTUxMTE3Nw==,10554254,2019-09-14T20:31:22Z,2019-09-14T20:31:22Z,NONE,"Some additional information on the topic:
Combining named 1D data arrays works.
```
da1 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [1, 2, 3])])
da2 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [5, 6, 7])])
xr.combine_by_coords([da1, da2])
Dimensions: (x: 6)
Coordinates:
* x (x) int64 1 2 3 5 6 7
Data variables:
foo (x) float64 1.443 0.4889 0.9233 0.1946 -1.639 -1.455
```
However, when combining 2D gridded data...
```
da1 = xr.DataArray(name='foo',
data=np.random.rand(3,3),
coords=[('x', [1, 2, 3]),
('y', [1, 2, 3])])
da2 = xr.DataArray(name='foo',
data=np.random.rand(3,3),
coords=[('x', [5, 6, 7]),
('y', [5, 6, 7])])
xr.combine_by_coords([da1, da2])
```
...the method fails, despite passing a data variable name.
```
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
9 ('y', [5, 6, 7])])
10
---> 11 xr.combine_by_coords([da1, da2])
~/xarray/xarray/core/combine.py in combine_by_coords(datasets, compat, data_vars, coords, fill_value, join)
580
581 # Group by data vars
--> 582 sorted_datasets = sorted(datasets, key=vars_as_keys)
583 grouped_by_vars = itertools.groupby(sorted_datasets, key=vars_as_keys)
584
~/xarray/xarray/core/combine.py in vars_as_keys(ds)
465
466 def vars_as_keys(ds):
--> 467 return tuple(sorted(ds))
468
469
~/xarray/xarray/core/common.py in __bool__(self)
119
120 def __bool__(self: Any) -> bool:
--> 121 return bool(self.values)
122
123 def __float__(self: Any) -> float:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
```
Again, converting to a dataset bypasses the issue.
```
ds1 = da1.to_dataset()
ds2 = da2.to_dataset()
xr.combine_by_coords([ds1, ds2])
Dimensions: (x: 6, y: 6)
Coordinates:
* x (x) int64 1 2 3 5 6 7
* y (y) int64 1 2 3 5 6 7
Data variables:
foo (x, y) float64 0.5078 0.8981 0.8707 nan ... 0.4172 0.7259 0.8431
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,484270833
https://github.com/pydata/xarray/issues/1301#issuecomment-344949160,https://api.github.com/repos/pydata/xarray/issues/1301,344949160,MDEyOklzc3VlQ29tbWVudDM0NDk0OTE2MA==,10554254,2017-11-16T15:01:59Z,2017-11-16T15:02:48Z,NONE,"Looks like it has been resolved! Tested with the latest pre-release v0.10.0rc2 on the dataset linked by najascutellatus above. https://marine.rutgers.edu/~michaesm/netcdf/data/
```
da.set_options(get=da.async.get_sync)
%prun -l 10 ds = xr.open_mfdataset('./*.nc')
```
xarray==0.10.0rc2-1-g8267fdb
dask==0.15.4
```
194381 function calls (188429 primitive calls) in 0.869 seconds
Ordered by: internal time
List reduced from 469 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
50 0.393 0.008 0.393 0.008 {numpy.core.multiarray.arange}
50 0.164 0.003 0.557 0.011 indexing.py:266(_index_indexer_1d)
5 0.083 0.017 0.085 0.017 netCDF4_.py:185(_open_netcdf4_group)
190 0.024 0.000 0.066 0.000 netCDF4_.py:256(open_store_variable)
190 0.022 0.000 0.022 0.000 netCDF4_.py:29(__init__)
50 0.018 0.000 0.021 0.000 {operator.getitem}
5145/3605 0.012 0.000 0.019 0.000 indexing.py:493(shape)
2317/1291 0.009 0.000 0.094 0.000 _abcoll.py:548(update)
26137 0.006 0.000 0.013 0.000 {isinstance}
720 0.005 0.000 0.006 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects}
```
xarray==0.9.1
dask==0.13.0
```
241253 function calls (229881 primitive calls) in 98.123 seconds
Ordered by: internal time
List reduced from 659 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
30 87.527 2.918 87.527 2.918 {pandas._libs.tslib.array_to_timedelta64}
65 7.055 0.109 7.059 0.109 {operator.getitem}
80 0.799 0.010 0.799 0.010 {numpy.core.multiarray.arange}
7895/4420 0.502 0.000 0.524 0.000 utils.py:412(shape)
68 0.442 0.007 0.442 0.007 {pandas._libs.algos.ensure_object}
80 0.350 0.004 1.150 0.014 indexing.py:318(_index_indexer_1d)
60/30 0.296 0.005 88.407 2.947 timedeltas.py:158(_convert_listlike)
30 0.284 0.009 0.298 0.010 algorithms.py:719(checked_add_with_arr)
123 0.140 0.001 0.140 0.001 {method 'astype' of 'numpy.ndarray' objects}
1049/719 0.096 0.000 96.513 0.134 {numpy.core.multiarray.array}
```","{""total_count"": 3, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 2, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278
https://github.com/pydata/xarray/issues/1301#issuecomment-293619896,https://api.github.com/repos/pydata/xarray/issues/1301,293619896,MDEyOklzc3VlQ29tbWVudDI5MzYxOTg5Ng==,10554254,2017-04-12T15:42:18Z,2017-04-12T15:42:18Z,NONE,"decode_times=False significantly reduces read time, but the proportional performance discrepancy between xarray 0.8.2 and 0.9.1 remains the same.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278
https://github.com/pydata/xarray/issues/1301#issuecomment-286220522,https://api.github.com/repos/pydata/xarray/issues/1301,286220522,MDEyOklzc3VlQ29tbWVudDI4NjIyMDUyMg==,10554254,2017-03-13T19:41:25Z,2017-03-13T19:41:25Z,NONE,"Looks like the issue might be that xarray 0.9.1 is decoding all timestamps on load.
xarray==0.9.1, dask==0.13.0
```
da.set_options(get=da.async.get_sync)
%prun -l 10 ds = xr.open_mfdataset('./*.nc')
167305 function calls (160352 primitive calls) in 59.688 seconds
Ordered by: internal time
List reduced from 625 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
18 57.057 3.170 57.057 3.170 {pandas.tslib.array_to_timedelta64}
39 0.860 0.022 0.863 0.022 {operator.getitem}
48 0.402 0.008 0.402 0.008 {numpy.core.multiarray.arange}
4341/2463 0.257 0.000 0.273 0.000 utils.py:412(shape)
88 0.245 0.003 0.245 0.003 {pandas.algos.ensure_object}
48 0.158 0.003 0.561 0.012 indexing.py:318(_index_indexer_1d)
36/18 0.135 0.004 57.509 3.195 timedeltas.py:150(_convert_listlike)
18 0.126 0.007 0.130 0.007 nanops.py:815(_checked_add_with_arr)
51 0.070 0.001 0.070 0.001 {method 'astype' of 'numpy.ndarray' objects}
676/475 0.047 0.000 58.853 0.124 {numpy.core.multiarray.array}
```
`pandas.tslib.array_to_timedelta64` appears to be the most expensive item on the list, and isn't being run when using xarray 0.8.2.
xarray==0.8.2, dask==0.13.0
```
da.set_options(get=da.async.get_sync)
%prun -l 10 ds = xr.open_mfdataset('./*.nc')
140668 function calls (136769 primitive calls) in 0.766 seconds
Ordered by: internal time
List reduced from 621 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
2571/1800 0.178 0.000 0.184 0.000 utils.py:387(shape)
18 0.174 0.010 0.174 0.010 {numpy.core.multiarray.arange}
16 0.079 0.005 0.079 0.005 {numpy.core.multiarray.concatenate}
483/420 0.077 0.000 0.125 0.000 {numpy.core.multiarray.array}
15 0.054 0.004 0.197 0.013 indexing.py:259(_index_indexer_1d)
3 0.041 0.014 0.043 0.014 netCDF4_.py:181(__init__)
105 0.013 0.000 0.057 0.001 netCDF4_.py:196(open_store_variable)
15 0.012 0.001 0.013 0.001 {operator.getitem}
2715/1665 0.007 0.000 0.178 0.000 indexing.py:343(shape)
5971 0.006 0.000 0.006 0.000 collections.py:71(__setitem__)
```
The version of dask is held constant in each test.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278