html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/3484#issuecomment-549627590,https://api.github.com/repos/pydata/xarray/issues/3484,549627590,MDEyOklzc3VlQ29tbWVudDU0OTYyNzU5MA==,10554254,2019-11-05T01:50:29Z,2020-02-12T02:51:51Z,NONE,"After reading through the issue tracker and PRs, it looks like sparse arrays can safely be wrapped with xarray, thanks to the work done in [PR#3117](https://github.com/pydata/xarray/pull/3117), but built-in functions are still under development (e.g. [PR#3542](https://github.com/pydata/xarray/pull/3542)). As a user, here is what I am seeing when test driving sparse: Sparse gives me a smaller in-memory array ```python In [1]: import xarray as xr, sparse, sys, numpy as np, dask.array as da In [2]: x = np.random.random((100, 100, 100)) In [3]: x[x < 0.9] = np.nan In [4]: s = sparse.COO.from_numpy(x, fill_value=np.nan) In [5]: sys.getsizeof(s) Out[5]: 3189592 In [6]: sys.getsizeof(x) Out[6]: 8000128 ``` Which I can wrap with dask and xarray ```python In [7]: x = da.from_array(x) In [8]: s = da.from_array(s) In [9]: ds_dense = xr.DataArray(x).to_dataset(name='data_variable') In [10]: ds_sparse = xr.DataArray(s).to_dataset(name='data_variable') In [11]: ds_dense Out[11]: Dimensions: (dim_0: 100, dim_1: 100, dim_2: 100) Dimensions without coordinates: dim_0, dim_1, dim_2 Data variables: data_variable (dim_0, dim_1, dim_2) float64 dask.array In [12]: ds_sparse Out[12]: Dimensions: (dim_0: 100, dim_1: 100, dim_2: 100) Dimensions without coordinates: dim_0, dim_1, dim_2 Data variables: data_variable (dim_0, dim_1, dim_2) float64 dask.array ``` However, computation on a sparse array takes longer than running compute on a dense array (which I think is expected...?) ```python In [13]: %%time ...: ds_sparse.mean().compute() CPU times: user 487 ms, sys: 22.9 ms, total: 510 ms Wall time: 518 ms Out[13]: Dimensions: () Data variables: data_variable float64 0.9501 In [14]: %%time ...: ds_dense.mean().compute() CPU times: user 10.9 ms, sys: 3.91 ms, total: 14.8 ms Wall time: 13.8 ms Out[14]: Dimensions: () Data variables: data_variable float64 0.9501 ``` And writing to netcdf, to take advantage of the smaller data size, doesn't work out of the box (yet) ```python In [15]: ds_sparse.to_netcdf('ds_sparse.nc') Out[15]: ... RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method. ``` Additional discussion happening at #3213 @dcherian @shoyer Am I missing any built-in methods that are working and ready for public release? Happy to send in a PR, if any of what is provided here should go into a basic example for the docs. At this stage, I am not using sparse arrays for my own research just yet, but when I get to that anticipated phase I can dig in more on this and hopefully send in some useful PRs for improved documentation and fixes/features. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,517338735 https://github.com/pydata/xarray/issues/3315#issuecomment-584975960,https://api.github.com/repos/pydata/xarray/issues/3315,584975960,MDEyOklzc3VlQ29tbWVudDU4NDk3NTk2MA==,10554254,2020-02-12T01:46:00Z,2020-02-12T01:46:00Z,NONE,"Few observations after looking at the default flags for `concat`: ```python xr.concat( objs, dim, data_vars='all', coords='different', compat='equals', positions=None, fill_value=, join='outer', ) ``` The description of `compat='equals'` indicates combining DataArrays with different names should fail: `'equals': all values and dimensions must be the same.` (though I am not entirely sure what is meant by `values`... I assume this perhaps generically means `keys`?) Another option is `compat='identical'` which is described as: `'identical': all values, dimensions and attributes must be the same.` Using this flag will cause the operation to fail, as one would expect from the description... ```python objs = [xr.DataArray([0], dims='x', name='a'), xr.DataArray([1], dims='x', name='b')] xr.concat(objs, dim='x', compat='identical') ``` ```python ValueError: array names not identical ``` ... and is the case for `concat` on Datasets, as previously shown by @TomNicholas ``` objs = [xr.Dataset({'a': ('x', [0])}), xr.Dataset({'b': ('x', [0])})] xr.concat(objs, dim='x') ``` ```python ValueError: 'a' is not present in all datasets. ``` However, `'identical': all values, dimensions and **attributes** must be the same.` doesn't quite seem to be the case for DataArrays, as ```python objs = [xr.DataArray([0], dims='x', name='a', attrs={'foo':1}), xr.DataArray([1], dims='x', name='a', attrs={'bar':2})] xr.concat(objs, dim='x', compat='identical') ``` succeeds with ```python array([0, 1]) Dimensions without coordinates: x Attributes: foo: 1 ``` but again fails on Datasets, as one would expect from the description. ```python ds1 = xr.Dataset({'a': ('x', [0])}) ds1.attrs['foo'] = 'example attribute' ds2 = xr.Dataset({'a': ('x', [1])}) ds2.attrs['bar'] = 'example attribute' objs = [ds1,ds2] xr.concat(objs, dim='x',compat='identical') ``` ```python ValueError: Dataset global attributes not equal. ``` Also had a look at `compat='override'`, which will override an `attrs` inconsistency but not a naming one when applied to Datasets. Works as expected on DataArrays. It is described as `'override': skip comparing and pick variable from first dataset`. Potential resolutions: 1. `'identical'` should raise an error when attributes are not the same for DataArrays 2. `'equals'` should raise an error when DataArray names are not identical (unless one is None, which works with Datasets and seems fine to be replaced) 3. `'override'` should override naming inconsistencies when combining DataSets. Final thought: perhaps promoting to Dataset when all requirements are met for a DataArray to be considered as such, might simplify keeping operations and checks consistent? ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,494906646 https://github.com/pydata/xarray/issues/3445#issuecomment-551359502,https://api.github.com/repos/pydata/xarray/issues/3445,551359502,MDEyOklzc3VlQ29tbWVudDU1MTM1OTUwMg==,10554254,2019-11-08T02:41:13Z,2019-11-08T02:41:13Z,NONE,"@El-minadero from the [sparse API](https://sparse.pydata.org/en/latest/generated/sparse.html) page I'm seeing two methods for combining data: ```python import sparse import numpy as np A = sparse.COO.from_numpy(np.array([[1, 2], [3, 4]])) B = sparse.COO.from_numpy(np.array([[5, 9], [6, 8]])) sparse.stack([A,B]).todense() Out[1]: array([[[1, 2], [3, 4]], [[5, 9], [6, 8]]]) sparse.concatenate([A,B]).todense() Out[2]: array([[1, 2], [3, 4], [5, 9], [6, 8]]) ``` Since this is an issue with `sparse` and merging data doesn't seem to be supported at this time, you might consider closing this issue out here and raising it over at [sparse](https://github.com/pydata/sparse/issues). ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,512205079 https://github.com/pydata/xarray/issues/3445#issuecomment-550516745,https://api.github.com/repos/pydata/xarray/issues/3445,550516745,MDEyOklzc3VlQ29tbWVudDU1MDUxNjc0NQ==,10554254,2019-11-06T21:51:31Z,2019-11-06T21:51:31Z,NONE,"Note that `dataset1 = xr.concat([data_array1,data_array2],dim='source')` or `dim='receiver'` seem to work, however, concat also fails if `time` is specified as the dimension.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,512205079 https://github.com/pydata/xarray/pull/3312#issuecomment-532754800,https://api.github.com/repos/pydata/xarray/issues/3312,532754800,MDEyOklzc3VlQ29tbWVudDUzMjc1NDgwMA==,10554254,2019-09-18T16:08:09Z,2019-09-18T16:08:09Z,NONE,Opened https://github.com/pydata/xarray/issues/3315 regarding combine_nested() failing when being passed nested DataSets.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,494210818 https://github.com/pydata/xarray/pull/3312#issuecomment-532419859,https://api.github.com/repos/pydata/xarray/issues/3312,532419859,MDEyOklzc3VlQ29tbWVudDUzMjQxOTg1OQ==,10554254,2019-09-17T22:03:23Z,2019-09-17T23:51:13Z,NONE,"`pytest -q xarray/tests/test_combine.py` is telling me that ``` def test_concat_name_symmetry(self): """"""Inspired by the discussion on GH issue #2777"""""" da1 = DataArray(name=""a"", data=[[0]], dims=[""x"", ""y""]) da2 = DataArray(name=""b"", data=[[1]], dims=[""x"", ""y""]) da3 = DataArray(name=""a"", data=[[2]], dims=[""x"", ""y""]) da4 = DataArray(name=""b"", data=[[3]], dims=[""x"", ""y""]) x_first = combine_nested([[da1, da2], [da3, da4]], concat_dim=[""x"", ""y""]) ``` fails with: ``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) in 3 da3 = xr.DataArray(name=""a"", data=[[2]], dims=[""x"", ""y""]) 4 da4 = xr.DataArray(name=""b"", data=[[3]], dims=[""x"", ""y""]) ----> 5 xr.combine_nested([[da1, da2], [da3, da4]], concat_dim=[""x"", ""y""]) ~/repos/contribute/xarray/xarray/core/combine.py in combine_nested(objects, concat_dim, compat, data_vars, coords, fill_value, join) 468 ids=False, 469 fill_value=fill_value, --> 470 join=join, 471 ) 472 ~/repos/contribute/xarray/xarray/core/combine.py in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join) 305 coords=coords, 306 fill_value=fill_value, --> 307 join=join, 308 ) 309 return combined ~/repos/contribute/xarray/xarray/core/combine.py in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join) 196 compat=compat, 197 fill_value=fill_value, --> 198 join=join, 199 ) 200 (combined_ds,) = combined_ids.values() ~/repos/contribute/xarray/xarray/core/combine.py in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join) 218 datasets = combined_ids.values() 219 new_combined_ids[new_id] = _combine_1d( --> 220 datasets, dim, compat, data_vars, coords, fill_value, join 221 ) 222 return new_combined_ids ~/repos/contribute/xarray/xarray/core/combine.py in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join) 246 compat=compat, 247 fill_value=fill_value, --> 248 join=join, 249 ) 250 except ValueError as err: ~/repos/contribute/xarray/xarray/core/concat.py in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join) 131 ""objects, got %s"" % type(first_obj) 132 ) --> 133 return f(objs, dim, data_vars, coords, compat, positions, fill_value, join) 134 135 ~/repos/contribute/xarray/xarray/core/concat.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join) 363 for k in datasets[0].variables: 364 if k in concat_over: --> 365 vars = ensure_common_dims([ds.variables[k] for ds in datasets]) 366 combined = concat_vars(vars, dim, positions) 367 assert isinstance(combined, Variable) ~/repos/contribute/xarray/xarray/core/concat.py in (.0) 363 for k in datasets[0].variables: 364 if k in concat_over: --> 365 vars = ensure_common_dims([ds.variables[k] for ds in datasets]) 366 combined = concat_vars(vars, dim, positions) 367 assert isinstance(combined, Variable) ~/repos/contribute/xarray/xarray/core/utils.py in __getitem__(self, key) 383 384 def __getitem__(self, key: K) -> V: --> 385 return self.mapping[key] 386 387 def __iter__(self) -> Iterator[K]: KeyError: 'a' ``` It looks like the existing combine_nested() routine actually wants a DataArray and fails if passed a DataSet. The following should work with current master. ``` da1 = xr.DataArray(name=""a"", data=[[0]], dims=[""x"", ""y""]) da2 = xr.DataArray(name=""b"", data=[[1]], dims=[""x"", ""y""]) da3 = xr.DataArray(name=""a"", data=[[2]], dims=[""x"", ""y""]) da4 = xr.DataArray(name=""b"", data=[[3]], dims=[""x"", ""y""]) xr.combine_nested([[da1, da2], [da3, da4]], concat_dim=[""x"", ""y""]) ``` While converting to DataSet will cause the same error expressed by the test. ``` ds1 = da1.to_dataset() ds2 = da2.to_dataset() ds3 = da3.to_dataset() ds4 = da4.to_dataset() xr.combine_nested([[ds1, ds2], [ds3, ds4]], concat_dim=[""x"", ""y""]) ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,494210818 https://github.com/pydata/xarray/issues/3248#issuecomment-531511177,https://api.github.com/repos/pydata/xarray/issues/3248,531511177,MDEyOklzc3VlQ29tbWVudDUzMTUxMTE3Nw==,10554254,2019-09-14T20:31:22Z,2019-09-14T20:31:22Z,NONE,"Some additional information on the topic: Combining named 1D data arrays works. ``` da1 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [1, 2, 3])]) da2 = xr.DataArray(name='foo', data=np.random.randn(3), coords=[('x', [5, 6, 7])]) xr.combine_by_coords([da1, da2]) Dimensions: (x: 6) Coordinates: * x (x) int64 1 2 3 5 6 7 Data variables: foo (x) float64 1.443 0.4889 0.9233 0.1946 -1.639 -1.455 ``` However, when combining 2D gridded data... ``` da1 = xr.DataArray(name='foo', data=np.random.rand(3,3), coords=[('x', [1, 2, 3]), ('y', [1, 2, 3])]) da2 = xr.DataArray(name='foo', data=np.random.rand(3,3), coords=[('x', [5, 6, 7]), ('y', [5, 6, 7])]) xr.combine_by_coords([da1, da2]) ``` ...the method fails, despite passing a data variable name. ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in 9 ('y', [5, 6, 7])]) 10 ---> 11 xr.combine_by_coords([da1, da2]) ~/xarray/xarray/core/combine.py in combine_by_coords(datasets, compat, data_vars, coords, fill_value, join) 580 581 # Group by data vars --> 582 sorted_datasets = sorted(datasets, key=vars_as_keys) 583 grouped_by_vars = itertools.groupby(sorted_datasets, key=vars_as_keys) 584 ~/xarray/xarray/core/combine.py in vars_as_keys(ds) 465 466 def vars_as_keys(ds): --> 467 return tuple(sorted(ds)) 468 469 ~/xarray/xarray/core/common.py in __bool__(self) 119 120 def __bool__(self: Any) -> bool: --> 121 return bool(self.values) 122 123 def __float__(self: Any) -> float: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() ``` Again, converting to a dataset bypasses the issue. ``` ds1 = da1.to_dataset() ds2 = da2.to_dataset() xr.combine_by_coords([ds1, ds2]) Dimensions: (x: 6, y: 6) Coordinates: * x (x) int64 1 2 3 5 6 7 * y (y) int64 1 2 3 5 6 7 Data variables: foo (x, y) float64 0.5078 0.8981 0.8707 nan ... 0.4172 0.7259 0.8431 ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,484270833 https://github.com/pydata/xarray/issues/1301#issuecomment-344949160,https://api.github.com/repos/pydata/xarray/issues/1301,344949160,MDEyOklzc3VlQ29tbWVudDM0NDk0OTE2MA==,10554254,2017-11-16T15:01:59Z,2017-11-16T15:02:48Z,NONE,"Looks like it has been resolved! Tested with the latest pre-release v0.10.0rc2 on the dataset linked by najascutellatus above. https://marine.rutgers.edu/~michaesm/netcdf/data/ ``` da.set_options(get=da.async.get_sync) %prun -l 10 ds = xr.open_mfdataset('./*.nc') ``` xarray==0.10.0rc2-1-g8267fdb dask==0.15.4 ``` 194381 function calls (188429 primitive calls) in 0.869 seconds Ordered by: internal time List reduced from 469 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 50 0.393 0.008 0.393 0.008 {numpy.core.multiarray.arange} 50 0.164 0.003 0.557 0.011 indexing.py:266(_index_indexer_1d) 5 0.083 0.017 0.085 0.017 netCDF4_.py:185(_open_netcdf4_group) 190 0.024 0.000 0.066 0.000 netCDF4_.py:256(open_store_variable) 190 0.022 0.000 0.022 0.000 netCDF4_.py:29(__init__) 50 0.018 0.000 0.021 0.000 {operator.getitem} 5145/3605 0.012 0.000 0.019 0.000 indexing.py:493(shape) 2317/1291 0.009 0.000 0.094 0.000 _abcoll.py:548(update) 26137 0.006 0.000 0.013 0.000 {isinstance} 720 0.005 0.000 0.006 0.000 {method 'getncattr' of 'netCDF4._netCDF4.Variable' objects} ``` xarray==0.9.1 dask==0.13.0 ``` 241253 function calls (229881 primitive calls) in 98.123 seconds Ordered by: internal time List reduced from 659 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 30 87.527 2.918 87.527 2.918 {pandas._libs.tslib.array_to_timedelta64} 65 7.055 0.109 7.059 0.109 {operator.getitem} 80 0.799 0.010 0.799 0.010 {numpy.core.multiarray.arange} 7895/4420 0.502 0.000 0.524 0.000 utils.py:412(shape) 68 0.442 0.007 0.442 0.007 {pandas._libs.algos.ensure_object} 80 0.350 0.004 1.150 0.014 indexing.py:318(_index_indexer_1d) 60/30 0.296 0.005 88.407 2.947 timedeltas.py:158(_convert_listlike) 30 0.284 0.009 0.298 0.010 algorithms.py:719(checked_add_with_arr) 123 0.140 0.001 0.140 0.001 {method 'astype' of 'numpy.ndarray' objects} 1049/719 0.096 0.000 96.513 0.134 {numpy.core.multiarray.array} ```","{""total_count"": 3, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 2, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278 https://github.com/pydata/xarray/issues/1301#issuecomment-293619896,https://api.github.com/repos/pydata/xarray/issues/1301,293619896,MDEyOklzc3VlQ29tbWVudDI5MzYxOTg5Ng==,10554254,2017-04-12T15:42:18Z,2017-04-12T15:42:18Z,NONE,"decode_times=False significantly reduces read time, but the proportional performance discrepancy between xarray 0.8.2 and 0.9.1 remains the same.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278 https://github.com/pydata/xarray/issues/1301#issuecomment-286220522,https://api.github.com/repos/pydata/xarray/issues/1301,286220522,MDEyOklzc3VlQ29tbWVudDI4NjIyMDUyMg==,10554254,2017-03-13T19:41:25Z,2017-03-13T19:41:25Z,NONE,"Looks like the issue might be that xarray 0.9.1 is decoding all timestamps on load. xarray==0.9.1, dask==0.13.0 ``` da.set_options(get=da.async.get_sync) %prun -l 10 ds = xr.open_mfdataset('./*.nc') 167305 function calls (160352 primitive calls) in 59.688 seconds Ordered by: internal time List reduced from 625 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 18 57.057 3.170 57.057 3.170 {pandas.tslib.array_to_timedelta64} 39 0.860 0.022 0.863 0.022 {operator.getitem} 48 0.402 0.008 0.402 0.008 {numpy.core.multiarray.arange} 4341/2463 0.257 0.000 0.273 0.000 utils.py:412(shape) 88 0.245 0.003 0.245 0.003 {pandas.algos.ensure_object} 48 0.158 0.003 0.561 0.012 indexing.py:318(_index_indexer_1d) 36/18 0.135 0.004 57.509 3.195 timedeltas.py:150(_convert_listlike) 18 0.126 0.007 0.130 0.007 nanops.py:815(_checked_add_with_arr) 51 0.070 0.001 0.070 0.001 {method 'astype' of 'numpy.ndarray' objects} 676/475 0.047 0.000 58.853 0.124 {numpy.core.multiarray.array} ``` `pandas.tslib.array_to_timedelta64` appears to be the most expensive item on the list, and isn't being run when using xarray 0.8.2. xarray==0.8.2, dask==0.13.0 ``` da.set_options(get=da.async.get_sync) %prun -l 10 ds = xr.open_mfdataset('./*.nc') 140668 function calls (136769 primitive calls) in 0.766 seconds Ordered by: internal time List reduced from 621 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 2571/1800 0.178 0.000 0.184 0.000 utils.py:387(shape) 18 0.174 0.010 0.174 0.010 {numpy.core.multiarray.arange} 16 0.079 0.005 0.079 0.005 {numpy.core.multiarray.concatenate} 483/420 0.077 0.000 0.125 0.000 {numpy.core.multiarray.array} 15 0.054 0.004 0.197 0.013 indexing.py:259(_index_indexer_1d) 3 0.041 0.014 0.043 0.014 netCDF4_.py:181(__init__) 105 0.013 0.000 0.057 0.001 netCDF4_.py:196(open_store_variable) 15 0.012 0.001 0.013 0.001 {operator.getitem} 2715/1665 0.007 0.000 0.178 0.000 indexing.py:343(shape) 5971 0.006 0.000 0.006 0.000 collections.py:71(__setitem__) ``` The version of dask is held constant in each test.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,212561278