html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/3213#issuecomment-597825416,https://api.github.com/repos/pydata/xarray/issues/3213,597825416,MDEyOklzc3VlQ29tbWVudDU5NzgyNTQxNg==,18172466,2020-03-11T19:29:31Z,2020-03-11T19:29:31Z,NONE,"Concatenating multiple lazy, **differently sized** xr.DataArrays - each wrapping a sparse.COO by **xr.apply_ufunc(sparse.COO, ds, dask='parallelized')** as @crusaderky suggested - results again in an xr.DataArray, whose wrapped dask array chunks are mapped to numpy arrays: ``` dask.array Coordinates: * time (time) float64 0.0 5e-07 1e-06 1.5e-06 2e-06 ... 4.0 4.0 4.0 4.0 * cycle (cycle) int64 1 2 3 4 5 6 7 8 9 10 ``` But also when mapping the resulting, concatenated DataArray to sparse.COO afterwards, my main goal - scalable serialization of a lazy xarray - cannot be achieved. So one suggestion to @shoyer original question: It would be great, if sparse, but still lazy DataArrays/Datasets could be serialized without the data-overhead itself. Currently, that seems to work only for DataArrays which are merged/aligned by DataArrays of the **same shape**.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077 https://github.com/pydata/xarray/issues/3213#issuecomment-591388766,https://api.github.com/repos/pydata/xarray/issues/3213,591388766,MDEyOklzc3VlQ29tbWVudDU5MTM4ODc2Ng==,18172466,2020-02-26T11:54:40Z,2020-02-26T11:54:40Z,NONE,"Thank you @crusaderky, unfortunately some obstacles appeared using your loading technique. As thousands of .h5 files are the datasource for my use case and they have various - and sometimes different paths to - datasets, using the _xarray.open_mfdatasets(...)_ function seems not to be possible straight forward. But: 1) I have a routine merging all .h5 datasets into corresponding dask arrays, wrapping dense numpy arrays implicitly 2) I ""manually"" slice out a part of the the huge lazy dask array and wrap that into an xarray.DataArray/Dataset 3) But applying _xr.apply_ufunc(sparse.COO, ds, dask='allowed')_ on that slice then results in an **NotImplementedError: Format not supported for conversion. Supplied type is , see help(sparse.as_coo) for supported formats.** (I am not sure, if this is the right place to discuss, so I would be thankful for a response on SO in that case: https://stackoverflow.com/questions/60117268/how-to-make-use-of-xarrays-sparse-functionality-when-combining-differently-size)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077 https://github.com/pydata/xarray/issues/3213#issuecomment-587471646,https://api.github.com/repos/pydata/xarray/issues/3213,587471646,MDEyOklzc3VlQ29tbWVudDU4NzQ3MTY0Ng==,18172466,2020-02-18T13:56:09Z,2020-02-18T13:56:51Z,NONE,"Thank you @crusaderky for your input. I understand and agree with your statements for sparse data files. My approach is different, because within my (hdf5) data files on disc, I have no sparse datasets at all. But as I combine two differently sampled xarray dataset (initialized by h5py > dask > xarray) with xarrays built-in top-level function ""xarray.merge()"" (resp. xarray.combine_by_coords()), the resulting dataset is sparse. Generally that is nice behaviour, because two differently sampled datasets get aligned along a coordinate/dimension, and the gaps are filled by NaNs. Nevertheless, those NaN ""gaps"" seem to need memory for every single NaN. That is what should be avoided. Maybe by implementing a redundant pointer to the same memory adress for each NaN?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077 https://github.com/pydata/xarray/issues/3213#issuecomment-585668294,https://api.github.com/repos/pydata/xarray/issues/3213,585668294,MDEyOklzc3VlQ29tbWVudDU4NTY2ODI5NA==,18172466,2020-02-13T10:55:15Z,2020-02-13T10:55:15Z,NONE,"Thank you all for making xarray and its tight development with dask so great! As @shoyer mentioned > Yes, it would be useful (eventually) to have lazy loading of sparse arrays from disk, like we want we currently do for dense arrays. This would indeed require knowing that the indices are sorted. I am wondering, if creating a **lazy** & **sparse** xarray Dataset/DataArray is already possible? Especially when _creating_ the sparse part at runtime, and _loading_ only the data part: Assume two differently sampled - and lazy dask - DataArrays are merged/combined along a coordinate axis into a Dataset. Then the smaller (= less dense) DataVariable is filled with NaNs. As far as I experienced the current behaviour is, that each NaN value requires memory. That issue might be formulated this way: **Dask integration enables xarray to scale to big data, _only_ as long as the data has no sparse character**. Do you agree on that formulation or am I missing something fundamental? A code example reproducing that issue is described here: https://stackoverflow.com/q/60117268/9657367","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077