html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/3213#issuecomment-592476821,https://api.github.com/repos/pydata/xarray/issues/3213,592476821,MDEyOklzc3VlQ29tbWVudDU5MjQ3NjgyMQ==,6213168,2020-02-28T11:39:50Z,2020-02-28T11:39:50Z,MEMBER,"*xr.apply_ufunc(sparse.COO, ds, dask='parallelized')*
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077
https://github.com/pydata/xarray/issues/3213#issuecomment-587564478,https://api.github.com/repos/pydata/xarray/issues/3213,587564478,MDEyOklzc3VlQ29tbWVudDU4NzU2NDQ3OA==,6213168,2020-02-18T16:58:25Z,2020-02-18T16:58:25Z,MEMBER,"you just need to
1. load up your NetCDF files with *xarray.open_mfdataset*. This will give
you
- an xarray.Dataset,
- that wraps around one dask.array.Array per variable,
- that wrap around one numpy.ndarray (DENSE array) per dask chunk.
2. convert to sparse with *xarray.apply_ufunc(sparse.COO, ds)*.
This will give you
- an xarray.Dataset,
- that wraps around one dask.array.Array per variable,
- that wrap around one sparse.COO (SPARSE array) per dask chunk.
3. use xarray.merge or whatever to align and merge
4. you may want to rechunk at this point to obtain less, larger chunks. You
can estimate your chunk size in bytes if you know your data density (read
my previous email).
5. Do whatever other calculations you want. All operations will produce in
output the same data type as point 2.
4. To go back to dense, invoke *xarray.apply_ufunc(lambda x: x.todense(),
ds)* to go back to the format as in (1). This step is only necessary if you
have something that won't accept/recognize sparse arrays directly in input;
namely, writing to a NetCDF dataset. If your data has not been reduced
enough, you may need to rechunk into smaller chunks first in order to fit
into your RAM constraints.
Regards
On Tue, 18 Feb 2020 at 13:56, fmfreeze wrote:
> Thank you @crusaderky for your input.
>
> I understand and agree with your statements for sparse data files.
> My approach is different, because within my (hdf5) data files on disc, I
> have no sparse datasets at all.
>
> But as I combine two differently sampled xarray dataset (initialized by
> h5py > dask > xarray) with xarrays built-in top-level function
> ""xarray.merge()"" (resp. xarray.combine_by_coords()), the resulting dataset
> is sparse.
>
> Generally that is nice behaviour, because two differently sampled datasets
> get aligned along a coordinate/dimension, and the gaps are filled by NaNs.
>
> Nevertheless, thos NaN ""gaps"" seem to need memory for every single NaN.
> That is what should be avoided.
> Maybe by implementing a redundant pointer to the same memory adress for
> each NaN?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077
https://github.com/pydata/xarray/issues/3213#issuecomment-585997533,https://api.github.com/repos/pydata/xarray/issues/3213,585997533,MDEyOklzc3VlQ29tbWVudDU4NTk5NzUzMw==,6213168,2020-02-13T22:12:37Z,2020-02-13T22:12:37Z,MEMBER,"Hi fmfreeze,
*> Dask integration enables xarray to scale to big data, only as long as
the data has no sparse character*. Do you agree on that formulation or am I
missing something fundamental?
I don't agree. To my understanding xarray->dask->sparse works very well
(save bugs), *as long as your data density *(the percentage of non-default
points)* is roughly constant across dask chunk*s.
If it isn't, then you'll have some chunks that consume substantially more
RAM and CPU to compute than others. This can be mitigated, if you know in
advance where you are going to have more samples, by setting uneven dask
chunk sizes. For example, if you have a one-dimensional array of 100k
points and you know in advance that the density of non-default samples
follows a gaussian or triangular distribution, then it may be wise to have
very large chunks at the tails and then get them progressively smaller
towards the center, e.g. (30k, 12k, 5k, 2k, 1k, 1k, 2k, 5k, 10k, 30k).
Of course, there are use cases where you're going to have unpredictable
hotspots; I'm afraid that in those the only thing you can do is size your
chunks for the worst case and end up oversplitting everywhere else.
Regards
Guido
On Thu, 13 Feb 2020 at 10:55, fmfreeze wrote:
> Thank you all for making xarray and its tight development with dask so
> great!
>
> As @shoyer mentioned
>
> Yes, it would be useful (eventually) to have lazy loading of sparse arrays
> from disk, like we want we currently do for dense arrays. This would indeed
> require knowing that the indices are sorted.
>
> I am wondering, if creating a *lazy* & *sparse* xarray Dataset/DataArray
> is already possible?
> Especially when *creating* the sparse part at runtime, and *loading* only
> the data part:
> Assume two differently sampled - and lazy dask - DataArrays are
> merged/combined along a coordinate axis into a Dataset.
> Then the smaller (= less dense) DataVariable is filled with NaNs. As far
> as I experienced the current behaviour is, that each NaN value requires
> memory.
>
> That issue might be formulated this way:
> *Dask integration enables xarray to scale to big data, only as long as the
> data has no sparse character*. Do you agree on that formulation or am I
> missing something fundamental?
>
> A code example reproducing that issue is described here:
> https://stackoverflow.com/q/60117268/9657367
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
>
> .
>
","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077
https://github.com/pydata/xarray/issues/3213#issuecomment-527766483,https://api.github.com/repos/pydata/xarray/issues/3213,527766483,MDEyOklzc3VlQ29tbWVudDUyNzc2NjQ4Mw==,6213168,2019-09-04T06:46:08Z,2019-09-04T06:46:08Z,MEMBER,"@p-d-moore what you say makes sense but it is well outside of the domain of xarray. What you're describing is basically a new sparse class, substantially more sophisticated than COO, and should be proposed in the sparse board, not here. After it's implemented in sparse, xarray will be able to wrap around it. ","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077
https://github.com/pydata/xarray/issues/3213#issuecomment-521224538,https://api.github.com/repos/pydata/xarray/issues/3213,521224538,MDEyOklzc3VlQ29tbWVudDUyMTIyNDUzOA==,6213168,2019-08-14T12:25:39Z,2019-08-14T12:25:39Z,MEMBER,"As for NetCDF, instead of a bespoke xarray-only convention, wouldn't it be much better to push a spec extension upstream?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077
https://github.com/pydata/xarray/issues/3213#issuecomment-521223609,https://api.github.com/repos/pydata/xarray/issues/3213,521223609,MDEyOklzc3VlQ29tbWVudDUyMTIyMzYwOQ==,6213168,2019-08-14T12:22:37Z,2019-08-14T12:22:37Z,MEMBER,"As already mentioned in #3206, ``unstack(sparse=True)`` would be extremely useful.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077
https://github.com/pydata/xarray/issues/3213#issuecomment-521221473,https://api.github.com/repos/pydata/xarray/issues/3213,521221473,MDEyOklzc3VlQ29tbWVudDUyMTIyMTQ3Mw==,6213168,2019-08-14T12:15:39Z,2019-08-14T12:20:59Z,MEMBER,"+1 for the introduction of to_sparse() / to_dense(), but let's please avoid the mistakes that were done with chunk(). DataArray.chunk() is extremely frustrating when you have non-index coords and, 9 times out of 10, you only want to chunk the data and you have to go through the horrid
```python
a = DataArray(a.data.chunk(), dims=a.dims, coords=a.coords, attrs=a.attrs, name=a.name)
```
Exactly the same issue would apply to to_sparse().
Possibly we could define them as
```python
class DataArray:
def to_sparse(
self,
data: bool = True,
coords: Union[Iterable[Hashable], bool] = False
)
class Dataset:
def to_sparse(
self,
data_vars: Union[Iterable[Hashable], bool] = True,
coords: Union[Iterable[Hashable], bool] = False
)
```
same for to_dense() and chunk() (the latter would require a DeprecationWarning for a few release before switching the default for coords from True to False - only to be triggered in presence of dask-backed coords).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,479942077