home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 585997533

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/3213#issuecomment-585997533 https://api.github.com/repos/pydata/xarray/issues/3213 585997533 MDEyOklzc3VlQ29tbWVudDU4NTk5NzUzMw== 6213168 2020-02-13T22:12:37Z 2020-02-13T22:12:37Z MEMBER

Hi fmfreeze,

> Dask integration enables xarray to scale to big data, only as long as the data has no sparse character. Do you agree on that formulation or am I missing something fundamental?

I don't agree. To my understanding xarray->dask->sparse works very well (save bugs), as long as your data density (the percentage of non-default points) is roughly constant across dask chunks. If it isn't, then you'll have some chunks that consume substantially more RAM and CPU to compute than others. This can be mitigated, if you know in advance where you are going to have more samples, by setting uneven dask chunk sizes. For example, if you have a one-dimensional array of 100k points and you know in advance that the density of non-default samples follows a gaussian or triangular distribution, then it may be wise to have very large chunks at the tails and then get them progressively smaller towards the center, e.g. (30k, 12k, 5k, 2k, 1k, 1k, 2k, 5k, 10k, 30k). Of course, there are use cases where you're going to have unpredictable hotspots; I'm afraid that in those the only thing you can do is size your chunks for the worst case and end up oversplitting everywhere else.

Regards Guido

On Thu, 13 Feb 2020 at 10:55, fmfreeze notifications@github.com wrote:

Thank you all for making xarray and its tight development with dask so great!

As @shoyer https://github.com/shoyer mentioned

Yes, it would be useful (eventually) to have lazy loading of sparse arrays from disk, like we want we currently do for dense arrays. This would indeed require knowing that the indices are sorted.

I am wondering, if creating a lazy & sparse xarray Dataset/DataArray is already possible? Especially when creating the sparse part at runtime, and loading only the data part: Assume two differently sampled - and lazy dask - DataArrays are merged/combined along a coordinate axis into a Dataset. Then the smaller (= less dense) DataVariable is filled with NaNs. As far as I experienced the current behaviour is, that each NaN value requires memory.

That issue might be formulated this way: Dask integration enables xarray to scale to big data, only as long as the data has no sparse character. Do you agree on that formulation or am I missing something fundamental?

A code example reproducing that issue is described here: https://stackoverflow.com/q/60117268/9657367

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3213?email_source=notifications&email_token=ABPM4MBFBIH7EK4PPAWHRH3RCURJLA5CNFSM4ILGYGP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELUJNRQ#issuecomment-585668294, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPM4MCMRVUZXQSDYCAIP3LRCURJLANCNFSM4ILGYGPQ .

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  479942077
Powered by Datasette · Queries took 0.555ms · About: xarray-datasette