issue_comments: 585997533

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/3213#issuecomment-585997533	https://api.github.com/repos/pydata/xarray/issues/3213	585997533	MDEyOklzc3VlQ29tbWVudDU4NTk5NzUzMw==	6213168	2020-02-13T22:12:37Z	2020-02-13T22:12:37Z	MEMBER	Hi fmfreeze, > Dask integration enables xarray to scale to big data, only as long as the data has no sparse character. Do you agree on that formulation or am I missing something fundamental? I don't agree. To my understanding xarray->dask->sparse works very well (save bugs), as long as your data density (the percentage of non-default points) is roughly constant across dask chunks. If it isn't, then you'll have some chunks that consume substantially more RAM and CPU to compute than others. This can be mitigated, if you know in advance where you are going to have more samples, by setting uneven dask chunk sizes. For example, if you have a one-dimensional array of 100k points and you know in advance that the density of non-default samples follows a gaussian or triangular distribution, then it may be wise to have very large chunks at the tails and then get them progressively smaller towards the center, e.g. (30k, 12k, 5k, 2k, 1k, 1k, 2k, 5k, 10k, 30k). Of course, there are use cases where you're going to have unpredictable hotspots; I'm afraid that in those the only thing you can do is size your chunks for the worst case and end up oversplitting everywhere else. Regards Guido On Thu, 13 Feb 2020 at 10:55, fmfreeze notifications@github.com wrote: Thank you all for making xarray and its tight development with dask so great! As @shoyer https://github.com/shoyer mentioned Yes, it would be useful (eventually) to have lazy loading of sparse arrays from disk, like we want we currently do for dense arrays. This would indeed require knowing that the indices are sorted. I am wondering, if creating a lazy & sparse xarray Dataset/DataArray is already possible? Especially when creating the sparse part at runtime, and loading only the data part: Assume two differently sampled - and lazy dask - DataArrays are merged/combined along a coordinate axis into a Dataset. Then the smaller (= less dense) DataVariable is filled with NaNs. As far as I experienced the current behaviour is, that each NaN value requires memory. That issue might be formulated this way: Dask integration enables xarray to scale to big data, only as long as the data has no sparse character. Do you agree on that formulation or am I missing something fundamental? A code example reproducing that issue is described here: https://stackoverflow.com/q/60117268/9657367 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3213?email_source=notifications&email_token=ABPM4MBFBIH7EK4PPAWHRH3RCURJLA5CNFSM4ILGYGP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELUJNRQ#issuecomment-585668294, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPM4MCMRVUZXQSDYCAIP3LRCURJLANCNFSM4ILGYGPQ .	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		479942077