home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 587564478

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/3213#issuecomment-587564478 https://api.github.com/repos/pydata/xarray/issues/3213 587564478 MDEyOklzc3VlQ29tbWVudDU4NzU2NDQ3OA== 6213168 2020-02-18T16:58:25Z 2020-02-18T16:58:25Z MEMBER

you just need to

  1. load up your NetCDF files with xarray.open_mfdataset. This will give you
  2. an xarray.Dataset,
  3. that wraps around one dask.array.Array per variable,
  4. that wrap around one numpy.ndarray (DENSE array) per dask chunk.

  5. convert to sparse with xarray.apply_ufunc(sparse.COO, ds). This will give you

  6. an xarray.Dataset,
  7. that wraps around one dask.array.Array per variable,
  8. that wrap around one sparse.COO (SPARSE array) per dask chunk.

  9. use xarray.merge or whatever to align and merge

  10. you may want to rechunk at this point to obtain less, larger chunks. You can estimate your chunk size in bytes if you know your data density (read my previous email).

  11. Do whatever other calculations you want. All operations will produce in output the same data type as point 2.

  12. To go back to dense, invoke xarray.apply_ufunc(lambda x: x.todense(), ds) to go back to the format as in (1). This step is only necessary if you have something that won't accept/recognize sparse arrays directly in input; namely, writing to a NetCDF dataset. If your data has not been reduced enough, you may need to rechunk into smaller chunks first in order to fit into your RAM constraints.

Regards

On Tue, 18 Feb 2020 at 13:56, fmfreeze notifications@github.com wrote:

Thank you @crusaderky https://github.com/crusaderky for your input.

I understand and agree with your statements for sparse data files. My approach is different, because within my (hdf5) data files on disc, I have no sparse datasets at all.

But as I combine two differently sampled xarray dataset (initialized by h5py > dask > xarray) with xarrays built-in top-level function "xarray.merge()" (resp. xarray.combine_by_coords()), the resulting dataset is sparse.

Generally that is nice behaviour, because two differently sampled datasets get aligned along a coordinate/dimension, and the gaps are filled by NaNs.

Nevertheless, thos NaN "gaps" seem to need memory for every single NaN. That is what should be avoided. Maybe by implementing a redundant pointer to the same memory adress for each NaN?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3213?email_source=notifications&email_token=ABPM4MFWF22BFFYDHV6BS2DRDPSHXA5CNFSM4ILGYGP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMCBWHQ#issuecomment-587471646, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPM4MHIUWDYX6ZFKRRBIJLRDPSHXANCNFSM4ILGYGPQ .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  479942077
Powered by Datasette · Queries took 0.665ms · About: xarray-datasette