home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where user = 18172466 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 2

  • How should xarray use/support sparse arrays? 4
  • Support for jagged array 1

user 1

  • fmfreeze · 5 ✖

author_association 1

  • NONE 5
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1014487633 https://github.com/pydata/xarray/issues/1482#issuecomment-1014487633 https://api.github.com/repos/pydata/xarray/issues/1482 IC_kwDOAMm_X848d9pR fmfreeze 18172466 2022-01-17T12:51:45Z 2022-01-17T12:51:45Z NONE

As I am not aware of implementation details I am not sure there is a useful link, but maybe progress in #3213 supporting sparse arrays can solve also the jagged array issue.

Long time ago I asked there a question about how xarray supports sparse arrays. But what I actually meant were "Jagged Arrays". I just was not aware of that term and stumbled over it some days ago the very first time.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Support for jagged array 243964948
597825416 https://github.com/pydata/xarray/issues/3213#issuecomment-597825416 https://api.github.com/repos/pydata/xarray/issues/3213 MDEyOklzc3VlQ29tbWVudDU5NzgyNTQxNg== fmfreeze 18172466 2020-03-11T19:29:31Z 2020-03-11T19:29:31Z NONE

Concatenating multiple lazy, differently sized xr.DataArrays - each wrapping a sparse.COO by xr.apply_ufunc(sparse.COO, ds, dask='parallelized') as @crusaderky suggested - results again in an xr.DataArray, whose wrapped dask array chunks are mapped to numpy arrays:

<xarray.DataArray 'myDataset' (cycle: 10, time: 8000000)> dask.array<concatenate, shape=(10, 8000000), dtype=float64, chunksize=(1, 5273216), chunktype=numpy.ndarray> Coordinates: * time (time) float64 0.0 5e-07 1e-06 1.5e-06 2e-06 ... 4.0 4.0 4.0 4.0 * cycle (cycle) int64 1 2 3 4 5 6 7 8 9 10

But also when mapping the resulting, concatenated DataArray to sparse.COO afterwards, my main goal - scalable serialization of a lazy xarray - cannot be achieved.

So one suggestion to @shoyer original question: It would be great, if sparse, but still lazy DataArrays/Datasets could be serialized without the data-overhead itself. Currently, that seems to work only for DataArrays which are merged/aligned by DataArrays of the same shape.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  How should xarray use/support sparse arrays? 479942077
591388766 https://github.com/pydata/xarray/issues/3213#issuecomment-591388766 https://api.github.com/repos/pydata/xarray/issues/3213 MDEyOklzc3VlQ29tbWVudDU5MTM4ODc2Ng== fmfreeze 18172466 2020-02-26T11:54:40Z 2020-02-26T11:54:40Z NONE

Thank you @crusaderky, unfortunately some obstacles appeared using your loading technique.

As thousands of .h5 files are the datasource for my use case and they have various - and sometimes different paths to - datasets, using the xarray.open_mfdatasets(...) function seems not to be possible straight forward.

But: 1) I have a routine merging all .h5 datasets into corresponding dask arrays, wrapping dense numpy arrays implicitly

2) I "manually" slice out a part of the the huge lazy dask array and wrap that into an xarray.DataArray/Dataset

3) But applying xr.apply_ufunc(sparse.COO, ds, dask='allowed') on that slice then results in an NotImplementedError: Format not supported for conversion. Supplied type is <class 'dask.array.core.Array'>, see help(sparse.as_coo) for supported formats.

(I am not sure, if this is the right place to discuss, so I would be thankful for a response on SO in that case: https://stackoverflow.com/questions/60117268/how-to-make-use-of-xarrays-sparse-functionality-when-combining-differently-size)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  How should xarray use/support sparse arrays? 479942077
587471646 https://github.com/pydata/xarray/issues/3213#issuecomment-587471646 https://api.github.com/repos/pydata/xarray/issues/3213 MDEyOklzc3VlQ29tbWVudDU4NzQ3MTY0Ng== fmfreeze 18172466 2020-02-18T13:56:09Z 2020-02-18T13:56:51Z NONE

Thank you @crusaderky for your input.

I understand and agree with your statements for sparse data files. My approach is different, because within my (hdf5) data files on disc, I have no sparse datasets at all.

But as I combine two differently sampled xarray dataset (initialized by h5py > dask > xarray) with xarrays built-in top-level function "xarray.merge()" (resp. xarray.combine_by_coords()), the resulting dataset is sparse.

Generally that is nice behaviour, because two differently sampled datasets get aligned along a coordinate/dimension, and the gaps are filled by NaNs.

Nevertheless, those NaN "gaps" seem to need memory for every single NaN. That is what should be avoided. Maybe by implementing a redundant pointer to the same memory adress for each NaN?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  How should xarray use/support sparse arrays? 479942077
585668294 https://github.com/pydata/xarray/issues/3213#issuecomment-585668294 https://api.github.com/repos/pydata/xarray/issues/3213 MDEyOklzc3VlQ29tbWVudDU4NTY2ODI5NA== fmfreeze 18172466 2020-02-13T10:55:15Z 2020-02-13T10:55:15Z NONE

Thank you all for making xarray and its tight development with dask so great!

As @shoyer mentioned

Yes, it would be useful (eventually) to have lazy loading of sparse arrays from disk, like we want we currently do for dense arrays. This would indeed require knowing that the indices are sorted.

I am wondering, if creating a lazy & sparse xarray Dataset/DataArray is already possible? Especially when creating the sparse part at runtime, and loading only the data part: Assume two differently sampled - and lazy dask - DataArrays are merged/combined along a coordinate axis into a Dataset. Then the smaller (= less dense) DataVariable is filled with NaNs. As far as I experienced the current behaviour is, that each NaN value requires memory.

That issue might be formulated this way: Dask integration enables xarray to scale to big data, only as long as the data has no sparse character. Do you agree on that formulation or am I missing something fundamental?

A code example reproducing that issue is described here: https://stackoverflow.com/q/60117268/9657367

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  How should xarray use/support sparse arrays? 479942077

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 16.339ms · About: xarray-datasette