home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

9 rows where issue = 712189206 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • dcherian 4
  • heerad 4
  • shoyer 1

author_association 2

  • MEMBER 5
  • NONE 4

issue 1

  • Preprocess function for save_mfdataset · 9 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
702307334 https://github.com/pydata/xarray/issues/4475#issuecomment-702307334 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMjMwNzMzNA== heerad 2560426 2020-10-01T18:07:55Z 2020-10-01T18:07:55Z NONE

Sounds good, I'll do this in the meantime. Still quite interested in save_mfdataset dealing with these lower level details, if possible. The ideal case would be loading with load_mfdataset, defining some ops lazily, then piping that directly to save_mfdataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
702276824 https://github.com/pydata/xarray/issues/4475#issuecomment-702276824 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMjI3NjgyNA== dcherian 2448579 2020-10-01T17:13:16Z 2020-10-01T17:13:16Z MEMBER

doesn't that then require me to load everything into memory before writing?

I think so.

I would try multiple processes and see if that is fast enough for what you want to do. Or else, write to zarr. This will be parallelized and is a lot easier than dealing with HDF5

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
702265883 https://github.com/pydata/xarray/issues/4475#issuecomment-702265883 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMjI2NTg4Mw== heerad 2560426 2020-10-01T16:52:59Z 2020-10-01T16:52:59Z NONE

Multiple threads (the default), because it's recommended "for numeric code that releases the GIL (like NumPy, Pandas, Scikit-Learn, Numba, …)" according to the dask docs.

I guess I could do multi-threaded for the compute part (everything up to the definition of ds), then multi-process for the write part, but doesn't that then require me to load everything into memory before writing?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
702226256 https://github.com/pydata/xarray/issues/4475#issuecomment-702226256 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMjIyNjI1Ng== dcherian 2448579 2020-10-01T15:46:45Z 2020-10-01T15:46:45Z MEMBER

Are you using multiple threads or multiple processes? IIUC you should be using multiple processes for max writing efficiency.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
702178407 https://github.com/pydata/xarray/issues/4475#issuecomment-702178407 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMjE3ODQwNw== heerad 2560426 2020-10-01T14:34:28Z 2020-10-01T14:34:28Z NONE

Thank you, this works for me. However, it's quite slow and seems to scale faster than linearly as the length of datasets increases (the number of groups in the groupby).

Could it be connected to https://github.com/pydata/xarray/issues/2912#issuecomment-485497398 where they suggest to use save_mfdataset instead of to_netcdf? If so, there's a stronger case for supporting delayed objects in save_mfdataset as you said.

Appreciate the help!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
701694586 https://github.com/pydata/xarray/issues/4475#issuecomment-701694586 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMTY5NDU4Ng== shoyer 1217238 2020-09-30T23:13:33Z 2020-09-30T23:13:33Z MEMBER

I think we could support delayed objects in save_mfdataset, at least in principle. But if you're OK using delayed objects, you might as well write each netCDF file separately using dask.delayed, e.g., ``` def write_dataset(dataset, path): your_function(ds).to_netcdf(path)

result = [dask.delayed(write_dataset)(ds, path) for ds, path in zip(datasets, path)] dask.compute(result) ```

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
701688956 https://github.com/pydata/xarray/issues/4475#issuecomment-701688956 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMTY4ODk1Ng== dcherian 2448579 2020-09-30T22:55:28Z 2020-09-30T22:55:28Z MEMBER

You could write to netCDF in your_function and avoid save_mfdataset altogether...

I guess this is a good argument for adding a preprocess kwarg.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
701676076 https://github.com/pydata/xarray/issues/4475#issuecomment-701676076 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMTY3NjA3Ng== heerad 2560426 2020-09-30T22:17:24Z 2020-09-30T22:17:24Z NONE

Unfortunately that doesn't work:

TypeError: save_mfdataset only supports writing Dataset objects, received type <class 'dask.delayed.Delayed'>

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
701577652 https://github.com/pydata/xarray/issues/4475#issuecomment-701577652 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMTU3NzY1Mg== dcherian 2448579 2020-09-30T18:51:25Z 2020-09-30T18:51:25Z MEMBER

you could use dask.delayed here

new_datasets = [dask.delayed(your_function)(dset) for dset in datasets] xr.save_mfdataset(new_datasets, paths)

I think this will work, but I've never used save_mfdataset. This is how preprocess is implemented with open_mfdataset btw.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 639.991ms · About: xarray-datasette