home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

13 rows where author_association = "MEMBER" and issue = 288184220 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • dcherian 7
  • rabernat 2
  • jhamman 2
  • TomNicholas 2

issue 1

  • We need a fast path for open_mfdataset · 13 ✖

author_association 1

  • MEMBER · 13 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
768627652 https://github.com/pydata/xarray/issues/1823#issuecomment-768627652 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDc2ODYyNzY1Mg== dcherian 2448579 2021-01-27T22:43:59Z 2021-01-27T22:43:59Z MEMBER

That's 34k 3MB files! I suggest combining to 1k 100MB files, that would work a lot better.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
768460310 https://github.com/pydata/xarray/issues/1823#issuecomment-768460310 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDc2ODQ2MDMxMA== dcherian 2448579 2021-01-27T17:50:09Z 2021-01-27T17:50:09Z MEMBER

Let's close this since there is an opt-in mostly-fast path. I've added an item to #4648 to cover adding an asv benchmark for mfdataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
531913598 https://github.com/pydata/xarray/issues/1823#issuecomment-531913598 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDUzMTkxMzU5OA== dcherian 2448579 2019-09-16T19:03:47Z 2019-09-16T19:03:47Z MEMBER

PS @rabernat

%%time ds = xr.open_mfdataset("/glade/p/cesm/community/ASD-HIGH-RES-CESM1/hybrid_v5_rel04_BC5_ne120_t12_pop62/ocn/proc/tseries/monthly/*.nc", parallel=True, coords="minimal", data_vars="minimal", compat='override') This completes in 40 seconds with 10 workers on cheyenne.

{
    "total_count": 3,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 2,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
531912893 https://github.com/pydata/xarray/issues/1823#issuecomment-531912893 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDUzMTkxMjg5Mw== dcherian 2448579 2019-09-16T19:01:57Z 2019-09-16T19:01:57Z MEMBER

=) @TomNicholas PRs welcome!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
531905844 https://github.com/pydata/xarray/issues/1823#issuecomment-531905844 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDUzMTkwNTg0NA== TomNicholas 35968931 2019-09-16T18:43:52Z 2019-09-16T18:43:52Z MEMBER

This is big if true!

But surely to close an issue raised by complaints about speed, we should really have some new asv speed tests?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
531816800 https://github.com/pydata/xarray/issues/1823#issuecomment-531816800 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDUzMTgxNjgwMA== dcherian 2448579 2019-09-16T15:00:16Z 2019-09-16T15:00:16Z MEMBER

YES! (well almost)

The PR lets you skip compatibility checks. The magic spell is xr.open_mfdataset(..., data_vars="minimal", coords="minimal", compat="override") You can skip index comparison by adding join="override".

Whats left is extremely large indexes and lazy index / coordinate loading but we have #2039 open for that. I will rename that issue.

If you have time, can you test it out?

{
    "total_count": 2,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
531813935 https://github.com/pydata/xarray/issues/1823#issuecomment-531813935 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDUzMTgxMzkzNQ== rabernat 1197350 2019-09-16T14:53:57Z 2019-09-16T14:53:57Z MEMBER

Is this issue really closed?!?

🎉🎂🏆🥇

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
489135792 https://github.com/pydata/xarray/issues/1823#issuecomment-489135792 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDQ4OTEzNTc5Mg== dcherian 2448579 2019-05-03T15:29:14Z 2019-05-03T15:40:27Z MEMBER

One common use-case is files with large numbers of concat_dim-invariant non-dimensional co-ordinates. This is easy to speed up by dropping those variables from all but the first file.

e.g. https://github.com/pangeo-data/esgf2xarray/blob/6a5e4df0d329c2f23b403cbfbb65f0f1dfa98d52/esgf2zarr/aggregate.py#L107-L110 python # keep only coordinates from first ensemble member to simplify merge first = member_dsets_aligned[0] rest = [mds.reset_coords(drop=True) for mds in member_dsets_aligned[1:]] objs_to_concat = [first] + rest

Similarly https://github.com/NCAR/intake-esm/blob/e86a8e8a80ce0fd4198665dbef3ba46af264b5ea/intake_esm/aggregate.py#L53-L57

python def merge_vars_two_datasets(ds1, ds2): """ Merge two datasets, dropping all variables from second dataset that already exist in the first dataset's coordinates. """

See also #2039 (second code block)

One way to do this might be to add a master_file kwarg to open_mfdataset. This would imply coords='minimal', join='exact' (I think; prealigned=True in some other proposals) and would drop non-dimensional coordinates from all but the first file and then call concat.

As bonus it would assign attributes from the master_file to the merged dataset (for which I think there are open issues) : this functionality exists in netCDF4.MFDataset so that's a plus.

EDIT: #2039 (third code block) is also a possibility. This might look like python xr.open_mfdataset('files*.nc', master_file='first', concat_dim='time') in which case the first file is read; all coords that are not concat_dim become drop_variables for an open_dataset call that reads the remaining files. We then merge with the first dataset and assign attrs.

EDIT2: master_file combines two different functionalities here: specifying a "template file" and a file to choose attributes from. So maybe we need two kwargs: template_file and attrs_from?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
489101053 https://github.com/pydata/xarray/issues/1823#issuecomment-489101053 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDQ4OTEwMTA1Mw== rabernat 1197350 2019-05-03T13:47:12Z 2019-05-03T13:47:12Z MEMBER

So I think it is quite important to consider this issue together with #2697. An xml specification called NCML already exists which tells software how to put together multiple netCDF files into a single virtual netcdf. We should leverage this existing spec as much as possible.

A realistic use case for me is that I have, say 1000 files of high-res model output, each with large coordinate variables, all generated from the same model run. If we want to for for which we know a priori that certain coordinates (dimension coordinates or otherwise) are identical, we could save a lot of disk reads (the slow part of open_mfdataset) by never reading those coordinates at all. Enabling this would require a pretty low-level change in xarray. For example, we couldn't even rely on open_dataset in its current form to open files, because open_dataset eagerly loads all dimension coordinates into indexes. One way forward might be to create a new Store class.

For a catalog of tricks I use to optimize opening these sorts of big, complex, multi-file datasets (e.g. CMIP), check out https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
489027263 https://github.com/pydata/xarray/issues/1823#issuecomment-489027263 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDQ4OTAyNzI2Mw== TomNicholas 35968931 2019-05-03T09:25:00Z 2019-05-03T09:25:00Z MEMBER

@dcherian I'm sorry, I'm very interested in this but after reading the issues I'm still not clear on what's being proposed:

What exactly is the bottleneck? Is it reading the coords from all the files? Is it loading the coord values into memory? Is it performing the alignment checks on those coords once they're in memory? Is it performing alignment checks on the dimensions? Is this suggestion relevant to datasets that don't have any coords?

Which of these steps would a join='exact' option omit?

A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset.

But this is already an option to open_mfdataset?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
488440840 https://github.com/pydata/xarray/issues/1823#issuecomment-488440840 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDQ4ODQ0MDg0MA== dcherian 2448579 2019-05-01T21:42:01Z 2019-05-01T21:45:38Z MEMBER

I am currently motivated to fix this.

  1. Over in https://github.com/pydata/xarray/pull/1413#issuecomment-302843502 @rabernat mentioned

    allowing the user to pass join='exact' via open_mfdataset. A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset.

  2. @shoyer suggested calling decode_cf later here though perhaps this wont help too much: https://github.com/pydata/xarray/issues/1385#issuecomment-439263419

Is this all that we can do on the xarray side?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
372862174 https://github.com/pydata/xarray/issues/1823#issuecomment-372862174 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDM3Mjg2MjE3NA== jhamman 2443309 2018-03-14T00:13:34Z 2018-03-14T00:13:34Z MEMBER

@jbusecke - No. These options are not mutually exclusive. The parallel open is, in my opinion, the lowest hanging fruit so that's why I started there. There are other improvements that we can tackle incrementally.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220
357336022 https://github.com/pydata/xarray/issues/1823#issuecomment-357336022 https://api.github.com/repos/pydata/xarray/issues/1823 MDEyOklzc3VlQ29tbWVudDM1NzMzNjAyMg== jhamman 2443309 2018-01-12T19:46:12Z 2018-01-12T19:46:12Z MEMBER

@rabernat - Depending on the structure of the dataset, another possibility that would speed up some open_mfdataset tasks substantially is to implement the step of opening each file and getting its metadata in in some parallel way (dask/joblib/etc.) and either returning the just dataset schema or a picklable version of the dataset itself. I think this will only be able to work with autoclose=True but it could be quite useful when working with many files.

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  We need a fast path for open_mfdataset 288184220

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 16.769ms · About: xarray-datasette