home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

4 rows where issue = 252541496 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • shoyer · 4 ✖

issue 1

  • open_mfdataset reads coords from disk multiple times · 4 ✖

author_association 1

  • MEMBER 4
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
327338750 https://github.com/pydata/xarray/issues/1521#issuecomment-327338750 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNzMzODc1MA== shoyer 1217238 2017-09-06T00:20:49Z 2017-09-06T00:20:49Z MEMBER

Enjoy your holiday!

On Tue, Sep 5, 2017 at 5:01 PM crusaderky notifications@github.com wrote:

P.S. need to put #1522 https://github.com/pydata/xarray/issues/1522 as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1521#issuecomment-327336207, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1lU4s3CH5v8Pvc1SPyI3jCpZiJtfks5sfeDkgaJpZM4PBH12 .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
326867685 https://github.com/pydata/xarray/issues/1521#issuecomment-326867685 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNjg2NzY4NQ== shoyer 1217238 2017-09-04T05:13:59Z 2017-09-04T05:20:50Z MEMBER

The problem is these lines in combine.py: https://github.com/pydata/xarray/blob/78ca20a6ea1a42eb637ae2ef09189f481cfda9a2/xarray/core/combine.py#L158-L168

We inspect compare coordinates for equality in order to decide whether to ignore redundant coordinates or stack them up. This happens if coords='different'. That is the default choice, which was convenient before we supported dask, but is now a source of performance trouble as you point out.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
326868217 https://github.com/pydata/xarray/issues/1521#issuecomment-326868217 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNjg2ODIxNw== shoyer 1217238 2017-09-04T05:18:55Z 2017-09-04T05:18:55Z MEMBER

So, to be more precise, I think the problem is that the first variable is computed many times over (once per comparison), inside the differs helper function above.

A very simple fix, slightly more conservative than loading every coordinate into memory, is to simply compute these first coordinates on the first variable, e.g., v = datasets[0].variables[vname] -> v = datasets[0].variables[vname].compute(). I am slightly nervous about the potential memory overhead of loading all coordinates into memory.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
324708622 https://github.com/pydata/xarray/issues/1521#issuecomment-324708622 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNDcwODYyMg== shoyer 1217238 2017-08-24T17:51:42Z 2017-08-24T17:51:42Z MEMBER

change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

In principle, coords can have the same shape as data variables. In those cases, you probably want to use the same chunking scheme.

An additional, more radical observation is that, very frequently, a user knows in advance that all coords are aligned. In this use case, the user could explicitly request xarray to blindly trust this assumption, and thus skip loading the coords not based on concat_dim in all datasets beyond the first.

@rabernat is interested in this use case. See https://github.com/pydata/xarray/issues/1385 and https://github.com/pydata/xarray/pull/1413 for discussion.

This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 230), chunks=(1, 230). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over?

Yes, I think you're correct here as well. This is also an annoying inefficiency, but the API design is a little tricky.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 227.294ms · About: xarray-datasette