home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

14 rows where issue = 252541496 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 5

  • crusaderky 7
  • shoyer 4
  • rabernat 1
  • jhamman 1
  • fmaussion 1

issue 1

  • open_mfdataset reads coords from disk multiple times · 14 ✖

author_association 1

  • MEMBER 14
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
331307451 https://github.com/pydata/xarray/issues/1521#issuecomment-331307451 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMzMTMwNzQ1MQ== crusaderky 6213168 2017-09-21T23:18:10Z 2017-09-21T23:18:10Z MEMBER

@jhamman There's already #1551 open but I need to heavily rethink it to cater for all the various use cases offered by the data_vars and coords parameters of concat().

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
331306970 https://github.com/pydata/xarray/issues/1521#issuecomment-331306970 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMzMTMwNjk3MA== jhamman 2443309 2017-09-21T23:14:45Z 2017-09-21T23:14:45Z MEMBER

@crusaderky - happy to help with this. Maybe you can get a PR open and then I can provide some ASV benchmarking.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
331303206 https://github.com/pydata/xarray/issues/1521#issuecomment-331303206 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMzMTMwMzIwNg== fmaussion 10050469 2017-09-21T22:50:58Z 2017-09-21T22:50:58Z MEMBER

Thanks @crusaderky for looking into this, I think this is very important.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
331301408 https://github.com/pydata/xarray/issues/1521#issuecomment-331301408 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMzMTMwMTQwOA== crusaderky 6213168 2017-09-21T22:39:55Z 2017-09-21T22:39:55Z MEMBER

Back to banging my head on it. Expect a heavy rewrite of combine.py. Can't say an ETA but it's going to be a fair amount of hours.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
327511317 https://github.com/pydata/xarray/issues/1521#issuecomment-327511317 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNzUxMTMxNw== rabernat 1197350 2017-09-06T14:59:29Z 2017-09-06T14:59:29Z MEMBER

This is closely related to #1385 and my aborted attempted fix in #1413.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
327338750 https://github.com/pydata/xarray/issues/1521#issuecomment-327338750 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNzMzODc1MA== shoyer 1217238 2017-09-06T00:20:49Z 2017-09-06T00:20:49Z MEMBER

Enjoy your holiday!

On Tue, Sep 5, 2017 at 5:01 PM crusaderky notifications@github.com wrote:

P.S. need to put #1522 https://github.com/pydata/xarray/issues/1522 as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1521#issuecomment-327336207, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1lU4s3CH5v8Pvc1SPyI3jCpZiJtfks5sfeDkgaJpZM4PBH12 .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
327336207 https://github.com/pydata/xarray/issues/1521#issuecomment-327336207 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNzMzNjIwNw== crusaderky 6213168 2017-09-06T00:01:39Z 2017-09-06T00:01:39Z MEMBER

P.S. need to put #1522 as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
327335940 https://github.com/pydata/xarray/issues/1521#issuecomment-327335940 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNzMzNTk0MA== crusaderky 6213168 2017-09-05T23:59:41Z 2017-09-05T23:59:41Z MEMBER

Just realised that you can concat(data_vars='different') and have the exact same problem on data_vars :|

Also, with "different" I realised that you're comparing the variable contents twice, once in _calc_concat_over and another time at the end of _dataset_concat. This also slows down concat on pure-numpy backend.

Need more time to work on this... I'll be on holiday for the next week with no access to my PC; should be able to continue after the 12 Sept.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
326867685 https://github.com/pydata/xarray/issues/1521#issuecomment-326867685 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNjg2NzY4NQ== shoyer 1217238 2017-09-04T05:13:59Z 2017-09-04T05:20:50Z MEMBER

The problem is these lines in combine.py: https://github.com/pydata/xarray/blob/78ca20a6ea1a42eb637ae2ef09189f481cfda9a2/xarray/core/combine.py#L158-L168

We inspect compare coordinates for equality in order to decide whether to ignore redundant coordinates or stack them up. This happens if coords='different'. That is the default choice, which was convenient before we supported dask, but is now a source of performance trouble as you point out.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
326868217 https://github.com/pydata/xarray/issues/1521#issuecomment-326868217 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNjg2ODIxNw== shoyer 1217238 2017-09-04T05:18:55Z 2017-09-04T05:18:55Z MEMBER

So, to be more precise, I think the problem is that the first variable is computed many times over (once per comparison), inside the differs helper function above.

A very simple fix, slightly more conservative than loading every coordinate into memory, is to simply compute these first coordinates on the first variable, e.g., v = datasets[0].variables[vname] -> v = datasets[0].variables[vname].compute(). I am slightly nervous about the potential memory overhead of loading all coordinates into memory.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
326754319 https://github.com/pydata/xarray/issues/1521#issuecomment-326754319 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNjc1NDMxOQ== crusaderky 6213168 2017-09-02T16:24:56Z 2017-09-02T16:24:56Z MEMBER

Getting closer. The problem is in xarray.concat, which resolves non-index dask coords, TWICE, even if it should not resolve them at all (as alignment should be done on index coords only?)

``` import xarray import numpy import dask.array

def kernel(label): print("Kernel [%s] invoked!" % label) return numpy.array([1, 2])

a = dask.array.Array(name='a', dask={('a', 0): (kernel, 'a')}, chunks=((2, ), ), dtype=int) b = dask.array.Array(name='b', dask={('b', 0): (kernel, 'b')}, chunks=((2, ), ), dtype=int)

ds0 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', a)}) ds1 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', b)}) xarray.concat([ds0, ds1], dim='z') Output: Kernel [a] invoked! Kernel [b] invoked! Kernel [b] invoked!Kernel [a] invoked!

<xarray.Dataset> Dimensions: (x: 2) Coordinates: * x (x) int64 1 2 y (x) int64 dask.array<shape=(2,), chunksize=(2,)> Data variables: empty ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
326750703 https://github.com/pydata/xarray/issues/1521#issuecomment-326750703 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNjc1MDcwMw== crusaderky 6213168 2017-09-02T15:24:20Z 2017-09-02T15:27:12Z MEMBER

As suspected, the problem is caused specifically by non-index coords:

``` import xarray import numpy

data = numpy.random.randint(1<<63, size=1000000)

for r in range(50): ds = xarray.Dataset( coords={'r': [r], 'c': data, 'otherindex': data}, data_vars={'data': (('r', 'c'), data.reshape(1, data.size))}) ds.to_netcdf('fast.%02d.nc' % r) del ds['otherindex'] ds.coords['nonindex'] = ('c', data) ds.to_netcdf('slow.%02d.nc' % r)

def load_coords(ds): for coord in ds.coords.values(): coord.load() return ds %time xarray.open_mfdataset('fast..nc') %time xarray.open_mfdataset('fast..nc', preprocess=load_coords) %time xarray.open_mfdataset('slow..nc') %time xarray.open_mfdataset('slow..nc', preprocess=load_coords) output: CPU times: user 332 ms, sys: 88 ms, total: 420 ms Wall time: 420 ms CPU times: user 348 ms, sys: 84 ms, total: 432 ms Wall time: 430 ms CPU times: user 1.13 s, sys: 200 ms, total: 1.33 s Wall time: 1.07 s CPU times: user 596 ms, sys: 104 ms, total: 700 ms Wall time: 697 ms ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
324708622 https://github.com/pydata/xarray/issues/1521#issuecomment-324708622 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNDcwODYyMg== shoyer 1217238 2017-08-24T17:51:42Z 2017-08-24T17:51:42Z MEMBER

change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

In principle, coords can have the same shape as data variables. In those cases, you probably want to use the same chunking scheme.

An additional, more radical observation is that, very frequently, a user knows in advance that all coords are aligned. In this use case, the user could explicitly request xarray to blindly trust this assumption, and thus skip loading the coords not based on concat_dim in all datasets beyond the first.

@rabernat is interested in this use case. See https://github.com/pydata/xarray/issues/1385 and https://github.com/pydata/xarray/pull/1413 for discussion.

This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 230), chunks=(1, 230). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over?

Yes, I think you're correct here as well. This is also an annoying inefficiency, but the API design is a little tricky.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
324586771 https://github.com/pydata/xarray/issues/1521#issuecomment-324586771 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNDU4Njc3MQ== crusaderky 6213168 2017-08-24T09:41:22Z 2017-08-24T09:42:16Z MEMBER

change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 2**30), chunks=(1, 2**30). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 16.537ms · About: xarray-datasette