home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where author_association = "MEMBER", issue = 252541496 and user = 6213168 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • crusaderky · 7 ✖

issue 1

  • open_mfdataset reads coords from disk multiple times · 7 ✖

author_association 1

  • MEMBER · 7 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
331307451 https://github.com/pydata/xarray/issues/1521#issuecomment-331307451 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMzMTMwNzQ1MQ== crusaderky 6213168 2017-09-21T23:18:10Z 2017-09-21T23:18:10Z MEMBER

@jhamman There's already #1551 open but I need to heavily rethink it to cater for all the various use cases offered by the data_vars and coords parameters of concat().

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
331301408 https://github.com/pydata/xarray/issues/1521#issuecomment-331301408 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMzMTMwMTQwOA== crusaderky 6213168 2017-09-21T22:39:55Z 2017-09-21T22:39:55Z MEMBER

Back to banging my head on it. Expect a heavy rewrite of combine.py. Can't say an ETA but it's going to be a fair amount of hours.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
327336207 https://github.com/pydata/xarray/issues/1521#issuecomment-327336207 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNzMzNjIwNw== crusaderky 6213168 2017-09-06T00:01:39Z 2017-09-06T00:01:39Z MEMBER

P.S. need to put #1522 as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
327335940 https://github.com/pydata/xarray/issues/1521#issuecomment-327335940 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNzMzNTk0MA== crusaderky 6213168 2017-09-05T23:59:41Z 2017-09-05T23:59:41Z MEMBER

Just realised that you can concat(data_vars='different') and have the exact same problem on data_vars :|

Also, with "different" I realised that you're comparing the variable contents twice, once in _calc_concat_over and another time at the end of _dataset_concat. This also slows down concat on pure-numpy backend.

Need more time to work on this... I'll be on holiday for the next week with no access to my PC; should be able to continue after the 12 Sept.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
326754319 https://github.com/pydata/xarray/issues/1521#issuecomment-326754319 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNjc1NDMxOQ== crusaderky 6213168 2017-09-02T16:24:56Z 2017-09-02T16:24:56Z MEMBER

Getting closer. The problem is in xarray.concat, which resolves non-index dask coords, TWICE, even if it should not resolve them at all (as alignment should be done on index coords only?)

``` import xarray import numpy import dask.array

def kernel(label): print("Kernel [%s] invoked!" % label) return numpy.array([1, 2])

a = dask.array.Array(name='a', dask={('a', 0): (kernel, 'a')}, chunks=((2, ), ), dtype=int) b = dask.array.Array(name='b', dask={('b', 0): (kernel, 'b')}, chunks=((2, ), ), dtype=int)

ds0 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', a)}) ds1 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', b)}) xarray.concat([ds0, ds1], dim='z') Output: Kernel [a] invoked! Kernel [b] invoked! Kernel [b] invoked!Kernel [a] invoked!

<xarray.Dataset> Dimensions: (x: 2) Coordinates: * x (x) int64 1 2 y (x) int64 dask.array<shape=(2,), chunksize=(2,)> Data variables: empty ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
326750703 https://github.com/pydata/xarray/issues/1521#issuecomment-326750703 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNjc1MDcwMw== crusaderky 6213168 2017-09-02T15:24:20Z 2017-09-02T15:27:12Z MEMBER

As suspected, the problem is caused specifically by non-index coords:

``` import xarray import numpy

data = numpy.random.randint(1<<63, size=1000000)

for r in range(50): ds = xarray.Dataset( coords={'r': [r], 'c': data, 'otherindex': data}, data_vars={'data': (('r', 'c'), data.reshape(1, data.size))}) ds.to_netcdf('fast.%02d.nc' % r) del ds['otherindex'] ds.coords['nonindex'] = ('c', data) ds.to_netcdf('slow.%02d.nc' % r)

def load_coords(ds): for coord in ds.coords.values(): coord.load() return ds %time xarray.open_mfdataset('fast..nc') %time xarray.open_mfdataset('fast..nc', preprocess=load_coords) %time xarray.open_mfdataset('slow..nc') %time xarray.open_mfdataset('slow..nc', preprocess=load_coords) output: CPU times: user 332 ms, sys: 88 ms, total: 420 ms Wall time: 420 ms CPU times: user 348 ms, sys: 84 ms, total: 432 ms Wall time: 430 ms CPU times: user 1.13 s, sys: 200 ms, total: 1.33 s Wall time: 1.07 s CPU times: user 596 ms, sys: 104 ms, total: 700 ms Wall time: 697 ms ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496
324586771 https://github.com/pydata/xarray/issues/1521#issuecomment-324586771 https://api.github.com/repos/pydata/xarray/issues/1521 MDEyOklzc3VlQ29tbWVudDMyNDU4Njc3MQ== crusaderky 6213168 2017-08-24T09:41:22Z 2017-08-24T09:42:16Z MEMBER

change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable?

This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 2**30), chunks=(1, 2**30). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset reads coords from disk multiple times 252541496

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 29.268ms · About: xarray-datasette