issue_comments
7 rows where author_association = "MEMBER", issue = 252541496 and user = 6213168 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- open_mfdataset reads coords from disk multiple times · 7 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
331307451 | https://github.com/pydata/xarray/issues/1521#issuecomment-331307451 | https://api.github.com/repos/pydata/xarray/issues/1521 | MDEyOklzc3VlQ29tbWVudDMzMTMwNzQ1MQ== | crusaderky 6213168 | 2017-09-21T23:18:10Z | 2017-09-21T23:18:10Z | MEMBER | @jhamman There's already #1551 open but I need to heavily rethink it to cater for all the various use cases offered by the data_vars and coords parameters of concat(). |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset reads coords from disk multiple times 252541496 | |
331301408 | https://github.com/pydata/xarray/issues/1521#issuecomment-331301408 | https://api.github.com/repos/pydata/xarray/issues/1521 | MDEyOklzc3VlQ29tbWVudDMzMTMwMTQwOA== | crusaderky 6213168 | 2017-09-21T22:39:55Z | 2017-09-21T22:39:55Z | MEMBER | Back to banging my head on it. Expect a heavy rewrite of combine.py. Can't say an ETA but it's going to be a fair amount of hours. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset reads coords from disk multiple times 252541496 | |
327336207 | https://github.com/pydata/xarray/issues/1521#issuecomment-327336207 | https://api.github.com/repos/pydata/xarray/issues/1521 | MDEyOklzc3VlQ29tbWVudDMyNzMzNjIwNw== | crusaderky 6213168 | 2017-09-06T00:01:39Z | 2017-09-06T00:01:39Z | MEMBER | P.S. need to put #1522 as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset reads coords from disk multiple times 252541496 | |
327335940 | https://github.com/pydata/xarray/issues/1521#issuecomment-327335940 | https://api.github.com/repos/pydata/xarray/issues/1521 | MDEyOklzc3VlQ29tbWVudDMyNzMzNTk0MA== | crusaderky 6213168 | 2017-09-05T23:59:41Z | 2017-09-05T23:59:41Z | MEMBER | Just realised that you can concat(data_vars='different') and have the exact same problem on data_vars :| Also, with "different" I realised that you're comparing the variable contents twice, once in _calc_concat_over and another time at the end of _dataset_concat. This also slows down concat on pure-numpy backend. Need more time to work on this... I'll be on holiday for the next week with no access to my PC; should be able to continue after the 12 Sept. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset reads coords from disk multiple times 252541496 | |
326754319 | https://github.com/pydata/xarray/issues/1521#issuecomment-326754319 | https://api.github.com/repos/pydata/xarray/issues/1521 | MDEyOklzc3VlQ29tbWVudDMyNjc1NDMxOQ== | crusaderky 6213168 | 2017-09-02T16:24:56Z | 2017-09-02T16:24:56Z | MEMBER | Getting closer. The problem is in xarray.concat, which resolves non-index dask coords, TWICE, even if it should not resolve them at all (as alignment should be done on index coords only?) ``` import xarray import numpy import dask.array def kernel(label): print("Kernel [%s] invoked!" % label) return numpy.array([1, 2]) a = dask.array.Array(name='a', dask={('a', 0): (kernel, 'a')}, chunks=((2, ), ), dtype=int) b = dask.array.Array(name='b', dask={('b', 0): (kernel, 'b')}, chunks=((2, ), ), dtype=int) ds0 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', a)})
ds1 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', b)})
xarray.concat([ds0, ds1], dim='z')
<xarray.Dataset> Dimensions: (x: 2) Coordinates: * x (x) int64 1 2 y (x) int64 dask.array<shape=(2,), chunksize=(2,)> Data variables: empty ``` |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset reads coords from disk multiple times 252541496 | |
326750703 | https://github.com/pydata/xarray/issues/1521#issuecomment-326750703 | https://api.github.com/repos/pydata/xarray/issues/1521 | MDEyOklzc3VlQ29tbWVudDMyNjc1MDcwMw== | crusaderky 6213168 | 2017-09-02T15:24:20Z | 2017-09-02T15:27:12Z | MEMBER | As suspected, the problem is caused specifically by non-index coords: ``` import xarray import numpy data = numpy.random.randint(1<<63, size=1000000) for r in range(50): ds = xarray.Dataset( coords={'r': [r], 'c': data, 'otherindex': data}, data_vars={'data': (('r', 'c'), data.reshape(1, data.size))}) ds.to_netcdf('fast.%02d.nc' % r) del ds['otherindex'] ds.coords['nonindex'] = ('c', data) ds.to_netcdf('slow.%02d.nc' % r) def load_coords(ds):
for coord in ds.coords.values():
coord.load()
return ds
%time xarray.open_mfdataset('fast..nc')
%time xarray.open_mfdataset('fast..nc', preprocess=load_coords)
%time xarray.open_mfdataset('slow..nc')
%time xarray.open_mfdataset('slow..nc', preprocess=load_coords)
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset reads coords from disk multiple times 252541496 | |
324586771 | https://github.com/pydata/xarray/issues/1521#issuecomment-324586771 | https://api.github.com/repos/pydata/xarray/issues/1521 | MDEyOklzc3VlQ29tbWVudDMyNDU4Njc3MQ== | crusaderky 6213168 | 2017-08-24T09:41:22Z | 2017-08-24T09:42:16Z | MEMBER |
This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 2**30), chunks=(1, 2**30). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
open_mfdataset reads coords from disk multiple times 252541496 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 1