github: issue_comments: 14 rows where issue = 252541496 sorted by updated

14 rows where issue = 252541496 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
331307451	https://github.com/pydata/xarray/issues/1521#issuecomment-331307451	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMzMTMwNzQ1MQ==	crusaderky 6213168	2017-09-21T23:18:10Z	2017-09-21T23:18:10Z	MEMBER	@jhamman There's already #1551 open but I need to heavily rethink it to cater for all the various use cases offered by the data_vars and coords parameters of concat().	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
331306970	https://github.com/pydata/xarray/issues/1521#issuecomment-331306970	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMzMTMwNjk3MA==	jhamman 2443309	2017-09-21T23:14:45Z	2017-09-21T23:14:45Z	MEMBER	@crusaderky - happy to help with this. Maybe you can get a PR open and then I can provide some ASV benchmarking.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
331303206	https://github.com/pydata/xarray/issues/1521#issuecomment-331303206	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMzMTMwMzIwNg==	fmaussion 10050469	2017-09-21T22:50:58Z	2017-09-21T22:50:58Z	MEMBER	Thanks @crusaderky for looking into this, I think this is very important.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
331301408	https://github.com/pydata/xarray/issues/1521#issuecomment-331301408	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMzMTMwMTQwOA==	crusaderky 6213168	2017-09-21T22:39:55Z	2017-09-21T22:39:55Z	MEMBER	Back to banging my head on it. Expect a heavy rewrite of combine.py. Can't say an ETA but it's going to be a fair amount of hours.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
327511317	https://github.com/pydata/xarray/issues/1521#issuecomment-327511317	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNzUxMTMxNw==	rabernat 1197350	2017-09-06T14:59:29Z	2017-09-06T14:59:29Z	MEMBER	This is closely related to #1385 and my aborted attempted fix in #1413.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
327338750	https://github.com/pydata/xarray/issues/1521#issuecomment-327338750	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNzMzODc1MA==	shoyer 1217238	2017-09-06T00:20:49Z	2017-09-06T00:20:49Z	MEMBER	Enjoy your holiday! On Tue, Sep 5, 2017 at 5:01 PM crusaderky notifications@github.com wrote: P.S. need to put #1522 https://github.com/pydata/xarray/issues/1522 as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1521#issuecomment-327336207, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1lU4s3CH5v8Pvc1SPyI3jCpZiJtfks5sfeDkgaJpZM4PBH12 .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
327336207	https://github.com/pydata/xarray/issues/1521#issuecomment-327336207	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNzMzNjIwNw==	crusaderky 6213168	2017-09-06T00:01:39Z	2017-09-06T00:01:39Z	MEMBER	P.S. need to put #1522 as a prerequisite in order not to lose my sanity, as this change is very much hitting on the same nails	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
327335940	https://github.com/pydata/xarray/issues/1521#issuecomment-327335940	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNzMzNTk0MA==	crusaderky 6213168	2017-09-05T23:59:41Z	2017-09-05T23:59:41Z	MEMBER	Just realised that you can concat(data_vars='different') and have the exact same problem on data_vars :\| Also, with "different" I realised that you're comparing the variable contents twice, once in _calc_concat_over and another time at the end of _dataset_concat. This also slows down concat on pure-numpy backend. Need more time to work on this... I'll be on holiday for the next week with no access to my PC; should be able to continue after the 12 Sept.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
326867685	https://github.com/pydata/xarray/issues/1521#issuecomment-326867685	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNjg2NzY4NQ==	shoyer 1217238	2017-09-04T05:13:59Z	2017-09-04T05:20:50Z	MEMBER	The problem is these lines in `combine.py`: https://github.com/pydata/xarray/blob/78ca20a6ea1a42eb637ae2ef09189f481cfda9a2/xarray/core/combine.py#L158-L168 We inspect compare coordinates for equality in order to decide whether to ignore redundant coordinates or stack them up. This happens if `coords='different'`. That is the default choice, which was convenient before we supported dask, but is now a source of performance trouble as you point out.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
326868217	https://github.com/pydata/xarray/issues/1521#issuecomment-326868217	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNjg2ODIxNw==	shoyer 1217238	2017-09-04T05:18:55Z	2017-09-04T05:18:55Z	MEMBER	So, to be more precise, I think the problem is that the first variable is computed many times over (once per comparison), inside the `differs` helper function above. A very simple fix, slightly more conservative than loading every coordinate into memory, is to simply compute these first coordinates on the first variable, e.g., `v = datasets[0].variables[vname]` -> `v = datasets[0].variables[vname].compute()`. I am slightly nervous about the potential memory overhead of loading all coordinates into memory.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
326754319	https://github.com/pydata/xarray/issues/1521#issuecomment-326754319	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNjc1NDMxOQ==	crusaderky 6213168	2017-09-02T16:24:56Z	2017-09-02T16:24:56Z	MEMBER	Getting closer. The problem is in xarray.concat, which resolves non-index dask coords, TWICE, even if it should not resolve them at all (as alignment should be done on index coords only?) ``` import xarray import numpy import dask.array def kernel(label): print("Kernel [%s] invoked!" % label) return numpy.array([1, 2]) a = dask.array.Array(name='a', dask={('a', 0): (kernel, 'a')}, chunks=((2, ), ), dtype=int) b = dask.array.Array(name='b', dask={('b', 0): (kernel, 'b')}, chunks=((2, ), ), dtype=int) ds0 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', a)}) ds1 = xarray.Dataset(coords={'x': ('x', [1, 2]), 'y': ('x', b)}) xarray.concat([ds0, ds1], dim='z') `Output:` Kernel [a] invoked! Kernel [b] invoked! Kernel [b] invoked!Kernel [a] invoked! <xarray.Dataset> Dimensions: (x: 2) Coordinates: * x (x) int64 1 2 y (x) int64 dask.array<shape=(2,), chunksize=(2,)> Data variables: empty ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
326750703	https://github.com/pydata/xarray/issues/1521#issuecomment-326750703	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNjc1MDcwMw==	crusaderky 6213168	2017-09-02T15:24:20Z	2017-09-02T15:27:12Z	MEMBER	As suspected, the problem is caused specifically by non-index coords: ``` import xarray import numpy data = numpy.random.randint(1<<63, size=1000000) for r in range(50): ds = xarray.Dataset( coords={'r': [r], 'c': data, 'otherindex': data}, data_vars={'data': (('r', 'c'), data.reshape(1, data.size))}) ds.to_netcdf('fast.%02d.nc' % r) del ds['otherindex'] ds.coords['nonindex'] = ('c', data) ds.to_netcdf('slow.%02d.nc' % r) def load_coords(ds): for coord in ds.coords.values(): coord.load() return ds %time xarray.open_mfdataset('fast..nc') %time xarray.open_mfdataset('fast..nc', preprocess=load_coords) %time xarray.open_mfdataset('slow..nc') %time xarray.open_mfdataset('slow..nc', preprocess=load_coords) `output:` CPU times: user 332 ms, sys: 88 ms, total: 420 ms Wall time: 420 ms CPU times: user 348 ms, sys: 84 ms, total: 432 ms Wall time: 430 ms CPU times: user 1.13 s, sys: 200 ms, total: 1.33 s Wall time: 1.07 s CPU times: user 596 ms, sys: 104 ms, total: 700 ms Wall time: 697 ms ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
324708622	https://github.com/pydata/xarray/issues/1521#issuecomment-324708622	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNDcwODYyMg==	shoyer 1217238	2017-08-24T17:51:42Z	2017-08-24T17:51:42Z	MEMBER	change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable? In principle, coords can have the same shape as data variables. In those cases, you probably want to use the same chunking scheme. An additional, more radical observation is that, very frequently, a user knows in advance that all coords are aligned. In this use case, the user could explicitly request xarray to blindly trust this assumption, and thus skip loading the coords not based on concat_dim in all datasets beyond the first. @rabernat is interested in this use case. See https://github.com/pydata/xarray/issues/1385 and https://github.com/pydata/xarray/pull/1413 for discussion. This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 230), chunks=(1, 230). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over? Yes, I think you're correct here as well. This is also an annoying inefficiency, but the API design is a little tricky.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496
324586771	https://github.com/pydata/xarray/issues/1521#issuecomment-324586771	https://api.github.com/repos/pydata/xarray/issues/1521	MDEyOklzc3VlQ29tbWVudDMyNDU4Njc3MQ==	crusaderky 6213168	2017-08-24T09:41:22Z	2017-08-24T09:42:16Z	MEMBER	change open_dataset() to always eagerly load the coords to memory, regardless of the chunks parameter. Is there any valid use case where lazy coords are actually desirable? This also leads to another inefficiency of open_dataset(chunks=...), where you may have your data e.g. shape=(50000, 230), chunks=(1, 230). If you pass the chunks above to open_dataset, it will break down the coords on the first dim into dask arrays of 1 element - which hardly benefits anybody. Things get worse if the dataset is compressed with zlib or whatever, but only the data vars were chunked at the moment of writing. Am I correct in understanding that the whole coord var will be read from disk 50000 times over?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	open_mfdataset reads coords from disk multiple times 252541496

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);