github: issues: 4 rows where comments = 10, repo = 13221727 and user = 6213168 sorted by updated

4 rows where comments = 10, repo = 13221727 and user = 6213168 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	closed_at	author_association	draft	pull_request	body	reactions	state_reason	repo	type
484499801	MDExOlB1bGxSZXF1ZXN0MzEwMzYxOTMz	3250	__slots__	crusaderky 6213168	closed	10	2019-08-23T12:16:44Z	2019-08-30T12:13:28Z	2019-08-29T17:14:20Z	MEMBER	0	pydata/xarray/pulls/3250	What changes: - Most classes now define `__slots__` - removed `_initialized` property - Enforced checks that all subclasses must also define `__slots__`. For third-party subclasses, this is for now a DeprecationWarning and should be changed into a hard crash later on. - 22% reduction in RAM usage - 5% performance speedup for a DataArray method that performs a `_to_temp_dataset` roundtrip DISCUSS: support for third party subclasses is very poor at the moment (#1097). Should we skip the deprecation altogether? Performance benchmark: ```python import timeit import psutil import xarray a = xarray.DataArray([1, 2], dims=['x'], coords={'x': [10, 20]}) RUNS = 10000 t = timeit.timeit("a.roll(x=1, roll_coords=True)", globals=globals(), number=RUNS) print("{:.0f} us".format(t / RUNS * 1e6)) p = psutil.Process() N = 100000 rss0 = p.memory_info().rss x = [ xarray.DataArray([1, 2], dims=['x'], coords={'x': [10, 20]}) for _ in range(N) ] rss1 = p.memory_info().rss print("{:.0f} bytes".format((rss1 - rss0) / N)) ``` Output: \| test \| env \| master \| slots \| \|:-------------:\|:---:\|:----------:\| ----------:\| \| DataArray.roll \| py35-min \| 332 us \| 360 us \| \| DataArray.roll \| py37 \| 354 us \| 337 us \| \| RAM usage of a DataArray \| py35-min \| 2755 bytes \| 2074 bytes \| \| RAM usage of a DataArray \| py37 \| 1970 bytes \| 1532 bytes \| The performance degradation on Python 3.5 is caused by the deprecation mechanism - see changes to common.py. I honestly never realised that xarray objects are measured in kilobytes (vs. 32 bytes of underlying buffers!)	{ "url": "https://api.github.com/repos/pydata/xarray/issues/3250/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
467756080	MDExOlB1bGxSZXF1ZXN0Mjk3MzQwNTEy	3112	More annotations in Dataset	crusaderky 6213168	closed	10	2019-07-13T19:06:49Z	2019-08-01T10:41:51Z	2019-07-31T17:48:00Z	MEMBER	0	pydata/xarray/pulls/3112		{ "url": "https://api.github.com/repos/pydata/xarray/issues/3112/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
316618290	MDU6SXNzdWUzMTY2MTgyOTA=	2074	xarray.dot() dask problems	crusaderky 6213168	closed	10	2018-04-22T22:18:10Z	2018-05-04T21:51:00Z	2018-05-04T21:51:00Z	MEMBER			xarray.dot() has comparable performance with numpy.einsum. However, when it uses a dask backend, it's much slower than the new dask.array.einsum function (https://github.com/dask/dask/pull/3412). The performance gap widens when the dimension upon which you are reducing is chunked. Also, for some reason `dot(a<s, t>, b<t>, dims=[t])` and `dot(a<s,t>, a<s,t>, dims=[s,t])` do work (very slowly) when `s` and `t` are chunked, while `dot(a<s, t>, a<s, t>, dims=[t])` crashes complaining it can't operate on a chunked core dim (related discussion: https://github.com/pydata/xarray/issues/1995). The proposed solution is to simply wait for https://github.com/dask/dask/pull/3412 to reach the next release and then reimplement xarray.dot to use dask.array.einsum. This means that dask users will lose the ability to use xarray.dot if they upgrade xarray version but not dask version, but I believe it shouldn't be a big problem for most? ``` import numpy import dask.array import xarray def bench(tchunk, a_by_a, dims, iis): print(f"\nbench({tchunk}, {a_by_a}, {dims}, {iis})") a = xarray.DataArray( dask.array.random.random((500000, 100), chunks=(50000, tchunk)), dims=['s', 't']) if a_by_a: b = a else: b = xarray.DataArray( dask.array.random.random((100, ), chunks=tchunk), dims=['t']) print("xarray.dot(numpy backend):") %timeit xarray.dot(a.compute(), b.compute(), dims=dims) print("numpy.einsum:") %timeit numpy.einsum(iis, a, b) print("xarray.dot(dask backend):") try: %timeit xarray.dot(a, b, dims=dims).compute() except ValueError as e: print(e) print("dask.array.einsum:") %timeit dask.array.einsum(iis, a, b).compute() bench(100, False, ['t'], '...i,...i') bench( 20, False, ['t'], '...i,...i') bench(100, True, ['t'], '...i,...i') bench( 20, True, ['t'], '...i,...i') bench(100, True, ['s', 't'], '...ij,...ij') bench( 20, True, ['s', 't'], '...ij,...ij') `Output:` bench(100, False, ['t'], ...i,...i) xarray.dot(numpy backend): 195 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) numpy.einsum: 205 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) xarray.dot(dask backend): 356 ms ± 44.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 244 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(20, False, ['t'], ...i,...i) xarray.dot(numpy backend): 297 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 254 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 732 ms ± 74.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 274 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(100, True, ['t'], ...i,...i) xarray.dot(numpy backend): 438 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 415 ms ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 633 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 431 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(20, True, ['t'], ...i,...i) xarray.dot(numpy backend): 457 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 463 ms ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): dimension 't' on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single dask array chunk along this dimension, i.e., `.rechunk({'t': -1})`, but beware that this may significantly increase memory usage. dask.array.einsum: 485 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(100, True, ['s', 't'], ...ij,...ij) xarray.dot(numpy backend): 418 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 444 ms ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 384 ms ± 57.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 415 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(20, True, ['s', 't'], ...ij,...ij) xarray.dot(numpy backend): 489 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 443 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 585 ms ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 455 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2074/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
184722754	MDU6SXNzdWUxODQ3MjI3NTQ=	1058	shallow copies become deep copies when pickling	crusaderky 6213168	closed	10	2016-10-23T23:12:03Z	2017-02-05T21:13:41Z	2017-01-17T01:53:18Z	MEMBER			Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays. This design fails when the object is pickled. Whenever a numpy view is pickled, it becomes a regular array: ``` a = numpy.arange(226) print(len(pickle.dumps(a)) / 220) 256.00015354156494 b = a.view() print(len(pickle.dumps((a, b))) / 2*20) 512.0001964569092 b.base is a True a2, b2 = pickle.loads(pickle.dumps((a, b))) b2.base is a2 False ``` This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer. Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies. I see a few possible solutions to this: 1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it. 2. Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's. 3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem. 4. Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that... 5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view. I implemented (5) as a workaround in my getstate method. Before: `%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 230) 2.535497265867889 Wall time: 33.3 s` Workaround: ``` def get_base(array): if not isinstance(array, numpy.ndarray): return array elif array.base is None: return array elif array.base.dtype != array.dtype: return array elif array.base.shape != array.shape: return array else: return array.base for v in cache.values(): if isinstance(v, xarray.DataArray): v.data = get_base(v.data) for coord in v.coords.values(): coord.data = get_base(coord.data) elif isinstance(v, xarray.Dataset): for var in v.variables(): var.data = get_base(var.data) ``` After: `%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 230) 0.9733252348378301 Wall time: 21.1 s`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1058/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);