home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

4 rows where comments = 10, repo = 13221727 and user = 6213168 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 2
  • pull 2

state 1

  • closed 4

repo 1

  • xarray · 4 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
484499801 MDExOlB1bGxSZXF1ZXN0MzEwMzYxOTMz 3250 __slots__ crusaderky 6213168 closed 0     10 2019-08-23T12:16:44Z 2019-08-30T12:13:28Z 2019-08-29T17:14:20Z MEMBER   0 pydata/xarray/pulls/3250

What changes: - Most classes now define __slots__ - removed _initialized property - Enforced checks that all subclasses must also define __slots__. For third-party subclasses, this is for now a DeprecationWarning and should be changed into a hard crash later on. - 22% reduction in RAM usage - 5% performance speedup for a DataArray method that performs a _to_temp_dataset roundtrip

DISCUSS: support for third party subclasses is very poor at the moment (#1097). Should we skip the deprecation altogether?

Performance benchmark: ```python import timeit import psutil import xarray

a = xarray.DataArray([1, 2], dims=['x'], coords={'x': [10, 20]}) RUNS = 10000 t = timeit.timeit("a.roll(x=1, roll_coords=True)", globals=globals(), number=RUNS) print("{:.0f} us".format(t / RUNS * 1e6))

p = psutil.Process() N = 100000 rss0 = p.memory_info().rss x = [ xarray.DataArray([1, 2], dims=['x'], coords={'x': [10, 20]}) for _ in range(N) ] rss1 = p.memory_info().rss print("{:.0f} bytes".format((rss1 - rss0) / N)) ``` Output:

| test | env | master | slots | |:-------------:|:---:|:----------:| ----------:| | DataArray.roll | py35-min | 332 us | 360 us | | DataArray.roll | py37 | 354 us | 337 us | | RAM usage of a DataArray | py35-min | 2755 bytes | 2074 bytes | | RAM usage of a DataArray | py37 | 1970 bytes | 1532 bytes |

The performance degradation on Python 3.5 is caused by the deprecation mechanism - see changes to common.py.

I honestly never realised that xarray objects are measured in kilobytes (vs. 32 bytes of underlying buffers!)

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3250/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
467756080 MDExOlB1bGxSZXF1ZXN0Mjk3MzQwNTEy 3112 More annotations in Dataset crusaderky 6213168 closed 0     10 2019-07-13T19:06:49Z 2019-08-01T10:41:51Z 2019-07-31T17:48:00Z MEMBER   0 pydata/xarray/pulls/3112
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3112/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
316618290 MDU6SXNzdWUzMTY2MTgyOTA= 2074 xarray.dot() dask problems crusaderky 6213168 closed 0     10 2018-04-22T22:18:10Z 2018-05-04T21:51:00Z 2018-05-04T21:51:00Z MEMBER      

xarray.dot() has comparable performance with numpy.einsum. However, when it uses a dask backend, it's much slower than the new dask.array.einsum function (https://github.com/dask/dask/pull/3412). The performance gap widens when the dimension upon which you are reducing is chunked.

Also, for some reason dot(a<s, t>, b<t>, dims=[t]) and dot(a<s,t>, a<s,t>, dims=[s,t]) do work (very slowly) when s and t are chunked, while dot(a<s, t>, a<s, t>, dims=[t]) crashes complaining it can't operate on a chunked core dim (related discussion: https://github.com/pydata/xarray/issues/1995).

The proposed solution is to simply wait for https://github.com/dask/dask/pull/3412 to reach the next release and then reimplement xarray.dot to use dask.array.einsum. This means that dask users will lose the ability to use xarray.dot if they upgrade xarray version but not dask version, but I believe it shouldn't be a big problem for most?

``` import numpy import dask.array import xarray

def bench(tchunk, a_by_a, dims, iis): print(f"\nbench({tchunk}, {a_by_a}, {dims}, {iis})")

a = xarray.DataArray(
    dask.array.random.random((500000, 100), chunks=(50000, tchunk)),
    dims=['s', 't'])
if a_by_a:
    b = a
else:
    b = xarray.DataArray(
        dask.array.random.random((100, ), chunks=tchunk),
        dims=['t'])

print("xarray.dot(numpy backend):")
%timeit xarray.dot(a.compute(), b.compute(), dims=dims)
print("numpy.einsum:")
%timeit numpy.einsum(iis, a, b)
print("xarray.dot(dask backend):")
try:
    %timeit xarray.dot(a, b, dims=dims).compute()
except ValueError as e:
    print(e)
print("dask.array.einsum:")
%timeit dask.array.einsum(iis, a, b).compute()

bench(100, False, ['t'], '...i,...i') bench( 20, False, ['t'], '...i,...i') bench(100, True, ['t'], '...i,...i') bench( 20, True, ['t'], '...i,...i') bench(100, True, ['s', 't'], '...ij,...ij') bench( 20, True, ['s', 't'], '...ij,...ij') Output: bench(100, False, ['t'], ...i,...i) xarray.dot(numpy backend): 195 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) numpy.einsum: 205 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) xarray.dot(dask backend): 356 ms ± 44.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 244 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(20, False, ['t'], ...i,...i) xarray.dot(numpy backend): 297 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 254 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 732 ms ± 74.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 274 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(100, True, ['t'], ...i,...i) xarray.dot(numpy backend): 438 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 415 ms ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 633 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 431 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(20, True, ['t'], ...i,...i) xarray.dot(numpy backend): 457 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 463 ms ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): dimension 't' on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single dask array chunk along this dimension, i.e., .rechunk({'t': -1}), but beware that this may significantly increase memory usage. dask.array.einsum: 485 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(100, True, ['s', 't'], ...ij,...ij) xarray.dot(numpy backend): 418 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 444 ms ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 384 ms ± 57.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 415 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(20, True, ['s', 't'], ...ij,...ij) xarray.dot(numpy backend): 489 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 443 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 585 ms ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 455 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2074/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
184722754 MDU6SXNzdWUxODQ3MjI3NTQ= 1058 shallow copies become deep copies when pickling crusaderky 6213168 closed 0     10 2016-10-23T23:12:03Z 2017-02-05T21:13:41Z 2017-01-17T01:53:18Z MEMBER      

Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays. This design fails when the object is pickled.

Whenever a numpy view is pickled, it becomes a regular array:

```

a = numpy.arange(226) print(len(pickle.dumps(a)) / 220) 256.00015354156494 b = a.view() print(len(pickle.dumps((a, b))) / 2**20) 512.0001964569092 b.base is a True a2, b2 = pickle.loads(pickle.dumps((a, b))) b2.base is a2 False ```

This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer. Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k * 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies.

I see a few possible solutions to this: 1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it. 2. Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's. 3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem. 4. Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that... 5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view.

I implemented (5) as a workaround in my getstate method. Before:

%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30) 2.535497265867889 Wall time: 33.3 s

Workaround:

``` def get_base(array): if not isinstance(array, numpy.ndarray): return array
elif array.base is None: return array elif array.base.dtype != array.dtype: return array elif array.base.shape != array.shape: return array else: return array.base

for v in cache.values(): if isinstance(v, xarray.DataArray): v.data = get_base(v.data) for coord in v.coords.values(): coord.data = get_base(coord.data) elif isinstance(v, xarray.Dataset): for var in v.variables(): var.data = get_base(var.data) ```

After:

%%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30) 0.9733252348378301 Wall time: 21.1 s

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1058/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 3657.136ms · About: xarray-datasette