id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 316618290,MDU6SXNzdWUzMTY2MTgyOTA=,2074,xarray.dot() dask problems,6213168,closed,0,,,10,2018-04-22T22:18:10Z,2018-05-04T21:51:00Z,2018-05-04T21:51:00Z,MEMBER,,,,"xarray.dot() has comparable performance with numpy.einsum. However, when it uses a dask backend, it's _much_ slower than the new dask.array.einsum function (https://github.com/dask/dask/pull/3412). The performance gap widens when the dimension upon which you are reducing is chunked. Also, for some reason ``dot(a, b, dims=[t])`` and ``dot(a, a, dims=[s,t])`` do work (very slowly) when ``s`` and ``t`` are chunked, while ``dot(a, a, dims=[t])`` crashes complaining it can't operate on a chunked core dim (related discussion: https://github.com/pydata/xarray/issues/1995). The proposed solution is to simply wait for https://github.com/dask/dask/pull/3412 to reach the next release and then reimplement xarray.dot to use dask.array.einsum. This means that dask users will lose the ability to use xarray.dot if they upgrade xarray version but not dask version, but I believe it shouldn't be a big problem for most? ``` import numpy import dask.array import xarray def bench(tchunk, a_by_a, dims, iis): print(f""\nbench({tchunk}, {a_by_a}, {dims}, {iis})"") a = xarray.DataArray( dask.array.random.random((500000, 100), chunks=(50000, tchunk)), dims=['s', 't']) if a_by_a: b = a else: b = xarray.DataArray( dask.array.random.random((100, ), chunks=tchunk), dims=['t']) print(""xarray.dot(numpy backend):"") %timeit xarray.dot(a.compute(), b.compute(), dims=dims) print(""numpy.einsum:"") %timeit numpy.einsum(iis, a, b) print(""xarray.dot(dask backend):"") try: %timeit xarray.dot(a, b, dims=dims).compute() except ValueError as e: print(e) print(""dask.array.einsum:"") %timeit dask.array.einsum(iis, a, b).compute() bench(100, False, ['t'], '...i,...i') bench( 20, False, ['t'], '...i,...i') bench(100, True, ['t'], '...i,...i') bench( 20, True, ['t'], '...i,...i') bench(100, True, ['s', 't'], '...ij,...ij') bench( 20, True, ['s', 't'], '...ij,...ij') ``` Output: ``` bench(100, False, ['t'], ...i,...i) xarray.dot(numpy backend): 195 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) numpy.einsum: 205 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) xarray.dot(dask backend): 356 ms ± 44.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 244 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(20, False, ['t'], ...i,...i) xarray.dot(numpy backend): 297 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 254 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 732 ms ± 74.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 274 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(100, True, ['t'], ...i,...i) xarray.dot(numpy backend): 438 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 415 ms ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 633 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 431 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(20, True, ['t'], ...i,...i) xarray.dot(numpy backend): 457 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 463 ms ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): dimension 't' on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single dask array chunk along this dimension, i.e., ``.rechunk({'t': -1})``, but beware that this may significantly increase memory usage. dask.array.einsum: 485 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(100, True, ['s', 't'], ...ij,...ij) xarray.dot(numpy backend): 418 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 444 ms ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 384 ms ± 57.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 415 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) bench(20, True, ['s', 't'], ...ij,...ij) xarray.dot(numpy backend): 489 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) numpy.einsum: 443 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) xarray.dot(dask backend): 585 ms ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) dask.array.einsum: 455 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2074/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 184722754,MDU6SXNzdWUxODQ3MjI3NTQ=,1058,shallow copies become deep copies when pickling,6213168,closed,0,,,10,2016-10-23T23:12:03Z,2017-02-05T21:13:41Z,2017-01-17T01:53:18Z,MEMBER,,,,"Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays. This design fails when the object is pickled. Whenever a numpy view is pickled, it becomes a regular array: ``` >> a = numpy.arange(2**26) >> print(len(pickle.dumps(a)) / 2**20) 256.00015354156494 >> b = a.view() >> print(len(pickle.dumps((a, b))) / 2**20) 512.0001964569092 >> b.base is a True >> a2, b2 = pickle.loads(pickle.dumps((a, b))) >> b2.base is a2 False ``` This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer. Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k \* 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies. I see a few possible solutions to this: 1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it. 2. Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's. 3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem. 4. Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it [as early as 2012](https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pickling), so I suspect there might be some pretty good reason why it works like that... 5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view. I implemented (5) as a workaround in my __getstate__ method. Before: ``` %%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30) 2.535497265867889 Wall time: 33.3 s ``` Workaround: ``` def get_base(array): if not isinstance(array, numpy.ndarray): return array elif array.base is None: return array elif array.base.dtype != array.dtype: return array elif array.base.shape != array.shape: return array else: return array.base for v in cache.values(): if isinstance(v, xarray.DataArray): v.data = get_base(v.data) for coord in v.coords.values(): coord.data = get_base(coord.data) elif isinstance(v, xarray.Dataset): for var in v.variables(): var.data = get_base(var.data) ``` After: ``` %%time print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30) 0.9733252348378301 Wall time: 21.1 s ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1058/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue