html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2074#issuecomment-385116430,https://api.github.com/repos/pydata/xarray/issues/2074,385116430,MDEyOklzc3VlQ29tbWVudDM4NTExNjQzMA==,6213168,2018-04-27T23:13:20Z,2018-04-27T23:13:20Z,MEMBER,Done the work - but we'll need to wait for dask 0.17.3 to integrate it,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290
https://github.com/pydata/xarray/issues/2074#issuecomment-383817119,https://api.github.com/repos/pydata/xarray/issues/2074,383817119,MDEyOklzc3VlQ29tbWVudDM4MzgxNzExOQ==,306380,2018-04-24T06:22:39Z,2018-04-24T06:22:39Z,MEMBER,"When doing benchmarks with things that might call BLAS operations in
multiple threads I recommend setting the OMP_NUM_THREADS environment
variable to 1.  This will avoid oversubscription.

On Mon, Apr 23, 2018 at 7:32 PM, Keisuke Fujii <notifications@github.com>
wrote:

> @crusaderky <https://github.com/crusaderky> , Thanks for the detailed
> benchmarking.
> Further note:
>
>    - xr.dot uses tensordot if possible, as when I implemented dask did
>    not have einsum.
>    In the other cases, we use dask.atop with np.einsum.
>
> In your example, bench(100, False, ['t'], '...i,...i') uses dask.tensordot
> ,
> bench(100, True, ['t'], '...i,...i') uses np.einsum.
>
> bench(100, True, [], ...i,...i->...i) also uses np.einsum.
> But I have no idea yet why dot(a, b, dims=[]) is faster than a * b.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/pydata/xarray/issues/2074#issuecomment-383754980>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AASszD_CL-zC6QgDunKQVaIGCiQA7u5Jks5trmSUgaJpZM4TfDSk>
> .
>
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290
https://github.com/pydata/xarray/issues/2074#issuecomment-383754980,https://api.github.com/repos/pydata/xarray/issues/2074,383754980,MDEyOklzc3VlQ29tbWVudDM4Mzc1NDk4MA==,6815844,2018-04-23T23:32:33Z,2018-04-23T23:32:33Z,MEMBER,"@crusaderky , Thanks for the detailed benchmarking.
Further note:

+ `xr.dot` uses `tensordot` if possible, as when I implemented `dask` did not have `einsum`.
In the other cases, we use `dask.atop` with `np.einsum`.

In your example, `bench(100, False, ['t'], '...i,...i')` uses `dask.tensordot`, 
`bench(100, True, ['t'], '...i,...i')` uses `np.einsum`.

`bench(100, True, [], ...i,...i->...i)` also uses `np.einsum`.
But I have no idea yet why `dot(a, b, dims=[])` is faster than `a * b`.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290
https://github.com/pydata/xarray/issues/2074#issuecomment-383754250,https://api.github.com/repos/pydata/xarray/issues/2074,383754250,MDEyOklzc3VlQ29tbWVudDM4Mzc1NDI1MA==,1217238,2018-04-23T23:28:27Z,2018-04-23T23:28:27Z,MEMBER,+1 for using dask.array.einsum in xarray.dot.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290
https://github.com/pydata/xarray/issues/2074#issuecomment-383724765,https://api.github.com/repos/pydata/xarray/issues/2074,383724765,MDEyOklzc3VlQ29tbWVudDM4MzcyNDc2NQ==,6213168,2018-04-23T21:12:04Z,2018-04-23T21:12:14Z,MEMBER,"> What are the arrays used as input for this case?

See blob in the opening post

> dot reduces one dimension from each input

``xarray.dot(a, b, dims=[])`` is functionally identical to  ``a * b`` to my understanding, but faster in some edge cases - which I can't make any sense of.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290
https://github.com/pydata/xarray/issues/2074#issuecomment-383723159,https://api.github.com/repos/pydata/xarray/issues/2074,383723159,MDEyOklzc3VlQ29tbWVudDM4MzcyMzE1OQ==,3019665,2018-04-23T21:06:42Z,2018-04-23T21:06:42Z,NONE,"> from what I understand `da.dot` implements... a limited special case of `da.einsum`?

Basically `dot` is an inner product. Certainly inner products can be formulated using Einstein notation (i.e. calling with `einsum`).

The question is whether the performance keeps up with that formulation. Currently it sounds like chunking causes some problems right now IIUC. However things like `dot` and `tensordot` dispatch through optimized BLAS routines. In theory `einsum` should do the same ( https://github.com/numpy/numpy/pull/9425 ), but the experimental data still shows a few warts. For example, `matmul` is implemented with `einsum`, but is slower than `dot`. ( https://github.com/numpy/numpy/issues/7569 ) ( https://github.com/numpy/numpy/issues/8957 ) Pure `einsum` implementations seem to perform similarly.

> I ran a few more benchmarks...

What are the arrays used as input for this case?

> ...apparently `xarray.dot` on a dask backend is situationally faster than all other implementations when you are not reducing on any dimensions...

Having a little trouble following this. `dot` reduces one dimension from each input. Excepting if one of the inputs is 0-D (i.e. a scalar), then it is just multiplying a single scalar through an array. Is that what you are referring?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290
https://github.com/pydata/xarray/issues/2074#issuecomment-383711323,https://api.github.com/repos/pydata/xarray/issues/2074,383711323,MDEyOklzc3VlQ29tbWVudDM4MzcxMTMyMw==,6213168,2018-04-23T20:26:59Z,2018-04-23T20:26:59Z,MEMBER,"@jakirkham from what I understand ``da.dot`` implements... a limited special case of ``da.einsum``?

Ok this is funny. I ran a few more benchmarks, and apparently ``xarray.dot`` on a dask backend is situationally faster than all other implementations when you are not reducing on any dimensions - which I understand is really the same as (a * b), except that *faster* than (a * b)?!?

```
def bench(...):
   ...
    if not dims:
        print(""a * b (numpy backend):"")
        %timeit a.compute() * b.compute()
        print(""a * b (dask backend):"")
        %timeit (a * b).compute()

bench(100, False, [], '...i,...i->...i')
bench( 20, False, [], '...i,...i->...i')
bench(100, True,  [], '...i,...i->...i')
bench( 20, True,  [], '...i,...i->...i')
```
Output:

```

bench(100, False, [], ...i,...i->...i)
xarray.dot(numpy backend):
291 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.einsum:
296 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
xarray.dot(dask backend):
dimension 's' on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single dask array chunk along this dimension, i.e., ``.rechunk({'s': -1})``, but beware that this may significantly increase memory usage.
dask.array.einsum:
296 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (numpy backend)
279 ms ± 9.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (dask backend)
241 ms ± 8.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

bench(20, False, [], ...i,...i->...i)
xarray.dot(numpy backend):
345 ms ± 6.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.einsum:
342 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
xarray.dot(dask backend):
dimension 's' on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single dask array chunk along this dimension, i.e., ``.rechunk({'s': -1})``, but beware that this may significantly increase memory usage.
dask.array.einsum:
347 ms ± 6.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (numpy backend)
319 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (dask backend)
247 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(100, True, [], ...i,...i->...i)
xarray.dot(numpy backend):
477 ms ± 8.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.einsum:
514 ms ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
xarray.dot(dask backend):
241 ms ± 8.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
dask.array.einsum:
497 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (numpy backend)
439 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (dask backend)
517 ms ± 41.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(20, True, [], ...i,...i->...i)
xarray.dot(numpy backend):
572 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy.einsum:
563 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
xarray.dot(dask backend):
268 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
dask.array.einsum:
563 ms ± 5.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (numpy backend)
501 ms ± 5.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (dask backend)
922 ms ± 93.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```


This particular bit is shocking and I can't wrap my head around it?!?
```
bench(100, True, [], ...i,...i->...i)
xarray.dot(dask backend):
241 ms ± 8.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (dask backend)
517 ms ± 41.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bench(20, True, [], ...i,...i->...i)
xarray.dot(dask backend):
268 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
a * b (dask backend)
922 ms ± 93.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290
https://github.com/pydata/xarray/issues/2074#issuecomment-383651390,https://api.github.com/repos/pydata/xarray/issues/2074,383651390,MDEyOklzc3VlQ29tbWVudDM4MzY1MTM5MA==,306380,2018-04-23T17:12:04Z,2018-04-23T17:12:04Z,MEMBER,See also https://github.com/dask/dask/issues/2225,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290
https://github.com/pydata/xarray/issues/2074#issuecomment-383637379,https://api.github.com/repos/pydata/xarray/issues/2074,383637379,MDEyOklzc3VlQ29tbWVudDM4MzYzNzM3OQ==,3019665,2018-04-23T16:26:51Z,2018-04-23T16:26:51Z,NONE,"Might be worth revisiting how `da.dot` is implemented as well. That would be the least amount of rewriting for you and would generally be nice for Dask users. If you have not already, @crusaderky, it would be nice to raise an issue over at Dask with a straight Dask benchmark comparing Dask Array's `dot` and `einsum`.

cc @mrocklin","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290
https://github.com/pydata/xarray/issues/2074#issuecomment-383419435,https://api.github.com/repos/pydata/xarray/issues/2074,383419435,MDEyOklzc3VlQ29tbWVudDM4MzQxOTQzNQ==,6815844,2018-04-22T23:05:39Z,2018-04-22T23:06:05Z,MEMBER,"`xr.dot` was implemented before dask/dask#3412 was merged, and thus it is not very efficient for dask now.

> The proposed solution is to simply wait for dask/dask#3412 to reach the next release and then reimplement xarray.dot to use dask.array.einsum. 

Agreed.
I think the reimplementation would be easy,
https://github.com/pydata/xarray/blob/99b457ce5859bd949cfea4671db5150c7297843a/xarray/core/computation.py#L1039-L1043
`dask='parallelrized'` -> `dask='allow'`
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,316618290