issue_comments
15 rows where issue = 950882492 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- Polyfit performance on large datasets - Suboptimal dask task graph · 15 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
960133879 | https://github.com/pydata/xarray/issues/5629#issuecomment-960133879 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X845Onr3 | dcherian 2448579 | 2021-11-03T21:39:08Z | 2023-01-24T20:19:19Z | MEMBER | In reading through #5933, I realized there's an alternate solution that preserves the desirable feature of fitting along chunked dimensions (at least for skipna=False) Since we are only ever fitting along one dimension, we can apply the reshape-other-dimensions operation blockwise. As long as we reshape back in a consistent manner (again blockwise) this is identical to our current stacking approach, but substantially more efficient. In particular, it avoids merging chunks. ```python def reshape_block(arr, output_chunks): """ Blockwise reshape array to match output_chunks
def polyfit(da, dim, skipna=False): import itertools
``` Here's the graph 1 chunk along time (like @jbusecke's example, and skipna=True)
And when the time dimension is chunked (and skipna=False) older version
``` python
from itertools import product
import numpy as np
from dask.array import Array
from dask.array.utils import assert_eq
from dask.base import tokenize
from dask.highlevelgraph import HighLevelGraph
def reshape_block(arr, output_chunks):
"""
Blockwise reshape array to match output_chunks
This method applies a reshape operation blockwise, so it only makes sense if
you do not care about the order of elements (along reshaped axes) after reshaping.
The order of elements will depend on the chunking of ``arr``.
It is substantially more efficient than a usual reshape, which makes sure array elements
appear in a numpy-like order regardless of chunking.
Based on https://github.com/dask/dask/issues/4855#issuecomment-497062002
"""
name = "reshape-block-" + tokenize(arr)
ichunks = tuple(range(len(chunks_v)) for chunks_v in arr.chunks)
ochunks = tuple(range(len(chunks_v)) for chunks_v in output_chunks)
dsk = {
(name, *ochunk): (np.reshape, (arr.name, *ichunk), oshape)
for ichunk, ochunk, oshape in zip(
product(*ichunks), product(*ochunks), product(*output_chunks)
)
}
graph = HighLevelGraph.from_collections(name, dsk, dependencies=[arr])
res = Array(graph, name, output_chunks, arr.dtype, meta=arr._meta)
return res
# assumes we're collapsing axes (0,1) and preserving axis 2
output_chunks = (tuple(map(np.product, product(*arr.chunks[:2]))),) + (arr.chunks[-1],)
reshaped = reshape_block(arr, output_chunks)
# make sure that we roundtrip
actual = reshape_block(reshaped, arr.chunks)
assert_eq(actual, arr) # True
```
``` python
def new_polyfit(da):
arr = da.data
output_chunks = (tuple(map(np.product, product(*arr.chunks[:2]))),) + (
arr.chunks[-1],
)
reshaped = xr.DataArray(reshape_block(arr, output_chunks), dims=("xy", "time"))
fit = reshaped.polyfit("time", 1, skipna=True)
result = xr.Dataset()
for var in fit:
result[var] = (
("degree", "x", "y"),
reshape_block(fit[var].data, ((2,),) + arr.chunks[:-1]),
)
result["x"] = da["x"]
result["y"] = da["y"]
result["degree"] = fit["degree"]
return result
poly = da.polyfit("time", 1, skipna=False)
xr.testing.assert_allclose(poly, new_polyfit(da)) # True!
```
|
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
961370482 | https://github.com/pydata/xarray/issues/5629#issuecomment-961370482 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X845TVly | dcherian 2448579 | 2021-11-04T19:56:35Z | 2021-11-04T19:57:13Z | MEMBER |
To clarify, the constraint here is that
Yes, same idea. Though when you do it that way, you have to implement the function you're applying (that xhistogram PR reimplements |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
961349927 | https://github.com/pydata/xarray/issues/5629#issuecomment-961349927 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X845TQkn | TomNicholas 35968931 | 2021-11-04T19:25:41Z | 2021-11-04T19:26:51Z | MEMBER |
The idea seems similar to what we used in xhistogram, which uses blockwise to apply a function that reshapes the data in that block to be 1D. EDIT: But in that case our algorithm was one that could be applied blockwise, |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
961347645 | https://github.com/pydata/xarray/issues/5629#issuecomment-961347645 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X845TQA9 | Illviljan 14371165 | 2021-11-04T19:22:19Z | 2021-11-04T19:22:41Z | MEMBER | I was thinking the general idea of reshape_block seems like a clever trick to get around the apply_ufunc constraints and could be used for some other functions as well. But maybe there aren't that many functions that could make use of it. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
961230139 | https://github.com/pydata/xarray/issues/5629#issuecomment-961230139 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X845SzU7 | dcherian 2448579 | 2021-11-04T16:51:43Z | 2021-11-04T16:54:02Z | MEMBER |
I interpreted this as asking whether this general idea could be implemented in apply-ufunc, to which the answer is "no" (I think) because it's pretty specific. At best we put this logic in a wrapper function and call that with This should handle all current use-cases of |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
961187488 | https://github.com/pydata/xarray/issues/5629#issuecomment-961187488 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X845So6g | TomNicholas 35968931 | 2021-11-04T16:04:17Z | 2021-11-04T16:04:17Z | MEMBER |
Sorry, what case does your suggestion not handle? It would be really nice to support chunked core dims if we are going to the effort of rewriting polyfit. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
961154611 | https://github.com/pydata/xarray/issues/5629#issuecomment-961154611 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X845Sg4z | dcherian 2448579 | 2021-11-04T15:27:49Z | 2021-11-04T15:27:49Z | MEMBER | The problem is that there is no general solution here. You have to write a chunk-aware function and then use |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
960414601 | https://github.com/pydata/xarray/issues/5629#issuecomment-960414601 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X845PsOJ | Illviljan 14371165 | 2021-11-04T02:58:33Z | 2021-11-04T02:58:33Z | MEMBER | @dcherian Is that something that could be implemented in apply_ufunc? Not being able to apply along a chunked dim I think happens quite often, right? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
887207964 | https://github.com/pydata/xarray/issues/5629#issuecomment-887207964 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X8404bgc | kmuehlbauer 5821660 | 2021-07-27T04:54:54Z | 2021-07-27T04:54:54Z | MEMBER |
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
886907556 | https://github.com/pydata/xarray/issues/5629#issuecomment-886907556 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X8403SKk | dcherian 2448579 | 2021-07-26T17:55:51Z | 2021-07-26T17:55:51Z | MEMBER | You could write a wrapper that does |
{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
886905635 | https://github.com/pydata/xarray/issues/5629#issuecomment-886905635 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X8403Rsj | aulemahal 20629530 | 2021-07-26T17:52:42Z | 2021-07-26T17:52:42Z | CONTRIBUTOR |
Should be easy! The code was written without apply_ufunc for a few reasons, one being that using |
{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
885216203 | https://github.com/pydata/xarray/issues/5629#issuecomment-885216203 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X840w1PL | slevang 39069044 | 2021-07-22T20:37:04Z | 2021-07-22T20:37:04Z | CONTRIBUTOR | Potentially a useful comparison: here is the task graph for the same problem posed to def linear(x, m, b): return m * x + b da = xr.DataArray(dsa.random.random((4,6, 100), chunks=(1,2,100)), dims=['x','y', 'time'])
fit = da.curvefit('time', linear).curvefit_coefficients
```
|
{ "total_count": 3, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 3, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
885087800 | https://github.com/pydata/xarray/issues/5629#issuecomment-885087800 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X840wV44 | TomNicholas 35968931 | 2021-07-22T17:31:02Z | 2021-07-22T17:31:58Z | MEMBER | Thanks for the clear example @jbusecke . I think your example is helpful enough that we should keep this open (even if the same PR will hopefully close both this and #4554) Was about to say the same @dcherian! @aulemahal what do you think about how easy it might be to change |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
885086453 | https://github.com/pydata/xarray/issues/5629#issuecomment-885086453 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X840wVj1 | jbusecke 14314623 | 2021-07-22T17:28:52Z | 2021-07-22T17:28:52Z | CONTRIBUTOR | Should I close this one then? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 | |
885085284 | https://github.com/pydata/xarray/issues/5629#issuecomment-885085284 | https://api.github.com/repos/pydata/xarray/issues/5629 | IC_kwDOAMm_X840wVRk | dcherian 2448579 | 2021-07-22T17:27:00Z | 2021-07-22T17:27:00Z | MEMBER | It's the same thing: |
{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Polyfit performance on large datasets - Suboptimal dask task graph 950882492 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 7