home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

17 rows where author_association = "MEMBER" and issue = 252358450 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 5

  • shoyer 8
  • mrocklin 3
  • jhamman 3
  • spencerkclark 2
  • clarkfitzg 1

issue 1

  • Automatic parallelization for dask arrays in apply_ufunc · 17 ✖

author_association 1

  • MEMBER · 17 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
335316764 https://github.com/pydata/xarray/pull/1517#issuecomment-335316764 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMzNTMxNjc2NA== shoyer 1217238 2017-10-09T23:28:52Z 2017-10-09T23:28:52Z MEMBER

I'll start on my PR to expose this as public API -- hopefully will make some progress on my flight from NY to SF tonight.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
335316029 https://github.com/pydata/xarray/pull/1517#issuecomment-335316029 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMzNTMxNjAyOQ== jhamman 2443309 2017-10-09T23:23:45Z 2017-10-09T23:23:45Z MEMBER

Great. Go ahead and merge it then. I'm very excited about this feature.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
335315831 https://github.com/pydata/xarray/pull/1517#issuecomment-335315831 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMzNTMxNTgzMQ== shoyer 1217238 2017-10-09T23:22:33Z 2017-10-09T23:22:33Z MEMBER

I think this is ready.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
335315689 https://github.com/pydata/xarray/pull/1517#issuecomment-335315689 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMzNTMxNTY4OQ== jhamman 2443309 2017-10-09T23:21:32Z 2017-10-09T23:21:32Z MEMBER

@shoyer - anything left to do here?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
330709438 https://github.com/pydata/xarray/pull/1517#issuecomment-330709438 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMzMDcwOTQzOA== jhamman 2443309 2017-09-20T00:18:32Z 2017-09-20T00:18:32Z MEMBER

@shoyer - My vote is for something closer to #2.

Your example scenario is something I run into frequently. In cases like this, I think its better to tell the user that they are not providing an appropriate input rather than attempting to rechunk a dataset.

(This is somewhat related to #1440)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
330701921 https://github.com/pydata/xarray/pull/1517#issuecomment-330701921 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMzMDcwMTkyMQ== mrocklin 306380 2017-09-19T23:27:49Z 2017-09-19T23:27:49Z MEMBER

The heuristics we have are I think just of the form "did you make way more chunks than you had previously". I can imagine other heuristics of the form "some of your new chunks are several times larger than your previous chunks". In general these heuristics might be useful in several places. It might make sense to build them in a dask/array/utils.py file.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
330701517 https://github.com/pydata/xarray/pull/1517#issuecomment-330701517 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMzMDcwMTUxNw== shoyer 1217238 2017-09-19T23:25:08Z 2017-09-19T23:25:08Z MEMBER

I have a design question here: how should we handle cases where a core dimension exists in multiple chunks? For example, suppose you are applying a function that needs access to every point along the "time" axis at once (e.g., an auto-correlation function).

Should we: 1. Automatically rechunk along "time" into a single chunk, or 2. Raise an error, and require the user to rechunk manually (xref https://github.com/dask/dask/issues/2689 for API on this)

Currently we do behavior 1, but behavior 2 might be more user friendly. Otherwise it could be pretty easy to inadvertently pass in a dask array (e.g., in small chunks along time) that apply_ufunc would load into memory by putting in a single chunk.

dask.array has some heuristics to protect against this in rechunk() but I'm not sure they are effective enough to catch this. (@mrocklin?)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
330679808 https://github.com/pydata/xarray/pull/1517#issuecomment-330679808 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMzMDY3OTgwOA== spencerkclark 6628425 2017-09-19T21:32:00Z 2017-09-19T21:32:00Z MEMBER

I was not aware of dask's atop function before reading this PR (it looks pretty cool), so I defer to @nbren12 there.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
330022743 https://github.com/pydata/xarray/pull/1517#issuecomment-330022743 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMzMDAyMjc0Mw== shoyer 1217238 2017-09-17T05:39:38Z 2017-09-17T05:39:38Z MEMBER

Alternatively apply_ufunc could see if the func object has a pre_dask_atop method, and apply it if it does.

This seems like a reasonable option to me. Once we get this merged, want to make a PR?

@jhamman could you give this a review? I have not included extensive documentation yet, but I am also reluctant to squeeze that into this PR before we make it public API. (Which I'd like to save for another one.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
328341717 https://github.com/pydata/xarray/pull/1517#issuecomment-328341717 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMyODM0MTcxNw== spencerkclark 6628425 2017-09-10T13:09:40Z 2017-09-10T13:09:40Z MEMBER

@nbren12 for similar use cases I've had success writing a single function that does the ghosting, applies a function with map_blocks, and trims the edges. Then I apply that single function on a DataArray with apply_ufunc (so a single call to apply_ufunc rather than three). As an example, a simple centered difference on an array with periodic boundaries might be accomplished with: ```python def centered_diff_numpy(arr, axis=-1, spacing=1.): return (np.roll(arr, -1, axis=axis) - np.roll(arr, 1, axis=axis)) / (2. * spacing)

def centered_diff(da, dim, spacing=1.): def apply_centered_diff(arr, spacing=1.): if isinstance(arr, np.ndarray): return centered_diff_numpy(arr, spacing=spacing) else: axis = len(arr.shape) - 1 g = darray.ghost.ghost(arr, depth={axis: 1}, boundary={axis: 'periodic'}) result = darray.map_blocks(centered_diff_numpy, g, spacing=spacing) return darray.ghost.trim_internal(result, {axis: 1})

return computation.apply_ufunc(
    apply_centered_diff, da, input_core_dims=[[dim]],
    output_core_dims=[[dim]], dask_array='allowed', kwargs={'spacing': spacing})

Depending on your use case, you might also consider `dask.ghost.map_overlap` to do all of those three steps in one line, i.e. replace `apply_centered_diff` with the following:python def apply_centered_diff(arr, spacing=1.): if isinstance(arr, np.ndarray): return centered_diff_numpy(arr, spacing=spacing) else: axis = len(arr.shape) - 1 return darray.ghost.map_overlap( arr, centered_diff_numpy, depth={axis: 1}, boundary={axis: 'periodic'}, spacing=spacing) ``` (Not sure if this is what @shoyer had in mind, but just offering an example)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
328299112 https://github.com/pydata/xarray/pull/1517#issuecomment-328299112 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMyODI5OTExMg== shoyer 1217238 2017-09-09T19:38:09Z 2017-09-09T19:38:09Z MEMBER

@nbren12 Probably the best way to do ghosting with the current interface is to write a function that acts on dask array objects to apply the ghosting, and then apply it using apply_ufunc. I don't see an easy way to incorporate it into the current interface, which is already getting pretty complicated.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
324734974 https://github.com/pydata/xarray/pull/1517#issuecomment-324734974 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMyNDczNDk3NA== shoyer 1217238 2017-08-24T19:34:49Z 2017-08-24T19:34:49Z MEMBER

@mrocklin I split that discussion off to #1525.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
324732814 https://github.com/pydata/xarray/pull/1517#issuecomment-324732814 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMyNDczMjgxNA== mrocklin 306380 2017-08-24T19:25:32Z 2017-08-24T19:25:32Z MEMBER

Yes if you don't care strongly about deduplication. The following will be slower:

b = (a.chunk(...) + 1) + (a.chunk(...) + 1)

In current operation this will be optimized to

tmp = a.chunk(...) + 1
b = tmp + tmp

So you'll lose that, but I suspect that in your case chunking the same dataset many times is somewhat rare.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
324732200 https://github.com/pydata/xarray/pull/1517#issuecomment-324732200 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMyNDczMjIwMA== shoyer 1217238 2017-08-24T19:22:56Z 2017-08-24T19:22:56Z MEMBER

@mrocklin Yes, that took a few seconds (due to hashing the array contents). Would you suggest setting name=False by default for xarray's chunk() method?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
324722153 https://github.com/pydata/xarray/pull/1517#issuecomment-324722153 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMyNDcyMjE1Mw== mrocklin 306380 2017-08-24T18:43:30Z 2017-08-24T18:43:30Z MEMBER

I'm curious, how long does this line take:

r = spearman_correlation(array1.chunk({'place': 10}), array2.chunk({'place': 10}), 'time')

Have you consider setting name=False in your from_array call by default when doing this? I often avoid creating deterministic names when going back and forth rapidly between dask.array and numpy.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
324705244 https://github.com/pydata/xarray/pull/1517#issuecomment-324705244 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMyNDcwNTI0NA== shoyer 1217238 2017-08-24T17:38:29Z 2017-08-24T17:38:29Z MEMBER

What's rs.randn()?

Oops, fixed.

When this makes it into the public facing API it would be nice to include some guidance on how the chunking scheme affects the run time.

We already have some tips here: http://xarray.pydata.org/en/stable/dask.html#chunking-and-performance

More ambitiously I could imagine an API such as array1.chunk('place') or array1.chunk('auto') meaning to figure out a reasonable chunking scheme only once .compute() is called so that all the compute steps are known.

Yes, this would be great.

Maybe this is more specific to dask than xarray. I believe it would also be difficult.

I agree with both!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450
324692881 https://github.com/pydata/xarray/pull/1517#issuecomment-324692881 https://api.github.com/repos/pydata/xarray/issues/1517 MDEyOklzc3VlQ29tbWVudDMyNDY5Mjg4MQ== clarkfitzg 5356122 2017-08-24T16:50:45Z 2017-08-24T16:50:45Z MEMBER

Wow, this is great stuff!

What's rs.randn()?

When this makes it into the public facing API it would be nice to include some guidance on how the chunking scheme affects the run time. Imagine a plot with run time plotted as a function of chunk size or number of chunks. Of course it also depends on the data size and the number of cores available.

To say it in a different way, array1.chunk({'place': 10}) is a performance tuning parameter, semantically no different than array1.

More ambitiously I could imagine an API such as array1.chunk('place') or array1.chunk('auto') meaning to figure out a reasonable chunking scheme only once .compute() is called so that all the compute steps are known. Maybe this is more specific to dask than xarray. I believe it would also be difficult.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Automatic parallelization for dask arrays in apply_ufunc 252358450

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 41.386ms · About: xarray-datasette