issue_comments
9 rows where author_association = "MEMBER", issue = 333312849 and user = 1217238 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date)
issue 1
- why time grouping doesn't preserve chunks · 9 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
482274302 | https://github.com/pydata/xarray/issues/2237#issuecomment-482274302 | https://api.github.com/repos/pydata/xarray/issues/2237 | MDEyOklzc3VlQ29tbWVudDQ4MjI3NDMwMg== | shoyer 1217238 | 2019-04-11T19:32:33Z | 2019-04-11T19:32:33Z | MEMBER | The original issue has been fixed, at least in the toy example: ```
I don't know if it's still an issue in more realistic scenarios. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
why time grouping doesn't preserve chunks 333312849 | |
398592643 | https://github.com/pydata/xarray/issues/2237#issuecomment-398592643 | https://api.github.com/repos/pydata/xarray/issues/2237 | MDEyOklzc3VlQ29tbWVudDM5ODU5MjY0Mw== | shoyer 1217238 | 2018-06-20T01:10:04Z | 2018-06-20T01:10:04Z | MEMBER |
Maybe it helps to think about these as matrices. The nth row of
Yes, this is definitely a shuffle. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
why time grouping doesn't preserve chunks 333312849 | |
398584002 | https://github.com/pydata/xarray/issues/2237#issuecomment-398584002 | https://api.github.com/repos/pydata/xarray/issues/2237 | MDEyOklzc3VlQ29tbWVudDM5ODU4NDAwMg== | shoyer 1217238 | 2018-06-20T00:11:33Z | 2018-06-20T00:11:33Z | MEMBER |
No worries, this is indeed, pretty confusing! For suppose N is the number of years of datalist_of_group_indices = [ [0, 365, 730, ..., (N-1)365], # day 1, ordered by year [1, 366, 731, ..., (N-1)365 + 1], # day 2, ordered by year ... ] indices_to_restore_orig_order = [ 0, N, 2N, 3N, ..., # year 1, ordered by day 1, N+1, 2N+1, 3N+1, ..., # year 2, ordered by day ... ] ``` As you can see, if you concatenate together the first set of indices and index by the second set of indices, it would arrange them into sequential integers. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
why time grouping doesn't preserve chunks 333312849 | |
398581618 | https://github.com/pydata/xarray/issues/2237#issuecomment-398581618 | https://api.github.com/repos/pydata/xarray/issues/2237 | MDEyOklzc3VlQ29tbWVudDM5ODU4MTYxOA== | shoyer 1217238 | 2018-06-19T23:57:03Z | 2018-06-19T23:57:03Z | MEMBER | Some sort of automatic rechunking could also make a big difference for performance, in cases where the groupby operation splits the original chunks into small pieces (like my |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
why time grouping doesn't preserve chunks 333312849 | |
398580421 | https://github.com/pydata/xarray/issues/2237#issuecomment-398580421 | https://api.github.com/repos/pydata/xarray/issues/2237 | MDEyOklzc3VlQ29tbWVudDM5ODU4MDQyMQ== | shoyer 1217238 | 2018-06-19T23:49:16Z | 2018-06-19T23:50:12Z | MEMBER | Another option would be to rewrite how xarray does groupby/transform operations to make it more dask friendly. Currently it looks roughly like:
For example, we could reverse the order of the last two steps. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
why time grouping doesn't preserve chunks 333312849 | |
398579480 | https://github.com/pydata/xarray/issues/2237#issuecomment-398579480 | https://api.github.com/repos/pydata/xarray/issues/2237 | MDEyOklzc3VlQ29tbWVudDM5ODU3OTQ4MA== | shoyer 1217238 | 2018-06-19T23:43:18Z | 2018-06-19T23:43:32Z | MEMBER |
Assuming the original array is chunked into one file per year-month (which is probably a reasonable starting point):
- For the |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
why time grouping doesn't preserve chunks 333312849 | |
398575742 | https://github.com/pydata/xarray/issues/2237#issuecomment-398575742 | https://api.github.com/repos/pydata/xarray/issues/2237 | MDEyOklzc3VlQ29tbWVudDM5ODU3NTc0Mg== | shoyer 1217238 | 2018-06-19T23:21:10Z | 2018-06-19T23:21:10Z | MEMBER | Here's an example of what these indices look like for a slightly more realistic groupby example: ```python import xarray import pandas import numpy as np array = xarray.DataArray( range(1000), [('time', pandas.date_range('2000-01-01', freq='D', periods=1000))]) this works with xarray 0.10.7xarray.core.groupby._inverse_permutation_indices(
array.groupby('time.month')._group_indices)
I think it would work with the "put contiguous index regions into the same chunk" heuristic. On the other hand, this could break pretty badly for other group-by operations, e.g., calculating those anomalies by day of year instead:
This looks like @mrocklin's second case. That said, it's still probably more graceful to fail by creating too many small tasks rather than one giant task. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
why time grouping doesn't preserve chunks 333312849 | |
398198466 | https://github.com/pydata/xarray/issues/2237#issuecomment-398198466 | https://api.github.com/repos/pydata/xarray/issues/2237 | MDEyOklzc3VlQ29tbWVudDM5ODE5ODQ2Ng== | shoyer 1217238 | 2018-06-18T21:16:24Z | 2018-06-18T21:16:24Z | MEMBER | I vaguely recall discussing chunks that result from indexing somewhere in the dask issue tracker (when we added the special case for a monotonic increasing indexer to preserve chunks), but I can't find it now. I think the challenge is that it isn't obvious what the right chunksizes should be. Chunks that are too small also have negative performance implications. Maybe the automatic chunking logic that @mrocklin has been looking into recently would be relevant here. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
why time grouping doesn't preserve chunks 333312849 | |
398157337 | https://github.com/pydata/xarray/issues/2237#issuecomment-398157337 | https://api.github.com/repos/pydata/xarray/issues/2237 | MDEyOklzc3VlQ29tbWVudDM5ODE1NzMzNw== | shoyer 1217238 | 2018-06-18T18:50:39Z | 2018-06-18T18:50:48Z | MEMBER | The source of the indexing operation that brings all the chunks together is the So basically the issue comes down to indexing with dask.array, which creates a single chunk when integers indices are not all in order: ``` import dask.array as da import numpy as np x = da.ones(4, chunks=1) print(x[np.arange(4)]) dask.array<getitem, shape=(4,), dtype=float64, chunksize=(1,)>print(x[np.arange(4)[::-1]]) dask.array<getitem, shape=(4,), dtype=float64, chunksize=(4,)>``` As a work-around in xarray, we could use explicit indexing + concatenation. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
why time grouping doesn't preserve chunks 333312849 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 1