home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

61 rows where issue = 146182176 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 6

  • rabernat 27
  • shoyer 19
  • jhamman 7
  • naught101 4
  • clarkfitzg 3
  • monocongo 1

author_association 2

  • MEMBER 56
  • NONE 5

issue 1

  • Multidimensional groupby · 61 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
231256264 https://github.com/pydata/xarray/pull/818#issuecomment-231256264 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIzMTI1NjI2NA== shoyer 1217238 2016-07-08T01:50:30Z 2016-07-08T01:50:30Z MEMBER

OK, merging.....

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
230818687 https://github.com/pydata/xarray/pull/818#issuecomment-230818687 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIzMDgxODY4Nw== shoyer 1217238 2016-07-06T16:00:54Z 2016-07-06T16:00:54Z MEMBER

@rabernat I agree. I have a couple of minor style/pep8 issues, and we need an entry for "what's new", but let's merge this. I can then play around a little bit with potential fixes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
230796165 https://github.com/pydata/xarray/pull/818#issuecomment-230796165 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIzMDc5NjE2NQ== rabernat 1197350 2016-07-06T14:50:42Z 2016-07-06T14:50:42Z MEMBER

I just rebased and updated this PR. I have not resolved all of the edge cases, such as what to do about non-reducing groupby_bins operations that don't span the entire coordinate. Unfortunately merging @shoyer's fix from #875 did not resolve this problem, at least not in a way that was obvious to me.

My feeling is that this PR in its current form introduces some very useful new features. For my part, I am eager to start using it for actual science projects. Multidimensional grouping is unfamiliar territory. I don't think every potential issue can be resolved by me right now via this PR--I don't have the necessary skills, nor can I anticipate every use case. I think that getting this merged and out in the wild will give us some valuable user feedback which will help figure out where to go next. Plus it would get exposed to developers with the skills to resolve some of the issues. By waiting much longer, we risk it going stale, since lots of other xarray elements are also in flux.

Please let me know what you think.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
224693231 https://github.com/pydata/xarray/pull/818#issuecomment-224693231 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyNDY5MzIzMQ== shoyer 1217238 2016-06-08T18:58:45Z 2016-06-08T18:58:45Z MEMBER

Looks like I still have a bug (failing Travis builds). Let me see if I can get that sorted out first.

On Wed, Jun 8, 2016 at 11:51 AM, Ryan Abernathey notifications@github.com wrote:

I think #875 https://github.com/pydata/xarray/pull/875 should fix the issue with concatenating index objects.

Should I try to merge your branch with my branch...or wait for your branch to get merged into master?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/818#issuecomment-224691235, or mute the thread https://github.com/notifications/unsubscribe/ABKS1oaZfZ0P384eSGKIQ8-0fbyH8KDWks5qJw86gaJpZM4IAuQH .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
224691235 https://github.com/pydata/xarray/pull/818#issuecomment-224691235 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyNDY5MTIzNQ== rabernat 1197350 2016-06-08T18:51:37Z 2016-06-08T18:51:37Z MEMBER

I think #875 should fix the issue with concatenating index objects.

Should I try to merge your branch with my branch...or wait for your branch to get merged into master?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
224484574 https://github.com/pydata/xarray/pull/818#issuecomment-224484574 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyNDQ4NDU3NA== shoyer 1217238 2016-06-08T04:32:29Z 2016-06-08T04:32:29Z MEMBER

I think #875 should fix the issue with concatenating index objects.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
223999761 https://github.com/pydata/xarray/pull/818#issuecomment-223999761 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMzk5OTc2MQ== shoyer 1217238 2016-06-06T15:45:49Z 2016-06-06T15:45:49Z MEMBER

Empty groups should be straightforward -- we should be able handle them.

Indices which don't belong to any group are indeed more problematic. I think we have three options here: 1. Raise an error when calling .groupby_bins(...) 2. Raise an error when calling .groupby_bins(...).apply(...) 3. Simply concatenate back together whatever items were grouped, and give up on the guarantee that the applying the identity function restores the original item.

I think my preference would be for option 3, though 1 or 2 could be reasonable work arounds for now (raising NotImplementedError), because 3 is likely to be a little tricky to implement.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
223934668 https://github.com/pydata/xarray/pull/818#issuecomment-223934668 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMzkzNDY2OA== rabernat 1197350 2016-06-06T11:36:02Z 2016-06-06T11:36:02Z MEMBER

@shoyer: I'm not sure this is as simple as a technical fix. It is a design question.

With regular groupby, the groups are guaranteed so span the original coordinates exactly, so you can always put the original dataarrays back together from the groupby object, i.e. ds.groupby('dim_0').apply(lambda x: x).

With groupby_bins, the user specifies the bins and might do so in such a way that - there are empty groups - there are indices which don't belong to to any group

In both cases, it is not obvious to me what should happen when calling .apply(lambda x: x). Especially for the latter, I would probably want to raise an error informing the user that their bins are not sufficient to reconstitute the full index.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
223870991 https://github.com/pydata/xarray/pull/818#issuecomment-223870991 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMzg3MDk5MQ== shoyer 1217238 2016-06-06T05:23:24Z 2016-06-06T05:23:24Z MEMBER

I think I can fix this, by making concatenation work properly on index objects. Stay tuned...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
223817102 https://github.com/pydata/xarray/pull/818#issuecomment-223817102 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMzgxNzEwMg== rabernat 1197350 2016-06-05T14:47:12Z 2016-06-05T14:47:12Z MEMBER

@shoyer, @jhamman, could you give me some feedback on one outstanding issue with this PR? I am stuck on a kind of obscure edge case, but I really want to get this finished.

Consider the following groupby operation, which creates bins which are finer than the original coordinate. In other words, some bins are empty because there are too many bins.

python dat = xr.DataArray(np.arange(4)) dim_0_bins = np.arange(0,4.5,0.5) gb = dat.groupby_bins('dim_0', dim_0_bins) print(gb.groups)

gives

{'(0.5, 1]': [1], '(2.5, 3]': [3], '(1.5, 2]': [2]}

If I try a reducing apply operation, e.g. gb.mean(), it works fine. However, if I do

python gb.apply(lambda x: x - x.mean())

I get an error on the concat step

--> 433 combined = self._concat(applied, shortcut=shortcut) ... [long stack trace] IndexError: index 3 is out of bounds for axis 1 with size 3

I'm really not sure what the "correct behavior" should even be in this case. It is not even possible to reconstitute the original data array by doing gb.apply(lambda x: x). The same problem arises when the groups do not span the entire coordinate (e.g. dim_0_bins = [1,2,3]).

Do you have any thoughts / suggestions? I'm not sure I can solve this issue right now, but I would at least like to have a more useful error message.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
221859813 https://github.com/pydata/xarray/pull/818#issuecomment-221859813 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMTg1OTgxMw== rabernat 1197350 2016-05-26T12:42:20Z 2016-05-26T12:42:20Z MEMBER

Just a little update--I realized that calling apply on multidimensional binned groups fails when the group is not reduced. For example

python ds.groupby_bins('lat', lat_bins).apply(lambda x: x - x.mean())

raises errors because of conflicting coordinates when trying to concat the results. I only discovered this when making my tutorial notebook. I think I know how to fix it, but I haven't had time yet.

So it is moving along... I am excited about this feature and am confident it can make it into the next release.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
220863225 https://github.com/pydata/xarray/pull/818#issuecomment-220863225 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMDg2MzIyNQ== clarkfitzg 5356122 2016-05-22T23:28:01Z 2016-05-22T23:28:01Z MEMBER

Ah, now I see what you were going for. More going on here than I realized. That's a nice plot :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
220861371 https://github.com/pydata/xarray/pull/818#issuecomment-220861371 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMDg2MTM3MQ== jhamman 2443309 2016-05-22T22:47:39Z 2016-05-22T22:47:39Z MEMBER

@rabernat - I'm a bit late to the party here but it looks like you have gotten it straightened out. I would have suggested plotting the projected data using Cartopy.

@clarkfitzg - this is the exact functionality we want with 2d plot coordinates and we definitely do not want to change it. It is a little annoying that pcolormesh wraps the x coordinate in the way it does but such is life.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
220859076 https://github.com/pydata/xarray/pull/818#issuecomment-220859076 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMDg1OTA3Ng== rabernat 1197350 2016-05-22T21:59:05Z 2016-05-22T21:59:05Z MEMBER

The right thing for xarray to do is probably to throw an error when any 2d plot method is called with 2 coordinates that actually have higher dimensions.

I disagree. I don't want to use the default dimensions as the x and y coords for the plot. I want to use the true lat / lon coords, which are xc and yc. In this case, I think the plot broke because pcolormesh can't handle the way the coordinates wrap. It's not a problem with xarray. If I pass the plot through cartopy, it actually works great, because cartopy knows how to handle the 2D geographic coordinates a bit better.

python ax = plt.axes(projection=ccrs.PlateCarree()) ax.set_global() ds.Tair[0].plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x='xc', y='yc') ax.coastlines()

This would fail of course if you could only use 1d coords for plotting, so I definitely think we should keep the plot code as is for now (not raise an error).

I am happy with this example for now.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
220844145 https://github.com/pydata/xarray/pull/818#issuecomment-220844145 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMDg0NDE0NQ== clarkfitzg 5356122 2016-05-22T17:16:14Z 2016-05-22T18:31:58Z MEMBER

The problem is with the shape of these coordinates.

```

ds = xr.tutorial.load_dataset('RASM_example_data') ds['xc'].shape (205, 275) ```

EDIT: just to be clear, it doesn't make sense to pass in 2d arrays for both x and y coordinates for a 2d plotting function.

Run this: ds.Tair[0].plot.pcolormesh(x='x', y='y') to produce:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
220844279 https://github.com/pydata/xarray/pull/818#issuecomment-220844279 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMDg0NDI3OQ== clarkfitzg 5356122 2016-05-22T17:18:48Z 2016-05-22T17:18:48Z MEMBER

The right thing for xarray to do is probably to throw an error when any 2d plot method is called with 2 coordinates that actually have higher dimensions.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
220833788 https://github.com/pydata/xarray/pull/818#issuecomment-220833788 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMDgzMzc4OA== rabernat 1197350 2016-05-22T13:55:51Z 2016-05-22T13:55:51Z MEMBER

@jhamman, @clarkfitzg: I am working on an example notebook for multidimensional coordinates. In addition to the new groupby features, I wanted to include an example of a 2D pcolormesh using the RASM_example_data.nc dataset.

Just doing the simplest possible thing, i.e.

python ds.Tair[0].plot.pcolormesh(x='xc', y='yc')

gives me a slightly mangled plot:

Am I missing something obvious here?

Seems somehow related to #781, #792.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
220144452 https://github.com/pydata/xarray/pull/818#issuecomment-220144452 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMDE0NDQ1Mg== jhamman 2443309 2016-05-18T20:15:04Z 2016-05-18T20:15:04Z MEMBER

@rabernat - the monthly-means example was developed in an ipython notebook and then exported to *.rst. The dataset in that example doesn't have lat/lon coordinates although it should. I'll see if I can add them this afternoon .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
220065292 https://github.com/pydata/xarray/pull/818#issuecomment-220065292 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMDA2NTI5Mg== rabernat 1197350 2016-05-18T15:33:45Z 2016-05-18T15:33:45Z MEMBER

A nice example for the docs.

There is indeed basic documentation, but not a detailed tutorial of what these features are good for. For this, this dataset from @jhamman with a non-uniform grid would actually be ideal. The monthly-means example I think contains a reference to a similar dataset.

How were the files in the doc/examples directory generated?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
220029256 https://github.com/pydata/xarray/pull/818#issuecomment-220029256 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIyMDAyOTI1Ng== rabernat 1197350 2016-05-18T13:41:47Z 2016-05-18T13:41:47Z MEMBER

Allow specification of which dims to stack.

I think this should wait for a future PR. It is pretty complicated. I think it would be better to get the current features out in the wild first and play with it a bit before moving forward.

I ran into the index is monotonic issue, it sounds like that was resolved. Do we cover that case in a test?

It is resolved, but not tested. I'll add a test.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
219875456 https://github.com/pydata/xarray/pull/818#issuecomment-219875456 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxOTg3NTQ1Ng== jhamman 2443309 2016-05-17T22:38:56Z 2016-05-17T22:38:56Z MEMBER

@rabernat - I just had a look through the code and it looks pretty good. I have a few broader questions though: 1. You have a few outstanding todo items from the first comment in your PR:

  • [ ] Allow specification of which dims to stack. For example, stack in space but keep time dimension intact. (Currently it just stacks all the dimensions of the group variable.)
  • [ ] A nice example for the docs.

Where do we stand on these? You have some simple examples in the docs now but maybe you were thinking of more complete examples? 2. In https://github.com/pydata/xarray/pull/818#issuecomment-218358050, I ran into the index is monotonic issue, it sounds like that was resolved. Do we cover that case in a test?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
219847587 https://github.com/pydata/xarray/pull/818#issuecomment-219847587 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxOTg0NzU4Nw== rabernat 1197350 2016-05-17T20:43:31Z 2016-05-17T20:43:31Z MEMBER

@shoyer, @jhamman: I'm pretty happy with where this is at. It's quite useful for a lots of things I want to do with xarray. Any more feedback?

One outstanding issue involves some buggy behavior with shortcut which I don't really understand.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
219262958 https://github.com/pydata/xarray/pull/818#issuecomment-219262958 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxOTI2Mjk1OA== rabernat 1197350 2016-05-15T02:44:19Z 2016-05-15T02:44:19Z MEMBER

Just updated this to use the groupby_bins syntax, which now exposes all the arguments of pd.cut to the user.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
219231243 https://github.com/pydata/xarray/pull/818#issuecomment-219231243 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxOTIzMTI0Mw== rabernat 1197350 2016-05-14T17:00:33Z 2016-05-14T17:00:33Z MEMBER

This is a good question, with a simple answer (stack), but it doesn't belong on the the discussion for this PR. Open a new issue or email your question to the mailing list.

On May 14, 2016, at 12:56 PM, James Adams notifications@github.com wrote:

I would also like to do what is described below but so far have had little success using xarray.

I have time series data (x years of monthly values) at each lat/lon point of a grid (x*12 times, lons, lats). I want to apply a function f() against the time series to return a corresponding time series of values. I then write these values to an output NetCDF which corresponds to the input NetCDF in terms of dimensions and coordinate variables. So instead of looping over every lat and every lon I want to apply f() in a vectorized manner such as what's described for xarray's groupby (in order to gain the expected performance from using xarray for the split-apply-combine pattern), but it needs to work for more than a single dimension which is the current capability.

Has anyone done what is described above using xarray? What sort of performance gains can be expected using your approach?

Thanks in advance for any help with this topic. My apologies if there is a more appropriate forum for this sort of discussion (please redirect if so), as this may not be applicable to the original issue...

--James

On Wed, May 11, 2016 at 2:24 AM, naught101 notifications@github.com wrote:

I want to be able to run a scikit-learn model over a bunch of variables in a 3D (lat/lon/time) dataset, and return values for each coordinate point. Is something like this multi-dimensional groupby required (I'm thinking groupby(lat, lon) => 2D matrices that can be fed straight into scikit-learn), or is there already some other mechanism that could achieve something like this? Or is the best way at the moment just to create a null dataset, and loop over lat/lon and fill in the blanks as you go?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/pydata/xarray/pull/818#issuecomment-218372591

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
219231028 https://github.com/pydata/xarray/pull/818#issuecomment-219231028 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxOTIzMTAyOA== monocongo 1328158 2016-05-14T16:56:37Z 2016-05-14T16:56:37Z NONE

I would also like to do what is described below but so far have had little success using xarray.

I have time series data (x years of monthly values) at each lat/lon point of a grid (x*12 times, lons, lats). I want to apply a function f() against the time series to return a corresponding time series of values. I then write these values to an output NetCDF which corresponds to the input NetCDF in terms of dimensions and coordinate variables. So instead of looping over every lat and every lon I want to apply f() in a vectorized manner such as what's described for xarray's groupby (in order to gain the expected performance from using xarray for the split-apply-combine pattern), but it needs to work for more than a single dimension which is the current capability.

Has anyone done what is described above using xarray? What sort of performance gains can be expected using your approach?

Thanks in advance for any help with this topic. My apologies if there is a more appropriate forum for this sort of discussion (please redirect if so), as this may not be applicable to the original issue...

--James

On Wed, May 11, 2016 at 2:24 AM, naught101 notifications@github.com wrote:

I want to be able to run a scikit-learn model over a bunch of variables in a 3D (lat/lon/time) dataset, and return values for each coordinate point. Is something like this multi-dimensional groupby required (I'm thinking groupby(lat, lon) => 2D matrices that can be fed straight into scikit-learn), or is there already some other mechanism that could achieve something like this? Or is the best way at the moment just to create a null dataset, and loop over lat/lon and fill in the blanks as you go?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/pydata/xarray/pull/818#issuecomment-218372591

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
219096410 https://github.com/pydata/xarray/pull/818#issuecomment-219096410 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxOTA5NjQxMA== shoyer 1217238 2016-05-13T16:42:58Z 2016-05-13T16:42:58Z MEMBER

Why? This was in fact my original idea, but you encouraged me to use pd.cut instead. One thing I like about cut is that it is very flexible and well documented, while digitize is somewhat obscure.

If you're not going to use the labels it produces I'm not sure there's an advantage to pd.cut. Otherwise I thought they were pretty similar.

groupby_bins seems pretty reasonable.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
219063079 https://github.com/pydata/xarray/pull/818#issuecomment-219063079 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxOTA2MzA3OQ== rabernat 1197350 2016-05-13T14:41:43Z 2016-05-13T14:41:43Z MEMBER

@rabernat It's possibly a better idea to use np.digitize rather than pd.cut.

Why? This was in fact my original idea, but you encouraged me to use pd.cut instead. One thing I like about cut is that it is very flexible and well documented, while digitize is somewhat obscure.

What about ds.groupby_bins('lat', bins=lat_bins, labels=lat_labels) ?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218879360 https://github.com/pydata/xarray/pull/818#issuecomment-218879360 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODg3OTM2MA== shoyer 1217238 2016-05-12T20:41:18Z 2016-05-12T20:41:18Z MEMBER

@rabernat It's possibly a better idea to use np.digitize rather than pd.cut.

I would strongly suggest controlling labeling with a keyword argument, maybe similar to diff.

Again, rather then further overloading the user facing API .groupby(), the binning is probably best expressed in a separate method. I would suggest a .bin(bins) method on Dataset/DataArray. Then you could just use a normal call to (multi-dimensional) groupby. So instead, we might have: ds.sample_tsurf.assign(lat_bin=ds.TLAT.bin(lat_bins)).groupby('lat_bins').mean().

On second thought, this is significantly more verbose, so maybe bins in the groupby call is OK.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218806328 https://github.com/pydata/xarray/pull/818#issuecomment-218806328 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODgwNjMyOA== shoyer 1217238 2016-05-12T16:10:04Z 2016-05-12T16:10:04Z MEMBER

Ah, of course -- forcing_data is a Dataset. You definitely want to pull out the DataArray first. Then .values if what you want.

On Wed, May 11, 2016 at 11:54 PM, naught101 notifications@github.com wrote:

forcing_data.isel(lat=lat, lon=lon).values() returns a ValuesView, which scikit-learn doesn't like. However, forcing_data.isel(lat=lat, lon=lon).to_array().T seems to work..

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/pydata/xarray/pull/818#issuecomment-218675077

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218756580 https://github.com/pydata/xarray/pull/818#issuecomment-218756580 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODc1NjU4MA== rabernat 1197350 2016-05-12T13:27:38Z 2016-05-12T13:27:38Z MEMBER

I suppose I should also add a test for non-monotonic multidimensional binning.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218756391 https://github.com/pydata/xarray/pull/818#issuecomment-218756391 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODc1NjM5MQ== rabernat 1197350 2016-05-12T13:26:58Z 2016-05-12T13:26:58Z MEMBER

@jhamman: My latest commit followed @shoyer's suggestion to fix the "non-monotonic" error.

I successfully loaded your data and took a zonal average in 10-degree bins with the following code:

``` python

ds = xr.open_dataset('sample_for_xarray_multigroupby.nc', decode_times=False) lat_bins = np.arange(20,90,10) t_mean = ds.sample_tsurf.groupby('TLAT', bins=lat_bins).mean() t_mean <xarray.DataArray 'sample_tsurf' (TLAT: 6)> array([ 27.05354874, 24.00267499, 15.74423768, 11.16990181, 6.45922212, 0.48820518]) Coordinates: time float64 7.226e+05 z_t float64 250.0 * TLAT (TLAT) object '(20, 30]' '(30, 40]' '(40, 50]' '(50, 60]' ... ```

The only big remaining issue is the values of the new coordinate. Currently it is just using the labels output by pd.cut, which are strings. This means if I try t_mean.plot(), I get TypeError: Plotting requires coordinates to be numeric or dates.

We could either allow the user to specify labels by adding a labels keyword to groupby, or we could infer the labels automatically, e.g. by taking the centered mean of the bins:

python bin_labels = 0.5*(lat_bins[1:] + lat_bins[:-1]

Please weigh in if you have an opinion about that.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218675077 https://github.com/pydata/xarray/pull/818#issuecomment-218675077 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODY3NTA3Nw== naught101 167164 2016-05-12T06:54:53Z 2016-05-12T06:54:53Z NONE

forcing_data.isel(lat=lat, lon=lon).values() returns a ValuesView, which scikit-learn doesn't like. However, forcing_data.isel(lat=lat, lon=lon).to_array().T seems to work..

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218672116 https://github.com/pydata/xarray/pull/818#issuecomment-218672116 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODY3MjExNg== shoyer 1217238 2016-05-12T06:34:56Z 2016-05-12T06:34:56Z MEMBER

@naught101 I was mixing up how to_dataframe() works. Please ignore it! (I edited my earlier post.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218663446 https://github.com/pydata/xarray/pull/818#issuecomment-218663446 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODY2MzQ0Ng== shoyer 1217238 2016-05-12T05:27:11Z 2016-05-12T06:34:17Z MEMBER

@naught101 I would consider changing:

python forcing_data.isel(lat=lat, lon=lon) .to_dataframe() .drop(['lat', 'lon'], axis=1)

to just forcing_data.isel(lat=lat, lon=lon).values, because there's no point in creating a DataFrame with a bunch of variables you wouldn't use -- pandas will be pretty wasteful in allocating this.

Otherwise that looks pretty reasonable, given the limitations of current groupby support. Now, ideally you could write something like instead:

``` python def make_prediction(forcing_data_time_series): predicted_values = model.predict(forcing_data_time_series.values) return xr.DataArray(predicted_values, [flux_vars, time])

forcing_data.groupby(['lat', 'lon']).dask_apply(make_prediction) ```

This would two the 2D groupby, and then apply the predict function in parallel with dask. Sadly we don't have this feature yet, though :).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218667702 https://github.com/pydata/xarray/pull/818#issuecomment-218667702 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODY2NzcwMg== naught101 167164 2016-05-12T06:02:55Z 2016-05-12T06:02:55Z NONE

@shoyer: Where does times come from in that code?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218654978 https://github.com/pydata/xarray/pull/818#issuecomment-218654978 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODY1NDk3OA== naught101 167164 2016-05-12T04:02:43Z 2016-05-12T04:03:01Z NONE

Example forcing data:

<xarray.Dataset> Dimensions: (lat: 360, lon: 720, time: 2928) Coordinates: * lon (lon) float64 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ... * lat (lat) float64 -89.75 -89.25 -88.75 -88.25 -87.75 -87.25 -86.75 ... * time (time) datetime64[ns] 2012-01-01 2012-01-01T03:00:00 ... Data variables: SWdown (time, lat, lon) float64 446.5 444.9 445.3 447.8 452.4 456.3 ...

Where there might be an arbitrary number of data variables, and the scikit-learn input would be time (rows) by data variables (columns). I'm currently doing this:

``` python def predict_gridded(model, forcing_data, flux_vars): """predict model results for gridded data

:model: TODO
:data: TODO
:returns: TODO

"""
# set prediction metadata
prediction = forcing_data[list(forcing_data.coords)]

# Arrays like (var, lon, lat, time)
result = np.full([len(flux_vars),
                  forcing_data.dims['lon'],
                  forcing_data.dims['lat'],
                  forcing_data.dims['time']],
                 np.nan)
print("predicting for lon: ")
for lon in range(len(forcing_data['lon'])):
    print(lon, end=', ')
    for lat in range(len(forcing_data['lat'])):
        result[:, lon, lat, :] = model.predict(
            forcing_data.isel(lat=lat, lon=lon)
                        .to_dataframe()
                        .drop(['lat', 'lon'], axis=1)
        ).T
print("")
for i, fv in enumerate(flux_vars):
    prediction.update(
        {fv: xr.DataArray(result[i, :, :, :], 
                          dims=['lon', 'lat', 'time'],
                          coords=forcing_data.coords)
        }
    )

return prediction

```

and I think it's working (still debugging, and it's pretty slow running)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218654283 https://github.com/pydata/xarray/pull/818#issuecomment-218654283 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODY1NDI4Mw== shoyer 1217238 2016-05-12T03:58:48Z 2016-05-12T03:58:48Z MEMBER

@jhamman @rabernat I'm pretty there is a good reason for that check to verify monotonicity, although I can no longer remember exactly why!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218653355 https://github.com/pydata/xarray/pull/818#issuecomment-218653355 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODY1MzM1NQ== shoyer 1217238 2016-05-12T03:54:09Z 2016-05-12T03:54:09Z MEMBER

@naught101

I want to be able to run a scikit-learn model over a bunch of variables in a 3D (lat/lon/time) dataset, and return values for each coordinate point. Is something like this multi-dimensional groupby required (I'm thinking groupby(lat, lon) => 2D matrices that can be fed straight into scikit-learn), or is there already some other mechanism that could achieve something like this? Or is the best way at the moment just to create a null dataset, and loop over lat/lon and fill in the blanks as you go?

Can you clarify exactly what shape data you want to put into scikit-learn to make predictions? What are the dimensions of your input? In principle, this is exactly the sort of thing that multi-dimensional groupby should solve, although we might also need support for multiple arguments to handle lat/lon (this should not be too difficult).


For the bins argument, I should suggest a separate DataArray/Dataset method for creating the GroupBy object. The resample method in xarray should be updated to return a GroupBy object (like the pandas method), and extending resample to numbers would be a natural fit. Something like Dataset.resample(longitude=10) could be a good way to spell this. (We would deprecate the how, freq and dim arguments, and ideally make all the remaining arguments keyword only.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218510748 https://github.com/pydata/xarray/pull/818#issuecomment-218510748 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODUxMDc0OA== jhamman 2443309 2016-05-11T16:18:05Z 2016-05-11T16:18:05Z MEMBER

@rabernat - See link to 2d slice with coordinates below:

sample_for_xarray_multigroupby.nc.zip

As for the TODO, I see now that it was there before and I agree that we should be able to side step the sorted requirement.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218450849 https://github.com/pydata/xarray/pull/818#issuecomment-218450849 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODQ1MDg0OQ== rabernat 1197350 2016-05-11T12:56:47Z 2016-05-11T12:56:47Z MEMBER

@jhamman: Could you post [a slice of] your dataset for me to try?

It seems this is only an issue when I specify bins. I see that there is a TODO statement there so maybe that will fix this.

The TODO comment was there when I started working on this. The error is raised by these lines

python index = safe_cast_to_index(group) if not index.is_monotonic: # TODO: sort instead of raising an error raise ValueError('index must be monotonic for resampling')

I'm not sure this check is necessary for binning.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218372591 https://github.com/pydata/xarray/pull/818#issuecomment-218372591 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODM3MjU5MQ== naught101 167164 2016-05-11T06:24:11Z 2016-05-11T06:24:11Z NONE

I want to be able to run a scikit-learn model over a bunch of variables in a 3D (lat/lon/time) dataset, and return values for each coordinate point. Is something like this multi-dimensional groupby required (I'm thinking groupby(lat, lon) => 2D matrices that can be fed straight into scikit-learn), or is there already some other mechanism that could achieve something like this? Or is the best way at the moment just to create a null dataset, and loop over lat/lon and fill in the blanks as you go?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
218358050 https://github.com/pydata/xarray/pull/818#issuecomment-218358050 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIxODM1ODA1MA== jhamman 2443309 2016-05-11T04:21:16Z 2016-05-11T04:21:16Z MEMBER

@rabernat - Sorry this took so long. Comments as I play around with the new feature... 1. I was getting some strange memory errors when trying this multidimensional groupbyon a large 4d ocean dataset (nlat: 720, nlon: 1280, time: 424, z_t: 45). my IPython Kernel just kept dying. Command was ds.TEMP.groupby('TLONG'). My naive guess is that this was a memory issue where a large number of bins were created - I think your first checkbox above alluded to this possibility. 2. Operating on a 2d field (dropped time and z_t dims), I get the following error:

``` pytb ----> 1 da.groupby('TLAT', bins=[50, 60, 70, 80, 90])

/Users/jhamman/Dropbox/src/xarray/xarray/core/common.py in groupby(self, group, squeeze, bins) 352 if isinstance(group, basestring): 353 group = self[group] --> 354 return self.groupby_cls(self, group, squeeze=squeeze, bins=bins) 355 356 def rolling(self, min_periods=None, center=False, **windows):

/Users/jhamman/Dropbox/src/xarray/xarray/core/groupby.py in init(self, obj, group, squeeze, grouper, bins) 141 if not index.is_monotonic: 142 # TODO: sort instead of raising an error --> 143 raise ValueError('index must be monotonic for resampling') 144 s = pd.Series(np.arange(index.size), index) 145 if grouper is not None:

ValueError: index must be monotonic for resampling ```

It seems this is only an issue when I specify bins. I see that there is a TODO statement there so maybe that will fix this.

Based on the datasets I have handy right now, I think number 2 in my list is a show show stopper so I think we want to make sure that feature makes it into this PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
208631007 https://github.com/pydata/xarray/pull/818#issuecomment-208631007 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwODYzMTAwNw== jhamman 2443309 2016-04-12T00:08:27Z 2016-04-12T00:08:27Z MEMBER

This looks really promising. I've gone through the code for the first time and had just a few comments. I'll pull your branch down and give it a test drive on some real data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
208092684 https://github.com/pydata/xarray/pull/818#issuecomment-208092684 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwODA5MjY4NA== rabernat 1197350 2016-04-10T23:39:29Z 2016-04-10T23:39:29Z MEMBER

@shoyer, @jhamman I think this is ready for a review

There are two distinct features added here: 1. groupby works with multidimensional coordinate variables. (See example at the top of the PR.) 2. groupby accepts a new keyword group_bins, which is passed to pandas.cut to digitize the groups (have not documented this yet because I could use some feedback on the api). For now, the coordinates are labeled with the category labels determined by cut. Using the example array above

``` python

da.groupby('lat', bins=[0,15,20]).apply(lambda x : x.sum()) <xarray.DataArray (lat: 2)> array([1, 5]) Coordinates: * lat (lat) object '(0, 15]' '(15, 20]' ```

I'm not sure this is the ideal behavior, since the categories are hard to slice. For my purposes, I would rather assign an integer or float index to each bin using e.g. the central value of the bin.

note: Both of these features have problems when used with shortcut=True.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
207983237 https://github.com/pydata/xarray/pull/818#issuecomment-207983237 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNzk4MzIzNw== rabernat 1197350 2016-04-10T13:15:49Z 2016-04-10T13:15:49Z MEMBER

So I tracked down the cause of the original array dimensions being overwritten. It happens within _concat_shortcut here: https://github.com/pydata/xarray/blob/master/xarray/core/groupby.py#L325

python result._coords[concat_dim.name] = as_variable(concat_dim, copy=True)

At this point, self.obj gets modified directly.

@shoyer should I just focus on the case where shortcut==False? Or should I try to debug the _concat_shortcut method? Your inline comments ("don't worry too much about maintaining this method") suggest that it is not going to be around forever.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
207531654 https://github.com/pydata/xarray/pull/818#issuecomment-207531654 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNzUzMTY1NA== rabernat 1197350 2016-04-08T17:39:10Z 2016-04-08T18:07:11Z MEMBER

I have tried adding a new keyword bins arg to groupby, which should accomplish what I want and more. (It will also work on regular one-dimensional groupby operations.)

The way it works is like this:

``` python

ar = xr.DataArray(np.arange(4), dims='dim_0') ar <xarray.DataArray (dim_0: 4)> array([0, 1, 2, 3]) Coordinates: * dim_0 (dim_0) int64 0 1 2 3 ar.groupby('dim_0', bins=[2,4]).sum() <xarray.DataArray (dim_0: 2)> array([1, 5]) Coordinates: * dim_0 (dim_0) int64 2 4 ```

The only problem is that it seems to overwrite the original dimension of the array! After calling groupby

``` python

ar <xarray.DataArray (dim_0: 4)> array([0, 1, 2, 3]) Coordinates: * dim_0 (dim_0) int64 2 4 ```

I think that resample overcomes this issue by renaming the dimension: https://github.com/pydata/xarray/blob/master/xarray/core/common.py#L437

I guess something similar should be possible here...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
207503695 https://github.com/pydata/xarray/pull/818#issuecomment-207503695 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNzUwMzY5NQ== shoyer 1217238 2016-04-08T16:29:58Z 2016-04-08T16:29:58Z MEMBER

@rabernat I'm not quite sure resample is the right place to put this, given that we aren't resampling on an axis. Just opened a pandas issue to discuss: https://github.com/pydata/pandas/issues/12828

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
207417668 https://github.com/pydata/xarray/pull/818#issuecomment-207417668 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNzQxNzY2OA== rabernat 1197350 2016-04-08T12:41:00Z 2016-04-08T12:41:00Z MEMBER

@shoyer regarding the binning, should I modify resample to allow for non-time dimensions? Or a new function?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
207077942 https://github.com/pydata/xarray/pull/818#issuecomment-207077942 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNzA3Nzk0Mg== rabernat 1197350 2016-04-07T20:34:53Z 2016-04-07T20:34:53Z MEMBER

The travis build failure is a conda problem, not my commit.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
207068032 https://github.com/pydata/xarray/pull/818#issuecomment-207068032 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNzA2ODAzMg== rabernat 1197350 2016-04-07T20:03:48Z 2016-04-07T20:03:48Z MEMBER

I think I got it working.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
207021028 https://github.com/pydata/xarray/pull/818#issuecomment-207021028 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNzAyMTAyOA== shoyer 1217238 2016-04-07T17:42:03Z 2016-04-07T17:42:26Z MEMBER

I think that if unstack things properly (only once instead of on each applied example) we should get something like this, alleviating the need for the new group name:

<xarray.DataArray (ny: 2, nx: 2)> array([[ 0. , -0.5], [ 0.5, 0]]) Coordinates: * ny (ny) int64 0 1 * nx (nx) int64 0 1

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
207000636 https://github.com/pydata/xarray/pull/818#issuecomment-207000636 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNzAwMDYzNg== rabernat 1197350 2016-04-07T17:14:55Z 2016-04-07T17:14:55Z MEMBER

My new commit supports unstacking in apply with shortcut=True. However, the behavior is kind of weird, in a way that is unique to the multidimensional case.

Consider the behavior of the text case:

``` python

da = xr.DataArray([[0,1],[2,3]], coords={'lon': (['ny','nx'], [[30,40],[40,50]] ), 'lat': (['ny','nx'], [[10,10],[20,20]] ),}, dims=['ny','nx'], da.groupby('lon').apply(lambda x : x - x.mean(), shortcut=False) <xarray.DataArray (lon_groups: 3, ny: 2, nx: 2)> array([[[ 0. , nan], [ nan, nan]],

   [[ nan, -0.5],
    [ 0.5,  nan]],

   [[ nan,  nan],
    [ nan,  0. ]]])

Coordinates: * ny (ny) int64 0 1 * nx (nx) int64 0 1 lat (lon_groups, ny, nx) float64 10.0 nan nan nan nan 10.0 20.0 ... lon (lon_groups, ny, nx) float64 30.0 nan nan nan nan 40.0 40.0 ... * lon_groups (lon_groups) int64 30 40 50 ```

When unstacking, the indices that are not part of the group get filled with nans. We are not able to put these arrays back together into a single array.

Note that if we do not rename the group name here: https://github.com/pydata/xarray/pull/818/files#diff-96b65e0bfec9fd2b9d562483f53661f5R121

Then we get an error here: https://github.com/pydata/xarray/pull/818/files#diff-96b65e0bfec9fd2b9d562483f53661f5R407

ValueError: the variable 'lon' has the same name as one of its dimensions ('lon', 'ny', 'nx'), but it is not 1-dimensional and thus it is not a valid index

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
206655187 https://github.com/pydata/xarray/pull/818#issuecomment-206655187 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNjY1NTE4Nw== shoyer 1217238 2016-04-07T01:48:01Z 2016-04-07T01:48:01Z MEMBER

@rabernat That looks like exactly the right place to me.

We only use variables for the concatenation in the shortcut=True path. With shortcut=False, we use DataArray/Dataset objects. For now, get it working with shortcut=False (hard code it if necessary) and I can help figure out how to extend it to shortcut=True.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
206628737 https://github.com/pydata/xarray/pull/818#issuecomment-206628737 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNjYyODczNw== rabernat 1197350 2016-04-07T00:14:17Z 2016-04-07T00:14:17Z MEMBER

@shoyer I'm having a tough time figuring out where to put the unstacking logic...maybe you can give me some advice.

My first idea was to add a method to the GroupBy class called _maybe_unstack_array and make a call to it here. The problem with that approach is that the group iteration happens over Variables, not full DataArrays, which means that unstacking is harder to do. Would need to store lots of metadata about the stacked / unstacked dimension names, sizes, etc.

If you think that is the right approach, I will forge ahead. But maybe, as the author of both the groupby and stack / unstack logic, you can see an easier way.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
206468443 https://github.com/pydata/xarray/pull/818#issuecomment-206468443 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNjQ2ODQ0Mw== jhamman 2443309 2016-04-06T17:09:31Z 2016-04-06T17:09:31Z MEMBER

@rabernat - I don't have much to add right now but I've very excited about this addition. Once you've filled in few more of the features, ping me and I'll give it a full review and will test it out in some applications we have in house.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
206445686 https://github.com/pydata/xarray/pull/818#issuecomment-206445686 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNjQ0NTY4Ng== shoyer 1217238 2016-04-06T16:13:01Z 2016-04-06T16:13:01Z MEMBER

(Oops, pressed the wrong button to close)

Can you clarify what you mean by this? At what point should the unstack happen?

Consider ds.groupby('latitude').apply(lambda x: x - x.mean()) or ds.groupby('latitude') - ds.groupby('latitude').mean() (these are two ways of writing the same thing). In each of these cases, the result of a groupby has the same dimensions as the original instead of replacing one or more of the original dimensions.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
206418244 https://github.com/pydata/xarray/pull/818#issuecomment-206418244 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNjQxODI0NA== rabernat 1197350 2016-04-06T15:05:54Z 2016-04-06T15:05:54Z MEMBER

Let me try to clarify what I mean in item 2:

Allow specification of which dims to stack.

Say you have the following dataset

``` python

ds = xr.Dataset( {'temperature': (['time','nx'], [[1,1,2,2],[2,2,3,3]] ), 'humidity': (['time','nx'], [[1,1,1,1],[1,1,1,1]] )}) ```

Now imagine you want to average humidity in temperature coordinates. (This might sound like a bizarre operation, but it is actually the foundation of a sophisticated sort of thermodynamic analysis.)

Currently this works as follows

``` python

ds = ds.set_coords('temperature') ds.humidity.groupby('temperature').sum() <xarray.DataArray 'humidity' (temperature: 3)> array([2, 4, 2]) Coordinates: * temperature (temperature) int64 1 2 3 ```

However, this sums over all time. What if you wanted to preserve the time dependence, but replace the nx coordinate with temperature. I would like to be able to say

python ds.humidity.groupby('temperature', group_over='nx').sum()

and get back a DataArray with dimensions ('time', 'temperature').

Maybe this is already possible with a sophisticated use of apply. But I don't see how to do it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
206389664 https://github.com/pydata/xarray/pull/818#issuecomment-206389664 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNjM4OTY2NA== rabernat 1197350 2016-04-06T14:09:43Z 2016-04-06T14:09:43Z MEMBER

As for the specialized "grouper", I agree that that makes sense. It's basically an extension of resample from dates to floating point -- noting that pandas recently changed the resample API so it works a little more like groupby. pandas.cut could probably handle most of the logic here.

I normally used numpy.digitize for this type of thing, but pandas.cut indeed seems like the obvious choice.

Should this go into a separate PR?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
206386864 https://github.com/pydata/xarray/pull/818#issuecomment-206386864 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNjM4Njg2NA== rabernat 1197350 2016-04-06T14:04:20Z 2016-04-06T14:04:20Z MEMBER

This will need to unstack to handle .apply. That will be nice for things like normalization.

Can you clarify what you mean by this? At what point should the unstack happen?

With the current code, apply seems to work ok:

``` python

da.groupby('lon').apply(lambda x : (x**2).sum()) <xarray.DataArray (lon: 3)> array([0, 5, 9]) Coordinates: * lon (lon) int64 30 40 50 ```

But perhaps I am missing a certain use case you have in mind?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
206182013 https://github.com/pydata/xarray/pull/818#issuecomment-206182013 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNjE4MjAxMw== shoyer 1217238 2016-04-06T07:31:32Z 2016-04-06T07:31:32Z MEMBER

This will need to unstack to handle .apply. That will be nice for things like normalization.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176
206165090 https://github.com/pydata/xarray/pull/818#issuecomment-206165090 https://api.github.com/repos/pydata/xarray/issues/818 MDEyOklzc3VlQ29tbWVudDIwNjE2NTA5MA== shoyer 1217238 2016-04-06T07:05:05Z 2016-04-06T07:05:05Z MEMBER

Yes, this is awesome! I had a vague idea that stack could make something like this possible but hadn't really thought it through.

As for the specialized "grouper", I agree that that makes sense. It's basically an extension of resample from dates to floating point -- noting that pandas recently changed the resample API so it works a little more like groupby. pandas.cut could probably handle most of the logic here.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multidimensional groupby 146182176

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1286.449ms · About: xarray-datasette