html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/818#issuecomment-219231028,https://api.github.com/repos/pydata/xarray/issues/818,219231028,MDEyOklzc3VlQ29tbWVudDIxOTIzMTAyOA==,1328158,2016-05-14T16:56:37Z,2016-05-14T16:56:37Z,NONE,"I would also like to do what is described below but so far have had little
success using xarray.

I have time series data (x years of monthly values) at each lat/lon point
of a grid (x*12 times, lons, lats). I want to apply a function f() against
the time series to return a corresponding time series of values. I then
write these values to an output NetCDF which corresponds to the input
NetCDF in terms of dimensions and coordinate variables. So instead of
looping over every lat and every lon I want to apply f() in a vectorized
manner such as what's described for xarray's groupby (in order to gain the
expected performance from using xarray for the split-apply-combine
pattern), but it needs to work for more than a single dimension which is
the current capability.

Has anyone done what is described above using xarray? What sort of
performance gains can be expected using your approach?

Thanks in advance for any help with this topic. My apologies if there is a
more appropriate forum for this sort of discussion (please redirect if so),
as this may not be applicable to the original issue...

--James

On Wed, May 11, 2016 at 2:24 AM, naught101 notifications@github.com wrote:

> I want to be able to run a scikit-learn model over a bunch of variables in
> a 3D (lat/lon/time) dataset, and return values for each coordinate point.
> Is something like this multi-dimensional groupby required (I'm thinking
> groupby(lat, lon) => 2D matrices that can be fed straight into
> scikit-learn), or is there already some other mechanism that could achieve
> something like this? Or is the best way at the moment just to create a null
> dataset, and loop over lat/lon and fill in the blanks as you go?
> 
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly or view it on GitHub
> https://github.com/pydata/xarray/pull/818#issuecomment-218372591
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,146182176
https://github.com/pydata/xarray/pull/818#issuecomment-218675077,https://api.github.com/repos/pydata/xarray/issues/818,218675077,MDEyOklzc3VlQ29tbWVudDIxODY3NTA3Nw==,167164,2016-05-12T06:54:53Z,2016-05-12T06:54:53Z,NONE,"`forcing_data.isel(lat=lat, lon=lon).values()` returns a `ValuesView`, which scikit-learn doesn't like. However, `forcing_data.isel(lat=lat, lon=lon).to_array().T` seems to work..
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,146182176
https://github.com/pydata/xarray/pull/818#issuecomment-218667702,https://api.github.com/repos/pydata/xarray/issues/818,218667702,MDEyOklzc3VlQ29tbWVudDIxODY2NzcwMg==,167164,2016-05-12T06:02:55Z,2016-05-12T06:02:55Z,NONE,"@shoyer: Where does `times` come from in that code?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,146182176
https://github.com/pydata/xarray/pull/818#issuecomment-218654978,https://api.github.com/repos/pydata/xarray/issues/818,218654978,MDEyOklzc3VlQ29tbWVudDIxODY1NDk3OA==,167164,2016-05-12T04:02:43Z,2016-05-12T04:03:01Z,NONE,"Example forcing data:

```
<xarray.Dataset>
Dimensions:  (lat: 360, lon: 720, time: 2928)
Coordinates:
  * lon      (lon) float64 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 ...
  * lat      (lat) float64 -89.75 -89.25 -88.75 -88.25 -87.75 -87.25 -86.75 ...
  * time     (time) datetime64[ns] 2012-01-01 2012-01-01T03:00:00 ...
Data variables:
    SWdown   (time, lat, lon) float64 446.5 444.9 445.3 447.8 452.4 456.3 ...
```

Where there might be an arbitrary number of data variables, and the scikit-learn input would be time (rows) by data variables (columns). I'm currently doing this:

``` python
def predict_gridded(model, forcing_data, flux_vars):
    """"""predict model results for gridded data

    :model: TODO
    :data: TODO
    :returns: TODO

    """"""
    # set prediction metadata
    prediction = forcing_data[list(forcing_data.coords)]

    # Arrays like (var, lon, lat, time)
    result = np.full([len(flux_vars),
                      forcing_data.dims['lon'],
                      forcing_data.dims['lat'],
                      forcing_data.dims['time']],
                     np.nan)
    print(""predicting for lon: "")
    for lon in range(len(forcing_data['lon'])):
        print(lon, end=', ')
        for lat in range(len(forcing_data['lat'])):
            result[:, lon, lat, :] = model.predict(
                forcing_data.isel(lat=lat, lon=lon)
                            .to_dataframe()
                            .drop(['lat', 'lon'], axis=1)
            ).T
    print("""")
    for i, fv in enumerate(flux_vars):
        prediction.update(
            {fv: xr.DataArray(result[i, :, :, :], 
                              dims=['lon', 'lat', 'time'],
                              coords=forcing_data.coords)
            }
        )

    return prediction
```

and I think it's working (still debugging, and it's pretty slow running)
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,146182176
https://github.com/pydata/xarray/pull/818#issuecomment-218372591,https://api.github.com/repos/pydata/xarray/issues/818,218372591,MDEyOklzc3VlQ29tbWVudDIxODM3MjU5MQ==,167164,2016-05-11T06:24:11Z,2016-05-11T06:24:11Z,NONE,"I want to be able to run a scikit-learn model over a bunch of variables in a 3D (lat/lon/time) dataset, and return values for each coordinate point. Is something like this multi-dimensional groupby required (I'm thinking groupby(lat, lon) => 2D matrices that can be fed straight into scikit-learn), or is there already some other mechanism that could achieve something like this? Or is the best way at the moment just to create a null dataset, and loop over lat/lon and fill in the blanks as you go?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,146182176