id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
425320466,MDU6SXNzdWU0MjUzMjA0NjY=,2852,Allow grouping by dask variables,10595679,open,0,,,10,2019-03-26T09:55:19Z,2022-04-18T15:45:41Z,,CONTRIBUTOR,,,,"#### Code Sample, a copy-pastable example if possible

I am using `xarray` in combination to `dask distributed` on a cluster, so a mimimal code sample demonstrating my problem is not easy to come up with. 

#### Problem description

Here is what I observe:
1. In my environment, `dask distributed` is correctly set-up with auto-scaling. I can verify this by loading data into `xarray` and using aggregation functions like `mean()`. This triggers auto-scaling and the dask dashboard shows that the processing is spread accross slave nodes.

2. I have the following `xarray` dataset called `geoms_ds`:
```
<xarray.Dataset>
Dimensions:  (x: 10980, y: 10980)
Coordinates:
  * y        (y) float64 4.9e+06 4.9e+06 4.9e+06 ... 4.79e+06 4.79e+06 4.79e+06
  * x        (x) float64 3e+05 3e+05 3e+05 ... 4.098e+05 4.098e+05 4.098e+05
Data variables:
    label    (y, x) uint16 dask.array<shape=(10980, 10980), chunksize=(200, 10980)>

```
Which I load with the following code sample:

```python
import xarray as xr
geoms = xr.open_rasterio('test_rasterization_T31TCJ_uint16.tif',chunks={'band': 1, 'x': 10980, 'y': 200})
geoms_squeez = geoms.isel(band=0).squeeze().drop(labels='band')
geoms_ds = geoms_squeez.to_dataset(name='label')
```
This `array` holds a finite number of integer values denoting groups (or classes if you like). I would like to perform statistics on groups (with additional variables) such as the mean value of a given variable for each group for instance.

3. I can do this perfectly for a single group using `.where(label=xxx).mean('variable')`, this behaves as expected, triggering auto-scaling and dask graph of task.

4. The problem is that I have a lot of groups (or classes) and looping through all of them and apply `where()` is not very efficient. From my reading of `xarray` documentation, `groupby` is what I need, to perform stats on all groups at once.

5. When I try to use `geoms_ds.groupby('label').size()` for instance, here is what I observe:
  * Grouping is not lazy, it is evaluated immediately,
  * Grouping is not performed through dask distributed, only the master node is working, on a single thread,
  * The grouping operation takes a large amount of time and eats a large amount of memory (nearly 30 Gb, which is a lot more than what is required to store the full dataset in memory)
  * Most of the time, the grouping fail with the following errors and warnings:

```
distributed.utils_perf - WARNING - full garbage collections took 52% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 48% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 50% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 53% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 58% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 58% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 59% CPU time recently (threshold: 10%)
WARNING:dask_jobqueue.core:Worker tcp://10.135.39.92:51747 restart in Job 2758934. This can be due to memory issue.
distributed.utils - ERROR - 'tcp://10.135.39.92:51747'
Traceback (most recent call last):
  File ""/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/distributed/utils.py"", line 648, in log_errors
    yield
  File ""/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/distributed/scheduler.py"", line 1360, in add_worker
    yield self.handle_worker(comm=comm, worker=address)
  File ""/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/tornado/gen.py"", line 1133, in run
    value = future.result()
  File ""/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/tornado/gen.py"", line 326, in wrapper
    yielded = next(result)
  File ""/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/distributed/scheduler.py"", line 2220, in handle_worker
    worker_comm = self.stream_comms[worker]
KeyError: ...
```
Which I assume comes from the fact that the process is killed by pbs for excessive memory usage.

#### Expected Output
I would except the following:
* Single call to `groupby`lazily evaluated,
* Evaluation of aggregation function performed through `dask distributed`
* The dataset is not so large, even on a single master thread the computation should end well in reasonable time.

#### Output of ``xr.show_versions()``

<details>
NSTALLED VERSIONS
------------------
commit: None
python: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.11.3
pandas: 0.24.1
numpy: 1.16.1
scipy: 1.2.0
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
PseudonetCDF: None
rasterio: 1.0.15
cfgrib: None
iris: None
bottleneck: None
cyordereddict: None
dask: 1.1.1
distributed: 1.25.3
matplotlib: 3.0.2
cartopy: 0.17.0
seaborn: 0.9.0
setuptools: 40.7.1
pip: 19.0.1
conda: None
pytest: None
IPython: 7.1.1
sphinx: None

</details>
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2852/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
428374352,MDExOlB1bGxSZXF1ZXN0MjY2NzU2OTEw,2865,BUG: Fix #2864 by adding the missing vrt parameters,10595679,closed,0,,,5,2019-04-02T18:22:07Z,2019-04-11T16:24:17Z,2019-04-11T16:24:13Z,CONTRIBUTOR,,0,pydata/xarray/pulls/2865," - [x] Closes #2864 
 - [x] Tests added
 - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2865/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull
428300345,MDU6SXNzdWU0MjgzMDAzNDU=,2864,Bug in WarpedVRT support of open_rasterio(),10595679,closed,0,,,3,2019-04-02T15:37:08Z,2019-04-11T16:24:13Z,2019-04-11T16:24:13Z,CONTRIBUTOR,,,,"#### Code Sample, a copy-pastable example if possible

Using the following data:

https://gitlab.orfeo-toolbox.org/orfeotoolbox/otb/blob/develop/Data/Input/QB_Toulouse_Ortho_XS.tif

```python
import rasterio as rio
from rasterio.crs import CRS
from rasterio.warp import calculate_default_transform,aligned_target
from rasterio.enums import Resampling
from rasterio.vrt import WarpedVRT

import xarray as xr

path = 'QB_Toulouse_Ortho_XS.tif'

# Read input metadata with rasterio                                                                                                                                                                                                           
with rio.open(path) as src:
    print('Input file CRS is {}'.format(src.crs))
    print('Input file shape is {}'.format(src.shape))
    print('Input file transform is {}'.format(src.transform))
    # Create a different CRS                                                                                                                                                                                                                  
    dst_crs = CRS.from_epsg(2154)
    # Compute a transform that will resample to dst_crs and change resolution to dst_crs                                                                                                                                                      
    transform, width, height = calculate_default_transform(
        src.crs, dst_crs, src.width, src.height,resolution=20, *src.bounds)
    # Fill vrt options as shown in https://rasterio.readthedocs.io/en/stable/topics/virtual-warping.html                                                                                                                                      
    vrt_options = {
    'resampling': Resampling.cubic,
    'crs': dst_crs,
    'transform': transform,
    'height': height,
    'width': width
    }
    # Create a WarpedVRT using the vrt_options                                                                                                                                                                                                
    with WarpedVRT(src,**vrt_options) as vrt:
        print('VRT shape is {}'.format(vrt.shape))
        # Open VRT with xarray                                                                                                                                                                                                                
        ds = xr.open_rasterio(vrt)
        # Shape does not match vrt shape:                                                                                                                                                                                                     
        print(ds)
```
Output:
```
$ python test_rio_vrt.py
Input file CRS is EPSG:32631
Input file shape is (500, 500)
Input file transform is | 0.60, 0.00, 374149.98|
| 0.00,-0.60, 4829183.99|
| 0.00, 0.00, 1.00|
VRT shape is (16, 16)
<xarray.DataArray (band: 4, y: 500, x: 500)>
[1000000 values with dtype=int16]
Coordinates:
  * band     (band) int64 1 2 3 4
  * y        (y) float64 6.28e+06 6.28e+06 6.28e+06 ... 6.279e+06 6.279e+06
  * x        (x) float64 5.741e+05 5.741e+05 5.741e+05 ... 5.744e+05 5.744e+05
Attributes:
    transform:   (0.6003151072155879, 0.0, 574068.2261249251, 0.0, -0.6003151...
    crs:         EPSG:2154
    res:         (0.6003151072155879, 0.6003151072155879)
    is_tiled:    0
    nodatavals:  (nan, nan, nan, nan)
```

#### Problem description

In the above example, `xarray.open_rasterio()` is asked to read a `WarpedVRT` created with `rasterio`. This `WarpedVRT` has custom `transform`, `width` and `height`, which is a very common use case of `WarpedVRT` where you upsample or downsample your data on the fly during read. Las, `open_rasterio()` ignores those custom attributes, which result in a wrong `DataArray` shape:
```
VRT shape is (16, 16)
<xarray.DataArray (band: 4, y: 500, x: 500)>
```

#### Expected Output
Correct output should be:

```
VRT shape is (16, 16)
<xarray.DataArray (band: 4, y: 16, x: 16)>
```

Which is fairly easy to obtain by modifying the following lines in `open_rasterio()`:
https://github.com/pydata/xarray/blob/0c73a380745c4792ab440eb020f78f203897abe5/xarray/backends/rasterio_.py#L222

With the following:
```python
      vrt_params = dict(crs=vrt.crs.to_string(),
                          resampling=vrt.resampling,
                          src_nodata=vrt.src_nodata,
                          dst_nodata=vrt.dst_nodata,
                          tolerance=vrt.tolerance,
                          # Edit                                                                                                                                                                                                              
                          transform=vrt.transform,
                          width=vrt.width,
                          height=vrt.height,
                          # End edit                                                                                                                                                                                                          
                          warp_extras=vrt.warp_extras)
```
I can provide this patch in a pull request if needed.

#### Output of ``xr.show_versions()``

<details>
>>> xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 03:09:43) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.2.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.utf8
LOCALE: fr_FR.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.11.3
pandas: 0.24.1
numpy: 1.16.1
scipy: 1.2.0
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
PseudonetCDF: None
rasterio: 1.0.17
cfgrib: None
iris: None
bottleneck: None
cyordereddict: None
dask: 1.1.1
distributed: 1.25.3
matplotlib: 3.0.2
cartopy: 0.17.0
seaborn: 0.9.0
setuptools: 40.7.3
pip: 19.0.1
conda: None
pytest: 4.2.0
IPython: 7.1.1
sphinx: None

</details>
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2864/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue