id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
416962458,MDU6SXNzdWU0MTY5NjI0NTg=,2799,Performance: numpy indexes small amounts of data 1000 faster than xarray,1386642,open,0,,,42,2019-03-04T19:44:17Z,2024-03-18T17:51:25Z,,CONTRIBUTOR,,,,"Machine learning applications often require iterating over every index along some of the dimensions of a dataset. For instance, iterating over all the `(lat, lon)` pairs in a 4D dataset with dimensions `(time, level, lat, lon)`. Unfortunately, this is very slow with xarray objects compared to numpy (or h5py) arrays. When the Pangeo machine learning working group met [today](https://github.com/pangeo-data/ml-workflow-examples/issues/1), we found that several of us have struggled with this.
I made some simplified [benchmarks](https://gist.github.com/nbren12/e781c5a8fe03ee170628194c4b3c3160), which show that xarray is about 1000 times slower than numpy when repeatedly grabbing a small amount of data from an array. This is a problem with both `isel` or `[]` indexing. After doing some profiling, the main culprits seem to be xarray routines like `_validate_indexers` and `_broadcast_indexes`.
While python will always be slower than C when iterating over an array in this fashion, I would hope that xarray could be nearly as fast as numpy. I am not sure what the best way to improve this is though.
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2799/reactions"", ""total_count"": 9, ""+1"": 9, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
856172272,MDU6SXNzdWU4NTYxNzIyNzI=,5144,Add chunks argument to {zeros/ones/empty}_like.,1386642,closed,0,,,5,2021-04-12T17:01:47Z,2023-10-25T03:18:05Z,2023-10-25T03:18:05Z,CONTRIBUTOR,,,,"**Describe the solution you'd like**
We have started using xarray objects as ""schema"" for initializing zarrs that will be written to using the `region` argument of `to_zarr`. For example,
```
output_schema.to_zarr(path, compute=False)
for region in regions:
output = func(input_data.isel(region))
output.to_zarr(path, region=region)
```
Currently, xarray's tools for computing the `output_schema` Dataset are a lacking since rechunking existing datasets can be slow. `dask.array.zeros_like` takes a chunks argument, can we add one here too?
**Describe alternatives you've considered**
`.chunk`","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5144/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
1473152374,I_kwDOAMm_X85XzoV2,7348,Using entry_points to register dataset and dataarray accessors?,1386642,open,0,,,4,2022-12-02T16:48:42Z,2023-09-14T19:53:46Z,,CONTRIBUTOR,,,,"### Is your feature request related to a problem?
External libraries often use the dataset/dataarray accessor pattern (e.g. [metpy](https://github.com/Unidata/MetPy/blob/f568aca6325cb23cfccc1006c4965ef7f7b5ad29/src/metpy/xarray.py#L105)). These accessors are not available until importing the external package where the registration occurs. This means scripts using these accessors must include an often-unused import that linters will complain about e.g.
```
import metpy # linter complains here
# some data
ds: xr.Dataset = ...
ds.metpy....
```
### Describe the solution you'd like
Use importlib entrypoints to register these as entrypoints so that registration is automatically handled. This is currently enabled for the array backend, but not for accessors (e.g. [metpy's setup.cfg](https://github.com/Unidata/MetPy/blob/f568aca6325cb23cfccc1006c4965ef7f7b5ad29/src/metpy/xarray.py#L105)).
### Describe alternatives you've considered
_No response_
### Additional context
_No response_","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/7348/reactions"", ""total_count"": 2, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 1}",,,13221727,issue
753852119,MDU6SXNzdWU3NTM4NTIxMTk=,4628,Lazy concatenation of arrays,1386642,open,0,,,5,2020-11-30T22:32:08Z,2022-05-10T17:02:34Z,,CONTRIBUTOR,,,,"**Is your feature request related to a problem? Please describe.**
Concatenating xarray objects forces the data to load. I recently learned about this [object](https://github.com/pydata/xarray/blob/235b2e5bcec253ca6a85762323121d28c3b06038/xarray/core/indexing.py#L592) allowing lazy indexing into an DataArrays/sets without using dask. Concatenation along a single dimension is the inverse operation of slicing, so it seems natural to also support it. Also, concatenating along dimensions (e.g. ""run""/""simulation""/""ensemble"") can be a common merging workflow.
**Describe the solution you'd like**
`xr.concat([a, b], dim=...)` does not load any data in a or b.
**Describe alternatives you've considered**
One could rename the variables in a and b to allow them to be merged (e.g. `a['air_temperature'] -> ""air_temperature_a""`), but it's more natural to make a new dimension.
**Additional context**
This is useful when not using dask for performance reasons (e.g. using another parallelism engine like Apache Beam).
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4628/reactions"", ""total_count"": 8, ""+1"": 8, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
588112617,MDU6SXNzdWU1ODgxMTI2MTc=,3894,Add public API for Dataset._copy_listed,1386642,open,0,,,15,2020-03-26T02:39:34Z,2022-04-18T16:41:39Z,,CONTRIBUTOR,,,,"In my data pipelines, I have been repeatedly burned using indexing notation to grab a few variables from a dataset in the following way:
```
ds = xr.Dataset(...)
vars = ('a' , 'b', 'c')
ds[vars] # this errors
ds[list(vars)] # this is ok
```
Moreover, because `Dataset__getitem__` is type unstable, it makes it hard to detect this kind of error using mypy, so it often appears 30 minutes into a long data pipeline. It would be great to have a type-stable method that can take any sequence of variable names and return the Dataset consisting of those variables and their coordinates only. In fact, this method already [exists](https://github.com/pydata/xarray/blob/6378a711d50ba7f1ba9b2a451d4d1f5e1fb37353/xarray/core/dataset.py#L1123), but it currently not public API. Could we make it so? Thanks.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3894/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
224846826,MDU6SXNzdWUyMjQ4NDY4MjY=,1387,FacetGrid with independent colorbars,1386642,open,0,,,7,2017-04-27T16:47:44Z,2022-04-13T11:07:49Z,,CONTRIBUTOR,,,,"Sometimes the magnitude of a variable can vary dramatically across a given coordinate, which makes 2d plots generated by xr.FacetGrid difficult to interpret. It would be useful if an option to xr.FacetGrid could be specified which allows each subplot to have its own colorbar.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1387/reactions"", ""total_count"": 2, ""+1"": 2, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1132894350,I_kwDOAMm_X85DhpiO,6269,Adding CDL Parser/`open_cdl`?,1386642,open,0,,,7,2022-02-11T17:31:36Z,2022-02-14T17:18:38Z,,CONTRIBUTOR,,,,"### Is your feature request related to a problem?
No.
### Describe the solution you'd like
It would be nice to load/generate xarray datasets from Common Data Language ([CDL][1]) descriptions. CDL is a DSL that that defines a netCDF dataset, and is quite nice for testing. We use it to build mock datasets for e.g. integration testing of plotting routines/complex data analysis etc. CDL provides a concise format for storing the schema of this data. This schema can be used for validation or generation (using the CLI `ncgen`).
CDL is basically the format produced by `xarray.Dataset.info`. It looks like this:
```
netcdf example { // example of CDL notation
dimensions:
lon = 3 ;
lat = 8 ;
variables:
float rh(lon, lat) ;
rh:units = ""percent"" ;
rh:long_name = ""Relative humidity"" ;
// global attributes
:title = ""Simple example, lacks some conventions"" ;
data:
/// optional ...ncgen will still build
rh =
2, 3, 5, 7, 11, 13, 17, 19,
23, 29, 31, 37, 41, 43, 47, 53,
59, 61, 67, 71, 73, 79, 83, 89 ;
}
```
I wrote a small pure python [parser](https://github.com/ai2cm/fv3net/blob/5c318e1a594a71baaa502ec4dc6809095b0828d3/external/vcm/vcm/cdl/parser.py#L1) for CDL last night and it seems work! There are [similar projects](https://github.com/rockdoc/cdlparser) on github. Sadly, these projects seem to be abandoned so it would be nice to attach to an effort like xarray.
[1]: https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/nc3model/Cdl.html
### Describe alternatives you've considered
Some kind of `schema` object that can be used to validate or generate an xarray Dataset, but does not contain any data.
### Additional context
_No response_","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6269/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
484863660,MDExOlB1bGxSZXF1ZXN0MzEwNjQxMzE0,3262,[WIP] Implement 1D to ND interpolation,1386642,closed,0,,,9,2019-08-24T21:23:21Z,2020-12-17T01:29:12Z,2020-12-17T01:29:12Z,CONTRIBUTOR,,0,pydata/xarray/pulls/3262,"
- [x] Closes #3252
- [ ] Tests added
- [ ] Passes `black . && mypy . && flake8`
- [ ] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3262/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull
334366223,MDU6SXNzdWUzMzQzNjYyMjM=,2241,Slow performance with isel on stacked coordinates,1386642,closed,0,,,4,2018-06-21T07:13:32Z,2020-06-20T20:51:48Z,2020-06-20T20:51:48Z,CONTRIBUTOR,,,,"#### Code Sample
```python
>>> a = xr.DataArray(np.random.rand(64,64,64), dims=list('xyz')).chunk({'x':8, 'y': 8})
>>> b = a.stack(b=['x', 'y'])
>>> %timeit b.isel(b=0).load()
3.81 ms ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit a.isel(x=0, y=0).load()
822 µs ± 3.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> np.allclose(b.isel(b=0).values, a.isel(x=0, y=0).values)
True
```
#### Problem description
I have noticed some pretty significant slow downs when using dask and stacked indices. As you can see in the example above, selecting the point x=0, y=0 takes about 4 times as long when the x and y dimensions are stacked together. This big difference only appears when `.load` is called. Does this mean it's a dask issue?
#### Output of ``xr.show_versions()``
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
xarray: 0.10.7
pandas: 0.22.0
numpy: 1.13.3
scipy: 1.0.0
netCDF4: 1.3.0
h5netcdf: 0.4.2
h5py: 2.7.1
Nio: None
zarr: 2.2.0
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.1
distributed: 1.21.1
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: 0.8.1
setuptools: 39.1.0
pip: 9.0.1
conda: None
pytest: 3.5.1
IPython: 6.2.1
sphinx: 1.6.5
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2241/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
636611699,MDExOlB1bGxSZXF1ZXN0NDMyNzU0MDQ5,4144,Improve typehints of xr.Dataset.__getitem__,1386642,closed,0,,,10,2020-06-10T23:33:41Z,2020-06-17T01:41:27Z,2020-06-15T11:25:53Z,CONTRIBUTOR,,0,pydata/xarray/pulls/4144,"To resolve some common type-related errors, this PR adds some overload type hints to `Dataset.__getitem__`. Now mypy can correctly infer that hashable inputs return DataArrays.
- [x] Closes #4125
- [x] Passes `isort -rc . && black . && mypy . && flake8`
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4144/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull
631940742,MDU6SXNzdWU2MzE5NDA3NDI=,4125,Improving typing of `xr.Dataset.__getitem__`,1386642,closed,0,,,2,2020-06-05T20:40:39Z,2020-06-15T11:25:53Z,2020-06-15T11:25:53Z,CONTRIBUTOR,,,,"First, I'd like the thank the xarray dev's for adding type hints to this library, not many libraries have this feature!
That said, the indexing notation of `xr.Dataset` does not currently play well wit mypy since it returns a Union type. This results in a lot of mypy errors like this:
```
workflows/fine_res_budget/budget/budgets.py:284: error: Argument 6 to ""compute_recoarsened_budget_field"" has incompatible type ""Union[DataArray, Dataset]""; expected ""DataArray""
workflows/fine_res_budget/budget/budgets.py:285: error: Argument 1 to ""storage"" has incompatible type ""Union[DataArray, Dataset]""; expected ""DataArray""
workflows/fine_res_budget/budget/budgets.py:286: error: Argument ""unresolved_flux"" to ""compute_recoarsened_budget_field"" has incompatible type ""Union[DataArray, Dataset]""; expected ""DataArray""
workflows/fine_res_budget/budget/budgets.py:287: error: Argument ""saturation_adjustment"" to ""compute_recoarsened_budget_field"" has incompatible type ""Union[DataArray, Dataset]""; expected ""DataArray""
```
#### MCVE Code Sample
```
def func(ds: xr.Dataset):
pass
dataset: xr.Dataset = ...
# error:
# this line will give type error because mypy doesn't know
# if ds[['a', 'b]] is Dataset or a DataArray
func(ds[['a', 'b']])
```
#### Expected Output
Mypy should be able to infer that `ds[['a', b']]` is a Dataset, and that `ds['a']` is a DataArray.
#### Problem Description
This requires any routine with type hints that consume an output of `xr.Dataset.__getitem__` to require a `Union[DataArray, Dataset]` even if it really intends to be used with either `DataArray` or `DataArray`. Because `ds[something]` is a ubiquitous syntax, this behavior accounts for approximately 50% of mypy errors in my xarray heavy code.
#### Versions
Output of xr.show_versions()
In [1]: import xarray as xr
xr.
In [2]: xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.7 (default, May 7 2020, 21:25:33)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-1020-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.7.3
xarray: 0.15.1
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: 1.1.2
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.17.2
distributed: 2.17.0
matplotlib: 3.1.3
cartopy: 0.17.0
seaborn: 0.10.1
numbagg: None
setuptools: 46.4.0.post20200518
pip: 20.0.2
conda: 4.8.3
pytest: 5.4.2
IPython: 7.13.0
sphinx: None
# Potential solution
I think we can fix this with [typing.overload](https://docs.python.org/3/library/typing.html#typing.overload). I am not too familiar with that librariy, but I think something like the following might work:
```
from typing import overload
class Dataset
@overload
def __getitem__(self, key: Hashable) -> DataArray: ...
@overload
def __getitem__(self, key: List[Hashable]) -> ""Dataset"": ...
# actual implementation
def __getitem__
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4125/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
289837692,MDU6SXNzdWUyODk4Mzc2OTI=,1839,Add simple array creation functions for easier unit testing,1386642,closed,0,,,3,2018-01-19T01:53:20Z,2020-01-19T04:21:10Z,2020-01-19T04:21:10Z,CONTRIBUTOR,,,,"When I am writing unit tests for routines that involve `DataArray` objects many lines of code are devoted to creating mock objects. Here is an example of a unit test I recently wrote to test some code which computes the fluid derivative of a field given the velocity.
```python
def test_material_derivative():
dims = ['x', 'y', 'z', 'time']
coords = {dim: np.arange(10) for dim in dims}
shape = [coords[dim].shape[0] for dim in coords]
f = xr.Dataset({'f': (dims, np.ones(shape))}, coords=coords)
f = f.f
one = 0 *f +1
zero = 0*f
md = material_derivative(zero, one, zero, f.x + 0*f)
np.testing.assert_array_almost_equal(md.values, 0)
md = material_derivative(one, zero, zero, f.x + 0*f)
np.testing.assert_array_almost_equal(md.isel(x=slice(1,-1)).values,
one.isel(x=slice(1,-1)).values)
md = material_derivative(zero, one, zero, f.y + 0*f)
np.testing.assert_array_almost_equal(md.isel(y=slice(1,-1)).values,
one.isel(y=slice(1,-1)).values)
md = material_derivative(zero, zero, one, f.z + 0*f)
np.testing.assert_array_almost_equal(md.isel(z=slice(1,-1)).values,
one.isel(z=slice(1,-1)).values)
```
As you can see, I devote many lines to initializing a 4D data array of all ones, where all the coordinates are `np.arange(10)` objects. It isn't too hard to do this once, but it gets pretty annoying to do many times, especially when I forget how the DataArray and Dataset constructors work. Now, I can do something like `xr.DataArray(np.ones(...))`, but I would still have to initialize the coordinates if I use them.
In any case, having some sort of functions like `xr.ones`, `xr.zeros`, and `xr.rand` which initialize the coordinates and data would be very nice.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1839/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
497427114,MDU6SXNzdWU0OTc0MjcxMTQ=,3337,"Dataset.groupby reductions give ""Dataset does not contain dimensions error"" in v0.13",1386642,closed,0,,,1,2019-09-24T03:01:00Z,2019-10-10T18:23:22Z,2019-10-10T18:23:22Z,CONTRIBUTOR,,,,"#### MCVE Code Sample
```python
>>> ds = xr.DataArray(np.ones((4,5)), dims=['z', 'x']).to_dataset(name='a')
>>> ds.a.groupby('z').mean()
array([1., 1., 1., 1.])
Dimensions without coordinates: z
>>> ds.groupby('z').mean()
Traceback (most recent call last):
File """", line 1, in
File ""/Users/noah/miniconda3/envs/broken/lib/python3.7/site-packages/xarray/core/common.py"", line 91, in wrapped_func
**kwargs
File ""/Users/noah/miniconda3/envs/broken/lib/python3.7/site-packages/xarray/core/groupby.py"", line 848, in reduce
return self.apply(reduce_dataset)
File ""/Users/noah/miniconda3/envs/broken/lib/python3.7/site-packages/xarray/core/groupby.py"", line 796, in apply
return self._combine(applied)
File ""/Users/noah/miniconda3/envs/broken/lib/python3.7/site-packages/xarray/core/groupby.py"", line 800, in _combine
applied_example, applied = peek_at(applied)
File ""/Users/noah/miniconda3/envs/broken/lib/python3.7/site-packages/xarray/core/utils.py"", line 181, in peek_at
peek = next(gen)
File ""/Users/noah/miniconda3/envs/broken/lib/python3.7/site-packages/xarray/core/groupby.py"", line 795, in
applied = (func(ds, *args, **kwargs) for ds in self._iter_grouped())
File ""/Users/noah/miniconda3/envs/broken/lib/python3.7/site-packages/xarray/core/groupby.py"", line 846, in reduce_dataset
return ds.reduce(func, dim, keep_attrs, **kwargs)
File ""/Users/noah/miniconda3/envs/broken/lib/python3.7/site-packages/xarray/core/dataset.py"", line 3888, in reduce
""Dataset does not contain the dimensions: %s"" % missing_dimensions
ValueError: Dataset does not contain the dimensions: ['z']
>>> ds.dims
Frozen(SortedKeysDict({'z': 4, 'x': 5}))
```
#### Problem Description
Groupby reduction operations on `Dataset` objects no longer seem to work in xarray v0.13. In the example, above I create an xarray dataset with one dataarray called ""a"". The same groupby operations fails on this `Dataset`, but succeeds when called directly on ""a"". Is this a bug or an intended change?
In addition the error message is confusing since `z` is one of the Dataset dimensions.
#### Output of ``xr.show_versions()``
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
[Clang 4.0.1 (tags/RELEASE_401/final)]
python-bits: 64
OS: Darwin
OS-release: 18.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None
xarray: 0.13.0
pandas: 0.25.1
numpy: 1.17.2
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 41.2.0
pip: 19.2.3
conda: None
pytest: None
IPython: None
sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3337/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
261131958,MDExOlB1bGxSZXF1ZXN0MTQzNTExMTA3,1597,Add methods for combining variables of differing dimensionality,1386642,closed,0,,,46,2017-09-27T22:01:57Z,2019-07-05T15:59:51Z,2019-07-05T00:32:51Z,CONTRIBUTOR,,0,pydata/xarray/pulls/1597," - [x] Closes #1317
- [x] Tests added / passed
- [x] Passes ``git diff upstream/master | flake8 --diff``
- [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API
While working on #1317, I settled upon combining `stack` and `to_array` to create two dimensional numpy arrays given an xarray Dataset. Unfortunately, `to_array` automatically broadcasts the variables of dataset, which is not always a desirable behavior. For instance, I was trying to combine precipitation (a horizontal field) and temperature (a 3D field) into one array.
This PR enables this by adding two new methods to xarray:
- `Dataset.stack_cat`, and
- `DataArray.unstack_cat`.
`stack_cat` uses `stack`, `expand_dims`, and `concat` to reshape a Dataset into a Dataarray with a helpful MultiIndex, and `unstack_cat` reverses the process.
I implemented this functionality as a new method since `to_array` is such a clean method already. I really appreciate your thoughts on this. Thanks!
cc @jhamman @shoyer ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1597/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull
216215022,MDU6SXNzdWUyMTYyMTUwMjI=,1317,"API for reshaping DataArrays as 2D ""data matrices"" for use in machine learning",1386642,closed,0,,,9,2017-03-22T21:33:07Z,2019-07-05T00:32:51Z,2019-07-05T00:32:51Z,CONTRIBUTOR,,,,"Machine learning and linear algebra problems are often expressed in terms of operations on matrices rather than arrays of arbitrary dimension, and there is currently no convenient way to turn DataArrays (or combinations of DataArrays) into a single ""data matrix"".
As an example, I have needed to use scikit-learn lately with data from DataArray objects. Scikit-learn requires the data to be expressed in terms of simple 2-dimensional matrices. The rows are called samples, and the columns are known as features. It is annoying and error to transpose and reshape a data array by hand to fit into this format. For instance, this [gituhub repo for xarray aware sklearn-like objects](https://github.com/wy2136/xlearn) devotes many lines of code to massaging data arrays into data matrices. I think that this reshaping workflow might be common enough to warrant some kind of treatment in xarray.
I have written some code in this [gist](https://gist.github.com/nbren12/46767d237e3b1e59f7e2e1165c1e72c5), that have found pretty convenient for doing this. This gist has an `XRReshaper` class which can be used for reshaping data to and from a matrix format. The basic usage for an EOF analysis of a dataset `A(lat, lon, time)` can be done like this
```python
feature_dims = ['lat', 'lon']
rs = XRReshaper(A)
data_matrix, _ = rs.to(feature_dims)
# Some linear algebra or machine learning
_,_, eofs = svd(data_matrix)
eofs_datarray = rs.get(eofs[0], ['mode'] + feature_dims)
```
I am not sure this is the best API, but it seems to work pretty well and I have used it [here](https://github.com/nbren12/gnl/blob/master/gnl/xlearn.py) to implement some xarray-aware sklearn-like objects for PCA, which can be used like
```
feature_dims = ['lat', 'lon']
pca = XPCA(feature_dims, n_components=10, weight=cos(A.lat))
pca.fit(A)
pca.transform(A)
eofs = pca.components_
```
Another syntax which might be helpful is some kind of context manager approach like
```python
with XRReshaper(A) as rs, data_matrix:
# do some stuff with data_matrix
# use rs to restore output to a data array.
```
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1317/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
294089233,MDExOlB1bGxSZXF1ZXN0MTY2OTQ5Nzcw,1885,Raise when pcolormesh coordinate is not sorted,1386642,closed,0,,,18,2018-02-03T06:37:34Z,2018-02-18T19:26:36Z,2018-02-18T19:06:31Z,CONTRIBUTOR,,0,pydata/xarray/pulls/1885," - [x] Closes #1852 (remove if there is no corresponding issue, which should only be the case for minor changes)
- [x] Tests added (for all bug fixes or enhancements)
- [x] Tests passed (for all non-documentation changes)
I added a simple warning to `_infer_interval_breaks` in `xarray/plot/plot.py`. The warning does not currently say the name of the coordinate, because that would requiring introducing a new function or potentially passing a name argument, which seems overly complicated for such a small dit. Hopefully, this isn't a problem because the user can easily figure out which coordinate is not sorted by process of elimination.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1885/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull
291103680,MDU6SXNzdWUyOTExMDM2ODA=,1852,bug: 2D pcolormesh plots are wrong when coordinate is not ascending order,1386642,closed,0,,,9,2018-01-24T07:01:07Z,2018-02-18T19:06:31Z,2018-02-18T19:06:31Z,CONTRIBUTOR,,,,"#### Code Sample, a copy-pastable example if possible
```python
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
x = np.arange(10)
y = np.arange(20)
np.random.shuffle(x)
x = xr.DataArray(x, dims=['x'], coords={'x': x})
y = xr.DataArray(y, dims=['y'], coords={'y': y})
z = x + y
z_sorted = z.isel(x=np.argsort(x.values))
# make plot
fig, axs= plt.subplots(1, 2, figsize=(6,3))
z_sorted.plot(ax=axs[0])
axs[0].set_title(""X is sorted"")
z.plot(ax=axs[1])
axs[1].set_title(""X is not unsorted"")
plt.tight_layout()
```
#### Problem description
Sometime the coordinates in an xarray dataset are not always sorted in ascending order. I recently had an issue where the time coordinate of a 2D datasets was scrambled, so calling `x.plot` gave very strange results. In my opinion, `x.plot` should probably sort the data along the coordinates, or at least provide a warning if the coordinates are unsorted.
#### Expected Output
Here is the image generated by the snippet above:

The left and right panels should be the same.
# Paste the output here xr.show_versions() here
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
xarray: 0.10.0+dev50.ga988dc2
pandas: 0.20.3
numpy: 1.13.1
scipy: 0.19.1
netCDF4: 1.3.1
h5netcdf: 0.5.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.15.2
distributed: 1.18.3
matplotlib: 2.0.2
cartopy: None
seaborn: 0.8.0
setuptools: 36.5.0.post20170921
pip: 9.0.1
conda: 4.3.29
pytest: 3.2.1
IPython: 6.1.0
sphinx: 1.6.3
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1852/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
258640421,MDU6SXNzdWUyNTg2NDA0MjE=,1577,Potential error in apply_ufunc docstring for input_core_dims,1386642,closed,0,,,5,2017-09-18T22:28:10Z,2017-10-10T04:42:21Z,2017-10-10T04:42:21Z,CONTRIBUTOR,,,,"The documentation for `input_core_dims` reads:
```
input_core_dims : Sequence[Sequence], optional
List of the same length as ``args`` giving the list of core dimensions
on each input argument that should be broadcast. By default, we assume
there are no core dimensions on any input arguments.
For example ,``input_core_dims=[[], ['time']]`` indicates that all
dimensions on the first argument and all dimensions other than 'time'
on the second argument should be broadcast.
```
The first and second paragraphs seem contradictory to me. Shouldn't the first paragraph be changed to:
```
List of the same length as ``args`` giving the list of core dimensions
on each input argument that should *not* be broadcast. By default, we assume
there are no core dimensions on any input arguments.
```","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1577/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue