id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
686608969,MDU6SXNzdWU2ODY2MDg5Njk=,4380,Error when rechunking from Zarr store,6130352,closed,0,,,5,2020-08-26T20:53:05Z,2023-11-12T05:50:29Z,2023-11-12T05:50:29Z,NONE,,,,"My assumption for this is that it should be possible to:

1. Write to a zarr store with some chunk size along a dimension
2. Load from that zarr store and rechunk to a multiple of that chunk size
3. Write that result to another zarr store

However I see this behavior instead:

```python
import xarray as xr
import dask.array as da

ds = xr.Dataset(dict(
    x=xr.DataArray(da.random.random(size=100, chunks=10), dims='d1')
))

# Write the store
ds.to_zarr('/tmp/ds1.zarr', mode='w') 

# Read it out, rechunk it, and attempt to write it again
xr.open_zarr('/tmp/ds1.zarr').chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')

ValueError: Final chunk of Zarr array must be the same size or smaller than the first. 
Specified Zarr chunk encoding['chunks']=(10,), for variable named 'x' but (20, 20, 20, 20, 20) 
in the variable's Dask chunks ((20, 20, 20, 20, 20),) is incompatible with this encoding. 
Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`.
```

<details><summary>Full trace</summary>
<pre>
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-122-e185759d81c5> in <module>
----> 1 xr.open_zarr('/tmp/ds1.zarr').chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')

/opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)
   1656             compute=compute,
   1657             consolidated=consolidated,
-> 1658             append_dim=append_dim,
   1659         )
   1660 

/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)
   1351     writer = ArrayWriter()
   1352     # TODO: figure out how to properly handle unlimited_dims
-> 1353     dump_to_store(dataset, zstore, writer, encoding=encoding)
   1354     writes = writer.sync(compute=compute)
   1355 

/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1126         variables, attrs = encoder(variables, attrs)
   1127 
-> 1128     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
   1129 
   1130 

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    411         self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)
    412         self.set_variables(
--> 413             variables_encoded, check_encoding_set, writer, unlimited_dims=unlimited_dims
    414         )
    415 

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
    466                 # new variable
    467                 encoding = extract_zarr_variable_encoding(
--> 468                     v, raise_on_invalid=check, name=vn
    469                 )
    470                 encoded_attrs = {}

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in extract_zarr_variable_encoding(variable, raise_on_invalid, name)
    214 
    215     chunks = _determine_zarr_chunks(
--> 216         encoding.get(""chunks""), variable.chunks, variable.ndim, name
    217     )
    218     encoding[""chunks""] = chunks

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name)
    154             if dchunks[-1] > zchunk:
    155                 raise ValueError(
--> 156                     ""Final chunk of Zarr array must be the same size or ""
    157                     ""smaller than the first. ""
    158                     f""Specified Zarr chunk encoding['chunks']={enc_chunks_tuple}, ""

ValueError: Final chunk of Zarr array must be the same size or smaller than the first. Specified Zarr chunk encoding['chunks']=(10,), for variable named 'x' but (20, 20, 20, 20, 20) in the variable's Dask chunks ((20, 20, 20, 20, 20),) is incompatible with this encoding. Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`.
</pre>
</details>


Overwriting chunks on `open_zarr` with `overwrite_encoded_chunks=True` works but I don't want that because it requires providing a uniform chunk size for all variables.  This workaround seems to be fine though:

```python
ds = xr.open_zarr('/tmp/ds1.zarr')
del ds.x.encoding['chunks']
ds.chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')
```

Does `encoding['chunks']` serve any purpose after you've loaded a zarr store and all the variables are defined as dask arrays?  In other words, Is there any harm in deleting it from all dask variables if I want those variables to write back out to zarr using the dask chunk definitions instead?

**Environment**:

<details><summary>Output of <tt>xr.show_versions()</tt></summary>
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:57:50) 
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-42-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: None

xarray: 0.16.0
pandas: 1.0.5
numpy: 1.19.0
scipy: 1.5.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.21.0
distributed: 2.21.0
matplotlib: 3.3.0
cartopy: None
seaborn: 0.10.1
numbagg: None
pint: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: 4.8.2
pytest: 5.4.3
IPython: 7.15.0
sphinx: 3.2.1


</details>
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4380/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
598991028,MDU6SXNzdWU1OTg5OTEwMjg=,3967,Support static type analysis ,6130352,closed,0,,,4,2020-04-13T16:34:43Z,2023-09-17T19:43:32Z,2023-09-17T19:43:31Z,NONE,,,,"As a related discussion to https://github.com/pydata/xarray/issues/3959, I wanted to see what possibilities exist for a user or API developer building on Xarray to enforce Dataset/DataArray structure through static analysis.

In my specific scenario, I would like to model several different types of data in my domain as Dataset objects, but I'd like to be able enforce that names and dtypes associated with both data variables and coordinates meet certain constraints. 

@keewis mentioned an example of this in https://github.com/pydata/xarray/issues/3959#issuecomment-612076605 where it might be possible to use something like a ```TypedDict``` to constrain variable/coord names and array dtypes, but this won't work with TypedDict as it's currently implemented.  Another possibility could be generics, and I took a stab at that in https://github.com/pydata/xarray/issues/3959#issuecomment-612513722 (though this would certainly be more intrusive). 

An example of where this would be useful is in adding extensions through accessors:

```python
@xr.register_dataset_accessor('ext')
def ExtAccessor:
    def __init__(self, ds)
        self.data = ds
    
    def is_zero(self):
        return self.ds['data'] == 0

ds = xr.Dataset(dict(DATA=xr.DataArray([0.0])))
# I'd like to catch that ""data"" was misspelled as ""DATA"" and that 
# this particular method shouldn't be run against floats prior to runtime
ds.ext.is_zero() 
```

I probably care more about this as someone looking to build an API on top of Xarray, but I imagine typical users would find a solution to this problem beneficial too.

There is a related conversation on doing something like this for Pandas DataFrames at https://github.com/python/typing/issues/28#issuecomment-351284520, so that might be helpful context for possibilities with ```TypeDict```.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3967/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,not_planned,13221727,issue
759709924,MDU6SXNzdWU3NTk3MDk5MjQ=,4663,Fancy indexing a Dataset with dask DataArray triggers multiple computes,6130352,closed,0,,,8,2020-12-08T19:17:08Z,2023-03-15T02:48:01Z,2023-03-15T02:48:01Z,NONE,,,,"It appears that boolean arrays (or any slicing array presumably) are evaluated many more times than necessary when applied to multiple variables in a Dataset.  Is this intentional?  Here is an example that demonstrates this:

```python

# Use a custom array type to know when data is being evaluated
class Array():
    
    def __init__(self, x):
        self.shape = (x.shape[0],)
        self.ndim = x.ndim
        self.dtype = 'bool'
        self.x = x
        
    def __getitem__(self, idx):
        if idx[0].stop > 0:
            print('Evaluating')
        return (self.x > .5).__getitem__(idx)

# Control case -- this shows that the print statement is only reached once
da.from_array(Array(np.random.rand(100))).compute();
# Evaluating

# This usage somehow results in two evaluations of this one array?
ds = xr.Dataset(dict(
    a=('x', da.from_array(Array(np.random.rand(100))))
))
ds.sel(x=ds.a)
# Evaluating
# Evaluating
# <xarray.Dataset>
# Dimensions:  (x: 51)
# Dimensions without coordinates: x
# Data variables:
#     a        (x) bool dask.array<chunksize=(51,), meta=np.ndarray>

# The array is evaluated an extra time for each new variable
ds = xr.Dataset(dict(
    a=('x', da.from_array(Array(np.random.rand(100)))),
    b=(('x', 'y'), da.random.random((100, 10))),
    c=(('x', 'y'), da.random.random((100, 10))),
    d=(('x', 'y'), da.random.random((100, 10))),
))
ds.sel(x=ds.a)
# Evaluating
# Evaluating
# Evaluating
# Evaluating
# Evaluating
# <xarray.Dataset>
# Dimensions:  (x: 48, y: 10)
# Dimensions without coordinates: x, y
# Data variables:
#     a        (x) bool dask.array<chunksize=(48,), meta=np.ndarray>
#     b        (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
#     c        (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
#     d        (x, y) float64 dask.array<chunksize=(48, 10), meta=np.ndarray>
```

Given that slicing is already not lazy, why does the same predicate array need to be computed more than once? 

@tomwhite originally pointed this out in https://github.com/pystatgen/sgkit/issues/299.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4663/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
884209406,MDU6SXNzdWU4ODQyMDk0MDY=,5286,Zarr chunks would overlap multiple dask chunks error,6130352,closed,0,,,3,2021-05-10T13:20:46Z,2021-05-12T16:16:05Z,2021-05-12T16:16:05Z,NONE,,,,"Would it be possible to get an explanation on how this situation results in a zarr chunk overlapping multiple dask chunks?  

This code below is generating an array with 2 chunks, selecting one row from each chunk, and then writing that resulting two row array back to zarr.  I don't see how it's possible in this case for one zarr chunk to correspond to different dask chunks.  There are clearly two resulting dask chunks, two input zarr chunks, and a correspondence between them that should be 1 to 1 ... what does this error message really mean then?

```python
import xarray as xr
import dask.array as da

ds = xr.Dataset(dict(
    x=(('a', 'b'), da.ones(shape=(10, 10), chunks=(5, 10))),
)).assign(a=list(range(10)))
ds
# <xarray.Dataset>
# Dimensions:  (a: 10, b: 10)
# Coordinates:
#   * a        (a) int64 0 1 2 3 4 5 6 7 8 9
# Dimensions without coordinates: b
# Data variables:
#     x        (a, b) float64 dask.array<chunksize=(5, 10), meta=np.ndarray>

# Write the dataset out
!rm -rf /tmp/test.zarr
ds.to_zarr('/tmp/test.zarr')

# Read it back in, subset to 1 record in two different chunks (two rows total), write back out
!rm -rf /tmp/test2.zarr
xr.open_zarr('/tmp/test.zarr').sel(a=[0, 11]).to_zarr('/tmp/test2.zarr')
# NotImplementedError: Specified zarr chunks encoding['chunks']=(5, 10) for variable named 'x' would overlap multiple dask chunks ((1, 1), (10,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.
```

Also what is the difference between ""deleting or modifying `encoding['chunks']`"" 
 and ""specify `safe_chunks=False`""?  That wasn't clear to me in https://github.com/pydata/xarray/issues/5056.

Lastly and most importantly, can data be corrupted when using parallel zarr writes and just deleting `encoding['chunks']` in these situations?

**Environment**:

<details><summary>Output of <tt>xr.show_versions()</tt></summary>

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) 
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-16-cloud-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.18.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.1
conda: None
pytest: 6.2.4
IPython: 7.23.1
sphinx: None


</details>
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5286/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
869792877,MDU6SXNzdWU4Njk3OTI4Nzc=,5229,Index level naming bug with `concat`,6130352,closed,0,,,2,2021-04-28T10:29:34Z,2021-04-28T19:38:26Z,2021-04-28T19:38:26Z,NONE,,,,"There is an inconsistency with how indexes are generated in a concat operation:

```python
def transform(df):
    return (
        df.to_xarray()
        .set_index(index=['id1', 'id2'])
        .pipe(lambda ds: xr.concat([
            ds.isel(index=ds.year == v)
            for v in ds.year.to_series().unique()
        ], dim='dates'))
    )

df1 = pd.DataFrame(dict(
    id1=[1,2,1,2],
    id2=[1,2,1,2],
    data=[1,2,3,4],
    year=[2019, 2019, 2020, 2020]
))
transform(df1)
<xarray.Dataset>
Dimensions:  (dates: 2, index: 2)
Coordinates:
  * index    (index) MultiIndex
  - id1      (index) int64 1 2
  - id2      (index) int64 1 2
Dimensions without coordinates: dates
Data variables:
    data     (dates, index) int64 1 2 3 4
    year     (dates, index) int64 2019 2019 2020 2020


df2 = pd.DataFrame(dict(
    id1=[1,2,1,2],
    id2=[1,2,1,3], # These don't quite align now
    data=[1,2,3,4],
    year=[2019, 2019, 2020, 2020]
))
transform(df2)
<xarray.Dataset>
Dimensions:        (dates: 2, index: 3)
Coordinates:
  * index          (index) MultiIndex
  - index_level_0  (index) int64 1 2 2   #  These names are now different from id1, id2
  - index_level_1  (index) int64 1 2 3
Dimensions without coordinates: dates
Data variables:
    data           (dates, index) float64 1.0 2.0 nan 3.0 nan 4.0
    year           (dates, index) float64 2.019e+03 2.019e+03 ... nan 2.02e+03
```

It only appears to happen when values in a multiindex for the datasets being concatenated differ.

**Environment**:

<details><summary>Output of <tt>xr.show_versions()</tt></summary>

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-16-cloud-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.8.0

xarray: 0.17.0
pandas: 1.1.1
numpy: 1.20.2
scipy: 1.6.2
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.30.0
distributed: 2.20.0
matplotlib: 3.3.3
cartopy: None
seaborn: 0.11.1
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.0.1
conda: None
pytest: 6.2.3
IPython: 7.22.0
sphinx: None


</details>
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/5229/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
727623263,MDU6SXNzdWU3Mjc2MjMyNjM=,4529,Dataset constructor with DataArray triggers computation,6130352,closed,0,,,5,2020-10-22T18:27:24Z,2021-02-19T23:13:57Z,2021-02-19T23:13:57Z,NONE,,,,"Is it intentional that creating a Dataset with a DataArray and dimension names for a single variable causes computation of that variable?  In other words, why does ```xr.Dataset(dict(a=('d0', xr.DataArray(da.random.random(10)))))``` cause the dask array to compute?

A longer example:

```python
import dask.array as da
import xarray as xr
x = da.random.randint(1, 10, size=(100, 25))
ds = xr.Dataset(dict(a=xr.DataArray(x, dims=('x', 'y'))))
type(ds.a.data)
dask.array.core.Array

# Recreate the dataset with the same array, but also redefine the dimensions
ds2 = xr.Dataset(dict(a=(('x', 'y'), ds.a))
type(ds2.a.data)
numpy.ndarray
```

","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4529/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
660112216,MDU6SXNzdWU2NjAxMTIyMTY=,4238,Missing return type annotations,6130352,closed,0,,,1,2020-07-18T12:09:06Z,2020-08-19T20:32:37Z,2020-08-19T20:32:37Z,NONE,,,,"[Dataset.to_dataframe](https://github.com/pydata/xarray/blob/1be777fe725a85b8cc0f65a2bc41f4bc2ba18043/xarray/core/dataset.py#L4536) should have a return type hint like [DataArray.to_dataframe](https://github.com/pydata/xarray/blob/1be777fe725a85b8cc0f65a2bc41f4bc2ba18043/xarray/core/dataarray.py#L2368).

Similarly, can [concat](https://github.com/pydata/xarray/blob/1be777fe725a85b8cc0f65a2bc41f4bc2ba18043/xarray/core/concat.py#L11) have a `Union[Dataset, DataArray]` return type or is it more complicated than that?
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/4238/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
597475005,MDU6SXNzdWU1OTc0NzUwMDU=,3959,Extending Xarray for domain-specific toolkits,6130352,closed,0,,,10,2020-04-09T18:34:34Z,2020-04-13T16:36:33Z,2020-04-13T16:36:32Z,NONE,,,,"Hi, I have a question about how to design an API over Xarray for a domain-specific use case (in genetics).  Having seen the following now:

- [Extending xarray](http://xarray.pydata.org/en/stable/internals.html#extending-xarray)
- [subclassing DataSet?](https://groups.google.com/forum/#!topic/xarray/wzprk6M-Mfg)
- [Subclassing Dataset and DataArray (issue #706)](https://github.com/pydata/xarray/issues/706)
- [Decorators for registering custom accessors in xarray (PR #806)](https://github.com/pydata/xarray/pull/806)

I wanted to reach out and seek some advice on what I'd like to do given that I don't think any of the solutions there are what I'm looking for.

More specifically, I would like to model the datasets we work with as xr.Dataset subtypes but I'd like to enforce certain preconditions for those types as well as support conversions between them.  An example would be that I may have a domain-specific type ```GenotypeDataset``` that should always contain 3 DataArrays and each of those arrays should meet different dtype and dimensionality constraints.  That type may be converted to another type, say ```HaplotypeDataset```, where the underlying data goes through some kind of transformation to produce a lower dimensional form more amenable to a specific class of algorithms.

One API I envision around these models consists of functions that enforce nominal typing on Xarray classes, so in that case I don't actually care if my subtypes are preserved by Xarray when operations are run.  It would be nice if that subtyping wasn't lost but I can understand that it's a limitation for now.  Here's an example of what I mean:

```python
from genetics import api

arr1 = ??? # some 3D integer DataArray of allele indices
arr2 = ??? # A missing data boolean DataArray
arr3 = ??? # Some other domain-specific stuff like variant phasing
ds = api.GenotypeDataset(arr1, arr2, arr3)

# A function that would be in the API would look like:
def analyze_haplotype(ds: xr.Dataset) -> xr.Dataset:
    # Do stuff assuming that the user has supplied a dataset compliant with 
    # the ""HaplotypeDataset"" constraints
    pass 

analyze_haplotype(ds.to_haplotype_dataset())
```

I like the idea of trying to avoid requiring API-specific data structures for all functionality in favor of conventions over Xarray data structures.   I think conveniences like these subtypes would be great for enforcing those conventions (rather than checking at the beginning of each function) as well as making it easier to go between representations, but I'm certainly open to suggestion.  I think something akin to structural subtyping that extends to what arrays are contained in the Dataset, how coordinates are named, what datatypes are used, etc. would be great but I have no idea if that's possible.

All that said, is it still a bad idea to try to subclass Xarray data structures even if the intent was never to touch any part of the internal APIs?  I noticed Xarray does some stuff like ```type(array)(...)``` internally but that's the only catch I've found so far (which I worked around by dispatching to constructors based on the arguments given).

cc: @alimanfoo - Alistair raised some concerns about trying this to me, so he may have some thoughts here too","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3959/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue
569176457,MDU6SXNzdWU1NjkxNzY0NTc=,3791,Self joins with non-unique indexes,6130352,closed,0,,,5,2020-02-21T20:47:35Z,2020-03-26T17:51:35Z,2020-03-05T19:32:38Z,NONE,,,,"Hi, is there a good way to self join arrays?

For example, given a dataset like this:

```python
import pandas as pd
df = pd.DataFrame(dict(
    x=[1, 1, 2, 2], 
    y=['1', '1', '2', '2'], 
    z=['a', 'b', 'c', 'd']))
df
```
![Screen Shot 2020-02-21 at 2 58 57 PM](https://user-images.githubusercontent.com/6130352/75069403-1c63a300-54bf-11ea-8a11-106d499c035e.png)

I am not looking for the pandas ```concat``` behavior for alignment:

```python
pd.concat([
    df.set_index(['x', 'y'])[['z']].rename(columns={'z': 'z_x'}),
    df.set_index(['x', 'y'])[['z']].rename(columns={'z': 'z_y'})
], axis=1, join='inner')
```
![Screen Shot 2020-02-21 at 2 58 40 PM](https://user-images.githubusercontent.com/6130352/75069417-238ab100-54bf-11ea-9903-dd7c91465f27.png)

but rather the ```merge``` behavior for a join by index:

```python
pd.merge(df, df, on=['x', 'y'])
```
![Screen Shot 2020-02-21 at 2 58 46 PM](https://user-images.githubusercontent.com/6130352/75069415-22f21a80-54bf-11ea-8653-4a1ce92e14f7.png)


I tried using ```xarray.merge``` but that seems to give the behavior like ```concat``` (i.e. alignment and not joining).  Even if it is possible, it's a large dataset that I need to process out-of-core via dask, and I have found that it takes some elbow grease to get this working with dask dataframes by ensuring that the number of partitions is set well and that the divisions are known prior to joining by index.  Should I expect that this sort of operation will work well with xarray (if it is possible) knowing that it's hard enough to do directly with dask without hitting OOM errors?

#### Output of ``xr.show_versions()``
<details>

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-28-generic
machine: x86_64
processor: 
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.7.3

xarray: 0.15.0
pandas: 0.25.2
numpy: 1.17.2
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.1
dask: 2.11.0
distributed: 2.11.0
matplotlib: 3.1.1
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 45.2.0.post20200209
pip: 20.0.2
conda: None
pytest: None
IPython: 7.12.0
sphinx: None
</details>
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/3791/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue