html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/2560#issuecomment-445482824,https://api.github.com/repos/pydata/xarray/issues/2560,445482824,MDEyOklzc3VlQ29tbWVudDQ0NTQ4MjgyNA==,514522,2018-12-08T19:13:08Z,2018-12-08T19:13:08Z,CONTRIBUTOR,"Sorry guys. I've found the problem and solution. The problem is that filesystem not supporting lock mechanism. The solution is to export the following variable: `export HDF5_USE_FILE_LOCKING=FALSE`.","{""total_count"": 2, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 2, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,383057458 https://github.com/pydata/xarray/issues/2560#issuecomment-440909054,https://api.github.com/repos/pydata/xarray/issues/2560,440909054,MDEyOklzc3VlQ29tbWVudDQ0MDkwOTA1NA==,514522,2018-11-22T04:25:05Z,2018-11-22T04:25:05Z,CONTRIBUTOR,https://github.com/limix/limix/blob/2.0.0/limix/qtl/test/test_qtl_xarr.py,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,383057458 https://github.com/pydata/xarray/issues/2410#issuecomment-422368970,https://api.github.com/repos/pydata/xarray/issues/2410,422368970,MDEyOklzc3VlQ29tbWVudDQyMjM2ODk3MA==,514522,2018-09-18T12:17:06Z,2018-09-18T12:17:06Z,CONTRIBUTOR,"I will first try to have both together. I'm well aware that learning by examples (that is true for me at least and apparently to most of people: tldr library), so at first I will try to combine all in one page: 1. Starts with examples, going from simple ones to more complicated one with no definition whasoever. 2. Begins a section defining terms and giving examples that ellucidate them (the first section we have here) 3. Ends with a formal description of the algorithm (the second section we have here) I prefer starting with 2 and 3 for me to actually understand xarray...","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,359240638 https://github.com/pydata/xarray/issues/2410#issuecomment-421998857,https://api.github.com/repos/pydata/xarray/issues/2410,421998857,MDEyOklzc3VlQ29tbWVudDQyMTk5ODg1Nw==,514522,2018-09-17T12:40:52Z,2018-09-17T12:40:52Z,CONTRIBUTOR,"I have updated mainly the _Indexing and selection data_ section. I'm proposing an indexing notation using `[]` operator vs `()` function call to differentiate between dimension lookup. But more importantly, I'm working out a precise definition of data array indexing in section _Formal indexing definition_. # Xarray definition A **data array** `a` has `D` dimensions, ordered from `0` to `D`. It contains an array of dimensionality `D`. The first dimension of that array is associated with the first dimension of the data array, and so forth. That array is returned by the data array attribute `values` . A **named data array** is a data array with the `name` attribute of string value: ```python >>> import xarray as xr >>> >>> a = xr.DataArray([[0, 1], [2, 3], [4, 5]]) >>> a.name = ""My name"" >>> a <xarray.DataArray 'My name' (dim_0: 3, dim_1: 2)> array([[0, 1], [2, 3], [4, 5]]) Dimensions without coordinates: dim_0, dim_1 ``` Each data array dimension has an unique `name` attribute of string type and can be accessed via data array `dims` attribute of tuple type. The name of the dimension `i` is `a.dims[i]` : ```python >>> a.dims[0] 'dim_0' ``` A data array can have zero or more coordinates, represented by a dict-like `coords` attribute. A coordinate is a named data array, referred also as **coordinate data array**. Coordinate data arrays have unique names among other coordinate data arrays. A coordinate data array of name `x` can be retrieved by `a.coords[x]` . A coordinate can have zero or more dimensions associated with. A **dimension data array** is a unidimensional coordinate data array associated with one, and only one, dimension having the same name as the coordinate data array itself. A dimension data array has always one, and only one, coordinate. That coordinate has again a dimension data array associated with: ```python >>> import numpy as np >>> >>> a = xr.DataArray(np.arange(6).reshape((3, 2)), dims=[""x"", ""y""], coords={""x"": list(""abc"")}) >>> a <xarray.DataArray (x: 3, y: 2)> array([[0, 1], [2, 3], [4, 5]]) Coordinates: * x (x) <U1 'a' 'b' 'c' Dimensions without coordinates: y ``` The above data array `a` has two dimensions: `""x""` and `""y""`. It has a single coordinate `""x""`, with its associated dimension data array `a.coords[""x""]`. The dimension data array definition implies in the following recursion: ```python >>> a.coords[""x""] <xarray.DataArray 'x' (x: 3)> array(['a', 'b', 'c'], dtype='<U1') Coordinates: * x (x) <U1 'a' 'b' 'c' >>> a.coords[""x""].coords[""x""] <xarray.DataArray 'x' (x: 3)> array(['a', 'b', 'c'], dtype='<U1') Coordinates: * x (x) <U1 'a' 'b' 'c' ``` Coordinate data arrays are meant to provide labels to array positions, allowing for convenient access to array elements: ```python >>> a.loc[""b"", :] <xarray.DataArray (y: 2)> array([2, 3]) Coordinates: x <U1 'b' Dimensions without coordinates: y ``` Note that there is no asterisk symbol for coordinate `""x""` of the above resulting data array, as the coordinate is not associated with any dimension. In other words, the coordinate data array `a.loc[""b"", :].coords[""x""]` is not a dimension data array. ## Indexing and selecting data There are four different but equally powerful ways of selecting data from a data array. They differ only on the type of dimension and index lookups: **position-based lookup** and **label-based lookup**: ``` | Dimension lookup | Index lookup | Data array | |------------------|----------------|--------------------| | Position-based | Position-based | a[:, 0] | | Position-based | Label-based | a.loc[:, ""UK""] | | Label-based | Position-based | a(country=0) | | Label-based | Label-based | a.loc(country=""UK"")| ``` A **dimension position-based lookup** is determined by the used position in the index operator: `a[first_dim, second_dim, ...]` and `a.loc[first_dim, second_dim, ...]`. An **index position-based lookup** is determined by the provided integers or slices: `a[0, [3, 9], :, ...]` and `a.loc(country=0, time=[3, 9], space=slice(None))`. A **dimension label-based lookup** is determined by the provided dimension name: `a(country=0)` and `a.loc(country=""UK"")`. An **index label-based loookup** is determined by the provided _index labels_ or slices [1]: `a.loc[:, ""UK""]` and `a.loc(countr=""UK"")`. [1] An **index label** is any [Numpy data type](https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html) object. Consider the following data array: ```python >>> a = xr.DataArray(np.arange(6).reshape((3, 2)), dims=[""year"", ""country""], coords={""year"": [1990, 1994, 1998], ""country"": [""UK"", ""US""]}) >>> a <xarray.DataArray (year: 3, country: 2)> array([[0, 1], [2, 3], [4, 5]]) Coordinates: * year (year) int64 1990 1994 1998 * country (country) <U2 'UK' 'US' ``` The expressions `a[:, 0]`, `a.loc[:, ""UK""]`, `a(country=0)`, and `a.loc(country=""UK"")` will all produce the same result: ```python >>> a.loc[:, ""UK""] <xarray.DataArray (year: 3)> array([0, 2, 4]) Coordinates: * year (year) int64 1990 1994 1998 country <U2 'UK' ``` ### Formal indexing definition Let `A` be the dimensionality of the `a`, the data array being indexed. Let `b` be the resulting data array. The Python operations for indexing is formally translated into an `A`-tuple: ``` (i_1, i_2, ..., i_A) ``` for which `i_j` is a named data array whose values are index labels. This data array, as usual, can have `0`, `1`, or more dimensions. Its construction is described later in this section. Let ``` (r_1, r_2, ..., r_A) ``` be the tuple representing the indices of `a`. Precisely, temporarily create dimension data arrays with labels from `0` to the dimension size for those dimensions without an associated dimension data array. Therefore, `r_j` is a dimension data array for dimension `j`. Also, it is required that `i_j` values are a subset of `r_j` values. For each `j`, define the lists `I_j = [(i_j0, 0), (i_j1, 1), ...]` and `R_j = [(r_j0, 0), (r_j1, 1), ...]` with pairs of data array values and positions. Perform a SQL JOIN as follows: 1. Apply the Cartesian product between the values of `I_j` and the values of `R_j`. 2. Preserve only those tuples that have equal values for the first dimension. Consider `i_0` and `r_0` defined as follows: ```python >>> i_0 <xarray.DataArray (apples: 2, oranges: 1)> array([['a'], ['c']], dtype='<U1') Dimensions without coordinates: apples, oranges >>> r_0 <xarray.DataArray (dim_0: 3)> array(['a', 'b', 'c'], dtype='<U1') Coordinates: * dim_0 (dim_0) <U1 'a' 'b' 'c' ``` Performing operations 1 and 2 will result in the list `IR_j = [(('a', 0), ('a', 0)), (('c', 1), ('c', 3))]`. The positions `IR_j[0][1][1]`, `IR_j[1][1][1]`, and so forth are used to access the values of `a` and assign them to the positions `IR_j[0][0][1]`, `IR_j[1][0][1]`, and so forth of `b`. Precisely, for each `(t_0, t_1, ..., t_A)` in `itertools.product(IR_0, IR_1, ..., IR_A)`, assign ```python b[reshape(t_0[0][1], i_0.shape), ..., reshape(t_A[0][1], i_A.shape)] = a[t_0[1][1], ..., t_A[1][1]] ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,359240638 https://github.com/pydata/xarray/issues/2410#issuecomment-420446944,https://api.github.com/repos/pydata/xarray/issues/2410,420446944,MDEyOklzc3VlQ29tbWVudDQyMDQ0Njk0NA==,514522,2018-09-11T22:25:23Z,2018-09-11T22:25:23Z,CONTRIBUTOR,"Thanks guys! Just to make sure, this is a work in progress. i realise that I made some wrong assumptions, and there are more to add into it.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,359240638 https://github.com/pydata/xarray/issues/2399#issuecomment-420446624,https://api.github.com/repos/pydata/xarray/issues/2399,420446624,MDEyOklzc3VlQ29tbWVudDQyMDQ0NjYyNA==,514522,2018-09-11T22:24:14Z,2018-09-11T22:24:14Z,CONTRIBUTOR,"Yes, I'm working on that doc for now to come up a very precise and as simple as possible definitions.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,357156174 https://github.com/pydata/xarray/issues/2399#issuecomment-420362244,https://api.github.com/repos/pydata/xarray/issues/2399,420362244,MDEyOklzc3VlQ29tbWVudDQyMDM2MjI0NA==,514522,2018-09-11T17:52:29Z,2018-09-11T17:52:29Z,CONTRIBUTOR,Hi again. I'm working on a precise definition of xarray and indexing. I find the official one a bit hard to understand. It might help me come up with a reasonable way to handle duplicate indices. https://drive.google.com/file/d/1uJ_U6nedkNe916SMViuVKlkGwPX-mGK7/view?usp=sharing,"{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,357156174 https://github.com/pydata/xarray/issues/2399#issuecomment-419714631,https://api.github.com/repos/pydata/xarray/issues/2399,419714631,MDEyOklzc3VlQ29tbWVudDQxOTcxNDYzMQ==,514522,2018-09-09T13:04:12Z,2018-09-09T13:04:12Z,CONTRIBUTOR,"I see. Now I read about it, let me give another shot. Let `i` be ``` <xarray.DataArray (y: 1, z: 1)> array([['a']], dtype='<U1') Dimensions without coordinates: y, z ``` and `d` be ``` <xarray.DataArray (x: 2)> array([0, 1]) Coordinates: * x (x) <U1 'a' 'a' ``` The result of `d.loc[i]` is equal to `d.sel(x=i)`. Also, it seems reasonable to expect the its result should be the same as `d0.sel(x=i)` for `d0` given by ``` <xarray.DataArray (x: 2, dim_1: 1)> array([[0], [1]]) Coordinates: * x (x) <U1 'a' 'a' Dimensions without coordinates: dim_1 ``` as per column vector representation assumption. ## Answer Laying down the first dimension gives | y | z | x | |---|---|---| | a | a | a | | | | a | By order, `x` will match with `y` and therefore we will append a new dimension after `x` to match with `z`: | y | z | x | dim_1 |---|---|---|-------| | a | a | a | ? | | | | a | ? | where `?` means any. Joining the first and second halves of the table gives | y | z | x | dim_1 |---|---|---|-------| | a | a | a | ? | | a | a | a | ? | And here is my suggestions. Use the mapping `y|->x` and `z|->dim_1` to decide which axis to expand for the additional element. I will choose y-axis because the additional `a` was originally appended to the x-axis. The answer is ``` <xarray.DataArray (y: 2, z: 1)> array([[0], [1]]) Coordinates: x (y, z) <U1 'a' 'a' Dimensions without coordinates: y, z ``` for ``` >>> ans.coords[""x""] <xarray.DataArray 'x' (y: 2, z: 1)> array([['a'], ['a']], dtype='<U1') Coordinates: x (y, z) <U1 'a' 'a' Dimensions without coordinates: y, z ``` ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,357156174 https://github.com/pydata/xarray/issues/2399#issuecomment-419383633,https://api.github.com/repos/pydata/xarray/issues/2399,419383633,MDEyOklzc3VlQ29tbWVudDQxOTM4MzYzMw==,514522,2018-09-07T09:39:01Z,2018-09-07T09:39:01Z,CONTRIBUTOR,"Now I see the problem. But I think it is solvable. I will ignore the dimension names for now as I don't have much experience with xarray yet. The code ```python da_nonunique = xr.DataArray([0, 1], dims=['x'], coords={'x': ['a', 'a']} indexer = xr.DataArray([['a']], dims=['y', 'z']) ``` can be understood as defining two indexed arrays: `[a, a]` and `[[a]]`. As we are allowing for non-unique indexing, I will denote unique array elements as `[e_0, e_1]` and `[[r_0]]` interchangeably. Algorithm: 1. Align. `[[a], [a]]` and `[[a]]`. 2. Ravel. `[(a,a), (a,a)]` and `[(a,a)]`. 3. Join. `[(a,a), (a,a)]`. I.e., `[e_0, e_1]`. 4. Unravel. `[[e_0, e_1]]`. Notice that `[e_0, e_1]` has been picked up by `r_0`. 5. Reshape. `[[e_0, e_1]]` (solution). Concretely, the solution is a bi-dimensional, 1x2 array: | 0 1 |. There is another relevant example. Let the code be ```python da_nonunique = xr.DataArray([0, 1, 2], dims=['x'], coords={'x': ['a', 'a', 'b']} indexer = xr.DataArray([['a', 'b']], dims=['y', 'z']) ``` We have `[a, a, b]` and `[[a, b]]`, also denoted as `[e_0, e_1, e_2]` and `[[r_0, r_1]]`. Algorithm: 1. Align. `[[a], [a], [b]]` and `[[a, b]]`. 2. Ravel. `[(a,a), (a,a), (b,b)]` and `[(a,a), (b,b)]`. 3. Join. `[(a,a), (a,a), (b,b)]`. I.e., `[e_0, e_1, e_2]`. 4. Unravel. `[[e_0, e_1, e_2]]`. Notice now that `[e_0, e_1]` has been picked up by `r_0` and `[e_2]` by `r_1`. 5. Reshape. `[[e_0, e_1, e_2]]`. The solution is a bi-dimensional, 1x3 array: | 0 1 2 | Explanation ----------- 1. Align recursively adds a new dimension in the array with lower dimensionality. 2. Ravel recursively removes a dimension by converting elements into tuples. 3. SQL Join operation: Cartesian product plus match. 4. Unravel performs the inverse of 2. 5. Reshape converts it to the indexer's dimensionality.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,357156174 https://github.com/pydata/xarray/issues/2399#issuecomment-419166914,https://api.github.com/repos/pydata/xarray/issues/2399,419166914,MDEyOklzc3VlQ29tbWVudDQxOTE2NjkxNA==,514522,2018-09-06T16:56:44Z,2018-09-06T16:56:44Z,CONTRIBUTOR,"Thanks for the feedback! 1. You can count on indexing if the is_unique flag is checked beforehand. The way pandas does indexing seems to be both **clear** to the user and **powerful**. It seems **clear** because indexing is the result of a Cartesian product after filtering for matching values. It is **powerful** because it allows indexing as complex as SQL INNER JOIN, which covers the trivial case of unique elements. For example, the following operation ```python import pandas as pd df = pd.DataFrame(data=[0, 1, 2], index=list(""aab"")) print(df.loc[list(""ab"")]) # 0 # a 0 # a 1 # b 2 ``` is an INNER JOIN between the two indexes ``` INNER((a, b) x (a, a, b)) = INNER(aa, aa, ab, ba, ba, bb) = (aa, aa, bb) ``` Another example: ```python import pandas as pd df = pd.DataFrame(data=[0, 1], index=list(""aa"")) print(df.loc[list(""aa"")]) # 0 # a 0 # a 1 # a 0 # a 1 ``` is again an INNER JOIN between the two indexes ``` INNER((a, a) x (a, a)) = INNER(aa, aa, aa, aa) = (aa, aa, aa, aa) ``` 2. Assume a bidimensional array with the following indexing: ``` 0 1 a ! @ a # $ ``` **This translate into an unidimensional index:** `(a, 0), (a, 1), (a, 0), (a, 1)`. As such, it can be treated as usual. Assume you index the above matrix using `[('a', 0), ('a', 0)]`. This implies ``` INNER( ((a, 0), (a, 0)) x ((a, 0), (a, 1), (a, 0), (a, 1)) ) = INNER( (a,0)(a,0), (a,0)(a,1), (a,0)(a,0), (a,0)(a,1), (a,0)(a,0), (a,0)(a,1), (a,0)(a,0), (a,0)(a,1) ) = ((a,0)(a,0), (a,0)(a,0), (a,0)(a,0), (a,0)(a,0)) ``` Converting it back to the matricial representation: ``` 0 0 a ! ! a # # ``` In summary, my suggestion is to consider the possibility of defining indexing `B` by using `A` (i.e., `B.loc(A)`) as a Cartesian product followed by match filtering. Or in SQL terms, an INNER JOIN. The multi-dimensional indexing, as far as I can see, can always be transformed into the uni-dimensional case and treated as such.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,357156174