github: issue_comments: 68 rows where issue = 262642978 sorted by updated

68 rows where issue = 262642978 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
1259326037	https://github.com/pydata/xarray/issues/1603#issuecomment-1259326037	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X85LD8pV	benbovy 4160723	2022-09-27T10:50:36Z	2022-09-27T10:50:36Z	MEMBER	Should we close this issue and continue the discussion in #6293? For anyone who wants to track the progress on this topic: https://github.com/pydata/xarray/projects/1	{ "total_count": 2, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 2, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949494376	https://github.com/pydata/xarray/issues/1603#issuecomment-949494376	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844mCJo	benbovy 4160723	2021-10-22T10:27:26Z	2021-10-22T10:27:26Z	MEMBER	well, both "contain the origin dims" or just "generate another one" have its benefit. Agreed, and both are supported by xarray actually. In case we want to keep the original dimensions like ("x", "y") in the example above, it's better to use masking. This discussion is broader than the topic covered in this issue so I'd suggest you start a new discussion if you want to further discuss this with the xarray community. Thanks.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949485684	https://github.com/pydata/xarray/issues/1603#issuecomment-949485684	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844mAB0	weipeng1999 38346144	2021-10-22T10:15:39Z	2021-10-22T10:15:39Z	NONE	So I think maintain the origin dims may do less broken on current code.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949484507	https://github.com/pydata/xarray/issues/1603#issuecomment-949484507	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844l_vb	weipeng1999 38346144	2021-10-22T10:14:01Z	2021-10-22T10:14:01Z	NONE	For such case you could already do `ds.stack(z=("t", "x")).set_index(z="C2").sel(z=["a", "e", "h"])`. After the explicit index refactor, we could imagine a custom index that supports multi-dimension coordinates such that you would only need to do something like ```python S_res = S4.sel(C2=("z", ["a", "e", "h"])) S_res <xarray.Dataset> Dimensions: (z: 3) Coordinates: * C2 (z) <U1 'a' 'e' 'h' Data variables: A1 (z) float64 4 3 3 ``` or without explicitly providing the name of the packed dimension: ```python S_res = S4.sel(C2=["a", "e", "h"]) S_res <xarray.Dataset> Dimensions: (C2: 3) Coordinates: * C2 (C2) <U1 'a' 'e' 'h' Data variables: A1 (C2) float64 4 3 3 ``` well, both "contain the origin dims" or just "generate another one" have its benefit. if we contain origin dims, we can ensure that: - less difference between 1d coordinate and multi dims ones, both can run like S1.sel(C1=["a", "e", "h"]) S4.sel(C2=["a", "e", "h"]) and return a new data set with origin dims ( that's why I highly not recommended the implicit one ) - return a new data set have original dims which means if you change C1 to C2, and the rest code have S_res.sel(x=[1,2,3]) still work.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949449312	https://github.com/pydata/xarray/issues/1603#issuecomment-949449312	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844l3Jg	benbovy 4160723	2021-10-22T09:28:01Z	2021-10-22T09:28:01Z	MEMBER	For such case you could already do `ds.stack(z=("t", "x")).set_index(z="C2").sel(z=["a", "e", "h"])`. After the explicit index refactor, we could imagine a custom index that supports multi-dimension coordinates such that you would only need to do something like ```python S_res = S4.sel(C2=("z", ["a", "e", "h"])) S_res <xarray.Dataset> Dimensions: (z: 3) Coordinates: * C2 (z) <U1 'a' 'e' 'h' Data variables: A1 (z) float64 4 3 3 ``` or without explicitly providing the name of the packed dimension: ```python S_res = S4.sel(C2=["a", "e", "h"]) S_res <xarray.Dataset> Dimensions: (C2: 3) Coordinates: * C2 (C2) <U1 'a' 'e' 'h' Data variables: A1 (C2) float64 4 3 3 ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949423480	https://github.com/pydata/xarray/issues/1603#issuecomment-949423480	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844lw14	weipeng1999 38346144	2021-10-22T08:56:38Z	2021-10-22T09:15:17Z	NONE	well, here are my ideas on how to define coordinates with multi dims.(because of github's bug, the characters of 1st image are white, I can not fix it)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949413144	https://github.com/pydata/xarray/issues/1603#issuecomment-949413144	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844luUY	benbovy 4160723	2021-10-22T08:41:36Z	2021-10-22T08:41:36Z	MEMBER	Sorry but this is confusing. To me It still looks like you want implicit broadcasting of the `A3` variable along the `x` dimension. In your last comment you depict `A3` inconsistently with a 2-d shape but with only the `t` dimension. I'm also not sure how your suggestion relates to the issue here.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949401881	https://github.com/pydata/xarray/issues/1603#issuecomment-949401881	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844lrkZ	weipeng1999 38346144	2021-10-22T08:25:54Z	2021-10-22T08:25:54Z	NONE	Thanks for the detailed description @weipeng1999. For the first 4 slides I don't see how this is different from how does `S_res = S1.sel(C1=['a', 'b']` and `S_res = S2.sel(C1=['a', 'b'])` currently? And for the last 2 slides, I don't think that we always want such implicit broadcasting for dimensions that are not involved in the indexed coordinates. thank you for figuring out the wrong things what I done. Well, it' is hard to explain the idea because it is a bit complicated, the last two picture is wrong and make misunderstanding, here are two images explain what I actuarily mean:	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
949358898	https://github.com/pydata/xarray/issues/1603#issuecomment-949358898	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844lhEy	benbovy 4160723	2021-10-22T07:22:24Z	2021-10-22T07:22:24Z	MEMBER	Thanks for the detailed description @weipeng1999. For the first 4 slides I don't see how this is different from how does `S_res = S1.sel(C1=['a', 'b']` and `S_res = S2.sel(C1=['a', 'b'])` currently? And for the last 2 slides, I don't think that we always want such implicit broadcasting for dimensions that are not involved in the indexed coordinates.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
947480352	https://github.com/pydata/xarray/issues/1603#issuecomment-947480352	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844eWcg	weipeng1999 38346144	2021-10-20T09:15:41Z	2021-10-20T09:15:41Z	NONE	Hi @weipeng1999, I'm not sure to fully understand your suggestion, would you mind sharing some illustrative examples? It is useful to have two distinct `coordinate variable` vs `data variable` concepts. Although both are data arrays, the former is used to locate data in the dimensional space(s) defined by all dimensions in the dataset while the latter is used to store field data. It also helps to have a clear separation between the `coordinate variable` and `index` concepts. An index is a specific data structure or object that allows efficient data extraction or alignment based one or more coordinate labels. Sometimes an index object may be handled like a data array (like pandas indexes) but this is not always the case (e.g., a KD-Tree). Currently in Xarray the `index` concept is hidden behind "dimension" coordinate variables. The goal of the explicit index refactor is to bring it to the light and make it available to any coordinate (and also open it to custom index structures, not only pandas indexes). It looks like what you suggest is some kind of implicit (co-)indexes hidden behind any dataset variable(s)? We actually took the opposite direction, trying to make everything explicit. Try to explain my idea, I make a PPT.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
946474674	https://github.com/pydata/xarray/issues/1603#issuecomment-946474674	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844ag6y	benbovy 4160723	2021-10-19T08:19:54Z	2021-10-19T08:19:54Z	MEMBER	Hi @weipeng1999, I'm not sure to fully understand your suggestion, would you mind sharing some illustrative examples? It is useful to have two distinct `coordinate variable` vs `data variable` concepts. Although both are data arrays, the former is used to locate data in the dimensional space(s) defined by all dimensions in the dataset while the latter is used to store field data. It also helps to have a clear separation between the `coordinate variable` and `index` concepts. An index is a specific data structure or object that allows efficient data extraction or alignment based one or more coordinate labels. Sometimes an index object may be handled like a data array (like pandas indexes) but this is not always the case (e.g., a KD-Tree). Currently in Xarray the `index` concept is hidden behind "dimension" coordinate variables. The goal of the explicit index refactor is to bring it to the light and make it available to any coordinate (and also open it to custom index structures, not only pandas indexes). It looks like what you suggest is some kind of implicit (co-)indexes hidden behind any dataset variable(s)? We actually took the opposite direction, trying to make everything explicit.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
946337314	https://github.com/pydata/xarray/issues/1603#issuecomment-946337314	https://api.github.com/repos/pydata/xarray/issues/1603	IC_kwDOAMm_X844Z_Yi	weipeng1999 38346144	2021-10-19T03:32:13Z	2021-10-19T03:33:54Z	NONE	Well, maybe we can consider the coordinates in a more generic way. Let us define coordinate an array in data set cause co-indexed when we index its data set. It means that: If A1,A2,A3 are in a same data set S, we index S[ {'A1':I} ] will return a new data set which not only have indexed A1, but they also been Indexed that the A2 A3 which have dims shared with A1. This behavior I call it co-index. Use dims to determined the way how other array of the data set will be co-indexed. If all dims of A1(as coordinate) are also in A2(as regular array co-indexed), obviously the behavior can simply follow the old behavior, just change at the same dim and contain others. If A1 has a dim which not in A2, we should broadcast A2 at the dim, because the older behavior is to consider None dim as broadcast-able dim during other operation so co-index should follow it. Some compatibility issues: maybe need a New Type like DataArray but only have dims instead of both dims and coordinate just define how Dataset to deal with index, maybe DataArray is simlar.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
822122172	https://github.com/pydata/xarray/issues/1603#issuecomment-822122172	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDgyMjEyMjE3Mg==	Hoeze 1200058	2021-04-19T02:18:58Z	2021-04-19T02:19:24Z	NONE	Many array types do have implicit indices. For example, sparse arrays do have their coordinates / CSR representation as primary index (`.sel()`) while dense array's primary index is the position (`.isel()`). Every labeled dimension is therefore just a separate mapping of a string to the index position in the array. Going one step further, one could have continuous dimensions where positional indexing (`.isel()`) does not really make sense. Looking at TileDB's dimensions provides an example for this. => Having explicit and implicit indices on arrays would be awesome, even if they don't support all xarray features!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
523240818	https://github.com/pydata/xarray/issues/1603#issuecomment-523240818	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDUyMzI0MDgxOA==	shoyer 1217238	2019-08-21T00:00:43Z	2021-03-03T16:46:25Z	MEMBER	Explicitly propagating indexes requires going through most of xarray's source code and auditing each time we create a Dataset or DataArray object with low-level operations. We have some pretty decent testing functions for this in the form of `xarray.testing._assert_internal_invariants`, so this is now a pretty mechanical process -- you know it's working if you're now setting indexes explicitly and xarray's test suite passes. Here's our current progress: - [x] most of `dataset.py` - [x] `alignment.py` - [x] `merge.py` (#3234) - [ ] `concat.py` - [x] `dataarray.py` (#3519, #3481) - [ ] `computation.py` - [ ] `groupby.py` - [ ] `resample.py` - [ ] `rolling.py` - [ ] everything else!	{ "total_count": 3, "+1": 3, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
557590898	https://github.com/pydata/xarray/issues/1603#issuecomment-557590898	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDU1NzU5MDg5OA==	max-sixty 5635139	2019-11-22T16:04:22Z	2019-11-22T16:04:22Z	MEMBER	I'll make an example of this when I find some free time, along with a contrasting one in Pandas. :) 👍	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
557579503	https://github.com/pydata/xarray/issues/1603#issuecomment-557579503	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDU1NzU3OTUwMw==	NowanIlfideme 2067093	2019-11-22T15:34:57Z	2019-11-22T15:34:57Z	NONE	Thanks @NowanIlfideme for your feedback. Could you perhaps share a gist of code related to your use case? The first example in this comment is similar to my use case: https://github.com/pydata/xarray/issues/3213#issuecomment-520741706 . There are several "core" dimensions, but some part of the coordinates may be hierarchical or cross-defined (e.g. country > province > city > building, but also country > province > voting district > building). We might have a full or nearly-full panel in the MultiIndex representation, but have a huge cross product (even if we keep strictly hierarchical dimensions out). Meanwhile using a true COO sparse representation (as I understand it) will likely end up with slower operations overall, since nearly all machine learning models (think: linear regression) require a dense array input anyways. I'll make an example of this when I find some free time, along with a contrasting one in Pandas. :)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
557567339	https://github.com/pydata/xarray/issues/1603#issuecomment-557567339	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDU1NzU2NzMzOQ==	dcherian 2448579	2019-11-22T15:08:26Z	2019-11-22T15:08:26Z	MEMBER	My first attempt was to just assume each dimension was orthogonal, which resulted in out-of-memory errors We have experimental support for https://sparse.pydata.org/en/latest/index.html that may help but no documentation unfortunately. There are some details here: https://github.com/pydata/xarray/issues/3213 and https://github.com/pydata/xarray/issues/3484	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
557566798	https://github.com/pydata/xarray/issues/1603#issuecomment-557566798	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDU1NzU2Njc5OA==	rabernat 1197350	2019-11-22T15:07:14Z	2019-11-22T15:07:14Z	MEMBER	Thanks @NowanIlfideme for your feedback. Could you perhaps share a gist of code related to your use case?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
557563566	https://github.com/pydata/xarray/issues/1603#issuecomment-557563566	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDU1NzU2MzU2Ng==	NowanIlfideme 2067093	2019-11-22T14:59:29Z	2019-11-22T14:59:29Z	NONE	I've noticed that basically all my current troubles with xarray lead to this issue (lack of MultiIndex support). I use xarray for machine learning/data science/econometrics. My current problem requires a semi-hierarchical indexing on one of the dimensions, and slicing/aggregation along some levels of those dimensions. My first attempt was to just assume each dimension was orthogonal, which resulted in out-of-memory errors. I ended up using a MultiIndex for the hierarchy dimension to have a "dense" representation of a sparse subspace. Unfortunately, currently `.sel()` and such will cut out MultiIndex dimensions, and I've had to do boolean masking to keep all the dimensions I need. Multidimensional groupby, especially within the MultiIndex, is a headache as it currently stands. I had to resort to making auxilliary dimensions with one-hot encoded levels (dummy variables) and doing multiply-aggregate operations by hand. `xarray` is really beautiful and should be used more by data scientists, but it's really difficult to recommend it to colleagues when not all the familiar `pandas`-style operations are supported.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
549179102	https://github.com/pydata/xarray/issues/1603#issuecomment-549179102	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDU0OTE3OTEwMg==	shoyer 1217238	2019-11-03T21:12:25Z	2019-11-03T21:12:25Z	MEMBER	I'm not working on any of these right now. You might start with a few of the `dataarray.py` methods (no need to do them all at once) to get a sense of what piping these arguments around looks like. I suspect you could get quite a few of these working just by handling indexes in `_to_temp_dataset`/`_from_temp_dataset`.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
549097800	https://github.com/pydata/xarray/issues/1603#issuecomment-549097800	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDU0OTA5NzgwMA==	dcherian 2448579	2019-11-03T02:03:35Z	2019-11-03T02:03:35Z	MEMBER	@shoyer I was thinking of starting on one of the listed files. Do you have any tips? Are you working on any of those at present? What might be the easiest one to begin?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
511126208	https://github.com/pydata/xarray/issues/1603#issuecomment-511126208	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDUxMTEyNjIwOA==	rabernat 1197350	2019-07-13T14:27:32Z	2019-07-13T14:27:32Z	MEMBER	After spending a few hours on the issue tracker yesterday, it became clear to me that the issue--more flexible indexes--is a major blocker on many high-priority features going forward. In #2639, @shoyer started to address this. In that now merged-PR, he outlined the following steps, each of which needs its own PR: [x] In #2639 Indexes are recreated from coordinates every time a new DataArray or Dataset is created. [ ] Refactor indexes to be propagated explicitly in xarray operations. This will facilitate future API changes, when indexes will no longer only be associated with dimensions. I will probably add some testing decorator that can be used to mark part of a test as including no creation of default indexes. [ ] Add explicit entries into indexes for MultiIndex levels that are checked instead of MultiIndex variables. Still no public API changes (aside from adding more entries to .indexes). [ ] Support arbitrary coordinates in indexes. So the best way to make progress on all manner of higher-level xarray feature requests is to start working through the next three items in this list.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
491229992	https://github.com/pydata/xarray/issues/1603#issuecomment-491229992	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ5MTIyOTk5Mg==	aldanor 2418513	2019-05-10T09:47:39Z	2019-05-10T09:47:39Z	NONE	There's now a good few dozen issues that reference this PR. Wondering if there's any particular help needed (in the form of coding, discussion, or any other fashion), so as to try and speed it up and unblock those issues? (I'm personally interested in resolving problems like #934 myself - allowing selection on non-dim coords, which seems to be a major hassle for a lot of use cases.)	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
450702503	https://github.com/pydata/xarray/issues/1603#issuecomment-450702503	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ1MDcwMjUwMw==	shoyer 1217238	2019-01-01T00:54:27Z	2019-01-01T00:54:27Z	MEMBER	I'm starting to make these changes incrementally -- the first step is in https://github.com/pydata/xarray/pull/2639.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
444403484	https://github.com/pydata/xarray/issues/1603#issuecomment-444403484	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0NDQwMzQ4NA==	benbovy 4160723	2018-12-05T08:39:35Z	2018-12-05T08:39:35Z	MEMBER	I guess the error is probably the best idea. Agreed. It seems very strict indeed, but it will be easier to relax this later than the other way. There is also a (very rare?) case where the two indexed coordinates have the same labels but are named differently in the two datasets (e.g., `station_name` and `sname`). In that case an error is probably better too. It would be a sort of indication that the most useful thing to do for future operations is to rename one of those coordinates first.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
444204957	https://github.com/pydata/xarray/issues/1603#issuecomment-444204957	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0NDIwNDk1Nw==	shoyer 1217238	2018-12-04T18:25:33Z	2018-12-04T18:25:33Z	MEMBER	Sorry for maybe asking this again but I'm a bit confused now: is there any good reason of supporting "multiple single indexes" along the same dimension? After all, perhaps better defaults would be to set indexes (`pandas.Index`) only for 1-d coordinates matching dimension names, like it is the case now. If you want a different behavior, then you need to use `.set_index()`, which would raise if it results in multiple single indexes along a dimension. We could also add a new `indexes` argument to the `Dataset` / `DataArray` constructors to save some typing (and avoid the creation of in-memory `pandas.Index` for very long coordinates if an out-of-core alternative is later supported). I discussed this is a little bit above in https://github.com/pydata/xarray/issues/1603#issuecomment-442661526, under "MultiIndex as part of the data schema". I agree that the default behavior should still be to create automatic indexes only for 1d coordinates matching dimension names. But we still will have (rare?) cases where "multiple single indexes" could arise from combining arguments with different indexes. For example, suppose the `station` dimension has an index for `station_name` in one dataset and `city` in another. Should the result be: - A `MultiIndex` with levels `station_name` and `city`? This would be most useful for future operations. - Two individual indexes for `station_name` and `city`? This would be the cheapest result to construct. - An error? This is arguably too strict, because there are no conflicts in either of the indexes. I guess the error is probably the best idea. Where does come from array([0, 1])? I wouldn't have been surprised if a KeyError was raised instead. Perhaps this specific case was initially for backward compatibility when the "dimensions without indexes" feature has been introduced, but it was a long time ago and I'm not sure this is still necessary. This is indeed the historical genesis, but I agree that this is confusing and we should deprecate/remove it.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
444187219	https://github.com/pydata/xarray/issues/1603#issuecomment-444187219	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0NDE4NzIxOQ==	alimanfoo 703554	2018-12-04T17:33:34Z	2018-12-04T17:33:34Z	CONTRIBUTOR	I think that one big source of confusion has been so far mixing coordinates/variables and indexes. These are really two separate concepts, and the indexes refactoring should address that IMHO. For example, I think that da[some_name] should never return indexes but only coordinates (and/or data variables for Dataset). That would be much simpler. Can't claim to be following every detail here, but this sounds very sensible to me FWIW.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
444132393	https://github.com/pydata/xarray/issues/1603#issuecomment-444132393	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0NDEzMjM5Mw==	benbovy 4160723	2018-12-04T15:06:21Z	2018-12-04T15:19:08Z	MEMBER	It occurs to me that for the case of "multiple single indexes" along the same dimension there is no good way to use them simultaneously for indexing/reindexing at the same time. Sorry for maybe asking this again but I'm a bit confused now: is there any good reason of supporting "multiple single indexes" along the same dimension? After all, perhaps better defaults would be to set indexes (`pandas.Index`) only for 1-d coordinates matching dimension names, like it is the case now. If you want a different behavior, then you need to use `.set_index()`, which would raise if it results in multiple single indexes along a dimension. We could also add a new `indexes` argument to the `Dataset` / `DataArray` constructors to save some typing (and avoid the creation of in-memory `pandas.Index` for very long coordinates if an out-of-core alternative is later supported). da[dim_name] should return all the indexes on that dimension I think that one big source of confusion has been so far mixing coordinates/variables and indexes. These are really two separate concepts, and the indexes refactoring should address that IMHO. For example, I think that `da[some_name]` should never return indexes but only coordinates (and/or data variables for Dataset). That would be much simpler. Take for example ```python da = xr.DataArray(np.random.rand(2, 2), ... dims=('one', 'two'), ... coords={'one_labels': ('one', ['a', 'b'])}) da <xarray.DataArray (one: 2, two: 2)> array([[ 0.536028, 0.291895], [ 0.682108, 0.926003]]) Coordinates: one_labels (one) <U1 'a' 'b' Dimensions without coordinates: one, two ``` I find it so weird being able to do this: ```python da['one'] <xarray.DataArray 'one' (one: 2)> array([0, 1]) Coordinates: one_labels (one) <U1 'a' 'b' Dimensions without coordinates: one ``` Where does come from `array([0, 1])`? I wouldn't have been surprised if a `KeyError` was raised instead. Perhaps this specific case was initially for backward compatibility when the "dimensions without indexes" feature has been introduced, but it was a long time ago and I'm not sure this is still necessary. I might be a good thing explicitly requiring `da.set_index('one_labels')` to enable indexing/alignment (edit: label indexing/alignment) along dimension `one` in the example above.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
443239040	https://github.com/pydata/xarray/issues/1603#issuecomment-443239040	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MzIzOTA0MA==	max-sixty 5635139	2018-11-30T15:29:15Z	2018-11-30T15:29:15Z	MEMBER	How should dimension names interact with index names - i.e. the "Mapping indexes into pandas" in @shoyer 's comment I'd suggest that option (3) should be invalid, and that `da[dim_name]` should return all the indexes on that dimension	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
443172604	https://github.com/pydata/xarray/issues/1603#issuecomment-443172604	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MzE3MjYwNA==	benbovy 4160723	2018-11-30T11:14:24Z	2018-11-30T11:14:24Z	MEMBER	A couple of thoughts: If nothing useful can be done in the case of "multiple single indexes", would it make sense to discourage users explicitly creating multiple single indexes along a dimension? "Multiple single indexes" would be just a default situation when nothing specific as been defined yet or resulting from a failback. For example, why not requiring that `set_index(['x', 'y'])` (with a list as argument) should always result in a multi-index regardless of the `kind` argument, i.e., raise if a single index is given? This is close to the current behavior, I think. This would require calling `set_index` for each single index that we want to (re)define, but I don't think setting a lot of single indexes at the same time is something that often happens. Hence, would it be possible to avoid `append=None` and instead change the default to `append=True`?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
443044579	https://github.com/pydata/xarray/issues/1603#issuecomment-443044579	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MzA0NDU3OQ==	shoyer 1217238	2018-11-30T00:24:39Z	2018-11-30T00:24:39Z	MEMBER	I wonder if we should also change the default value of the `append` argument in `set_index()` to `append=None`, which means something like "append if creating a MultiIndex". For most users, keeping a single MultiIndex is the most usable way to use multiple indexes along a dimension, and our default behavior should reflect that.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442965602	https://github.com/pydata/xarray/issues/1603#issuecomment-442965602	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0Mjk2NTYwMg==	shoyer 1217238	2018-11-29T19:38:34Z	2018-11-29T19:38:34Z	MEMBER	It occurs to me that for the case of "multiple single indexes" along the same dimension there is no good way to use them simultaneously for indexing/reindexing at the same time. We should explicitly raise if you try to do this. I guess we have a few options for automatic alignment with multiple single indexes, too: 1. We could only support "exact" indexing 2. We could require that aligning each index separately gives the same result (2) seems least restrictive and is probably the right choice. One advantage of not having `MultiIndex` objects as variables is that the serialization story gets simpler. The rule becomes "multi-indexes don't get saved". What should the default behavior of `set_index(['x', 'y'])` without an explicit `kind` argument be? - Should this mean individual indexes or a combined MultiIndex? The later might be more surprising but is arguably more useful. It would make sense if the model is that `set_index()` always creates a single index object. - We could potentially automatically pick an index type using simple heuristics. For example, if the arguments are 1D, you get get a MultiIndex by default. If the arguments have two or more dimensions, you get a KDTree.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442956167	https://github.com/pydata/xarray/issues/1603#issuecomment-442956167	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0Mjk1NjE2Nw==	shoyer 1217238	2018-11-29T19:10:14Z	2018-11-29T19:10:14Z	MEMBER	Looking at the reported issues related to multi-indexes in xarray, I have the same feeling. Simply reusing pandas.MultiIndex in xarray where slightly different semantics are generally expected has shown to be painful. It seems easier to have our own baked solution and deal with differences during xarray<-> pandas conversion if needed. I think the pandas.MultiIndex is a pretty solid data structure on a fundamental level, it just has some weird semantics for some indexing edge cases. Whether or not we write xarray.MultiIndex structure, we can achieve most of what we want with a thin layer over `pandas.MultiIndex`. If a variable for each multi-coordinate index is "just" for data schema consistency, then why not showing all those indexes in a separate section of the repr? Yes, I like this! Generally I like @benbovy's entire proposal :). @fujiisoup can you clarity the use-cases you have for a MultiIndex as a variable? Am I right in thinking the Multi-indexes is only a helpful note to users, rather than conveying anything about how data is accessed? From a data perspective, the only thing having an Index and/or MultiIndex should change is that the data is immutable. But by necessity the nature of the index will determine which indexing operations are possible/efficient. For example, if you want to do nearest-neighbor indexing with multiple coordinates you'll need a KDTree. We should not be afraid to raise errors if an indexing operation can't be done efficiently. With regards to reindexing: I don't think this needs any special handling versus normal indexing (`sel()`). The rules basically fall out of those for normal indexing, except we handle missing values differently (by filling with NaN). Another issue: how do automatic alignment with multiple indexes? Let me suggest a straw-man proposal: We always align indexed coordinates. If a coordinate is used in different types of indexes (e.g., a base `Index` in one argument and a `MultiIndex` level in another), we can either: 1. create a `MultiIndex` with the variable on the fly (this could be slightly expensive), or 2. fall back to only supporting "exact" indexing	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442907394	https://github.com/pydata/xarray/issues/1603#issuecomment-442907394	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjkwNzM5NA==	benbovy 4160723	2018-11-29T16:49:12Z	2018-11-29T17:18:10Z	MEMBER	ds.sel(multi=list_of_pairs) can probably be replaced by ds.sel(x=..., y=...), but how about reindex along MultiIndex? Indeed I haven't really thought about `reindex` and alignment in my suggestion above. How do you currently `reindex` along a multi-index dimension? Contrary to `.sel`, `ds.reindex(multi=list_of_pairs)` doesn't seem to work (the list of n-length tuples being interpreted as a ~~n-dim~~ 2-d array). The only way I've found to make it work is to pass another `pandas.MultiIndex`. Wouldn't be it rather confusing if we choose to go with our own implementation of MultiIndex for xarray instead of `pandas.MultiIndex`? Wouldn't be possible to easily support `ds.reindex(x=..., y=...)` within the new data model proposed here? Am I right in thinking the Multi-indexes is only a helpful note to users, rather than conveying anything about how data is accessed? This is a good question. A related question: apart from `ds.sel(multi=list_of_pairs)` and `ds.reindex(multi=list_of_pairs)` use cases discussed so far, is there other reasons of having a variable for a multi-index? I think we can do much of this before adding the ability to set custom indexes, which would be cool but further from where we are, I think. I agree, although whether or not we will eventually support custom indexes might influence the design choices that we have to do now, IMO.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442906486	https://github.com/pydata/xarray/issues/1603#issuecomment-442906486	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjkwNjQ4Ng==	max-sixty 5635139	2018-11-29T16:46:52Z	2018-11-29T16:46:52Z	MEMBER	And broadening out further: Default behavior: all 1-dimensional coordinates each have their own, single index (pandas.Index), unless explicitly stated. This is basically how I think of indexes - as a performant lookup data structure, rather than a feature of the schema. An RDBMS in a good corollary there. Now, maybe there's enough overlap between the data access and the data schema that we should let them couple - e.g. would you want to be able to run `.sel` on any coord, even 2D? While it's possible in concept, it could guide users to inefficient operations. We probably don't need to answer this question to proceed, but I'd be interested whether others see indexes as a property of the schema / I'm missing something.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442902327	https://github.com/pydata/xarray/issues/1603#issuecomment-442902327	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjkwMjMyNw==	max-sixty 5635139	2018-11-29T16:36:20Z	2018-11-29T16:36:20Z	MEMBER	I broadly agree with @benbovy 's proposal. One question that I think is worth being clear on is what additional contracts do multiple indexes on a dimension have over individual indexes? e.g. re: `Coordinates: * level_1 (x) object 'a' 'a' 'b' 'b' * level_2 (x) int64 1 2 1 2 Multi-indexes: pandas.MultiIndex [level_1, level_2]` Am I right in thinking the `Multi-indexes` is only a helpful note to users, rather than conveying anything about how data is accessed? @fujiisoup 's poses a good case of this question: ds.sel(multi=list_of_pairs) can probably be replaced by ds.sel(x=..., y=...), but how about reindex along MultiIndex? (and separately, I think we can do much of this before adding the ability to set custom indexes, which would be cool but further from where we are, I think)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442809859	https://github.com/pydata/xarray/issues/1603#issuecomment-442809859	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjgwOTg1OQ==	fujiisoup 6815844	2018-11-29T12:05:03Z	2018-11-29T12:05:03Z	MEMBER	I am late for the party (but still only have time to write a short comment). I am a big fan of MultiIndex and like @shoyer 's idea. `ds.sel(multi=list_of_pairs)` can probably be replaced by `ds.sel(x=..., y=...)`, but how about `reindex` along MultiIndex? I have encountered its use cases several times. I also think it would be nice to have MultiIndex as a variable.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442797084	https://github.com/pydata/xarray/issues/1603#issuecomment-442797084	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0Mjc5NzA4NA==	benbovy 4160723	2018-11-29T11:15:17Z	2018-11-29T11:15:17Z	MEMBER	we will definitely have to make some intentional deviations from the behavior of pandas Looking at the reported issues related to multi-indexes in xarray, I have the same feeling. Simply reusing `pandas.MultiIndex` in xarray where slightly different semantics are generally expected has shown to be painful. It seems easier to have our own baked solution and deal with differences during xarray<-> pandas conversion if needed. If we re-design indexes so that we allow 3rd-party indexes, maybe we could support both and let the user choose the one (xarray or pandas baked) that best suits his needs? Regarding MultiIndex as part of the data schema vs an implementation detail, if we support extending indexes (and already given the different kinds of multi-coordinate indexes: MultiIndex, KDTree, etc.), then I think that it should be transparent to the user. However, I don't really see why a multi-coordinate index should have its own variable (with tuples of values). I don't want to speak for others, but IMHO `ds.sel(multi=list_of_pairs)` is rather a edge case and I'm not sure if we really need to support it. Using `ds.sel(x=..., y=...)` with DataArray objects is certainly more code to write, but this form of indexing is very powerful and it might not be a bad idea to encourage it. If a variable for each multi-coordinate index is "just" for data schema consistency, then why not showing all those indexes in a separate section of the repr? For example: `Coordinates: * level_1 (x) object 'a' 'a' 'b' 'b' * level_2 (x) int64 1 2 1 2 Multi-indexes: pandas.MultiIndex [level_1, level_2]` It is equally transparent, not more verbose, and it is clear that multi-indexes are not part of the coordinates (in fact there is no need of "virtual" coordinates either, nor to name the index). I don't think single indexes should be shown here as it would results in duplicated, uninformative lines. More generally, here is how I would see indexes handled in xarray (I might be missing important aspects, though): Default behavior: all 1-dimensional coordinates each have their own, single index (`pandas.Index`), unless explicitly stated. Explicit API is used for setting new, possibly multi-coordinate indexes. Note the absence of keyword argument below to specify the variables: This is actually more consistent with the pandas API but this would be a breaking change and I don't know how a smooth transition could look like. `set_index(['x', 'y'], kind='multiindex') # xarray built-in index` `set_index(['x', 'y'], kind='kdtree') # xarray built-in index` `set_index('x', kind=ASingleIndexWrapperClass) # 3rd-party index` If a coordinate is removed from the Dataset or if its index is reset or changed: If the coordinate had a single index, no problem If the coordinate was part of a multi-coordinate index: a new index is built from all remaining coordinates that were also part of the original index, if it is supported. Otherwise, the original index is removed and the default behavior (single `pandas.Index`) is reset for all those remaining coordinates.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442725856	https://github.com/pydata/xarray/issues/1603#issuecomment-442725856	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjcyNTg1Ng==	max-sixty 5635139	2018-11-29T06:52:49Z	2018-11-29T06:52:49Z	MEMBER	Let me make a tentative proposal: we should model a MultiIndex in xarray as exactly equivalent to a sparse multi-dimensional array, except with missing elements modeled implicitly (by omission) instead of explicitly (with NaN). 💯- that very much resonates! And it leaves the implementation flexible if we want to iterate. I'll try to think of some dissenting cases to the proposal / helpful responses to the above.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442710536	https://github.com/pydata/xarray/issues/1603#issuecomment-442710536	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjcxMDUzNg==	shoyer 1217238	2018-11-29T05:23:33Z	2018-11-29T05:25:48Z	MEMBER	There's no need to support indexing like `ds.sel(multi=list_of_pairs)`. Indexing like `ds.sel(x=..., y=...)` solves the same use case and looks nicer. This needs an important caveat: it's only true that you use `ds.sel(x=..., y=...)` to emulate `ds.sel(multi=list_of_pairs)` if you do explicit vectorized indexing like in @max-sixty's example above (https://github.com/pydata/xarray/issues/1603#issuecomment-442636798). It would be nice to preserve a way to select a list of particular points that didn't require constructing explicit DataArray objects as the indexers. (But maybe this is a somewhat niche use-case and it isn't worth the trouble.) Let me make a tentative proposal: we should model a MultiIndex in xarray as exactly equivalent to a sparse multi-dimensional array, except with missing elements modeled implicitly (by omission) instead of explicitly (with NaN). If we do this, I think MultiIndex semantics could be defined to be identical to those of separable Index objects. One challenge is that we will definitely have to make some intentional deviations from the behavior of pandas, at least when dealing with array indexing of a MultiIndex level. Pandas has some strange behaviors with array indexing of a MultiIndex level, and I'm honestly not sure if they are bugs or features: - It ignores missing labels (https://github.com/pandas-dev/pandas/issues/15452) - It drops duplicate labels (https://github.com/pandas-dev/pandas/issues/19414) Fortunately, the MultiIndex data model is not that complicated, and it is quite straightforward to remap indexing results from sub-Index levels onto integer codes. I suspect we will find it easier to rewrite some of these routines than to change pandas, both because pandas may not agree with different semantics and because the pandas indexing code is an unholy mess. For example, we can reproduce the above issues: `python import pandas as pd index = pd.MultiIndex.from_arrays([['a', 'b', 'c']]) print(index.get_locs((['a', 'a'],))) # [0] print(index.get_locs((['a', 'd'],))) # [0]` We actually want something more like: ```python def get_locs(index, key): return index.get_indexer(pd.MultiIndex.from_product(key)) print(get_locs(index, (['a', 'a'],))) # [0, 0] print(get_locs(index, (['a', 'd'],))) # [0, -1] ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442680467	https://github.com/pydata/xarray/issues/1603#issuecomment-442680467	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjY4MDQ2Nw==	shoyer 1217238	2018-11-29T02:15:48Z	2018-11-29T02:19:06Z	MEMBER	That said, I still don't know how to use public MultiIndex methods for this. Neither `index.get_loc_level([1, 2], level=1)` nor `index.get_loc((slice(None), [1, 2]))` work. The answer is the `index.get_locs()` method: `index.get_locs([slice(None), 1, 2]])` works. It's painfully slow for large numbers of points due to a Python loop over each point, but presumably that could be optimized: `x = np.arange(10000) index = pd.MultiIndex.from_arrays([x]) %timeit index.get_locs((x,)) # 1.31 s per loop %timeit index.levels[0].get_indexer(x) # 93 µs per loop`	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442581754	https://github.com/pydata/xarray/issues/1603#issuecomment-442581754	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjU4MTc1NA==	shoyer 1217238	2018-11-28T19:51:42Z	2018-11-29T00:48:53Z	MEMBER	I've been thinking about this a little more in the context of starting on the implementation (in #2195). In particular, I no longer agree with this "Separate indexers without a MultiIndex should be prohibited" from my original proposal. The problem is that the semantics of a MultiIndex are not quite the same as separate indexes, and I don't think all use-cases are well solved by always using a MultiIndex. ~~For example, I don't think it's possible to do point-wise indexing along anything other than the first level of a MultiIndex.~~ (note: this is not true, see https://github.com/pydata/xarray/issues/1603#issuecomment-442662561) Instead, I think we should make the model transparent by retaining an xarray variable for the MultiIndex, and provide APIs for explicitly converting index types. e.g., for the repr with a MultiIndex: `Coordinates: * x (x) MultiIndex[level_1, level_2] * level_1 (x) object 'a' 'a' 'b' 'b' * level_2 (x) int64 1 2 1 2` and without a MultiIndex: `Coordinates: * level_1 (x) object 'a' 'a' 'b' 'b' * level_2 (x) int64 1 2 1 2` The main way in which this could get confusing is if you explicitly mutate the Dataset to remove some but not all of the variables corresponding to the MultiIndex (e.g., `x` but not `level_1` or vise-versa). We have a few potential options here: 1. Don't worry about it: if you mutate objects, you can potentially end up in slightly confusing internal states. If you care about whether `level_1` uses a pandas.Index or pandas.MultiIndex, you can find out for sure by checking `ds.indexes['level_1']`. 2. Prohibit it in our data model: either (a) raise an error if you try to manually delete a single variable or (b) automatically delete all associated variables, too. Encourage using various explicit APIs that return new objects with a new index. 3. Use a different indicator than `` for marking "indirect" indexes, so it's more obvious if some coordinates get removed, e.g., `Coordinates: x (x) MultiIndex[level_1, level_2] + level_1 (x) object 'a' 'a' 'b' 'b' + level_2 (x) int64 1 2 1 2` The different indicator might make sense regardless but I am also partial to "Prohibit it in our data model." The main downside is that this adds a little more complexity to the logic for determining indexes resulting from an operation (namely, verifying that all MultiIndex levels still correspond to coordinates).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442662561	https://github.com/pydata/xarray/issues/1603#issuecomment-442662561	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjY2MjU2MQ==	shoyer 1217238	2018-11-29T00:48:12Z	2018-11-29T00:48:28Z	MEMBER	For example, I don't think it's possible to do point-wise indexing along anything other than the first level of a MultiIndex. This is clearly not true, since it works in pandas: `python import pandas as pd index = pd.MultiIndex.from_product([list('ab'),[1,2]]) series = pd.Series(range(4), index) print(series.loc[:, [1, 2]])` That said, I still don't know how to use public `MultiIndex` methods for this. Neither `index.get_loc_level([1, 2], level=1)` nor `index.get_loc((slice(None), [1, 2]))` work.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442661526	https://github.com/pydata/xarray/issues/1603#issuecomment-442661526	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjY2MTUyNg==	shoyer 1217238	2018-11-29T00:42:39Z	2018-11-29T00:42:39Z	MEMBER	@max-sixty I like your schema vs. implementation breakdown. In general, I agree with you that it would be nice to have MultiIndex has an implementation detail rather than part of xarray's schema. But I'm not entirely sure that's feasible. Let's try to list out the pros/cons. Consider a MultiIndex 'multi' with levels 'x' and 'y': - Advantages of MultiIndex as part of the data schema: - There is an explicit coordinate (of tuples) corresponding to MultiIndex values, which can be returned from `ds.coords['multi']`. This is inherently not that useful compared to the separable variables, but is a cleaner solution that creating `ds.coords['multi']` as a "virtual" variable on the fly (which we would need for backwards compatibility). - We don't need to do full "normalization" when multiple indexes along the same dimension are encountered, e.g., in an operation that combines two different indexes, we would simply put both on the result instead of building a MultiIndex (which would require allocating a whole new array of integer codes). - The nature of the MultiIndex is more transparent as part of the data model. For example, if `x` and `y` are numeric, it could make sense to use either a MultiIndex or KDTree for indexing. Explicit APIs (e.g., `set_multiindex` and `set_kdtree`) would allow users a high level of control. - For advanced use-cases, it is potentially easier to work around the limitations of a MultiIndex, e.g., the way that some operations require lex-sorted-ness. - Advantages of MultiIndex as an implementation detail: - Simpler data model (for users). There are few good use cases for multiple indexes that aren't a MultiIndex. - Easier to do automatic alignment: we know that indexes will always have the same normalized form (in a MultiIndex). Otherwise, we would have to do this on the fly, or request that users explicitly setup compatible indexes. - More flexibility for xarray: we can potentially swap out indexing without changing the user-facing API. We might have something like a "hybrid" MultiIndex/KDTree that chooses the appropriate index based on the requested operation. - We don't need to create an explicit array of tuples for the MultiIndex variable (but we could still have a variable corresponding to a MultiIndex and only construct the `.data` array in a "lazy" fashion). - There's no need to name extraneous variables that only exist for the sake of a MultiIndex. - There's no need to support indexing like `ds.sel(multi=list_of_pairs)`. Indexing like `ds.sel(x=..., y=...)` solves the same use case and looks nicer. That said, this would be a minor backwards compatibility break (this currently works in xarray). P.S. I haven't made much progress on this yet so there's definitely still time to figure out the right decision -- thanks for your engagement on this!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442636798	https://github.com/pydata/xarray/issues/1603#issuecomment-442636798	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDQ0MjYzNjc5OA==	max-sixty 5635139	2018-11-28T22:54:26Z	2018-11-28T22:54:26Z	MEMBER	Potentially this is too much 'stepping back' now we're at the implementation stage - my perception is that @shoyer is leading this without much support, so weighting having some additional viewpoints, some questions: Is a MultiIndex a feature of the schema or the implementation? I had thought of an MI being an implementation detail in code, rather than in the data schema. We use it as a container for all the indexes along a dimension, rather than representing any properties about the data it contains. One exception to that would be if we wanted multiple groups of indexes along the same dimension, for example: ``` Coordinates: * xa (x) MultiIndex[level_a_1, level_a_2] * level_a_1 (x) object 'a' 'a' 'b' 'b' * level_a_2 (x) int64 1 2 1 2 xb (x) MultiIndex[level_b_1, level_b_2] level_b_1 (x) object 'a' 'a' 'b' 'b' level_b_2 (x) int64 1 2 1 2 ``` But is that common / required? MultiIndex as an implementation detail If it's an implementation detail, is there a benefit to investing in allowing both separate and MIs? While it may not be possible to do pointwise indexing with the current implementation of MI, am I mistaken that it's not an API issue, assuming we pass in index names? e.g.: ```python [ins] In [22]: da = xr.DataArray(np.arange(12).reshape((3, 4)), dims=['x', 'y'], coords=dict(x=list('abc'), y=pd.MultiIndex.from_product([list('ab'),[1,2]]))) [ins] In [23]: da Out[23]: <xarray.DataArray (x: 3, y: 4)> array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) Coordinates: * x (x) <U1 'a' 'b' 'c' * y (y) MultiIndex - y_level_0 (y) object 'a' 'a' 'b' 'b' - y_level_1 (y) int64 1 2 1 2 [ins] In [26]: da.sel(x=xr.DataArray(['a','c'],dims=['z']), y_level_0=xr.DataArray(['a','b'],dims=['z']) y_level_1=xr.DataArray([1,1],dims=['z'])) Out[80]: # hypothetical <xarray.DataArray (z: 3)> array([ 0, 10]) Dimensions without coordinates: z ``` If that's the case, could we instead force all indexes along a dimension to be in a MI, tolerate the short-term constraints of the current MI implementation, and where needed build out additional features? That would (ideally) leave us uncoupled to MIs - if we built a better in-memory data structure, we could transition. The contract would be around the cases above. -- ...and as mentioned above, these are intended as questions rather than high-confident views.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
392833478	https://github.com/pydata/xarray/issues/1603#issuecomment-392833478	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM5MjgzMzQ3OA==	shoyer 1217238	2018-05-29T16:04:27Z	2018-05-29T16:04:27Z	MEMBER	Sure, this is as good a time as any. But we'll probably need to refinish this refactoring before it makes sense to implement anything. On Tue, May 29, 2018 at 8:59 AM Alistair Miles notifications@github.com wrote: Ok, cool. Was wondering if now was right time to revisit that, alongside the work proposed in this PR. Happy to participate in that discussion, still interested in implementing some alternative index classes. On Tue, 29 May 2018, 15:45 Stephan Hoyer, notifications@github.com wrote: Yes, the index API still needs to be determined. But I think we want to support something like that. On Tue, May 29, 2018 at 1:20 AM Alistair Miles <notifications@github.com wrote: I see this mentions an Index API, is that still to be decided? On Tue, 29 May 2018, 05:28 Stephan Hoyer, notifications@github.com wrote: I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this: Normalizing and creating default indexes in the Dataset/DataArray constructor. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables. I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in #2195 https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/pydata/xarray/issues/1603#issuecomment-392649605 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAq8QvMauEPa6hfgorDoShZ2PwyYWk6Tks5t3M6AgaJpZM4PtACU . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392692996, or mute the thread < https://github.com/notifications/unsubscribe-auth/ABKS1p8RjrupPM2z2d4_ylWX7826RQ0Rks5t3QTHgaJpZM4PtACU . — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392803210, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAq8QgygnzTX053NlGZ5A5j_tRkRxMj7ks5t3V79gaJpZM4PtACU . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392831984, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1u3XEy9d3xV4M2LLfshNFWN786h9ks5t3XBzgaJpZM4PtACU .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
392831984	https://github.com/pydata/xarray/issues/1603#issuecomment-392831984	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM5MjgzMTk4NA==	alimanfoo 703554	2018-05-29T15:59:46Z	2018-05-29T15:59:46Z	CONTRIBUTOR	Ok, cool. Was wondering if now was right time to revisit that, alongside the work proposed in this PR. Happy to participate in that discussion, still interested in implementing some alternative index classes. On Tue, 29 May 2018, 15:45 Stephan Hoyer, notifications@github.com wrote: Yes, the index API still needs to be determined. But I think we want to support something like that. On Tue, May 29, 2018 at 1:20 AM Alistair Miles notifications@github.com wrote: I see this mentions an Index API, is that still to be decided? On Tue, 29 May 2018, 05:28 Stephan Hoyer, notifications@github.com wrote: I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this: Normalizing and creating default indexes in the Dataset/DataArray constructor. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables. I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in #2195 https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs. — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392649605, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAq8QvMauEPa6hfgorDoShZ2PwyYWk6Tks5t3M6AgaJpZM4PtACU . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392692996, or mute the thread < https://github.com/notifications/unsubscribe-auth/ABKS1p8RjrupPM2z2d4_ylWX7826RQ0Rks5t3QTHgaJpZM4PtACU . — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392803210, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QgygnzTX053NlGZ5A5j_tRkRxMj7ks5t3V79gaJpZM4PtACU .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
392803210	https://github.com/pydata/xarray/issues/1603#issuecomment-392803210	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM5MjgwMzIxMA==	shoyer 1217238	2018-05-29T14:45:12Z	2018-05-29T14:45:12Z	MEMBER	Yes, the index API still needs to be determined. But I think we want to support something like that. On Tue, May 29, 2018 at 1:20 AM Alistair Miles notifications@github.com wrote: I see this mentions an Index API, is that still to be decided? On Tue, 29 May 2018, 05:28 Stephan Hoyer, notifications@github.com wrote: I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this: Normalizing and creating default indexes in the Dataset/DataArray constructor. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables. I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in #2195 https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs. — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392649605, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAq8QvMauEPa6hfgorDoShZ2PwyYWk6Tks5t3M6AgaJpZM4PtACU . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392692996, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1p8RjrupPM2z2d4_ylWX7826RQ0Rks5t3QTHgaJpZM4PtACU .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
392692996	https://github.com/pydata/xarray/issues/1603#issuecomment-392692996	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM5MjY5Mjk5Ng==	alimanfoo 703554	2018-05-29T08:20:22Z	2018-05-29T08:20:22Z	CONTRIBUTOR	I see this mentions an Index API, is that still to be decided? On Tue, 29 May 2018, 05:28 Stephan Hoyer, notifications@github.com wrote: I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this: Normalizing and creating default indexes in the Dataset/DataArray constructor. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables. I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in #2195 https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs. — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392649605, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QvMauEPa6hfgorDoShZ2PwyYWk6Tks5t3M6AgaJpZM4PtACU .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
392649605	https://github.com/pydata/xarray/issues/1603#issuecomment-392649605	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM5MjY0OTYwNQ==	shoyer 1217238	2018-05-29T04:28:45Z	2018-05-29T04:28:45Z	MEMBER	I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this: 1. Normalizing and creating default `indexes` in the `Dataset`/`DataArray` constructor. 2. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs. 3. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables. I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
379905457	https://github.com/pydata/xarray/issues/1603#issuecomment-379905457	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM3OTkwNTQ1Nw==	shoyer 1217238	2018-04-09T21:52:02Z	2018-04-11T04:34:43Z	MEMBER	I've been thinking about getting started on this. Here are my current thoughts on the right design approach. Data model `Dataset.indexes` and `DataArray.indexes` My current thinking is that `indexes` should simply be a dictionary mapping from coordinate and/or dimension names to `pandas.Index` objects. Mapping from label-based to integer-based then becomes simply a matter of looking up the appropriate indexes for each coordinate/dimension (i.e., the keyword argument names in `.sel()`), and using the corresponding index(es) to transform label-based indexers into integer indexers. If multiple coordinates are part of the same index, they should point to the same `MultiIndex`/`KDTree` object. The MultiIndex would be responsible for resolving the combined indexing operation along the coordinate dimension(s). By default, `indexes` is populated with an Index/MultiIndex for each dimension of all indexes along that dimension. Additional indexes may be set manually, e.g., using `set_index()`. Indexes keyed by a dimension name are used for axis-positional indexing with `.loc` and for alignment with `reindex`/`align`. However, if the index is a MultiIndex with a level name matching a coordinate, then only that level will be used for indexing/alignment. In other words: the coordinate name corresponding to indexing request takes precedence, but if it isn't found, we use all indexes along the dimension. Separate indexers without a MultiIndex should be prohibited It should be impossible to express inconsistent and/or confusing states in xarray's data model. This sort of inconsistency (e.g., levels not being stored directly in `Dataset.variables`) is the major source of our issues with the current MultiIndex data model. I'm particularly concerned about the clearly showing difference between coordinates that are part of a `MultiIndex` and coordinates that are separately indexed. I suspect we could make indexing operations nearly equivalent from a user perspective, but there would likely remain small differences that would be a source of confusion and bugs. Preserving indexes in the form in which they are created is not also not really an option, because there are lots of xarray operations that would probably normalize indexes into standard forms, such as groupby, stack/unstack and to/from_pandas. The simplest option is to prohibit one of these cases entirely, either: 1. Always group repeated indexes along a dimension into a MultiIndex, or 2. Never use `pandas.MultiIndex` (keep separate indexes for each coordinate). From xarray's perspective, it would certainly be cleaner to prohibit MultiIndex. The level order dependent behavior of MultiIndex is not the best fit for xarray's data model, and could be challenging to keep in sync with coordinate order on xarray objects. We would need to ensure that coordinate/level order remains consistent in all operations, or at least ensure that coordinates are always printed in order of their appearence in MultiIndex levels. (We generally preserve coordinate order already, but well behaved programs using xarray currently don't need to rely on this behavior.) That said, always using MultiIndexes for multiple indexes along the same dimension has it's own clear advantages. First, it's consistent with pandas, which makes it easier to transition data back and forth. Second, simultaneous indexing operations across MultiIndex levels would be difficult to express efficiently with a MultiIndex. This is probably the right choice for xarray. We could potentially allow for non-consolidated indexes (not part of a MultiIndex) when using the advanced API (e.g., directly setting the `indexes` parameter). But we'll save this for later. Functionality Index variables Every MultiIndex level must have a corresponding xarray.Variable object in coordinates on each Dataset/DataArray on which they appear. These objects may reference the same `pandas.Index`/`pandas.MultiIndex` object used for indexing, but must have immutable data (e.g., `flag.writeable = False` in NumPy). For now, I expect to reuse the existing `IndexVariable` class. Now that levels are xarray.Variable objects, there will no longer be a `Variable` object in `Dataset._variables`/`DataArray._coords` corresponding to a `pandas.MultiIndex`. However, we will continue to create a "virtual variable" upon indexing consisting of an dtype=object array of MultiIndex values, as a fallback if there is no coordinate matching a dimension name. Mapping indexes into pandas Another concern is how to map all of the new possible indexing states into pandas: ``` case 1 (one indexed variable, same name as dimension): time (time) case 2 (one indexed variable, different name from dimension): year (time) case 3 (multiple indexed variables, one has same name as dimension): time (time) year (time) case 4 (multiple indexed variables, all have different names from dimension): year (time) month (time) ``` For consistency with current behavior, case 1 should correspond to a standard `pandas.Index` and case 4 should correspond to a `pandas.MultiIndex`. But what about the intermediate cases 2 and 3, which are currently prohibited by xarray's data model? I think we should use the rule that all indexed variables are consolidated into a single Index in pandas. If there are multiple indexed variables (case 3 or 4), this would be a MultiIndex; otherwise (cases 2 or 3), this would be a standard Index. This has a virtue of speed and simplicity: we can simply reuse the existing Index or MultiIndex object from `indexes`. The other option would be prohibit cases 2 and 3 (like we currently do), because we will not be able to map them into pandas and back faithfully. I think this would be a mistake, because indexes on multiple levels would be useful for xarray, even if one level corresponds to the dimension name. Indexes for unstack With the introduction of more flexible and optional index levels, it may not always may sense to `unstack()` every index coordinate. We should support optionally specifying levels to unstack, possibly with an API mirroring `stack()`, e.g., perhaps `.unstack(dim_name=['level_0', 'level_1'])` to unstack coordinates `level_0` and `level_1` from dimension `dim_name`.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
380323532	https://github.com/pydata/xarray/issues/1603#issuecomment-380323532	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM4MDMyMzUzMg==	max-sixty 5635139	2018-04-11T04:28:53Z	2018-04-11T04:28:53Z	MEMBER	Overall, I agree with the proposed conclusion. And appreciate the level of thoughtfulness and clarity. I'm happy to help with some of the implementation if we can split this up.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
379937531	https://github.com/pydata/xarray/issues/1603#issuecomment-379937531	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM3OTkzNzUzMQ==	shoyer 1217238	2018-04-10T00:42:19Z	2018-04-10T00:42:19Z	MEMBER	@fujiisoup Yes, we certainly could add a "N-dimensional index", even if it has no function other than a placeholder to mark a variable as an index. This would let us restore index state after selecting/concatenating along a dimension. However, I'm not sure it would be a satisfactory solution. If we keep these indexes around like coordinates, we could end up with scalar coordinates from different dimensions. Then it's still not clear how they should stack up in the final result -- we would have the same issue we currently have with concatenating coordinates. The other concern is that existence and behavior of scalar/N-dimensional indexes could be a surprising. What does it mean to index an N-dimensional index? This operations probably cannot be supported in a sensible way, or at least not without significant effort.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
379920389	https://github.com/pydata/xarray/issues/1603#issuecomment-379920389	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM3OTkyMDM4OQ==	fujiisoup 6815844	2018-04-09T23:03:03Z	2018-04-09T23:04:01Z	MEMBER	@shoyer, thank you for detailing. I am thinking how can we establish the following `selecting-concatenating` behavior with MultiIndex(-like) coordinate with our new Indexes machinary, `xr.concat([da.isel(x=i) for i in range(len(da['x':))], dim='x')` Personally, I think it would be nice if we could recover the original Index structue. We may need to track Indexes object even when the corresponding dimension becomes one dimensional? But scalar index sounds strange... Or, we may give up to restore the original coordinate structure during the above action, but stil keep them as ordinary coodinates.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
340012824	https://github.com/pydata/xarray/issues/1603#issuecomment-340012824	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM0MDAxMjgyNA==	shoyer 1217238	2017-10-27T15:59:51Z	2017-10-27T15:59:51Z	MEMBER	@jjpr-mit can you explain your use case a little more? What sort of order dependent queries do you want to do? The one that comes to mind for me are range based queries, e.g, `[('bar', 1) : ('foo', 9)]`. I think it is still relatively easy to ensure a unique ordering between levels, based on the order of coordinate variables in the xarray dataset. A bigger challenge is that for efficiency, these sorts of queries depend critically on having an actual MultiIndex. This means that if indexes for each of the levels arise from different arguments that were merged together, we might need to "merge" the separate indexes into a joint MultiIndex. This could potentially be slightly expensive.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
340005903	https://github.com/pydata/xarray/issues/1603#issuecomment-340005903	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDM0MDAwNTkwMw==	jjpr-mit 25231875	2017-10-27T15:34:42Z	2017-10-27T15:34:42Z	NONE	Will the new API preserve the order of the levels? One of the features that's necessary for `MultiIndex` to be truly hierarchical is that there is a defined order to the levels.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
338622746	https://github.com/pydata/xarray/issues/1603#issuecomment-338622746	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzODYyMjc0Ng==	alimanfoo 703554	2017-10-23T10:56:40Z	2017-10-23T10:56:40Z	CONTRIBUTOR	Just to say I'm interested in how MultiIndexes are handled also. In our use case, we have two variables conventionally named CHROM (chromosome) and POS (position) which together describe a location in a genome. I want to combine both variables into a multi-index so I can, e.g., select all data from some data variable for chromosome X between positions 100,000-200,000. For all our data variables, this genome location multi-index would be used to index the first dimension.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
336496995	https://github.com/pydata/xarray/issues/1603#issuecomment-336496995	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNjQ5Njk5NQ==	shoyer 1217238	2017-10-13T16:09:23Z	2017-10-13T16:09:38Z	MEMBER	I am wondering what the advantageous cases which are realized with this Index concept are. The other advantage is that it solves many of the issues with the current `MultiIndex` implementation. Making MultiIndex levels their own variables considerably simplifies the data model, and means that many features (including serialization) should "just work". In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space]. I like the latter one, as it is easier to understand even for non-pandas users. I agree, but there are probably some advantages to using a MultiIndex internally. For example, it allows for looking up on multiple levels at the same time. What does the actual implementation look like? xr.Dataset.indexes will be an OrderedDict that maps from variable's name to its associated dimension? Actual instance of Index will be one of xr.Dataset.variables? I think we could get away with making `xr.Dataset.indexes` simply a dict, with keys given by index names and values given by a `pandas.Index` instance. We should enforce that `Index.name` or `MultiIndex.names` corresponds to coordinate variables. For KDTree, this means we'll have to write our own wrapper `KDTreeIndex` that adds a `names` property, but we would probably need to add special methods like `get_indexer` anyways.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
336381864	https://github.com/pydata/xarray/issues/1603#issuecomment-336381864	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNjM4MTg2NA==	fujiisoup 6815844	2017-10-13T08:09:25Z	2017-10-13T08:09:25Z	MEMBER	Thanks for the details. (Sorry for my late responce. It took a long for me to understand what does it look like.) I am wondering what the advantageous cases which are realized with this `Index` concept are. As far as my understanding is correct, It will enable more flexible indexing, e.g. more than one Indexes are associated with one dimension and we can select from these coordinate values very flexibly. It will naturally integrate more advanced Indexes such as `KDTree` Are they correct? Probably the most elegant rule would again be to check all indexed variables for exact matches. That sounds reasonable. In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space]. I like the latter one, as it is easier to understand even for non-pandas users. What does the actual implementation look like? `xr.Dataset.indexes` will be an `OrderedDict` that maps from variable's name to its associated dimension? Actual instance of Index will be one of `xr.Dataset.variables`?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334229444	https://github.com/pydata/xarray/issues/1603#issuecomment-334229444	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNDIyOTQ0NA==	shoyer 1217238	2017-10-04T17:27:44Z	2017-10-04T17:27:44Z	MEMBER	Use cases of the independent Index and dims Would it be general cases where dimension and index are independent? (It is the case only for MultiIndex and KDtree)? We would still assign default indexes (using a normal `pandas.Index`) when you assign a 1D coordinate with matching name and dimension. But in general, yes, it seems like you should be able to make an index even for variables that aren't dimensions, including for a 1D variable whose name does not match a dimension. The rule would be that any coordinates can be part of an index. Another aspect to consider how to handle alignment when you have indexes along non-dimension coordinates. Probably the most elegant rule would again be to check all indexed variables for exact matches. Directly assigning indexes rather than using this default or `set_index()` would be an advanced feature, not recommended for everyday use. The main use case is routines which create a new xarray object based on an existing one, and want to re-use old indexes. For performance reasons, we probably do not want to actually check the values of manually assigned indexes, although we should verify that the shape matches. (We would have a clear disclaimer that if you manually assign an index with mismatched values the behavior is not well defined.) In principle, this data model would allow for two mostly equivalent indexing schemes: `MultiIndex[time, space]` vs two indexes `Index[time]` and `Index[space]`. We would need to figure out how to propagate and compare indexes like this. (I suppose if the coordinate values match, the result could have the union of all indexes from input arguments.) MultiIndex implementation In MultiIndex case, will a xarray object store a MultiIndex object and also the level variables as Variable objects (there will be some duplicates)? Yes, this is a little unfortunate. We could potentially make a custom wrapper for use in `IndexVariable._data` on the level variabless that lazily computes values from the MultiIndex (similar to our `LazilyIndexedArray` class), but I'm not certain yet that this is necessary. If indexes[dim] returns multiple Variables, which realizes a MultiIndex-like structure without pd.MultiIndex, indexes would be very different from dim, because a single dimension can have multiple indexes. Every entry in `indexes` should be a single `pandas.Index` or subclass, including `MultiIndex` (possibly eventually allowing for index-like objects such as something based on a KDTree).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334125888	https://github.com/pydata/xarray/issues/1603#issuecomment-334125888	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNDEyNTg4OA==	fujiisoup 6815844	2017-10-04T11:25:14Z	2017-10-04T12:43:59Z	MEMBER	@shoyer, could you add more details of this idea? I think I do not yet fully understand the practical difference between `dim` and `index`. Use cases of the independent `Index` and `dims` Would it be general cases where `dimension` and `index` are independent? (It is the case only for `MultiIndex` and `KDtree`)? `MultiIndex` implementation In `MultiIndex` case, will a xarray object store a `MultiIndex` object and also the level variables as `Variable` objects (there will be some duplicates)? If `indexes[dim]` returns multiple `Variable`s, which realizes a `MultiIndex`-like structure without `pd.MultiIndex`, `indexes` would be very different from `dim`, because a single dimension can have multiple `indexes`.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334091075	https://github.com/pydata/xarray/issues/1603#issuecomment-334091075	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNDA5MTA3NQ==	benbovy 4160723	2017-10-04T08:52:08Z	2017-10-04T08:52:08Z	MEMBER	I think that promoting "Indexes" to a first-class concept is indeed a very good idea, at both internal and public levels, even if at the latter level it would be another concept for users (it should be already familiar for pandas users, though). IMHO the "coordinate" and "index" concepts are different enough to consider them separately. I like the proposed repr for `Dataset.indexes`. I wouldn't mind if it is not included in `Dataset.__repr__`, considering that multi-indexes, kdtree, etc. only represent a few use cases. In too many cases it could result in a long, uninformative list of simple `pandas.Index`. I have to think a bit more about the details but I like the idea.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334048571	https://github.com/pydata/xarray/issues/1603#issuecomment-334048571	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNDA0ODU3MQ==	shoyer 1217238	2017-10-04T04:45:07Z	2017-10-04T04:45:07Z	MEMBER	CC @benbovy @fmaussion	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334045987	https://github.com/pydata/xarray/issues/1603#issuecomment-334045987	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNDA0NTk4Nw==	shoyer 1217238	2017-10-04T04:19:55Z	2017-10-04T04:20:25Z	MEMBER	Does your proposal means that Dataset will keep an additional attribute indexes, and indexes[dim] gives a pd.Index (or pd.MultiIndex, KDTree)? Yes, exactly. We actually already have an attribute that works like this, but it's current computed lazily, from either `Dataset._variables` or `DataArray._coords`.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334041813	https://github.com/pydata/xarray/issues/1603#issuecomment-334041813	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNDA0MTgxMw==	shoyer 1217238	2017-10-04T03:40:13Z	2017-10-04T04:15:39Z	MEMBER	I sometimes find it helpful to think about what the right `repr()` looks right, and then work backwards from there to the right data model. For example, we might imagine that "Indexes" are no longer coordinates, but instead their own entry in the repr: `<xarray.Dataset (exp_time: 5)> Coordinates: * experiment (exp_time) int64 0 0 0 1 1 * time (exp_time) float64 0.0 0.1 0.2 0.0 0.15 Indexes: exp_time: pandas.MultiIndex[experiment, time]` "Indexes" might not even need to be part of the main `Dataset.__repr__`, but it would certainly be the repr for `Dataset.indexes`. Other entries could include: `time: pandas.Datetime64Index[time] space: scipy.spatial.KDTree[latitude, longitude]` In this model: We would promote "Indexes" to a first-class concept in the xarray data model: (a) The levels of a MultiIndex would have corresponding `Variable` objects and be found in `coords`. (b) In contrast, the`MultiIndex` would not have a corresponding `Variable` object or be part of `coords`, though it could still be returned upon `__getitem__` access (computed on demand from `.indexes`). (c) Dataset and DataArray would gain an `indexes` argument in their constructors, which could be used for passing indexes on to new xarray objects. Coordinates marked with `*` are part of an index. They can't be modified, unless all corresponding indexes ares removed. Indexes would still be propagated, like coordinates.	{ "total_count": 5, "+1": 5, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334043044	https://github.com/pydata/xarray/issues/1603#issuecomment-334043044	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNDA0MzA0NA==	fujiisoup 6815844	2017-10-04T03:51:57Z	2017-10-04T03:51:57Z	MEMBER	I think we currently assume `variables[dim]` is an Index. Does your proposal means that `Dataset` will keep an additional attribute `indexes`, and `indexes[dim]` gives a `pd.Index` (or `pd.MultiIndex`, `KDTree`)? It sounds a much cleaner data model.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334030279	https://github.com/pydata/xarray/issues/1603#issuecomment-334030279	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNDAzMDI3OQ==	shoyer 1217238	2017-10-04T02:03:39Z	2017-10-04T02:03:39Z	MEMBER	One API design challenge here is that I think we still want a explicit notation of "indexed" variables. We could possibly allow operations like `.sel()` on non-indexed variables, but they would be slower, because we would not want to create expensive hash-tables (i.e., `pandas.Index`) in a non-transparent fashion.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334029215	https://github.com/pydata/xarray/issues/1603#issuecomment-334029215	https://api.github.com/repos/pydata/xarray/issues/1603	MDEyOklzc3VlQ29tbWVudDMzNDAyOTIxNQ==	fujiisoup 6815844	2017-10-04T01:55:02Z	2017-10-04T01:55:02Z	MEMBER	I'm using `MultiIndex` a lot, but I noticed that it is just a workaround to index along multiple kinds of coordinate. Consider the following example, ```python In [1]: import numpy as np ...: import xarray as xr ...: da = xr.DataArray(np.arange(5), dims=['x'], ...: coords={'experiment': ('x', [0, 0, 0, 1, 1]), ...: 'time': ('x', [0.0, 0.1, 0.2, 0.0, 0.15])}) ...: In [2]: da Out[2]: <xarray.DataArray (x: 5)> array([0, 1, 2, 3, 4]) Coordinates: experiment (x) int64 0 0 0 1 1 time (x) float64 0.0 0.1 0.2 0.0 0.15 Dimensions without coordinates: x ``` I want to do something like this `python da.sel(experiment=0).sel(time=0.1)` but it cannot. MultiIndexing enables this, `python In [2]: da = da.set_index(exp_time=['experiment', 'time']) ...: da ...: Out[2]: <xarray.DataArray (x: 5)> array([0, 1, 2, 3, 4]) Coordinates: * exp_time (exp_time) MultiIndex - experiment (exp_time) int64 0 0 0 1 1 - time (exp_time) float64 0.0 0.1 0.2 0.0 0.15 Dimensions without coordinates: x` If we could make a selection from a non-index coordinate, `MultiIndex` is not necessary for this case. I think there should be other important usecases of `MultiIndex`. I would be happy if anyone could list them in this issue.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

68 rows where issue = 262642978 sorted by updated_at descending

Let us define coordinate an array in data set cause co-indexed when we index its data set. It means that:

Use dims to determined the way how other array of the data set will be co-indexed.

Some compatibility issues:

Is a MultiIndex a feature of the schema or the implementation?

MultiIndex as an implementation detail

Data model

Dataset.indexes and DataArray.indexes

Separate indexers without a MultiIndex should be prohibited

Functionality

Index variables

Mapping indexes into pandas

case 1 (one indexed variable, same name as dimension):

case 2 (one indexed variable, different name from dimension):

case 3 (multiple indexed variables, one has same name as dimension):

case 4 (multiple indexed variables, all have different names from dimension):

Indexes for unstack

Advanced export

`Dataset.indexes` and `DataArray.indexes`