home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

24 rows where issue = 262642978 and user = 1217238 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

These facets timed out: author_association, issue

user 1

  • shoyer · 24 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
523240818 https://github.com/pydata/xarray/issues/1603#issuecomment-523240818 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDUyMzI0MDgxOA== shoyer 1217238 2019-08-21T00:00:43Z 2021-03-03T16:46:25Z MEMBER

Explicitly propagating indexes requires going through most of xarray's source code and auditing each time we create a Dataset or DataArray object with low-level operations. We have some pretty decent testing functions for this in the form of xarray.testing._assert_internal_invariants, so this is now a pretty mechanical process -- you know it's working if you're now setting indexes explicitly and xarray's test suite passes.

Here's our current progress: - [x] most of dataset.py - [x] alignment.py - [x] merge.py (#3234) - [ ] concat.py - [x] dataarray.py (#3519, #3481) - [ ] computation.py - [ ] groupby.py - [ ] resample.py - [ ] rolling.py - [ ] everything else!

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
549179102 https://github.com/pydata/xarray/issues/1603#issuecomment-549179102 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDU0OTE3OTEwMg== shoyer 1217238 2019-11-03T21:12:25Z 2019-11-03T21:12:25Z MEMBER

I'm not working on any of these right now. You might start with a few of the dataarray.py methods (no need to do them all at once) to get a sense of what piping these arguments around looks like. I suspect you could get quite a few of these working just by handling indexes in _to_temp_dataset/_from_temp_dataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
450702503 https://github.com/pydata/xarray/issues/1603#issuecomment-450702503 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ1MDcwMjUwMw== shoyer 1217238 2019-01-01T00:54:27Z 2019-01-01T00:54:27Z MEMBER

I'm starting to make these changes incrementally -- the first step is in https://github.com/pydata/xarray/pull/2639.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
444204957 https://github.com/pydata/xarray/issues/1603#issuecomment-444204957 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0NDIwNDk1Nw== shoyer 1217238 2018-12-04T18:25:33Z 2018-12-04T18:25:33Z MEMBER

Sorry for maybe asking this again but I'm a bit confused now: is there any good reason of supporting "multiple single indexes" along the same dimension?

After all, perhaps better defaults would be to set indexes (pandas.Index) only for 1-d coordinates matching dimension names, like it is the case now.

If you want a different behavior, then you need to use .set_index(), which would raise if it results in multiple single indexes along a dimension. We could also add a new indexes argument to the Dataset / DataArray constructors to save some typing (and avoid the creation of in-memory pandas.Index for very long coordinates if an out-of-core alternative is later supported).

I discussed this is a little bit above in https://github.com/pydata/xarray/issues/1603#issuecomment-442661526, under "MultiIndex as part of the data schema".

I agree that the default behavior should still be to create automatic indexes only for 1d coordinates matching dimension names. But we still will have (rare?) cases where "multiple single indexes" could arise from combining arguments with different indexes.

For example, suppose the station dimension has an index for station_name in one dataset and city in another. Should the result be: - A MultiIndex with levels station_name and city? This would be most useful for future operations. - Two individual indexes for station_name and city? This would be the cheapest result to construct. - An error? This is arguably too strict, because there are no conflicts in either of the indexes.

I guess the error is probably the best idea.

Where does come from array([0, 1])? I wouldn't have been surprised if a KeyError was raised instead. Perhaps this specific case was initially for backward compatibility when the "dimensions without indexes" feature has been introduced, but it was a long time ago and I'm not sure this is still necessary.

This is indeed the historical genesis, but I agree that this is confusing and we should deprecate/remove it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
443044579 https://github.com/pydata/xarray/issues/1603#issuecomment-443044579 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0MzA0NDU3OQ== shoyer 1217238 2018-11-30T00:24:39Z 2018-11-30T00:24:39Z MEMBER

I wonder if we should also change the default value of the append argument in set_index() to append=None, which means something like "append if creating a MultiIndex". For most users, keeping a single MultiIndex is the most usable way to use multiple indexes along a dimension, and our default behavior should reflect that.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442965602 https://github.com/pydata/xarray/issues/1603#issuecomment-442965602 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0Mjk2NTYwMg== shoyer 1217238 2018-11-29T19:38:34Z 2018-11-29T19:38:34Z MEMBER

It occurs to me that for the case of "multiple single indexes" along the same dimension there is no good way to use them simultaneously for indexing/reindexing at the same time. We should explicitly raise if you try to do this.

I guess we have a few options for automatic alignment with multiple single indexes, too: 1. We could only support "exact" indexing 2. We could require that aligning each index separately gives the same result

(2) seems least restrictive and is probably the right choice.


One advantage of not having MultiIndex objects as variables is that the serialization story gets simpler. The rule becomes "multi-indexes don't get saved".


What should the default behavior of set_index(['x', 'y']) without an explicit kind argument be? - Should this mean individual indexes or a combined MultiIndex? The later might be more surprising but is arguably more useful. It would make sense if the model is that set_index() always creates a single index object. - We could potentially automatically pick an index type using simple heuristics. For example, if the arguments are 1D, you get get a MultiIndex by default. If the arguments have two or more dimensions, you get a KDTree.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442956167 https://github.com/pydata/xarray/issues/1603#issuecomment-442956167 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0Mjk1NjE2Nw== shoyer 1217238 2018-11-29T19:10:14Z 2018-11-29T19:10:14Z MEMBER

Looking at the reported issues related to multi-indexes in xarray, I have the same feeling. Simply reusing pandas.MultiIndex in xarray where slightly different semantics are generally expected has shown to be painful. It seems easier to have our own baked solution and deal with differences during xarray<-> pandas conversion if needed.

I think the pandas.MultiIndex is a pretty solid data structure on a fundamental level, it just has some weird semantics for some indexing edge cases. Whether or not we write xarray.MultiIndex structure, we can achieve most of what we want with a thin layer over pandas.MultiIndex.

If a variable for each multi-coordinate index is "just" for data schema consistency, then why not showing all those indexes in a separate section of the repr?

Yes, I like this! Generally I like @benbovy's entire proposal :).

@fujiisoup can you clarity the use-cases you have for a MultiIndex as a variable?

Am I right in thinking the Multi-indexes is only a helpful note to users, rather than conveying anything about how data is accessed?

From a data perspective, the only thing having an Index and/or MultiIndex should change is that the data is immutable.

But by necessity the nature of the index will determine which indexing operations are possible/efficient. For example, if you want to do nearest-neighbor indexing with multiple coordinates you'll need a KDTree. We should not be afraid to raise errors if an indexing operation can't be done efficiently.


With regards to reindexing: I don't think this needs any special handling versus normal indexing (sel()). The rules basically fall out of those for normal indexing, except we handle missing values differently (by filling with NaN).

Another issue: how do automatic alignment with multiple indexes? Let me suggest a straw-man proposal: We always align indexed coordinates. If a coordinate is used in different types of indexes (e.g., a base Index in one argument and a MultiIndex level in another), we can either: 1. create a MultiIndex with the variable on the fly (this could be slightly expensive), or 2. fall back to only supporting "exact" indexing

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442710536 https://github.com/pydata/xarray/issues/1603#issuecomment-442710536 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0MjcxMDUzNg== shoyer 1217238 2018-11-29T05:23:33Z 2018-11-29T05:25:48Z MEMBER

There's no need to support indexing like ds.sel(multi=list_of_pairs). Indexing like ds.sel(x=..., y=...) solves the same use case and looks nicer.

This needs an important caveat: it's only true that you use ds.sel(x=..., y=...) to emulate ds.sel(multi=list_of_pairs) if you do explicit vectorized indexing like in @max-sixty's example above (https://github.com/pydata/xarray/issues/1603#issuecomment-442636798). It would be nice to preserve a way to select a list of particular points that didn't require constructing explicit DataArray objects as the indexers. (But maybe this is a somewhat niche use-case and it isn't worth the trouble.)

Let me make a tentative proposal: we should model a MultiIndex in xarray as exactly equivalent to a sparse multi-dimensional array, except with missing elements modeled implicitly (by omission) instead of explicitly (with NaN). If we do this, I think MultiIndex semantics could be defined to be identical to those of separable Index objects.

One challenge is that we will definitely have to make some intentional deviations from the behavior of pandas, at least when dealing with array indexing of a MultiIndex level. Pandas has some strange behaviors with array indexing of a MultiIndex level, and I'm honestly not sure if they are bugs or features: - It ignores missing labels (https://github.com/pandas-dev/pandas/issues/15452) - It drops duplicate labels (https://github.com/pandas-dev/pandas/issues/19414)

Fortunately, the MultiIndex data model is not that complicated, and it is quite straightforward to remap indexing results from sub-Index levels onto integer codes. I suspect we will find it easier to rewrite some of these routines than to change pandas, both because pandas may not agree with different semantics and because the pandas indexing code is an unholy mess.

For example, we can reproduce the above issues: python import pandas as pd index = pd.MultiIndex.from_arrays([['a', 'b', 'c']]) print(index.get_locs((['a', 'a'],))) # [0] print(index.get_locs((['a', 'd'],))) # [0] We actually want something more like: ```python def get_locs(index, key): return index.get_indexer(pd.MultiIndex.from_product(key))

print(get_locs(index, (['a', 'a'],))) # [0, 0] print(get_locs(index, (['a', 'd'],))) # [0, -1] ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442680467 https://github.com/pydata/xarray/issues/1603#issuecomment-442680467 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0MjY4MDQ2Nw== shoyer 1217238 2018-11-29T02:15:48Z 2018-11-29T02:19:06Z MEMBER

That said, I still don't know how to use public MultiIndex methods for this. Neither index.get_loc_level([1, 2], level=1) nor index.get_loc((slice(None), [1, 2])) work.

The answer is the index.get_locs() method: index.get_locs([slice(None), 1, 2]]) works.

It's painfully slow for large numbers of points due to a Python loop over each point, but presumably that could be optimized: x = np.arange(10000) index = pd.MultiIndex.from_arrays([x]) %timeit index.get_locs((x,)) # 1.31 s per loop %timeit index.levels[0].get_indexer(x) # 93 µs per loop

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442581754 https://github.com/pydata/xarray/issues/1603#issuecomment-442581754 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0MjU4MTc1NA== shoyer 1217238 2018-11-28T19:51:42Z 2018-11-29T00:48:53Z MEMBER

I've been thinking about this a little more in the context of starting on the implementation (in #2195).

In particular, I no longer agree with this "Separate indexers without a MultiIndex should be prohibited" from my original proposal. The problem is that the semantics of a MultiIndex are not quite the same as separate indexes, and I don't think all use-cases are well solved by always using a MultiIndex. ~~For example, I don't think it's possible to do point-wise indexing along anything other than the first level of a MultiIndex.~~ (note: this is not true, see https://github.com/pydata/xarray/issues/1603#issuecomment-442662561)

Instead, I think we should make the model transparent by retaining an xarray variable for the MultiIndex, and provide APIs for explicitly converting index types.

e.g., for the repr with a MultiIndex: Coordinates: * x (x) MultiIndex[level_1, level_2] * level_1 (x) object 'a' 'a' 'b' 'b' * level_2 (x) int64 1 2 1 2 and without a MultiIndex: Coordinates: * level_1 (x) object 'a' 'a' 'b' 'b' * level_2 (x) int64 1 2 1 2

The main way in which this could get confusing is if you explicitly mutate the Dataset to remove some but not all of the variables corresponding to the MultiIndex (e.g., x but not level_1 or vise-versa). We have a few potential options here: 1. Don't worry about it: if you mutate objects, you can potentially end up in slightly confusing internal states. If you care about whether level_1 uses a pandas.Index or pandas.MultiIndex, you can find out for sure by checking ds.indexes['level_1']. 2. Prohibit it in our data model: either (a) raise an error if you try to manually delete a single variable or (b) automatically delete all associated variables, too. Encourage using various explicit APIs that return new objects with a new index. 3. Use a different indicator than * for marking "indirect" indexes, so it's more obvious if some coordinates get removed, e.g., Coordinates: * x (x) MultiIndex[level_1, level_2] + level_1 (x) object 'a' 'a' 'b' 'b' + level_2 (x) int64 1 2 1 2

The different indicator might make sense regardless but I am also partial to "Prohibit it in our data model." The main downside is that this adds a little more complexity to the logic for determining indexes resulting from an operation (namely, verifying that all MultiIndex levels still correspond to coordinates).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442662561 https://github.com/pydata/xarray/issues/1603#issuecomment-442662561 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0MjY2MjU2MQ== shoyer 1217238 2018-11-29T00:48:12Z 2018-11-29T00:48:28Z MEMBER

For example, I don't think it's possible to do point-wise indexing along anything other than the first level of a MultiIndex.

This is clearly not true, since it works in pandas: python import pandas as pd index = pd.MultiIndex.from_product([list('ab'),[1,2]]) series = pd.Series(range(4), index) print(series.loc[:, [1, 2]])

That said, I still don't know how to use public MultiIndex methods for this. Neither index.get_loc_level([1, 2], level=1) nor index.get_loc((slice(None), [1, 2])) work.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442661526 https://github.com/pydata/xarray/issues/1603#issuecomment-442661526 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0MjY2MTUyNg== shoyer 1217238 2018-11-29T00:42:39Z 2018-11-29T00:42:39Z MEMBER

@max-sixty I like your schema vs. implementation breakdown. In general, I agree with you that it would be nice to have MultiIndex has an implementation detail rather than part of xarray's schema. But I'm not entirely sure that's feasible.

Let's try to list out the pros/cons. Consider a MultiIndex 'multi' with levels 'x' and 'y': - Advantages of MultiIndex as part of the data schema: - There is an explicit coordinate (of tuples) corresponding to MultiIndex values, which can be returned from ds.coords['multi']. This is inherently not that useful compared to the separable variables, but is a cleaner solution that creating ds.coords['multi'] as a "virtual" variable on the fly (which we would need for backwards compatibility). - We don't need to do full "normalization" when multiple indexes along the same dimension are encountered, e.g., in an operation that combines two different indexes, we would simply put both on the result instead of building a MultiIndex (which would require allocating a whole new array of integer codes). - The nature of the MultiIndex is more transparent as part of the data model. For example, if x and y are numeric, it could make sense to use either a MultiIndex or KDTree for indexing. Explicit APIs (e.g., set_multiindex and set_kdtree) would allow users a high level of control. - For advanced use-cases, it is potentially easier to work around the limitations of a MultiIndex, e.g., the way that some operations require lex-sorted-ness. - Advantages of MultiIndex as an implementation detail: - Simpler data model (for users). There are few good use cases for multiple indexes that aren't a MultiIndex. - Easier to do automatic alignment: we know that indexes will always have the same normalized form (in a MultiIndex). Otherwise, we would have to do this on the fly, or request that users explicitly setup compatible indexes. - More flexibility for xarray: we can potentially swap out indexing without changing the user-facing API. We might have something like a "hybrid" MultiIndex/KDTree that chooses the appropriate index based on the requested operation. - We don't need to create an explicit array of tuples for the MultiIndex variable (but we could still have a variable corresponding to a MultiIndex and only construct the .data array in a "lazy" fashion). - There's no need to name extraneous variables that only exist for the sake of a MultiIndex. - There's no need to support indexing like ds.sel(multi=list_of_pairs). Indexing like ds.sel(x=..., y=...) solves the same use case and looks nicer. That said, this would be a minor backwards compatibility break (this currently works in xarray).

P.S. I haven't made much progress on this yet so there's definitely still time to figure out the right decision -- thanks for your engagement on this!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
392833478 https://github.com/pydata/xarray/issues/1603#issuecomment-392833478 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDM5MjgzMzQ3OA== shoyer 1217238 2018-05-29T16:04:27Z 2018-05-29T16:04:27Z MEMBER

Sure, this is as good a time as any. But we'll probably need to refinish this refactoring before it makes sense to implement anything.

On Tue, May 29, 2018 at 8:59 AM Alistair Miles notifications@github.com wrote:

Ok, cool. Was wondering if now was right time to revisit that, alongside the work proposed in this PR. Happy to participate in that discussion, still interested in implementing some alternative index classes.

On Tue, 29 May 2018, 15:45 Stephan Hoyer, notifications@github.com wrote:

Yes, the index API still needs to be determined. But I think we want to support something like that. On Tue, May 29, 2018 at 1:20 AM Alistair Miles <notifications@github.com

wrote:

I see this mentions an Index API, is that still to be decided?

On Tue, 29 May 2018, 05:28 Stephan Hoyer, notifications@github.com wrote:

I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this:

  1. Normalizing and creating default indexes in the Dataset/DataArray constructor.
  2. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs.
  3. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables.

I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in #2195 https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/pydata/xarray/issues/1603#issuecomment-392649605 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/AAq8QvMauEPa6hfgorDoShZ2PwyYWk6Tks5t3M6AgaJpZM4PtACU

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392692996, or mute the thread <

https://github.com/notifications/unsubscribe-auth/ABKS1p8RjrupPM2z2d4_ylWX7826RQ0Rks5t3QTHgaJpZM4PtACU

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392803210, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAq8QgygnzTX053NlGZ5A5j_tRkRxMj7ks5t3V79gaJpZM4PtACU

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392831984, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1u3XEy9d3xV4M2LLfshNFWN786h9ks5t3XBzgaJpZM4PtACU .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
392803210 https://github.com/pydata/xarray/issues/1603#issuecomment-392803210 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDM5MjgwMzIxMA== shoyer 1217238 2018-05-29T14:45:12Z 2018-05-29T14:45:12Z MEMBER

Yes, the index API still needs to be determined. But I think we want to support something like that. On Tue, May 29, 2018 at 1:20 AM Alistair Miles notifications@github.com wrote:

I see this mentions an Index API, is that still to be decided?

On Tue, 29 May 2018, 05:28 Stephan Hoyer, notifications@github.com wrote:

I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this:

  1. Normalizing and creating default indexes in the Dataset/DataArray constructor.
  2. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs.
  3. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables.

I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in #2195 https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392649605, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAq8QvMauEPa6hfgorDoShZ2PwyYWk6Tks5t3M6AgaJpZM4PtACU

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392692996, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1p8RjrupPM2z2d4_ylWX7826RQ0Rks5t3QTHgaJpZM4PtACU .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
392649605 https://github.com/pydata/xarray/issues/1603#issuecomment-392649605 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDM5MjY0OTYwNQ== shoyer 1217238 2018-05-29T04:28:45Z 2018-05-29T04:28:45Z MEMBER

I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this: 1. Normalizing and creating default indexes in the Dataset/DataArray constructor. 2. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs. 3. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables.

I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
379905457 https://github.com/pydata/xarray/issues/1603#issuecomment-379905457 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDM3OTkwNTQ1Nw== shoyer 1217238 2018-04-09T21:52:02Z 2018-04-11T04:34:43Z MEMBER

I've been thinking about getting started on this. Here are my current thoughts on the right design approach.


Data model

Dataset.indexes and DataArray.indexes

My current thinking is that indexes should simply be a dictionary mapping from coordinate and/or dimension names to pandas.Index objects. Mapping from label-based to integer-based then becomes simply a matter of looking up the appropriate indexes for each coordinate/dimension (i.e., the keyword argument names in .sel()), and using the corresponding index(es) to transform label-based indexers into integer indexers.

If multiple coordinates are part of the same index, they should point to the same MultiIndex/KDTree object. The MultiIndex would be responsible for resolving the combined indexing operation along the coordinate dimension(s).

By default, indexes is populated with an Index/MultiIndex for each dimension of all indexes along that dimension. Additional indexes may be set manually, e.g., using set_index().

Indexes keyed by a dimension name are used for axis-positional indexing with .loc and for alignment with reindex/align. However, if the index is a MultiIndex with a level name matching a coordinate, then only that level will be used for indexing/alignment. In other words: the coordinate name corresponding to indexing request takes precedence, but if it isn't found, we use all indexes along the dimension.

Separate indexers without a MultiIndex should be prohibited

It should be impossible to express inconsistent and/or confusing states in xarray's data model. This sort of inconsistency (e.g., levels not being stored directly in Dataset.variables) is the major source of our issues with the current MultiIndex data model.

I'm particularly concerned about the clearly showing difference between coordinates that are part of a MultiIndex and coordinates that are separately indexed. I suspect we could make indexing operations nearly equivalent from a user perspective, but there would likely remain small differences that would be a source of confusion and bugs. Preserving indexes in the form in which they are created is not also not really an option, because there are lots of xarray operations that would probably normalize indexes into standard forms, such as groupby, stack/unstack and to/from_pandas.

The simplest option is to prohibit one of these cases entirely, either: 1. Always group repeated indexes along a dimension into a MultiIndex, or 2. Never use pandas.MultiIndex (keep separate indexes for each coordinate).

From xarray's perspective, it would certainly be cleaner to prohibit MultiIndex. The level order dependent behavior of MultiIndex is not the best fit for xarray's data model, and could be challenging to keep in sync with coordinate order on xarray objects. We would need to ensure that coordinate/level order remains consistent in all operations, or at least ensure that coordinates are always printed in order of their appearence in MultiIndex levels. (We generally preserve coordinate order already, but well behaved programs using xarray currently don't need to rely on this behavior.)

That said, always using MultiIndexes for multiple indexes along the same dimension has it's own clear advantages. First, it's consistent with pandas, which makes it easier to transition data back and forth. Second, simultaneous indexing operations across MultiIndex levels would be difficult to express efficiently with a MultiIndex. This is probably the right choice for xarray.

We could potentially allow for non-consolidated indexes (not part of a MultiIndex) when using the advanced API (e.g., directly setting the indexes parameter). But we'll save this for later.

Functionality

Index variables

Every MultiIndex level must have a corresponding xarray.Variable object in coordinates on each Dataset/DataArray on which they appear. These objects may reference the same pandas.Index/pandas.MultiIndex object used for indexing, but must have immutable data (e.g., flag.writeable = False in NumPy). For now, I expect to reuse the existing IndexVariable class.

Now that levels are xarray.Variable objects, there will no longer be a Variable object in Dataset._variables/DataArray._coords corresponding to a pandas.MultiIndex. However, we will continue to create a "virtual variable" upon indexing consisting of an dtype=object array of MultiIndex values, as a fallback if there is no coordinate matching a dimension name.

Mapping indexes into pandas

Another concern is how to map all of the new possible indexing states into pandas:

```

case 1 (one indexed variable, same name as dimension):

  • time (time)

case 2 (one indexed variable, different name from dimension):

  • year (time)

case 3 (multiple indexed variables, one has same name as dimension):

  • time (time)
  • year (time)

case 4 (multiple indexed variables, all have different names from dimension):

  • year (time)
  • month (time) ```

For consistency with current behavior, case 1 should correspond to a standard pandas.Index and case 4 should correspond to a pandas.MultiIndex. But what about the intermediate cases 2 and 3, which are currently prohibited by xarray's data model?

I think we should use the rule that all indexed variables are consolidated into a single Index in pandas. If there are multiple indexed variables (case 3 or 4), this would be a MultiIndex; otherwise (cases 2 or 3), this would be a standard Index. This has a virtue of speed and simplicity: we can simply reuse the existing Index or MultiIndex object from indexes.

The other option would be prohibit cases 2 and 3 (like we currently do), because we will not be able to map them into pandas and back faithfully. I think this would be a mistake, because indexes on multiple levels would be useful for xarray, even if one level corresponds to the dimension name.

Indexes for unstack

With the introduction of more flexible and optional index levels, it may not always may sense to unstack() every index coordinate. We should support optionally specifying levels to unstack, possibly with an API mirroring stack(), e.g., perhaps .unstack(dim_name=['level_0', 'level_1']) to unstack coordinates level_0 and level_1 from dimension dim_name.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
379937531 https://github.com/pydata/xarray/issues/1603#issuecomment-379937531 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDM3OTkzNzUzMQ== shoyer 1217238 2018-04-10T00:42:19Z 2018-04-10T00:42:19Z MEMBER

@fujiisoup Yes, we certainly could add a "N-dimensional index", even if it has no function other than a placeholder to mark a variable as an index. This would let us restore index state after selecting/concatenating along a dimension.

However, I'm not sure it would be a satisfactory solution. If we keep these indexes around like coordinates, we could end up with scalar coordinates from different dimensions. Then it's still not clear how they should stack up in the final result -- we would have the same issue we currently have with concatenating coordinates.

The other concern is that existence and behavior of scalar/N-dimensional indexes could be a surprising. What does it mean to index an N-dimensional index? This operations probably cannot be supported in a sensible way, or at least not without significant effort.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
340012824 https://github.com/pydata/xarray/issues/1603#issuecomment-340012824 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDM0MDAxMjgyNA== shoyer 1217238 2017-10-27T15:59:51Z 2017-10-27T15:59:51Z MEMBER

@jjpr-mit can you explain your use case a little more? What sort of order dependent queries do you want to do? The one that comes to mind for me are range based queries, e.g, [('bar', 1) : ('foo', 9)].

I think it is still relatively easy to ensure a unique ordering between levels, based on the order of coordinate variables in the xarray dataset.

A bigger challenge is that for efficiency, these sorts of queries depend critically on having an actual MultiIndex. This means that if indexes for each of the levels arise from different arguments that were merged together, we might need to "merge" the separate indexes into a joint MultiIndex. This could potentially be slightly expensive.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
336496995 https://github.com/pydata/xarray/issues/1603#issuecomment-336496995 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDMzNjQ5Njk5NQ== shoyer 1217238 2017-10-13T16:09:23Z 2017-10-13T16:09:38Z MEMBER

I am wondering what the advantageous cases which are realized with this Index concept are.

The other advantage is that it solves many of the issues with the current MultiIndex implementation. Making MultiIndex levels their own variables considerably simplifies the data model, and means that many features (including serialization) should "just work".

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space]. I like the latter one, as it is easier to understand even for non-pandas users.

I agree, but there are probably some advantages to using a MultiIndex internally. For example, it allows for looking up on multiple levels at the same time.

What does the actual implementation look like? xr.Dataset.indexes will be an OrderedDict that maps from variable's name to its associated dimension? Actual instance of Index will be one of xr.Dataset.variables?

I think we could get away with making xr.Dataset.indexes simply a dict, with keys given by index names and values given by a pandas.Index instance. We should enforce that Index.name or MultiIndex.names corresponds to coordinate variables.

For KDTree, this means we'll have to write our own wrapper KDTreeIndex that adds a names property, but we would probably need to add special methods like get_indexer anyways.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334229444 https://github.com/pydata/xarray/issues/1603#issuecomment-334229444 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDMzNDIyOTQ0NA== shoyer 1217238 2017-10-04T17:27:44Z 2017-10-04T17:27:44Z MEMBER
  1. Use cases of the independent Index and dims Would it be general cases where dimension and index are independent? (It is the case only for MultiIndex and KDtree)?

We would still assign default indexes (using a normal pandas.Index) when you assign a 1D coordinate with matching name and dimension. But in general, yes, it seems like you should be able to make an index even for variables that aren't dimensions, including for a 1D variable whose name does not match a dimension. The rule would be that any coordinates can be part of an index.

Another aspect to consider how to handle alignment when you have indexes along non-dimension coordinates. Probably the most elegant rule would again be to check all indexed variables for exact matches.

Directly assigning indexes rather than using this default or set_index() would be an advanced feature, not recommended for everyday use. The main use case is routines which create a new xarray object based on an existing one, and want to re-use old indexes.

For performance reasons, we probably do not want to actually check the values of manually assigned indexes, although we should verify that the shape matches. (We would have a clear disclaimer that if you manually assign an index with mismatched values the behavior is not well defined.)

In principle, this data model would allow for two mostly equivalent indexing schemes: MultiIndex[time, space] vs two indexes Index[time] and Index[space]. We would need to figure out how to propagate and compare indexes like this. (I suppose if the coordinate values match, the result could have the union of all indexes from input arguments.)

  1. MultiIndex implementation In MultiIndex case, will a xarray object store a MultiIndex object and also the level variables as Variable objects (there will be some duplicates)?

Yes, this is a little unfortunate. We could potentially make a custom wrapper for use in IndexVariable._data on the level variabless that lazily computes values from the MultiIndex (similar to our LazilyIndexedArray class), but I'm not certain yet that this is necessary.

If indexes[dim] returns multiple Variables, which realizes a MultiIndex-like structure without pd.MultiIndex, indexes would be very different from dim, because a single dimension can have multiple indexes.

Every entry in indexes should be a single pandas.Index or subclass, including MultiIndex (possibly eventually allowing for index-like objects such as something based on a KDTree).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334048571 https://github.com/pydata/xarray/issues/1603#issuecomment-334048571 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDMzNDA0ODU3MQ== shoyer 1217238 2017-10-04T04:45:07Z 2017-10-04T04:45:07Z MEMBER

CC @benbovy @fmaussion

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334045987 https://github.com/pydata/xarray/issues/1603#issuecomment-334045987 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDMzNDA0NTk4Nw== shoyer 1217238 2017-10-04T04:19:55Z 2017-10-04T04:20:25Z MEMBER

Does your proposal means that Dataset will keep an additional attribute indexes, and indexes[dim] gives a pd.Index (or pd.MultiIndex, KDTree)?

Yes, exactly. We actually already have an attribute that works like this, but it's current computed lazily, from either Dataset._variables or DataArray._coords.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334041813 https://github.com/pydata/xarray/issues/1603#issuecomment-334041813 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDMzNDA0MTgxMw== shoyer 1217238 2017-10-04T03:40:13Z 2017-10-04T04:15:39Z MEMBER

I sometimes find it helpful to think about what the right repr() looks right, and then work backwards from there to the right data model.

For example, we might imagine that "Indexes" are no longer coordinates, but instead their own entry in the repr: <xarray.Dataset (exp_time: 5)> Coordinates: * experiment (exp_time) int64 0 0 0 1 1 * time (exp_time) float64 0.0 0.1 0.2 0.0 0.15 Indexes: exp_time: pandas.MultiIndex[experiment, time]

"Indexes" might not even need to be part of the main Dataset.__repr__, but it would certainly be the repr for Dataset.indexes. Other entries could include: time: pandas.Datetime64Index[time] space: scipy.spatial.KDTree[latitude, longitude]

In this model:

  1. We would promote "Indexes" to a first-class concept in the xarray data model: (a) The levels of a MultiIndex would have corresponding Variable objects and be found in coords. (b) In contrast, theMultiIndex would not have a corresponding Variable object or be part of coords, though it could still be returned upon __getitem__ access (computed on demand from .indexes). (c) Dataset and DataArray would gain an indexes argument in their constructors, which could be used for passing indexes on to new xarray objects.
  2. Coordinates marked with * are part of an index. They can't be modified, unless all corresponding indexes ares removed.
  3. Indexes would still be propagated, like coordinates.
{
    "total_count": 5,
    "+1": 5,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
334030279 https://github.com/pydata/xarray/issues/1603#issuecomment-334030279 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDMzNDAzMDI3OQ== shoyer 1217238 2017-10-04T02:03:39Z 2017-10-04T02:03:39Z MEMBER

One API design challenge here is that I think we still want a explicit notation of "indexed" variables. We could possibly allow operations like .sel() on non-indexed variables, but they would be slower, because we would not want to create expensive hash-tables (i.e., pandas.Index) in a non-transparent fashion.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 3186.971ms · About: xarray-datasette