home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 442661526

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/1603#issuecomment-442661526 https://api.github.com/repos/pydata/xarray/issues/1603 442661526 MDEyOklzc3VlQ29tbWVudDQ0MjY2MTUyNg== 1217238 2018-11-29T00:42:39Z 2018-11-29T00:42:39Z MEMBER

@max-sixty I like your schema vs. implementation breakdown. In general, I agree with you that it would be nice to have MultiIndex has an implementation detail rather than part of xarray's schema. But I'm not entirely sure that's feasible.

Let's try to list out the pros/cons. Consider a MultiIndex 'multi' with levels 'x' and 'y': - Advantages of MultiIndex as part of the data schema: - There is an explicit coordinate (of tuples) corresponding to MultiIndex values, which can be returned from ds.coords['multi']. This is inherently not that useful compared to the separable variables, but is a cleaner solution that creating ds.coords['multi'] as a "virtual" variable on the fly (which we would need for backwards compatibility). - We don't need to do full "normalization" when multiple indexes along the same dimension are encountered, e.g., in an operation that combines two different indexes, we would simply put both on the result instead of building a MultiIndex (which would require allocating a whole new array of integer codes). - The nature of the MultiIndex is more transparent as part of the data model. For example, if x and y are numeric, it could make sense to use either a MultiIndex or KDTree for indexing. Explicit APIs (e.g., set_multiindex and set_kdtree) would allow users a high level of control. - For advanced use-cases, it is potentially easier to work around the limitations of a MultiIndex, e.g., the way that some operations require lex-sorted-ness. - Advantages of MultiIndex as an implementation detail: - Simpler data model (for users). There are few good use cases for multiple indexes that aren't a MultiIndex. - Easier to do automatic alignment: we know that indexes will always have the same normalized form (in a MultiIndex). Otherwise, we would have to do this on the fly, or request that users explicitly setup compatible indexes. - More flexibility for xarray: we can potentially swap out indexing without changing the user-facing API. We might have something like a "hybrid" MultiIndex/KDTree that chooses the appropriate index based on the requested operation. - We don't need to create an explicit array of tuples for the MultiIndex variable (but we could still have a variable corresponding to a MultiIndex and only construct the .data array in a "lazy" fashion). - There's no need to name extraneous variables that only exist for the sake of a MultiIndex. - There's no need to support indexing like ds.sel(multi=list_of_pairs). Indexing like ds.sel(x=..., y=...) solves the same use case and looks nicer. That said, this would be a minor backwards compatibility break (this currently works in xarray).

P.S. I haven't made much progress on this yet so there's definitely still time to figure out the right decision -- thanks for your engagement on this!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  262642978
Powered by Datasette · Queries took 0.851ms · About: xarray-datasette