id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 2247914876,I_kwDOAMm_X86F_HV8,8950,ENH: Make `_to_dataframe` faster for extension array columns after `pandas` fix,43999641,open,0,,,0,2024-04-17T10:10:37Z,2024-04-28T20:03:23Z,,CONTRIBUTOR,,,,"### What is your issue? One https://github.com/pandas-dev/pandas/issues/57676 is completed, we should be able to do the joins in the `_to_dataframe` method faster (we need to be able to handle the singleton case which is hte issue with pandas): https://github.com/pydata/xarray/blob/239309f881ba0d7e02280147bc443e6e286e6a63/xarray/core/dataset.py#L7170-L7177 see discussion [here](https://github.com/pydata/xarray/pull/8723/files#r1506275296) ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8950/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 2125478394,PR_kwDOAMm_X85mZIzr,8723,(feat): Support for `pandas` `ExtensionArray`,43999641,closed,0,,,23,2024-02-08T15:38:18Z,2024-04-18T12:52:06Z,2024-04-18T12:52:03Z,CONTRIBUTOR,,0,pydata/xarray/pulls/8723," Some outstanding points/decisions brought up by this PR: - [ ] Confirm type promotion rules and write them out. As it stands now, if everything is of the same extension array type, it is passed onwards and otherwise is converted to numpy. (related: https://github.com/pydata/xarray/pull/8714) ~- [ ] Acceptance of `plum` as a dispatch method. Without it, the behavior should be fallen back on from before (cast to `numpy` types). I am a big fan of dispatching and think it could serve as a model going forward for making support of other data types/arrays more feasible. The other option, I think, would be to just use the underlying `array` of the `ExtensionDuckArray` class to decide and then have some central registry that serves as the basis for a decorator (like the api for accessors via `_CachedAccessor`). That being said, the current defaults are quite good so this is a marginal feature, in all likelihood.~ - [ ] Do we allow just pandas `ExtensionArray` directly or can we also allow `Series`? Possible missing something else! Let me know! Checklist: - [x] Closes #8463 and Closes #5287 - [x] Tests added - [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8723/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1999657332,I_kwDOAMm_X853MFl0,8463,Categorical Array,43999641,closed,0,,,19,2023-11-17T17:57:12Z,2024-04-18T12:52:04Z,2024-04-18T12:52:04Z,CONTRIBUTOR,,,,"### Is your feature request related to a problem? We are looking to improve compatibility between `AnnData` and `xarray` (see https://github.com/scverse/anndata/issues/744), and so categoricals are naturally on our roadmap. Thus, I think some sort of standard-use categoricals array would be desirable. It seems something similar has come up with [netCDF](https://github.com/pydata/xarray/issues/8144), although my knowledge is limited so this issue may be more distinct than I am aware. So what comes of this issue may solve two birds with one stone, or it may work towards some common solution that can at least help both use-cases (`AnnData` and `netCDF` `ENUM`). ### Describe the solution you'd like The goal would be a standard-use categorical data type `xarray` container of some sort. I'm not sure what form this can take. We have something functional [here](https://github.com/scverse/anndata/blob/3a428f4ba9b0df0981e9ec73607ac5b00ed0d32f/anndata/experimental/backed/_lazy_arrays.py#L34-L107) that inherits from `ExplicitlyIndexedNDArrayMixin` and returns `pandas.CategoricalDtype`. So let's say this implementation would be at least a conceptual starting point to work from (it also seems not dissimilar to what is done [here](https://github.com/pydata/xarray/blob/b6eaf436f7b120f5b6b3934892061af1e9ad89fe/xarray/coding/variables.py#L115-L144) for new CF types). Some issues: 1. I have no idea what a standard ""return type"" for an `xarray` categorical array should be (i.e., `numpy` with the categories applied, `pandas`, something custom etc.). So I'm not sure if using `pandas.CategoricalDtype` type is acceptable as In do in the linked implementation. Relatedly.... 2. I don't think using `pandas.CategoricalDtype` really helps with [the already existing CF Enum need](https://github.com/pydata/xarray/issues/8144) if you want to have the return type be some sort of `numpy` array (although again, not sure about the return type). As I understand it, though, the whole point of categoricals is to use `integers` as the base type and then only show ""strings"" outwardly i.e., printing, the API for equality operations, accessors etc., while the internals are based on integers. So I'm not really sure `numpy` is even an option here. Maybe we roll our own solution? 3. I am not sure this is the right level at which to implement this (maybe it should be a `Variable`? I don't think so, but I am just a beginner here 😄 ) It seems you may want, in addition to the array container, some sort of i/o functionality for this feature (so maybe some on-disk specification?). ### Describe alternatives you've considered I think there is some route via `VariableCoder` as hinted [here](https://github.com/pydata/xarray/issues/8144#issuecomment-1712413924) i.e., using `encode`/`decode`. This would probably be more general purpose as we could encode directly to other data types if using `pandas` is not desirable. Maybe this would be a way to support both `netCDF` and returning a `pandas.CategoricalDtype` (again, not sure what the `netCDF` return type should be for `ENUM`). ### Additional context So just for reference, the current behavior of `to_xarray` with `pandas.CategoricalDtype` is `object` `dtype` from `numpy`: ```python import pandas as pd df = pd.DataFrame({'cat': ['a', 'b', 'a', 'b', 'c']}) df['cat'] = df['cat'].astype('category') df.to_xarray()['cat'] # # array(['a', 'b', 'a', 'b', 'c'], dtype=object) # Coordinates: # * index (index) int64 0 1 2 3 4 ``` And as stated in the `netCDF` issue, for that use-case, the information about `ENUM` is lost (from what I can read). Apologies if I'm missing something here! Feedback welcome! Sorry if this is a bit chaotic, just trying to cover my bases.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8463/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue