id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 2247914876,I_kwDOAMm_X86F_HV8,8950,ENH: Make `_to_dataframe` faster for extension array columns after `pandas` fix,43999641,open,0,,,0,2024-04-17T10:10:37Z,2024-04-28T20:03:23Z,,CONTRIBUTOR,,,,"### What is your issue? One https://github.com/pandas-dev/pandas/issues/57676 is completed, we should be able to do the joins in the `_to_dataframe` method faster (we need to be able to handle the singleton case which is hte issue with pandas): https://github.com/pydata/xarray/blob/239309f881ba0d7e02280147bc443e6e286e6a63/xarray/core/dataset.py#L7170-L7177 see discussion [here](https://github.com/pydata/xarray/pull/8723/files#r1506275296) ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8950/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 1999657332,I_kwDOAMm_X853MFl0,8463,Categorical Array,43999641,closed,0,,,19,2023-11-17T17:57:12Z,2024-04-18T12:52:04Z,2024-04-18T12:52:04Z,CONTRIBUTOR,,,,"### Is your feature request related to a problem? We are looking to improve compatibility between `AnnData` and `xarray` (see https://github.com/scverse/anndata/issues/744), and so categoricals are naturally on our roadmap. Thus, I think some sort of standard-use categoricals array would be desirable. It seems something similar has come up with [netCDF](https://github.com/pydata/xarray/issues/8144), although my knowledge is limited so this issue may be more distinct than I am aware. So what comes of this issue may solve two birds with one stone, or it may work towards some common solution that can at least help both use-cases (`AnnData` and `netCDF` `ENUM`). ### Describe the solution you'd like The goal would be a standard-use categorical data type `xarray` container of some sort. I'm not sure what form this can take. We have something functional [here](https://github.com/scverse/anndata/blob/3a428f4ba9b0df0981e9ec73607ac5b00ed0d32f/anndata/experimental/backed/_lazy_arrays.py#L34-L107) that inherits from `ExplicitlyIndexedNDArrayMixin` and returns `pandas.CategoricalDtype`. So let's say this implementation would be at least a conceptual starting point to work from (it also seems not dissimilar to what is done [here](https://github.com/pydata/xarray/blob/b6eaf436f7b120f5b6b3934892061af1e9ad89fe/xarray/coding/variables.py#L115-L144) for new CF types). Some issues: 1. I have no idea what a standard ""return type"" for an `xarray` categorical array should be (i.e., `numpy` with the categories applied, `pandas`, something custom etc.). So I'm not sure if using `pandas.CategoricalDtype` type is acceptable as In do in the linked implementation. Relatedly.... 2. I don't think using `pandas.CategoricalDtype` really helps with [the already existing CF Enum need](https://github.com/pydata/xarray/issues/8144) if you want to have the return type be some sort of `numpy` array (although again, not sure about the return type). As I understand it, though, the whole point of categoricals is to use `integers` as the base type and then only show ""strings"" outwardly i.e., printing, the API for equality operations, accessors etc., while the internals are based on integers. So I'm not really sure `numpy` is even an option here. Maybe we roll our own solution? 3. I am not sure this is the right level at which to implement this (maybe it should be a `Variable`? I don't think so, but I am just a beginner here 😄 ) It seems you may want, in addition to the array container, some sort of i/o functionality for this feature (so maybe some on-disk specification?). ### Describe alternatives you've considered I think there is some route via `VariableCoder` as hinted [here](https://github.com/pydata/xarray/issues/8144#issuecomment-1712413924) i.e., using `encode`/`decode`. This would probably be more general purpose as we could encode directly to other data types if using `pandas` is not desirable. Maybe this would be a way to support both `netCDF` and returning a `pandas.CategoricalDtype` (again, not sure what the `netCDF` return type should be for `ENUM`). ### Additional context So just for reference, the current behavior of `to_xarray` with `pandas.CategoricalDtype` is `object` `dtype` from `numpy`: ```python import pandas as pd df = pd.DataFrame({'cat': ['a', 'b', 'a', 'b', 'c']}) df['cat'] = df['cat'].astype('category') df.to_xarray()['cat'] # # array(['a', 'b', 'a', 'b', 'c'], dtype=object) # Coordinates: # * index (index) int64 0 1 2 3 4 ``` And as stated in the `netCDF` issue, for that use-case, the information about `ENUM` is lost (from what I can read). Apologies if I'm missing something here! Feedback welcome! Sorry if this is a bit chaotic, just trying to cover my bases.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8463/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue