id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type
2247914876,I_kwDOAMm_X86F_HV8,8950,ENH: Make `_to_dataframe` faster for extension array columns after `pandas` fix,43999641,open,0,,,0,2024-04-17T10:10:37Z,2024-04-28T20:03:23Z,,CONTRIBUTOR,,,,"### What is your issue?

One https://github.com/pandas-dev/pandas/issues/57676 is completed, we should be able to do the joins in the `_to_dataframe` method faster (we need to be able to handle the singleton case which is hte issue with pandas): https://github.com/pydata/xarray/blob/239309f881ba0d7e02280147bc443e6e286e6a63/xarray/core/dataset.py#L7170-L7177

see discussion [here](https://github.com/pydata/xarray/pull/8723/files#r1506275296)

","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8950/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue
1999657332,I_kwDOAMm_X853MFl0,8463,Categorical Array,43999641,closed,0,,,19,2023-11-17T17:57:12Z,2024-04-18T12:52:04Z,2024-04-18T12:52:04Z,CONTRIBUTOR,,,,"### Is your feature request related to a problem?

We are looking to improve compatibility between `AnnData` and `xarray` (see https://github.com/scverse/anndata/issues/744), and so categoricals are naturally on our roadmap.  Thus, I think some sort of standard-use categoricals array would be desirable.  It seems something similar has come up with [netCDF](https://github.com/pydata/xarray/issues/8144), although my knowledge is limited so this issue may be more distinct than I am aware.  So what comes of this issue may solve two birds with one stone, or it may work towards some common solution that can at least help both use-cases (`AnnData` and `netCDF` `ENUM`).

### Describe the solution you'd like

The goal would be a standard-use categorical data type `xarray` container of some sort.  I'm not sure what form this can take.

We have something functional [here](https://github.com/scverse/anndata/blob/3a428f4ba9b0df0981e9ec73607ac5b00ed0d32f/anndata/experimental/backed/_lazy_arrays.py#L34-L107) that inherits from `ExplicitlyIndexedNDArrayMixin` and returns `pandas.CategoricalDtype`.  So let's say this implementation would be at least a conceptual starting point to work from (it also seems not dissimilar to what is done [here](https://github.com/pydata/xarray/blob/b6eaf436f7b120f5b6b3934892061af1e9ad89fe/xarray/coding/variables.py#L115-L144) for new CF types).

Some issues:
1. I have no idea what a standard ""return type"" for an `xarray` categorical array should be (i.e., `numpy` with the categories applied, `pandas`, something custom etc.).  So I'm not sure if using `pandas.CategoricalDtype` type is acceptable as In do in the linked implementation.  Relatedly....
2. I don't think using `pandas.CategoricalDtype` really helps with [the already existing CF Enum need](https://github.com/pydata/xarray/issues/8144) if you want to have the return type be some sort of `numpy` array (although again, not sure about the return type).  As I understand it, though, the whole point of categoricals is to use `integers` as the base type and then only show ""strings"" outwardly i.e., printing, the API for equality operations, accessors etc., while the internals are based on integers.  So I'm not really sure `numpy` is even an option here.  Maybe we roll our own solution?
3. I am not sure this is the right level at which to implement this (maybe it should be a `Variable`?  I don't think so, but I am just a beginner here 😄 )

It seems you may want, in addition to the array container, some sort of i/o functionality for this feature (so maybe some on-disk specification?).

### Describe alternatives you've considered

I think there is some route via `VariableCoder` as hinted [here](https://github.com/pydata/xarray/issues/8144#issuecomment-1712413924) i.e., using `encode`/`decode`.  This would probably be more general purpose as we could encode directly to other data types if using `pandas` is not desirable.  Maybe this would be a way to support both `netCDF` and returning a `pandas.CategoricalDtype` (again, not sure what the `netCDF` return type should be for `ENUM`).

### Additional context

So just for reference, the current behavior of `to_xarray` with `pandas.CategoricalDtype` is `object` `dtype` from `numpy`:

```python
import pandas as pd
df = pd.DataFrame({'cat': ['a', 'b', 'a', 'b', 'c']})
df['cat'] = df['cat'].astype('category')
 df.to_xarray()['cat']
# <xarray.DataArray 'cat' (index: 5)>
# array(['a', 'b', 'a', 'b', 'c'], dtype=object)
# Coordinates:
#   * index    (index) int64 0 1 2 3 4
```

And as stated in the `netCDF` issue, for that use-case, the information about `ENUM` is lost (from what I can read).  

Apologies if I'm missing something here!  Feedback welcome! Sorry if this is a bit chaotic, just trying to cover my bases.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8463/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue