id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 1913983402,I_kwDOAMm_X85yFRGq,8233,numbagg & flox,5635139,closed,0,,,13,2023-09-26T17:33:32Z,2023-10-15T07:48:56Z,2023-10-09T15:40:29Z,MEMBER,,,,"### What is your issue? I've been doing some work recently on our old friend [numbagg](https://github.com/numbagg/numbagg), improving the ewm routines & adding some more. I'm keen to get numbagg back in shape, doing the things that it does best, and trimming anything it doesn't. I notice that it has [grouped calcs](https://github.com/numbagg/numbagg/blob/main/numbagg/grouped.py). Am I correct to think that [flox](https://github.com/xarray-contrib/flox) does this better? I haven't been up with the latest. flox looks like it's particularly focused on dask arrays, whereas [numpy_groupies](https://github.com/ml31415/numpy-groupies), one of the inspirations for this, was applicable to numpy arrays too. At least from the xarray perspective, are we OK to deprecate these numbagg functions, and direct folks to flox?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8233/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 365973662,MDU6SXNzdWUzNjU5NzM2NjI=,2459,Stack + to_array before to_xarray is much faster that a simple to_xarray,5635139,closed,0,,,13,2018-10-02T16:13:26Z,2020-07-02T20:39:01Z,2020-07-02T20:39:01Z,MEMBER,,,,"I was seeing some slow performance around `to_xarray()` on MultiIndexed series, and found that unstacking one of the dimensions before running `to_xarray()`, and then restacking with `to_array()` was ~30x faster. This time difference is consistent with larger data sizes. To reproduce: Create a series with a MultiIndex, ensuring the MultiIndex isn't a simple product: ```python s = pd.Series( np.random.rand(100000), index=pd.MultiIndex.from_product([ list('abcdefhijk'), list('abcdefhijk'), pd.DatetimeIndex(start='2000-01-01', periods=1000, freq='B'), ])) cropped = s[::3] cropped.index=pd.MultiIndex.from_tuples(cropped.index, names=list('xyz')) cropped.head() # x y z # a a 2000-01-03 0.993989 # 2000-01-06 0.850518 # 2000-01-11 0.068944 # 2000-01-14 0.237197 # 2000-01-19 0.784254 # dtype: float64 ``` Two approaches for getting this into xarray; 1 - Simple `.to_xarray()`: ```python # current_method = cropped.to_xarray() array([[[0.993989, nan, ..., nan, 0.721663], [ nan, nan, ..., 0.58224 , nan], ..., [ nan, 0.369382, ..., nan, nan], [0.98558 , nan, ..., nan, 0.403732]], [[ nan, nan, ..., 0.493711, nan], [ nan, 0.126761, ..., nan, nan], ..., [0.976758, nan, ..., nan, 0.816612], [ nan, nan, ..., 0.982128, nan]], ..., [[ nan, 0.971525, ..., nan, nan], [0.146774, nan, ..., nan, 0.419806], ..., [ nan, nan, ..., 0.700764, nan], [ nan, 0.502058, ..., nan, nan]], [[0.246768, nan, ..., nan, 0.079266], [ nan, nan, ..., 0.802297, nan], ..., [ nan, 0.636698, ..., nan, nan], [0.025195, nan, ..., nan, 0.629305]]]) Coordinates: * x (x) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k' * y (y) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k' * z (z) datetime64[ns] 2000-01-03 2000-01-04 ... 2003-10-30 2003-10-31 ``` This takes *536 ms* 2 - unstack in pandas first, and then use `to_array` to do the equivalent of a restack: ``` proposed_version = ( cropped .unstack('y') .to_xarray() .to_array('y') ) ``` This takes *17.3 ms* To confirm these are identical: ``` proposed_version_adj = ( proposed_version .assign_coords(y=proposed_version['y'].astype(object)) .transpose(*current_version.dims) ) proposed_version_adj.equals(current_version) # True ``` #### Problem description A default operation is much slower than a (potentially) equivalent operation that's not the default. I need to look more at what's causing the issues. I think it's to do with the `.reindex(full_idx)`, but I'm unclear why it's so much faster in the alternative route, and whether there's a fix that we can make to make the default path fast. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Linux OS-release: 4.9.93-linuxkit-aufs machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.utf8 LOCALE: None.None xarray: 0.10.9 pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None PseudonetCDF: None rasterio: None iris: None bottleneck: 1.2.1 cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 40.4.3 pip: 18.0 conda: None pytest: 3.8.1 IPython: 5.8.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2459/reactions"", ""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 115210260,MDU6SXNzdWUxMTUyMTAyNjA=,645,Display of PeriodIndex,5635139,closed,0,,,13,2015-11-05T05:01:22Z,2015-12-30T05:59:05Z,2015-12-30T05:59:05Z,MEMBER,,,,"Not the greatest issue but: While coordinates that are given as `PeriodIndex`es are stored in that form, their `Int` representation is shown in the `DataArray` repr, which adds a frequent additional step to see what dates we're dealing with. Or correct me if I'm making some basic mistake. ``` python In [23]: data_array = xray.DataArray( data=pd.Series(np.random.rand(20), index=pd.period_range(start='2000', periods=20, name='Date')) ) data_array Out[23]: array([ 0.95861189, 0.3607297 , 0.9890032 , 0.77674314, 0.39461886, 0.98425749, 0.79044973, 0.81376587, 0.07091318, 0.02757213, 0.87366025, 0.0496346 , 0.45433931, 0.3339866 , 0.67261248, 0.91684965, 0.60889737, 0.33469611, 0.94966724, 0.50328461]) Coordinates: * Date (Date) int64 10957 10958 10959 10960 10961 10962 10963 10964 ... In [25]: data_array.to_series() Out[25]: Date 2000-01-01 0.958612 2000-01-02 0.360730 2000-01-03 0.989003 2000-01-04 0.776743 2000-01-05 0.394619 2000-01-06 0.984257 2000-01-07 0.790450 2000-01-08 0.813766 2000-01-09 0.070913 2000-01-10 0.027572 2000-01-11 0.873660 2000-01-12 0.049635 2000-01-13 0.454339 2000-01-14 0.333987 2000-01-15 0.672612 2000-01-16 0.916850 2000-01-17 0.608897 2000-01-18 0.334696 2000-01-19 0.949667 2000-01-20 0.503285 Freq: D, dtype: float64 ``` ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/645/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue