home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

3 rows where comments = 13, repo = 13221727 and user = 5635139 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), closed_at (date)

type 1

  • issue 3

state 1

  • closed 3

repo 1

  • xarray · 3 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1913983402 I_kwDOAMm_X85yFRGq 8233 numbagg & flox max-sixty 5635139 closed 0     13 2023-09-26T17:33:32Z 2023-10-15T07:48:56Z 2023-10-09T15:40:29Z MEMBER      

What is your issue?

I've been doing some work recently on our old friend numbagg, improving the ewm routines & adding some more.

I'm keen to get numbagg back in shape, doing the things that it does best, and trimming anything it doesn't. I notice that it has grouped calcs. Am I correct to think that flox does this better? I haven't been up with the latest. flox looks like it's particularly focused on dask arrays, whereas numpy_groupies, one of the inspirations for this, was applicable to numpy arrays too.

At least from the xarray perspective, are we OK to deprecate these numbagg functions, and direct folks to flox?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8233/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
365973662 MDU6SXNzdWUzNjU5NzM2NjI= 2459 Stack + to_array before to_xarray is much faster that a simple to_xarray max-sixty 5635139 closed 0     13 2018-10-02T16:13:26Z 2020-07-02T20:39:01Z 2020-07-02T20:39:01Z MEMBER      

I was seeing some slow performance around to_xarray() on MultiIndexed series, and found that unstacking one of the dimensions before running to_xarray(), and then restacking with to_array() was ~30x faster. This time difference is consistent with larger data sizes.

To reproduce:

Create a series with a MultiIndex, ensuring the MultiIndex isn't a simple product:

```python s = pd.Series( np.random.rand(100000), index=pd.MultiIndex.from_product([ list('abcdefhijk'), list('abcdefhijk'), pd.DatetimeIndex(start='2000-01-01', periods=1000, freq='B'), ]))

cropped = s[::3] cropped.index=pd.MultiIndex.from_tuples(cropped.index, names=list('xyz'))

cropped.head()

x y z

a a 2000-01-03 0.993989

2000-01-06 0.850518

2000-01-11 0.068944

2000-01-14 0.237197

2000-01-19 0.784254

dtype: float64

```

Two approaches for getting this into xarray; 1 - Simple .to_xarray():

```python

current_method = cropped.to_xarray()

<xarray.DataArray (x: 10, y: 10, z: 1000)> array([[[0.993989, nan, ..., nan, 0.721663], [ nan, nan, ..., 0.58224 , nan], ..., [ nan, 0.369382, ..., nan, nan], [0.98558 , nan, ..., nan, 0.403732]],

   [[     nan,      nan, ..., 0.493711,      nan],
    [     nan, 0.126761, ...,      nan,      nan],
    ...,
    [0.976758,      nan, ...,      nan, 0.816612],
    [     nan,      nan, ..., 0.982128,      nan]],

   ...,

   [[     nan, 0.971525, ...,      nan,      nan],
    [0.146774,      nan, ...,      nan, 0.419806],
    ...,
    [     nan,      nan, ..., 0.700764,      nan],
    [     nan, 0.502058, ...,      nan,      nan]],

   [[0.246768,      nan, ...,      nan, 0.079266],
    [     nan,      nan, ..., 0.802297,      nan],
    ...,
    [     nan, 0.636698, ...,      nan,      nan],
    [0.025195,      nan, ...,      nan, 0.629305]]])

Coordinates: * x (x) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k' * y (y) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k' * z (z) datetime64[ns] 2000-01-03 2000-01-04 ... 2003-10-30 2003-10-31 ```

This takes 536 ms

2 - unstack in pandas first, and then use to_array to do the equivalent of a restack: proposed_version = ( cropped .unstack('y') .to_xarray() .to_array('y') )

This takes 17.3 ms

To confirm these are identical:

``` proposed_version_adj = ( proposed_version .assign_coords(y=proposed_version['y'].astype(object)) .transpose(*current_version.dims) )

proposed_version_adj.equals(current_version)

True

```

Problem description

A default operation is much slower than a (potentially) equivalent operation that's not the default.

I need to look more at what's causing the issues. I think it's to do with the .reindex(full_idx), but I'm unclear why it's so much faster in the alternative route, and whether there's a fix that we can make to make the default path fast.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Linux OS-release: 4.9.93-linuxkit-aufs machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.utf8 LOCALE: None.None xarray: 0.10.9 pandas: 0.23.4 numpy: 1.15.2 scipy: 1.1.0 netCDF4: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None PseudonetCDF: None rasterio: None iris: None bottleneck: 1.2.1 cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 40.4.3 pip: 18.0 conda: None pytest: 3.8.1 IPython: 5.8.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2459/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
115210260 MDU6SXNzdWUxMTUyMTAyNjA= 645 Display of PeriodIndex max-sixty 5635139 closed 0     13 2015-11-05T05:01:22Z 2015-12-30T05:59:05Z 2015-12-30T05:59:05Z MEMBER      

Not the greatest issue but: While coordinates that are given as PeriodIndexes are stored in that form, their Int representation is shown in the DataArray repr, which adds a frequent additional step to see what dates we're dealing with.

Or correct me if I'm making some basic mistake.

``` python In [23]:

data_array = xray.DataArray( data=pd.Series(np.random.rand(20), index=pd.period_range(start='2000', periods=20, name='Date')) ) data_array Out[23]: <xray.DataArray (Date: 20)> array([ 0.95861189, 0.3607297 , 0.9890032 , 0.77674314, 0.39461886, 0.98425749, 0.79044973, 0.81376587, 0.07091318, 0.02757213, 0.87366025, 0.0496346 , 0.45433931, 0.3339866 , 0.67261248, 0.91684965, 0.60889737, 0.33469611, 0.94966724, 0.50328461]) Coordinates: * Date (Date) int64 10957 10958 10959 10960 10961 10962 10963 10964 ...

In [25]:

data_array.to_series() Out[25]: Date 2000-01-01 0.958612 2000-01-02 0.360730 2000-01-03 0.989003 2000-01-04 0.776743 2000-01-05 0.394619 2000-01-06 0.984257 2000-01-07 0.790450 2000-01-08 0.813766 2000-01-09 0.070913 2000-01-10 0.027572 2000-01-11 0.873660 2000-01-12 0.049635 2000-01-13 0.454339 2000-01-14 0.333987 2000-01-15 0.672612 2000-01-16 0.916850 2000-01-17 0.608897 2000-01-18 0.334696 2000-01-19 0.949667 2000-01-20 0.503285 Freq: D, dtype: float64 ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/645/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 2400.212ms · About: xarray-datasette