home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

13 rows where issue = 365973662 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 6

  • shoyer 4
  • max-sixty 3
  • kefirbandi 2
  • tqfjo 2
  • brey 1
  • crusaderky 1

author_association 3

  • MEMBER 8
  • NONE 3
  • CONTRIBUTOR 2

issue 1

  • Stack + to_array before to_xarray is much faster that a simple to_xarray · 13 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
652009055 https://github.com/pydata/xarray/issues/2459#issuecomment-652009055 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDY1MjAwOTA1NQ== kefirbandi 1277781 2020-06-30T19:53:46Z 2020-06-30T19:53:46Z CONTRIBUTOR

I've reimplemented from_dataframe to make use of in #4184, and it indeed makes things much, much faster! The original example in this thread is now 40x faster.

Very good news! Thanks for implementing it!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
650624827 https://github.com/pydata/xarray/issues/2459#issuecomment-650624827 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDY1MDYyNDgyNw== shoyer 1217238 2020-06-27T20:50:45Z 2020-06-27T20:50:45Z MEMBER

However a much faster solution was through numpy array. The below code is based on the idea of Igor Raush

Thanks for sharing! This is a great tip indeed.

I've reimplemented from_dataframe to make use of in https://github.com/pydata/xarray/pull/4184, and it indeed makes things much, much faster! The original example in this thread is now 40x faster.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
648721465 https://github.com/pydata/xarray/issues/2459#issuecomment-648721465 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDY0ODcyMTQ2NQ== brey 5442433 2020-06-24T09:55:00Z 2020-06-24T09:55:00Z NONE

Hi All. I stumble across the same issue trying to convert a 5000 column dataframe to xarray (it was never going to happen...). I found a workaround and I am posting the test below. Hope it helps.

```python import xarray as xr import pandas as pd import numpy as np

xr.version

'0.15.1'

pd.version

'1.0.5'

df = pd.DataFrame(np.random.randn(200, 500))

%%time one = df.to_xarray()

CPU times: user 29.6 s, sys: 60.4 ms, total: 29.6 s
Wall time: 29.7 s

%%time dic={} for name in df.columns: dic.update({name:(['index'],df[name].values)})

two = xr.Dataset(dic, coords={'index': ('index', df.index.values)})

CPU times: user 17.6 ms, sys: 158 µs, total: 17.8 ms
Wall time: 17.8 ms

one.equals(two)

True

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
592991059 https://github.com/pydata/xarray/issues/2459#issuecomment-592991059 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDU5Mjk5MTA1OQ== kefirbandi 1277781 2020-02-29T20:27:20Z 2020-02-29T20:27:20Z CONTRIBUTOR

I know this is not a recent thread but I found no resolution, and we just ran in the same issue recently. In our case we had a pandas series of roughly 15 milliion entries, with a 3-level multi-index which had to be converted to an xarray.DataArray. The .to_xarray took almost 2 minutes. Unstack + to_array took it down to roughly 3 seconds, provided the last level of the multi index was unstacked.

However a much faster solution was through numpy array. The below code is based on the idea of Igor Raush

(In this case df is a dataframe with a single column, or a series) arr = np.full(df.index.levshape, np.nan) arr[tuple(df.index.codes)] = df.values.flat da = xr.DataArray(arr,dims=df.index.names,coords=dict(zip(df.index.names, df.index.levels)))

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
586552823 https://github.com/pydata/xarray/issues/2459#issuecomment-586552823 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDU4NjU1MjgyMw== tqfjo 40251676 2020-02-15T04:31:54Z 2020-02-15T04:31:54Z NONE

@crusaderky Thanks for the pointer to xarray.DataArray(df) -- that makes my life a ton easier.


That said, if it helps anyone to know, I did just want a DataArray, but figured there was no alternative to first running the rather singular to_xarray. I also still find the runtime surprising, though I know nothing about xarray's internals.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
586139738 https://github.com/pydata/xarray/issues/2459#issuecomment-586139738 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDU4NjEzOTczOA== crusaderky 6213168 2020-02-14T07:50:08Z 2020-02-14T07:50:47Z MEMBER

@tqfjo unrelated. You're comparing the creation of a dataset with 2 variables with the creation of one with 3000. Unsurprisingly, the latter will take 1500x. If your dataset doesn't functionally contain 3000 variables but just a single two-dimensional variable, use xarray.DataArray(ds).

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
586066908 https://github.com/pydata/xarray/issues/2459#issuecomment-586066908 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDU4NjA2NjkwOA== tqfjo 40251676 2020-02-14T02:25:25Z 2020-02-14T02:25:25Z NONE

I've run into this twice. This time I'm seeing a difference of very roughly 100x or more just using a transpose -- I can't test or time it properly right now, but this is what it looks like:

``` ipdb> df x a b ... c d y 0 0 ... 7 7 z ...
0 0.000000 0.0 ... 0.0 0.0 1 -0.000416 0.0 ... 0.0 0.0

[2 rows x 2932 columns] ipdb> df.to_xarray()

ipdb> df.T.to_xarray()

<Finishes instantly>

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426646669 https://github.com/pydata/xarray/issues/2459#issuecomment-426646669 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDQyNjY0NjY2OQ== max-sixty 5635139 2018-10-03T13:55:40Z 2018-10-03T16:13:41Z MEMBER

My working hypothesis is that pandas has a set of fast routines in C, such that it can stack without reindexing to the full index. The routines only work in 1-2 dimensions.

So without some hackery (i.e. converting multi-dimensional arrays to pandas' size and back), the current implementation is reasonable*. Next step would be to write our own routines that can operate on multiple dimensions (numbagg!).

Is that consistent with others' views, particularly those who know this area well?

'* one small fix that would improve performance of series.to_xarray() only, is the comment above. Lmk if you think worth making that change

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426691841 https://github.com/pydata/xarray/issues/2459#issuecomment-426691841 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDQyNjY5MTg0MQ== shoyer 1217238 2018-10-03T15:57:28Z 2018-10-03T15:57:28Z MEMBER

@max-sixty nevermind, you seem to have already discovered that :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426689282 https://github.com/pydata/xarray/issues/2459#issuecomment-426689282 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDQyNjY4OTI4Mg== shoyer 1217238 2018-10-03T15:50:32Z 2018-10-03T15:50:32Z MEMBER

The vast majority of the time in xarray's current implementation seems to be spent in DataFrame.reindex(), but I see no reason why this operations needs to be so slow. I expect we could probably optimize this significantly on the pandas side.

See these results from line-profiler: ``` In [8]: %lprun -f xarray.Dataset.from_dataframe cropped.to_xarray() Timer unit: 1e-06 s

Total time: 0.727191 s File: /Users/shoyer/dev/xarray/xarray/core/dataset.py Function: from_dataframe at line 3094

Line # Hits Time Per Hit % Time Line Contents

3094 @classmethod 3095 def from_dataframe(cls, dataframe): 3096 """Convert a pandas.DataFrame into an xarray.Dataset 3097 3098 Each column will be converted into an independent variable in the 3099 Dataset. If the dataframe's index is a MultiIndex, it will be expanded 3100 into a tensor product of one-dimensional indices (filling in missing 3101 values with NaN). This method will produce a Dataset very similar to 3102 that on which the 'to_dataframe' method was called, except with 3103 possibly redundant dimensions (since all dataset variables will have 3104 the same dimensionality). 3105 """ 3106 # TODO: Add an option to remove dimensions along which the variables 3107 # are constant, to enable consistent serialization to/from a dataframe, 3108 # even if some variables have different dimensionality. 3109 3110 1 352.0 352.0 0.0 if not dataframe.columns.is_unique: 3111 raise ValueError( 3112 'cannot convert DataFrame with non-unique columns') 3113 3114 1 3.0 3.0 0.0 idx = dataframe.index 3115 1 356.0 356.0 0.0 obj = cls() 3116 3117 1 2.0 2.0 0.0 if isinstance(idx, pd.MultiIndex): 3118 # it's a multi-index 3119 # expand the DataFrame to include the product of all levels 3120 1 4524.0 4524.0 0.6 full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names) 3121 1 717008.0 717008.0 98.6 dataframe = dataframe.reindex(full_idx) 3122 1 3.0 3.0 0.0 dims = [name if name is not None else 'level_%i' % n 3123 1 20.0 20.0 0.0 for n, name in enumerate(idx.names)] 3124 4 9.0 2.2 0.0 for dim, lev in zip(dims, idx.levels): 3125 3 2973.0 991.0 0.4 obj[dim] = (dim, lev) 3126 1 37.0 37.0 0.0 shape = [lev.size for lev in idx.levels] 3127 else: 3128 dims = (idx.name if idx.name is not None else 'index',) 3129 obj[dims[0]] = (dims, idx) 3130 shape = -1 3131 3132 2 350.0 175.0 0.0 for name, series in iteritems(dataframe): 3133 1 33.0 33.0 0.0 data = np.asarray(series).reshape(shape) 3134 1 1520.0 1520.0 0.2 obj[name] = (dims, data) 3135 1 1.0 1.0 0.0 return obj ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426483497 https://github.com/pydata/xarray/issues/2459#issuecomment-426483497 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDQyNjQ4MzQ5Nw== max-sixty 5635139 2018-10-03T01:30:07Z 2018-10-03T01:30:07Z MEMBER

It's 3x faster to unstack & stack all-but-one level, vs reindexing over a filled-out index (and I think always produces the same result).

Our current code takes the slow path.

I could make that change, but that strongly feels like I don't understand the root cause. I haven't spent much time with reshaping code - lmk if anyone has ideas.

```python

idx = cropped.index full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names)

reindexed = cropped.reindex(full_idx)

%timeit reindexed = cropped.reindex(full_idx)

1 loop, best of 3: 278 ms per loop

%%timeit stack_unstack = ( cropped .unstack(list('yz')) .stack(list('yz'),dropna=False) )

10 loops, best of 3: 80.8 ms per loop

stack_unstack.equals(reindexed)

True

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426408924 https://github.com/pydata/xarray/issues/2459#issuecomment-426408924 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDQyNjQwODkyNA== max-sixty 5635139 2018-10-02T19:57:20Z 2018-10-02T19:57:20Z MEMBER

When I stepped through, it was by-and-large all taken up by https://github.com/pydata/xarray/blob/master/xarray/core/dataset.py#L3121. That's where the boxing & unboxing of the datetimes is from.

I haven't yet discovered how the alternative path avoids this work. If anyone has priors please lmk!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426398031 https://github.com/pydata/xarray/issues/2459#issuecomment-426398031 https://api.github.com/repos/pydata/xarray/issues/2459 MDEyOklzc3VlQ29tbWVudDQyNjM5ODAzMQ== shoyer 1217238 2018-10-02T19:20:04Z 2018-10-02T19:20:04Z MEMBER

Here are the top entries I see with %prun cropped.to_xarray(): ``` 308597 function calls (308454 primitive calls) in 0.651 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function) 100000 0.255 0.000 0.275 0.000 datetimes.py:606(<lambda>) 1 0.165 0.165 0.165 0.165 {built-in method pandas._libs.lib.is_datetime_with_singletz_array} 1 0.071 0.071 0.634 0.634 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects} 1 0.054 0.054 0.054 0.054 {pandas._libs.lib.fast_zip} 1 0.029 0.029 0.304 0.304 {pandas._libs.lib.map_infer} 100009 0.011 0.000 0.011 0.000 datetimelike.py:232(freq) 9 0.010 0.001 0.010 0.001 {pandas._libs.lib.infer_dtype} 100021 0.010 0.000 0.010 0.000 datetimes.py:684(tz) 1 0.009 0.009 0.009 0.009 {built-in method pandas._libs.tslib.array_to_datetime} 2 0.008 0.004 0.008 0.004 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects} 1 0.008 0.008 0.651 0.651 dataarray.py:1827(from_series) 66/65 0.005 0.000 0.005 0.000 {built-in method numpy.core.multiarray.array} 24/22 0.001 0.000 0.362 0.016 base.py:677(_values) 17 0.001 0.000 0.001 0.000 {built-in method numpy.core.multiarray.empty} 19/18 0.001 0.000 0.189 0.010 base.py:4914(_ensure_index) 5 0.001 0.000 0.001 0.000 {method 'repeat' of 'numpy.ndarray' objects} 2 0.001 0.000 0.001 0.000 {method 'tolist' of 'numpy.ndarray' objects} 2 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_object_object} 4 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_int64_int64} 1846 0.001 0.000 0.001 0.000 {built-in method builtins.isinstance} 16 0.001 0.000 0.001 0.000 {method 'reduce' of 'numpy.ufunc' objects} 1 0.001 0.001 0.001 0.001 {method 'get_indexer' of 'pandas._libs.index.DatetimeEngine' objects} ```

There seems to be a suspiciously large amount of effort applying a function to individual datetime objects.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.924ms · About: xarray-datasette