github: issue_comments: 13 rows where issue = 365973662 sorted by updated

13 rows where issue = 365973662 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
652009055	https://github.com/pydata/xarray/issues/2459#issuecomment-652009055	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDY1MjAwOTA1NQ==	kefirbandi 1277781	2020-06-30T19:53:46Z	2020-06-30T19:53:46Z	CONTRIBUTOR	I've reimplemented `from_dataframe` to make use of in #4184, and it indeed makes things much, much faster! The original example in this thread is now 40x faster. Very good news! Thanks for implementing it!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
650624827	https://github.com/pydata/xarray/issues/2459#issuecomment-650624827	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDY1MDYyNDgyNw==	shoyer 1217238	2020-06-27T20:50:45Z	2020-06-27T20:50:45Z	MEMBER	However a much faster solution was through numpy array. The below code is based on the idea of Igor Raush Thanks for sharing! This is a great tip indeed. I've reimplemented `from_dataframe` to make use of in https://github.com/pydata/xarray/pull/4184, and it indeed makes things much, much faster! The original example in this thread is now 40x faster.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
648721465	https://github.com/pydata/xarray/issues/2459#issuecomment-648721465	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDY0ODcyMTQ2NQ==	brey 5442433	2020-06-24T09:55:00Z	2020-06-24T09:55:00Z	NONE	Hi All. I stumble across the same issue trying to convert a 5000 column dataframe to xarray (it was never going to happen...). I found a workaround and I am posting the test below. Hope it helps. ```python import xarray as xr import pandas as pd import numpy as np xr.version `'0.15.1'` pd.version `'1.0.5'` df = pd.DataFrame(np.random.randn(200, 500)) %%time one = df.to_xarray() `CPU times: user 29.6 s, sys: 60.4 ms, total: 29.6 s Wall time: 29.7 s` %%time dic={} for name in df.columns: dic.update({name:(['index'],df[name].values)}) two = xr.Dataset(dic, coords={'index': ('index', df.index.values)}) `CPU times: user 17.6 ms, sys: 158 µs, total: 17.8 ms Wall time: 17.8 ms` one.equals(two) `True` ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
592991059	https://github.com/pydata/xarray/issues/2459#issuecomment-592991059	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDU5Mjk5MTA1OQ==	kefirbandi 1277781	2020-02-29T20:27:20Z	2020-02-29T20:27:20Z	CONTRIBUTOR	I know this is not a recent thread but I found no resolution, and we just ran in the same issue recently. In our case we had a pandas series of roughly 15 milliion entries, with a 3-level multi-index which had to be converted to an xarray.DataArray. The .to_xarray took almost 2 minutes. Unstack + to_array took it down to roughly 3 seconds, provided the last level of the multi index was unstacked. However a much faster solution was through numpy array. The below code is based on the idea of Igor Raush (In this case df is a dataframe with a single column, or a series) `arr = np.full(df.index.levshape, np.nan) arr[tuple(df.index.codes)] = df.values.flat da = xr.DataArray(arr,dims=df.index.names,coords=dict(zip(df.index.names, df.index.levels)))`	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
586552823	https://github.com/pydata/xarray/issues/2459#issuecomment-586552823	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDU4NjU1MjgyMw==	tqfjo 40251676	2020-02-15T04:31:54Z	2020-02-15T04:31:54Z	NONE	@crusaderky Thanks for the pointer to `xarray.DataArray(df)` -- that makes my life a ton easier. That said, if it helps anyone to know, I did just want a `DataArray`, but figured there was no alternative to first running the rather singular `to_xarray`. I also still find the runtime surprising, though I know nothing about `xarray`'s internals.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
586139738	https://github.com/pydata/xarray/issues/2459#issuecomment-586139738	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDU4NjEzOTczOA==	crusaderky 6213168	2020-02-14T07:50:08Z	2020-02-14T07:50:47Z	MEMBER	@tqfjo unrelated. You're comparing the creation of a dataset with 2 variables with the creation of one with 3000. Unsurprisingly, the latter will take 1500x. If your dataset doesn't functionally contain 3000 variables but just a single two-dimensional variable, use `xarray.DataArray(ds)`.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
586066908	https://github.com/pydata/xarray/issues/2459#issuecomment-586066908	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDU4NjA2NjkwOA==	tqfjo 40251676	2020-02-14T02:25:25Z	2020-02-14T02:25:25Z	NONE	I've run into this twice. This time I'm seeing a difference of very roughly 100x or more just using a transpose -- I can't test or time it properly right now, but this is what it looks like: ``` ipdb> df x a b ... c d y 0 0 ... 7 7 z ... 0 0.000000 0.0 ... 0.0 0.0 1 -0.000416 0.0 ... 0.0 0.0 [2 rows x 2932 columns] ipdb> df.to_xarray() ipdb> df.T.to_xarray() <Finishes instantly> ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426646669	https://github.com/pydata/xarray/issues/2459#issuecomment-426646669	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDQyNjY0NjY2OQ==	max-sixty 5635139	2018-10-03T13:55:40Z	2018-10-03T16:13:41Z	MEMBER	My working hypothesis is that pandas has a set of fast routines in C, such that it can stack without reindexing to the full index. The routines only work in 1-2 dimensions. So without some hackery (i.e. converting multi-dimensional arrays to pandas' size and back), the current implementation is reasonable. Next step would be to write our own routines that can operate on multiple dimensions (numbagg!). Is that consistent with others' views, particularly those who know this area well? ' one small fix that would improve performance of `series.to_xarray()` only, is the comment above. Lmk if you think worth making that change	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426691841	https://github.com/pydata/xarray/issues/2459#issuecomment-426691841	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDQyNjY5MTg0MQ==	shoyer 1217238	2018-10-03T15:57:28Z	2018-10-03T15:57:28Z	MEMBER	@max-sixty nevermind, you seem to have already discovered that :)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426689282	https://github.com/pydata/xarray/issues/2459#issuecomment-426689282	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDQyNjY4OTI4Mg==	shoyer 1217238	2018-10-03T15:50:32Z	2018-10-03T15:50:32Z	MEMBER	The vast majority of the time in xarray's current implementation seems to be spent in `DataFrame.reindex()`, but I see no reason why this operations needs to be so slow. I expect we could probably optimize this significantly on the pandas side. See these results from line-profiler: ``` In [8]: %lprun -f xarray.Dataset.from_dataframe cropped.to_xarray() Timer unit: 1e-06 s Total time: 0.727191 s File: /Users/shoyer/dev/xarray/xarray/core/dataset.py Function: from_dataframe at line 3094 Line # Hits Time Per Hit % Time Line Contents 3094 @classmethod 3095 def from_dataframe(cls, dataframe): 3096 """Convert a pandas.DataFrame into an xarray.Dataset 3097 3098 Each column will be converted into an independent variable in the 3099 Dataset. If the dataframe's index is a MultiIndex, it will be expanded 3100 into a tensor product of one-dimensional indices (filling in missing 3101 values with NaN). This method will produce a Dataset very similar to 3102 that on which the 'to_dataframe' method was called, except with 3103 possibly redundant dimensions (since all dataset variables will have 3104 the same dimensionality). 3105 """ 3106 # TODO: Add an option to remove dimensions along which the variables 3107 # are constant, to enable consistent serialization to/from a dataframe, 3108 # even if some variables have different dimensionality. 3109 3110 1 352.0 352.0 0.0 if not dataframe.columns.is_unique: 3111 raise ValueError( 3112 'cannot convert DataFrame with non-unique columns') 3113 3114 1 3.0 3.0 0.0 idx = dataframe.index 3115 1 356.0 356.0 0.0 obj = cls() 3116 3117 1 2.0 2.0 0.0 if isinstance(idx, pd.MultiIndex): 3118 # it's a multi-index 3119 # expand the DataFrame to include the product of all levels 3120 1 4524.0 4524.0 0.6 full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names) 3121 1 717008.0 717008.0 98.6 dataframe = dataframe.reindex(full_idx) 3122 1 3.0 3.0 0.0 dims = [name if name is not None else 'level_%i' % n 3123 1 20.0 20.0 0.0 for n, name in enumerate(idx.names)] 3124 4 9.0 2.2 0.0 for dim, lev in zip(dims, idx.levels): 3125 3 2973.0 991.0 0.4 obj[dim] = (dim, lev) 3126 1 37.0 37.0 0.0 shape = [lev.size for lev in idx.levels] 3127 else: 3128 dims = (idx.name if idx.name is not None else 'index',) 3129 obj[dims[0]] = (dims, idx) 3130 shape = -1 3131 3132 2 350.0 175.0 0.0 for name, series in iteritems(dataframe): 3133 1 33.0 33.0 0.0 data = np.asarray(series).reshape(shape) 3134 1 1520.0 1520.0 0.2 obj[name] = (dims, data) 3135 1 1.0 1.0 0.0 return obj ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426483497	https://github.com/pydata/xarray/issues/2459#issuecomment-426483497	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDQyNjQ4MzQ5Nw==	max-sixty 5635139	2018-10-03T01:30:07Z	2018-10-03T01:30:07Z	MEMBER	It's 3x faster to unstack & stack all-but-one level, vs reindexing over a filled-out index (and I think always produces the same result). Our current code takes the slow path. I could make that change, but that strongly feels like I don't understand the root cause. I haven't spent much time with reshaping code - lmk if anyone has ideas. ```python idx = cropped.index full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names) reindexed = cropped.reindex(full_idx) %timeit reindexed = cropped.reindex(full_idx) 1 loop, best of 3: 278 ms per loop %%timeit stack_unstack = ( cropped .unstack(list('yz')) .stack(list('yz'),dropna=False) ) 10 loops, best of 3: 80.8 ms per loop stack_unstack.equals(reindexed) True ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426408924	https://github.com/pydata/xarray/issues/2459#issuecomment-426408924	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDQyNjQwODkyNA==	max-sixty 5635139	2018-10-02T19:57:20Z	2018-10-02T19:57:20Z	MEMBER	When I stepped through, it was by-and-large all taken up by https://github.com/pydata/xarray/blob/master/xarray/core/dataset.py#L3121. That's where the boxing & unboxing of the datetimes is from. I haven't yet discovered how the alternative path avoids this work. If anyone has priors please lmk!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426398031	https://github.com/pydata/xarray/issues/2459#issuecomment-426398031	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDQyNjM5ODAzMQ==	shoyer 1217238	2018-10-02T19:20:04Z	2018-10-02T19:20:04Z	MEMBER	Here are the top entries I see with `%prun cropped.to_xarray()`: ``` 308597 function calls (308454 primitive calls) in 0.651 seconds Ordered by: internal time ncalls tottime 100000 1 1 1 1 100009 9 100021 1 2 1 66/65 24/22 17 19/18 5 2 2 4 1846 16 1 ``` percall cumtime percall filename:lineno(function) 0.255 0.000 0.275 0.000 datetimes.py:606(<lambda>) 0.165 0.165 0.165 0.165 {built-in method pandas._libs.lib.is_datetime_with_singletz_array} 0.071 0.071 0.634 0.634 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects} 0.054 0.054 0.054 0.054 {pandas._libs.lib.fast_zip} 0.029 0.029 0.304 0.304 {pandas._libs.lib.map_infer} 0.011 0.000 0.011 0.000 datetimelike.py:232(freq) 0.010 0.001 0.010 0.001 {pandas._libs.lib.infer_dtype} 0.010 0.000 0.010 0.000 datetimes.py:684(tz) 0.009 0.009 0.009 0.009 {built-in method pandas._libs.tslib.array_to_datetime} 0.008 0.004 0.008 0.004 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects} 0.008 0.008 0.651 0.651 dataarray.py:1827(from_series) 0.005 0.000 0.005 0.000 {built-in method numpy.core.multiarray.array} 0.001 0.000 0.362 0.016 base.py:677(_values) 0.001 0.000 0.001 0.000 {built-in method numpy.core.multiarray.empty} 0.001 0.000 0.189 0.010 base.py:4914(_ensure_index) 0.001 0.000 0.001 0.000 {method 'repeat' of 'numpy.ndarray' objects} 0.001 0.000 0.001 0.000 {method 'tolist' of 'numpy.ndarray' objects} 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_object_object} 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_int64_int64} 0.001 0.000 0.001 0.000 {built-in method builtins.isinstance} 0.001 0.000 0.001 0.000 {method 'reduce' of 'numpy.ufunc' objects} 0.001 0.001 0.001 0.001 {method 'get_indexer' of 'pandas._libs.index.DatetimeEngine' objects} There seems to be a suspiciously large amount of effort applying a function to individual datetime objects.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

13 rows where issue = 365973662 sorted by updated_at descending

Line # Hits Time Per Hit % Time Line Contents

1 loop, best of 3: 278 ms per loop

10 loops, best of 3: 80.8 ms per loop

True

Advanced export