github: issue_comments: 4 rows where author_association = "MEMBER", issue = 365973662 and user = 1217238 sorted by updated

4 rows where author_association = "MEMBER", issue = 365973662 and user = 1217238 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
650624827	https://github.com/pydata/xarray/issues/2459#issuecomment-650624827	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDY1MDYyNDgyNw==	shoyer 1217238	2020-06-27T20:50:45Z	2020-06-27T20:50:45Z	MEMBER	However a much faster solution was through numpy array. The below code is based on the idea of Igor Raush Thanks for sharing! This is a great tip indeed. I've reimplemented `from_dataframe` to make use of in https://github.com/pydata/xarray/pull/4184, and it indeed makes things much, much faster! The original example in this thread is now 40x faster.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426691841	https://github.com/pydata/xarray/issues/2459#issuecomment-426691841	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDQyNjY5MTg0MQ==	shoyer 1217238	2018-10-03T15:57:28Z	2018-10-03T15:57:28Z	MEMBER	@max-sixty nevermind, you seem to have already discovered that :)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426689282	https://github.com/pydata/xarray/issues/2459#issuecomment-426689282	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDQyNjY4OTI4Mg==	shoyer 1217238	2018-10-03T15:50:32Z	2018-10-03T15:50:32Z	MEMBER	The vast majority of the time in xarray's current implementation seems to be spent in `DataFrame.reindex()`, but I see no reason why this operations needs to be so slow. I expect we could probably optimize this significantly on the pandas side. See these results from line-profiler: ``` In [8]: %lprun -f xarray.Dataset.from_dataframe cropped.to_xarray() Timer unit: 1e-06 s Total time: 0.727191 s File: /Users/shoyer/dev/xarray/xarray/core/dataset.py Function: from_dataframe at line 3094 Line # Hits Time Per Hit % Time Line Contents 3094 @classmethod 3095 def from_dataframe(cls, dataframe): 3096 """Convert a pandas.DataFrame into an xarray.Dataset 3097 3098 Each column will be converted into an independent variable in the 3099 Dataset. If the dataframe's index is a MultiIndex, it will be expanded 3100 into a tensor product of one-dimensional indices (filling in missing 3101 values with NaN). This method will produce a Dataset very similar to 3102 that on which the 'to_dataframe' method was called, except with 3103 possibly redundant dimensions (since all dataset variables will have 3104 the same dimensionality). 3105 """ 3106 # TODO: Add an option to remove dimensions along which the variables 3107 # are constant, to enable consistent serialization to/from a dataframe, 3108 # even if some variables have different dimensionality. 3109 3110 1 352.0 352.0 0.0 if not dataframe.columns.is_unique: 3111 raise ValueError( 3112 'cannot convert DataFrame with non-unique columns') 3113 3114 1 3.0 3.0 0.0 idx = dataframe.index 3115 1 356.0 356.0 0.0 obj = cls() 3116 3117 1 2.0 2.0 0.0 if isinstance(idx, pd.MultiIndex): 3118 # it's a multi-index 3119 # expand the DataFrame to include the product of all levels 3120 1 4524.0 4524.0 0.6 full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names) 3121 1 717008.0 717008.0 98.6 dataframe = dataframe.reindex(full_idx) 3122 1 3.0 3.0 0.0 dims = [name if name is not None else 'level_%i' % n 3123 1 20.0 20.0 0.0 for n, name in enumerate(idx.names)] 3124 4 9.0 2.2 0.0 for dim, lev in zip(dims, idx.levels): 3125 3 2973.0 991.0 0.4 obj[dim] = (dim, lev) 3126 1 37.0 37.0 0.0 shape = [lev.size for lev in idx.levels] 3127 else: 3128 dims = (idx.name if idx.name is not None else 'index',) 3129 obj[dims[0]] = (dims, idx) 3130 shape = -1 3131 3132 2 350.0 175.0 0.0 for name, series in iteritems(dataframe): 3133 1 33.0 33.0 0.0 data = np.asarray(series).reshape(shape) 3134 1 1520.0 1520.0 0.2 obj[name] = (dims, data) 3135 1 1.0 1.0 0.0 return obj ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662
426398031	https://github.com/pydata/xarray/issues/2459#issuecomment-426398031	https://api.github.com/repos/pydata/xarray/issues/2459	MDEyOklzc3VlQ29tbWVudDQyNjM5ODAzMQ==	shoyer 1217238	2018-10-02T19:20:04Z	2018-10-02T19:20:04Z	MEMBER	Here are the top entries I see with `%prun cropped.to_xarray()`: ``` 308597 function calls (308454 primitive calls) in 0.651 seconds Ordered by: internal time ncalls tottime 100000 1 1 1 1 100009 9 100021 1 2 1 66/65 24/22 17 19/18 5 2 2 4 1846 16 1 ``` percall cumtime percall filename:lineno(function) 0.255 0.000 0.275 0.000 datetimes.py:606(<lambda>) 0.165 0.165 0.165 0.165 {built-in method pandas._libs.lib.is_datetime_with_singletz_array} 0.071 0.071 0.634 0.634 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects} 0.054 0.054 0.054 0.054 {pandas._libs.lib.fast_zip} 0.029 0.029 0.304 0.304 {pandas._libs.lib.map_infer} 0.011 0.000 0.011 0.000 datetimelike.py:232(freq) 0.010 0.001 0.010 0.001 {pandas._libs.lib.infer_dtype} 0.010 0.000 0.010 0.000 datetimes.py:684(tz) 0.009 0.009 0.009 0.009 {built-in method pandas._libs.tslib.array_to_datetime} 0.008 0.004 0.008 0.004 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects} 0.008 0.008 0.651 0.651 dataarray.py:1827(from_series) 0.005 0.000 0.005 0.000 {built-in method numpy.core.multiarray.array} 0.001 0.000 0.362 0.016 base.py:677(_values) 0.001 0.000 0.001 0.000 {built-in method numpy.core.multiarray.empty} 0.001 0.000 0.189 0.010 base.py:4914(_ensure_index) 0.001 0.000 0.001 0.000 {method 'repeat' of 'numpy.ndarray' objects} 0.001 0.000 0.001 0.000 {method 'tolist' of 'numpy.ndarray' objects} 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_object_object} 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_int64_int64} 0.001 0.000 0.001 0.000 {built-in method builtins.isinstance} 0.001 0.000 0.001 0.000 {method 'reduce' of 'numpy.ufunc' objects} 0.001 0.001 0.001 0.001 {method 'get_indexer' of 'pandas._libs.index.DatetimeEngine' objects} There seems to be a suspiciously large amount of effort applying a function to individual datetime objects.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Stack + to_array before to_xarray is much faster that a simple to_xarray 365973662

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

4 rows where author_association = "MEMBER", issue = 365973662 and user = 1217238 sorted by updated_at descending

Line # Hits Time Per Hit % Time Line Contents

Advanced export