html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/2459#issuecomment-650624827,https://api.github.com/repos/pydata/xarray/issues/2459,650624827,MDEyOklzc3VlQ29tbWVudDY1MDYyNDgyNw==,1217238,2020-06-27T20:50:45Z,2020-06-27T20:50:45Z,MEMBER,"> However a much faster solution was through numpy array. The below code is based on the [idea of Igor Raush](https://stackoverflow.com/a/35049899)
Thanks for sharing! This is a great tip indeed.
I've reimplemented `from_dataframe` to make use of in https://github.com/pydata/xarray/pull/4184, and it indeed makes things much, much faster! The original example in this thread is now 40x faster.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,365973662
https://github.com/pydata/xarray/issues/2459#issuecomment-426691841,https://api.github.com/repos/pydata/xarray/issues/2459,426691841,MDEyOklzc3VlQ29tbWVudDQyNjY5MTg0MQ==,1217238,2018-10-03T15:57:28Z,2018-10-03T15:57:28Z,MEMBER,"@max-sixty nevermind, you seem to have already discovered that :)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,365973662
https://github.com/pydata/xarray/issues/2459#issuecomment-426689282,https://api.github.com/repos/pydata/xarray/issues/2459,426689282,MDEyOklzc3VlQ29tbWVudDQyNjY4OTI4Mg==,1217238,2018-10-03T15:50:32Z,2018-10-03T15:50:32Z,MEMBER,"The vast majority of the time in xarray's current implementation seems to be spent in `DataFrame.reindex()`, but I see no reason why this operations needs to be so slow. I expect we could probably optimize this significantly on the pandas side.
See these results from line-profiler:
```
In [8]: %lprun -f xarray.Dataset.from_dataframe cropped.to_xarray()
Timer unit: 1e-06 s
Total time: 0.727191 s
File: /Users/shoyer/dev/xarray/xarray/core/dataset.py
Function: from_dataframe at line 3094
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3094 @classmethod
3095 def from_dataframe(cls, dataframe):
3096 """"""Convert a pandas.DataFrame into an xarray.Dataset
3097
3098 Each column will be converted into an independent variable in the
3099 Dataset. If the dataframe's index is a MultiIndex, it will be expanded
3100 into a tensor product of one-dimensional indices (filling in missing
3101 values with NaN). This method will produce a Dataset very similar to
3102 that on which the 'to_dataframe' method was called, except with
3103 possibly redundant dimensions (since all dataset variables will have
3104 the same dimensionality).
3105 """"""
3106 # TODO: Add an option to remove dimensions along which the variables
3107 # are constant, to enable consistent serialization to/from a dataframe,
3108 # even if some variables have different dimensionality.
3109
3110 1 352.0 352.0 0.0 if not dataframe.columns.is_unique:
3111 raise ValueError(
3112 'cannot convert DataFrame with non-unique columns')
3113
3114 1 3.0 3.0 0.0 idx = dataframe.index
3115 1 356.0 356.0 0.0 obj = cls()
3116
3117 1 2.0 2.0 0.0 if isinstance(idx, pd.MultiIndex):
3118 # it's a multi-index
3119 # expand the DataFrame to include the product of all levels
3120 1 4524.0 4524.0 0.6 full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names)
3121 1 717008.0 717008.0 98.6 dataframe = dataframe.reindex(full_idx)
3122 1 3.0 3.0 0.0 dims = [name if name is not None else 'level_%i' % n
3123 1 20.0 20.0 0.0 for n, name in enumerate(idx.names)]
3124 4 9.0 2.2 0.0 for dim, lev in zip(dims, idx.levels):
3125 3 2973.0 991.0 0.4 obj[dim] = (dim, lev)
3126 1 37.0 37.0 0.0 shape = [lev.size for lev in idx.levels]
3127 else:
3128 dims = (idx.name if idx.name is not None else 'index',)
3129 obj[dims[0]] = (dims, idx)
3130 shape = -1
3131
3132 2 350.0 175.0 0.0 for name, series in iteritems(dataframe):
3133 1 33.0 33.0 0.0 data = np.asarray(series).reshape(shape)
3134 1 1520.0 1520.0 0.2 obj[name] = (dims, data)
3135 1 1.0 1.0 0.0 return obj
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,365973662
https://github.com/pydata/xarray/issues/2459#issuecomment-426398031,https://api.github.com/repos/pydata/xarray/issues/2459,426398031,MDEyOklzc3VlQ29tbWVudDQyNjM5ODAzMQ==,1217238,2018-10-02T19:20:04Z,2018-10-02T19:20:04Z,MEMBER,"Here are the top entries I see with `%prun cropped.to_xarray()`:
```
308597 function calls (308454 primitive calls) in 0.651 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
100000 0.255 0.000 0.275 0.000 datetimes.py:606()
1 0.165 0.165 0.165 0.165 {built-in method pandas._libs.lib.is_datetime_with_singletz_array}
1 0.071 0.071 0.634 0.634 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
1 0.054 0.054 0.054 0.054 {pandas._libs.lib.fast_zip}
1 0.029 0.029 0.304 0.304 {pandas._libs.lib.map_infer}
100009 0.011 0.000 0.011 0.000 datetimelike.py:232(freq)
9 0.010 0.001 0.010 0.001 {pandas._libs.lib.infer_dtype}
100021 0.010 0.000 0.010 0.000 datetimes.py:684(tz)
1 0.009 0.009 0.009 0.009 {built-in method pandas._libs.tslib.array_to_datetime}
2 0.008 0.004 0.008 0.004 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
1 0.008 0.008 0.651 0.651 dataarray.py:1827(from_series)
66/65 0.005 0.000 0.005 0.000 {built-in method numpy.core.multiarray.array}
24/22 0.001 0.000 0.362 0.016 base.py:677(_values)
17 0.001 0.000 0.001 0.000 {built-in method numpy.core.multiarray.empty}
19/18 0.001 0.000 0.189 0.010 base.py:4914(_ensure_index)
5 0.001 0.000 0.001 0.000 {method 'repeat' of 'numpy.ndarray' objects}
2 0.001 0.000 0.001 0.000 {method 'tolist' of 'numpy.ndarray' objects}
2 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_object_object}
4 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_int64_int64}
1846 0.001 0.000 0.001 0.000 {built-in method builtins.isinstance}
16 0.001 0.000 0.001 0.000 {method 'reduce' of 'numpy.ufunc' objects}
1 0.001 0.001 0.001 0.001 {method 'get_indexer' of 'pandas._libs.index.DatetimeEngine' objects}
```
There seems to be a suspiciously large amount of effort applying a function to individual datetime objects.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,365973662