html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/4789#issuecomment-766983346,https://api.github.com/repos/pydata/xarray/issues/4789,766983346,MDEyOklzc3VlQ29tbWVudDc2Njk4MzM0Ng==,14808389,2021-01-25T17:33:19Z,2021-01-26T21:59:20Z,MEMBER,"that seems to be the main issue. With ```diff diff --git a/xarray/core/formatting.py b/xarray/core/formatting.py index 282620e3..f825ed85 100644 --- a/xarray/core/formatting.py +++ b/xarray/core/formatting.py @@ -300,9 +300,11 @@ def _summarize_coord_multiindex(coord, col_width, marker): def _summarize_coord_levels(coord, col_width, marker=""-""): + indices = list(range(10)) + list(range(-10, 0)) + subset = coord[indices] return ""\n"".join( summarize_variable( - lname, coord.get_level_variable(lname), col_width, marker=marker + lname, subset.get_level_variable(lname), col_width, marker=marker ) for lname in coord.level_names ) ``` I get a speed up of about 180x (for `xr.DataArray(pd.Series(25_000_000, index=idx))`, not sure if the speed-up is as significant for bigger arrays). We should probably make the shape of `indices` depend on `col_width`, though.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,782943813 https://github.com/pydata/xarray/issues/4789#issuecomment-767186998,https://api.github.com/repos/pydata/xarray/issues/4789,767186998,MDEyOklzc3VlQ29tbWVudDc2NzE4Njk5OA==,5635139,2021-01-25T23:50:56Z,2021-01-25T23:50:56Z,MEMBER,"Yes great, I think that would be a great cut-through solution!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,782943813 https://github.com/pydata/xarray/issues/4789#issuecomment-766518276,https://api.github.com/repos/pydata/xarray/issues/4789,766518276,MDEyOklzc3VlQ29tbWVudDc2NjUxODI3Ng==,5635139,2021-01-25T03:36:23Z,2021-01-25T03:36:23Z,MEMBER,"The rabbit hole went deeper than I expected. I need to sign off now, but leaving what I have in case someone else has some insight. Essentially, we call `get_level_variable` on the coord in `formatting.py`, which calls `get_level_values` into pandas. This is really slow on large MultiIndexes! I think it's recreating the whole index. I got as deep as `algos.take_1d`. I think we can probably do something smarter to only call this on the first & last items in the MultiIndex. For reference, here's the output of line_profiler, a good profiler for figuring this sort of thing out: ``` %lprun -f formatting._summarize_coord_levels -f IndexVariable.get_level_variable -f pd.MultiIndex.get_level_values -f pd.MultiIndex._get_level_values coords_repr(da.coords) Total time: 1.91029 s File: /Users/maximilian/workspace/xarray/xarray/core/formatting.py Function: _summarize_coord_levels at line 302 Line # Hits Time Per Hit % Time Line Contents ============================================================== 302 def _summarize_coord_levels(coord, col_width, marker=""-""): 303 2 1910185.0 955092.5 100.0 return ""\n"".join( 304 summarize_variable( 305 lname, coord.get_level_variable(lname), col_width, marker=marker 306 ) 307 1 102.0 102.0 0.0 for lname in coord.level_names 308 ) Total time: 1.81777 s File: /Users/maximilian/workspace/xarray/xarray/core/variable.py Function: get_level_variable at line 2687 Line # Hits Time Per Hit % Time Line Contents ============================================================== 2687 def get_level_variable(self, level): 2688 """"""Return a new IndexVariable from a given MultiIndex level."""""" 2689 2 303.0 151.5 0.0 if self.level_names is None: 2690 raise ValueError(""IndexVariable %r has no MultiIndex"" % self.name) 2691 2 216.0 108.0 0.0 index = self.to_index() 2692 2 1817254.0 908627.0 100.0 return type(self)(self.dims, index.get_level_values(level)) Total time: 1.81709 s File: /usr/local/lib/python3.9/site-packages/pandas/core/indexes/multi.py Function: _get_level_values at line 1617 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1617 def _get_level_values(self, level, unique=False): 1618 """""" 1619 Return vector of label values for requested level, 1620 equal to the length of the index 1621 1622 **this is an internal method** 1623 1624 Parameters 1625 ---------- 1626 level : int level 1627 unique : bool, default False 1628 if True, drop duplicated values 1629 1630 Returns 1631 ------- 1632 values : ndarray 1633 """""" 1634 2 47.0 23.5 0.0 lev = self.levels[level] 1635 2 5.0 2.5 0.0 level_codes = self.codes[level] 1636 2 2.0 1.0 0.0 name = self._names[level] 1637 2 1.0 0.5 0.0 if unique: 1638 level_codes = algos.unique(level_codes) 1639 2 1816971.0 908485.5 100.0 filled = algos.take_1d(lev._values, level_codes, fill_value=lev._na_value) 1640 2 60.0 30.0 0.0 return lev._shallow_copy(filled, name=name) Total time: 1.81712 s File: /usr/local/lib/python3.9/site-packages/pandas/core/indexes/multi.py Function: get_level_values at line 1642 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1642 def get_level_values(self, level): 1643 """""" 1644 Return vector of label values for requested level. 1645 1646 Length of returned vector is equal to the length of the index. 1647 1648 Parameters 1649 ---------- 1650 level : int or str 1651 ``level`` is either the integer position of the level in the 1652 MultiIndex, or the name of the level. 1653 1654 Returns 1655 ------- 1656 values : Index 1657 Values is a level of this MultiIndex converted to 1658 a single :class:`Index` (or subclass thereof). 1659 1660 Examples 1661 -------- 1662 Create a MultiIndex: 1663 1664 >>> mi = pd.MultiIndex.from_arrays((list('abc'), list('def'))) 1665 >>> mi.names = ['level_1', 'level_2'] 1666 1667 Get level values by supplying level as either integer or name: 1668 1669 >>> mi.get_level_values(0) 1670 Index(['a', 'b', 'c'], dtype='object', name='level_1') 1671 >>> mi.get_level_values('level_2') 1672 Index(['d', 'e', 'f'], dtype='object', name='level_2') 1673 """""" 1674 2 11.0 5.5 0.0 level = self._get_level_number(level) 1675 2 1817107.0 908553.5 100.0 values = self._get_level_values(level) 1676 2 2.0 1.0 0.0 return values ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,782943813 https://github.com/pydata/xarray/issues/4789#issuecomment-766504338,https://api.github.com/repos/pydata/xarray/issues/4789,766504338,MDEyOklzc3VlQ29tbWVudDc2NjUwNDMzOA==,5635139,2021-01-25T02:46:40Z,2021-01-25T02:46:40Z,MEMBER,"One quick observation is that it's related to the MultiIndex — if we swap out the index for `idx = pd.Index(range(100_000_000))`, the time drops from 1.8s to 812mics","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,782943813 https://github.com/pydata/xarray/issues/4789#issuecomment-758373462,https://api.github.com/repos/pydata/xarray/issues/4789,758373462,MDEyOklzc3VlQ29tbWVudDc1ODM3MzQ2Mg==,1197350,2021-01-12T03:36:26Z,2021-01-12T03:36:26Z,MEMBER,I uncovered this issue with Dask's SVG in its `_repr_html` function: https://github.com/dask/dask/issues/6670. The fix made a big difference in repr size. Possibly related?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,782943813