home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where issue = 782943813 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • max-sixty 3
  • rabernat 1
  • keewis 1

issue 1

  • Poor performance of repr of large arrays, particularly jupyter repr · 5 ✖

author_association 1

  • MEMBER 5
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
766983346 https://github.com/pydata/xarray/issues/4789#issuecomment-766983346 https://api.github.com/repos/pydata/xarray/issues/4789 MDEyOklzc3VlQ29tbWVudDc2Njk4MzM0Ng== keewis 14808389 2021-01-25T17:33:19Z 2021-01-26T21:59:20Z MEMBER

that seems to be the main issue. With ```diff diff --git a/xarray/core/formatting.py b/xarray/core/formatting.py index 282620e3..f825ed85 100644 --- a/xarray/core/formatting.py +++ b/xarray/core/formatting.py @@ -300,9 +300,11 @@ def _summarize_coord_multiindex(coord, col_width, marker):

def _summarize_coord_levels(coord, col_width, marker="-"): + indices = list(range(10)) + list(range(-10, 0)) + subset = coord[indices] return "\n".join( summarize_variable( - lname, coord.get_level_variable(lname), col_width, marker=marker + lname, subset.get_level_variable(lname), col_width, marker=marker ) for lname in coord.level_names ) `` I get a speed up of about 180x (forxr.DataArray(pd.Series(25_000_000, index=idx)), not sure if the speed-up is as significant for bigger arrays). We should probably make the shape ofindicesdepend oncol_width`, though.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Poor performance of repr of large arrays, particularly jupyter repr 782943813
767186998 https://github.com/pydata/xarray/issues/4789#issuecomment-767186998 https://api.github.com/repos/pydata/xarray/issues/4789 MDEyOklzc3VlQ29tbWVudDc2NzE4Njk5OA== max-sixty 5635139 2021-01-25T23:50:56Z 2021-01-25T23:50:56Z MEMBER

Yes great, I think that would be a great cut-through solution!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Poor performance of repr of large arrays, particularly jupyter repr 782943813
766518276 https://github.com/pydata/xarray/issues/4789#issuecomment-766518276 https://api.github.com/repos/pydata/xarray/issues/4789 MDEyOklzc3VlQ29tbWVudDc2NjUxODI3Ng== max-sixty 5635139 2021-01-25T03:36:23Z 2021-01-25T03:36:23Z MEMBER

The rabbit hole went deeper than I expected. I need to sign off now, but leaving what I have in case someone else has some insight.

Essentially, we call get_level_variable on the coord in formatting.py, which calls get_level_values into pandas. This is really slow on large MultiIndexes! I think it's recreating the whole index. I got as deep as algos.take_1d.

I think we can probably do something smarter to only call this on the first & last items in the MultiIndex.

For reference, here's the output of line_profiler, a good profiler for figuring this sort of thing out:

``` %lprun -f formatting._summarize_coord_levels -f IndexVariable.get_level_variable -f pd.MultiIndex.get_level_values -f pd.MultiIndex._get_level_values coords_repr(da.coords)

Total time: 1.91029 s File: /Users/maximilian/workspace/xarray/xarray/core/formatting.py Function: _summarize_coord_levels at line 302

Line # Hits Time Per Hit % Time Line Contents

302 def _summarize_coord_levels(coord, col_width, marker="-"): 303 2 1910185.0 955092.5 100.0 return "\n".join( 304 summarize_variable( 305 lname, coord.get_level_variable(lname), col_width, marker=marker 306 ) 307 1 102.0 102.0 0.0 for lname in coord.level_names 308 )

Total time: 1.81777 s File: /Users/maximilian/workspace/xarray/xarray/core/variable.py Function: get_level_variable at line 2687

Line # Hits Time Per Hit % Time Line Contents

2687 def get_level_variable(self, level): 2688 """Return a new IndexVariable from a given MultiIndex level.""" 2689 2 303.0 151.5 0.0 if self.level_names is None: 2690 raise ValueError("IndexVariable %r has no MultiIndex" % self.name) 2691 2 216.0 108.0 0.0 index = self.to_index() 2692 2 1817254.0 908627.0 100.0 return type(self)(self.dims, index.get_level_values(level))

Total time: 1.81709 s File: /usr/local/lib/python3.9/site-packages/pandas/core/indexes/multi.py Function: _get_level_values at line 1617

Line # Hits Time Per Hit % Time Line Contents

1617 def _get_level_values(self, level, unique=False): 1618 """ 1619 Return vector of label values for requested level, 1620 equal to the length of the index 1621
1622 this is an internal method 1623
1624 Parameters 1625 ---------- 1626 level : int level 1627 unique : bool, default False 1628 if True, drop duplicated values 1629
1630 Returns 1631 ------- 1632 values : ndarray 1633 """ 1634 2 47.0 23.5 0.0 lev = self.levels[level] 1635 2 5.0 2.5 0.0 level_codes = self.codes[level] 1636 2 2.0 1.0 0.0 name = self._names[level] 1637 2 1.0 0.5 0.0 if unique: 1638 level_codes = algos.unique(level_codes) 1639 2 1816971.0 908485.5 100.0 filled = algos.take_1d(lev._values, level_codes, fill_value=lev._na_value) 1640 2 60.0 30.0 0.0 return lev._shallow_copy(filled, name=name)

Total time: 1.81712 s File: /usr/local/lib/python3.9/site-packages/pandas/core/indexes/multi.py Function: get_level_values at line 1642

Line # Hits Time Per Hit % Time Line Contents

1642 def get_level_values(self, level): 1643 """ 1644 Return vector of label values for requested level. 1645
1646 Length of returned vector is equal to the length of the index. 1647
1648 Parameters 1649 ---------- 1650 level : int or str 1651 level is either the integer position of the level in the 1652 MultiIndex, or the name of the level. 1653
1654 Returns 1655 ------- 1656 values : Index 1657 Values is a level of this MultiIndex converted to 1658 a single :class:Index (or subclass thereof). 1659
1660 Examples 1661 -------- 1662 Create a MultiIndex: 1663
1664 >>> mi = pd.MultiIndex.from_arrays((list('abc'), list('def'))) 1665 >>> mi.names = ['level_1', 'level_2'] 1666
1667 Get level values by supplying level as either integer or name: 1668
1669 >>> mi.get_level_values(0) 1670 Index(['a', 'b', 'c'], dtype='object', name='level_1') 1671 >>> mi.get_level_values('level_2') 1672 Index(['d', 'e', 'f'], dtype='object', name='level_2') 1673 """ 1674 2 11.0 5.5 0.0 level = self._get_level_number(level) 1675 2 1817107.0 908553.5 100.0 values = self._get_level_values(level) 1676 2 2.0 1.0 0.0 return values ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Poor performance of repr of large arrays, particularly jupyter repr 782943813
766504338 https://github.com/pydata/xarray/issues/4789#issuecomment-766504338 https://api.github.com/repos/pydata/xarray/issues/4789 MDEyOklzc3VlQ29tbWVudDc2NjUwNDMzOA== max-sixty 5635139 2021-01-25T02:46:40Z 2021-01-25T02:46:40Z MEMBER

One quick observation is that it's related to the MultiIndex — if we swap out the index for idx = pd.Index(range(100_000_000)), the time drops from 1.8s to 812mics

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Poor performance of repr of large arrays, particularly jupyter repr 782943813
758373462 https://github.com/pydata/xarray/issues/4789#issuecomment-758373462 https://api.github.com/repos/pydata/xarray/issues/4789 MDEyOklzc3VlQ29tbWVudDc1ODM3MzQ2Mg== rabernat 1197350 2021-01-12T03:36:26Z 2021-01-12T03:36:26Z MEMBER

I uncovered this issue with Dask's SVG in its _repr_html function: https://github.com/dask/dask/issues/6670. The fix made a big difference in repr size. Possibly related?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Poor performance of repr of large arrays, particularly jupyter repr 782943813

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 16.756ms · About: xarray-datasette