home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where issue = 572875480 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • max-sixty 5
  • seth-p 4
  • josephnowak 1

author_association 2

  • CONTRIBUTOR 5
  • MEMBER 5

issue 1

  • {DataArray,Dataset}.rank() should support an optional list of dimensions · 10 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
973623524 https://github.com/pydata/xarray/issues/3810#issuecomment-973623524 https://api.github.com/repos/pydata/xarray/issues/3810 IC_kwDOAMm_X846CFDk josephnowak 25071375 2021-11-19T01:00:11Z 2021-11-19T15:09:10Z CONTRIBUTOR

Is it possible to add the option of modifying what happens when there is a tie in the rank? (If you want I can create a separate issue for this)

I think this can be done using the scipy rankdata function instead of the bottleneck rank (but also I think that adding the method option for the bottleneck package is also possible).

Small example: ```py

arr = xarray.DataArray( dask.array.random.random((11, 10), chunks=(3, 2)), coords={'a': list(range(11)), 'b': list(range(10))} )

def rank(x: xarray.DataArray, dim: str, method: str): # This option generate less tasks, I don't know why

axis = x.dims.index(dim)
return xarray.DataArray(
    dask.array.apply_along_axis(
        rankdata,
        axis,
        x.data,
        dtype=float,
        shape=(x.sizes[dim], ),
        method=method
    ),
    coords=x.coords,
    dims=x.dims
)

def rank2(x: xarray.DataArray, dim: str, method: str): from scipy.stats import rankdata

axis = x.dims.index(dim)
return xarray.apply_ufunc(
    rankdata,
    x.chunk({dim: x.sizes[dim]}),
    dask='parallelized',
    kwargs={'method': method, 'axis': axis},
    meta=x.data._meta
)

arr_rank1 = rank(arr, 'a', 'ordinal') arr_rank2 = rank2(arr, 'a', 'ordinal')

assert arr_rank1.equals(arr_rank2) ```

```py

Probably this can work for ranking arrays with nan values

def _nanrankdata1(a, method): y = np.empty(a.shape, dtype=np.float64) y.fill(np.nan) idx = ~np.isnan(a) y[idx] = rankdata(a[idx], method=method) return y

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
592738965 https://github.com/pydata/xarray/issues/3810#issuecomment-592738965 https://api.github.com/repos/pydata/xarray/issues/3810 MDEyOklzc3VlQ29tbWVudDU5MjczODk2NQ== max-sixty 5635139 2020-02-28T21:33:35Z 2020-02-28T21:33:35Z MEMBER

Yeah, unfortunately I'm fairly confident about this; have a go with moderately large arrays for sum and you'll quickly see the performance cliff

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
592737661 https://github.com/pydata/xarray/issues/3810#issuecomment-592737661 https://api.github.com/repos/pydata/xarray/issues/3810 MDEyOklzc3VlQ29tbWVudDU5MjczNzY2MQ== seth-p 7441788 2020-02-28T21:29:58Z 2020-02-28T21:31:31Z CONTRIBUTOR

Note that with the apply_ufunc implementation we're only reshaping dims-sized ndarrays, not (necessarily) the whole DataArray, so maybe it's not too bad? It might be better to first sort dims to be in the same order as self.dims. i.e. dims = [dim_ for dim_ in self.dims if dim_ in dims]. But I'm just speculating.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
592721162 https://github.com/pydata/xarray/issues/3810#issuecomment-592721162 https://api.github.com/repos/pydata/xarray/issues/3810 MDEyOklzc3VlQ29tbWVudDU5MjcyMTE2Mg== max-sixty 5635139 2020-02-28T20:47:33Z 2020-02-28T20:47:33Z MEMBER

Great -- that's cool and a good implementation of apply_ufunc. As above, we wouldn't want to replace rank with that given the reshaping (we'd need a function that computes over multiple dimensions)

We could use something similar for groupbys though?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
592715925 https://github.com/pydata/xarray/issues/3810#issuecomment-592715925 https://api.github.com/repos/pydata/xarray/issues/3810 MDEyOklzc3VlQ29tbWVudDU5MjcxNTkyNQ== seth-p 7441788 2020-02-28T20:33:43Z 2020-02-28T20:35:57Z CONTRIBUTOR

A few minor tweaks needed: ``` In [20]: import bottleneck

In [21]: xr.apply_ufunc( ...: lambda x: bottleneck.rankdata(x).reshape(x.shape), ...: d, ...: input_core_dims=[['xyz', 'abc']], ...: output_core_dims=[['xyz', 'abc']], ...: vectorize=True ...: ).transpose(*d.dims)
Out[21]: <xarray.DataArray (abc: 4, xyz: 3)> array([[ 1., 2., 3.], [ 4., 5., 6.], [ 7., 8., 9.], [10., 11., 12.]]) Dimensions without coordinates: abc, xyz ```

Despite what the docs say, bottleneck.{nan}rankdata(a) returns a 1-dimensional ndarray, not an array with the same shape as a.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
592708353 https://github.com/pydata/xarray/issues/3810#issuecomment-592708353 https://api.github.com/repos/pydata/xarray/issues/3810 MDEyOklzc3VlQ29tbWVudDU5MjcwODM1Mw== max-sixty 5635139 2020-02-28T20:13:51Z 2020-02-28T20:13:51Z MEMBER

Could you try running that?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
592672463 https://github.com/pydata/xarray/issues/3810#issuecomment-592672463 https://api.github.com/repos/pydata/xarray/issues/3810 MDEyOklzc3VlQ29tbWVudDU5MjY3MjQ2Mw== seth-p 7441788 2020-02-28T18:51:18Z 2020-02-28T18:52:29Z CONTRIBUTOR

What's wrong with the following? (Still need to deal with pct and keep_attrs.) apply_ufunc( bottleneck.{nan}rankdata, self, input_core_dims=[dims], output_core_dims=[dims], vectorize=True )

Per https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.rankdata, "The default (axis=None) is to rank the elements of the flattened array."

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
592665711 https://github.com/pydata/xarray/issues/3810#issuecomment-592665711 https://api.github.com/repos/pydata/xarray/issues/3810 MDEyOklzc3VlQ29tbWVudDU5MjY2NTcxMQ== max-sixty 5635139 2020-02-28T18:34:44Z 2020-02-28T18:34:44Z MEMBER

Yes, we can always reshape as a way of running numerical operations over multiple dimensions. But reshaping can be an expensive operation, so doing it as part of a numerical operation can cause surprises. (if you're interested, try running a sum over multiple dimensions and comparing to a reshape + a sum over the single reshaped dimension).

Instead, users can do this themselves, giving them context and control.

Reshaping is OK to do in groupby though (I think), so adding rank to groupby would be one way of accomplishing this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
592654794 https://github.com/pydata/xarray/issues/3810#issuecomment-592654794 https://api.github.com/repos/pydata/xarray/issues/3810 MDEyOklzc3VlQ29tbWVudDU5MjY1NDc5NA== seth-p 7441788 2020-02-28T18:06:57Z 2020-02-28T18:06:57Z CONTRIBUTOR

Assuming dims is a non-empty list of dimensions, the following code seems to work: temp_dim = '__temp_dim__' return da.stack(**{temp_dim: dims}).\ rank(temp_dim, pct=pct, keep_attrs=keep_attrs).\ unstack(temp_dim).transpose(*da.dims).\ drop_vars([dim_ for dim_ in dims if dim_ not in da.coords])

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480
592645335 https://github.com/pydata/xarray/issues/3810#issuecomment-592645335 https://api.github.com/repos/pydata/xarray/issues/3810 MDEyOklzc3VlQ29tbWVudDU5MjY0NTMzNQ== max-sixty 5635139 2020-02-28T17:43:05Z 2020-02-28T17:43:05Z MEMBER

This would be great. The underlying numerical library we use, bottleneck, doesn't support multiple dimensions. If there were another option, or someone wanted to write one in numbagg, that would be a welcome addition.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  {DataArray,Dataset}.rank() should support an optional list of dimensions 572875480

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.161ms · About: xarray-datasette