home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

13 rows where issue = 713834297 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • heerad 5
  • max-sixty 4
  • shoyer 2
  • mathause 2

author_association 2

  • MEMBER 8
  • NONE 5

issue 1

  • Allow skipna in .dot() · 13 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
713172015 https://github.com/pydata/xarray/issues/4482#issuecomment-713172015 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcxMzE3MjAxNQ== heerad 2560426 2020-10-20T22:17:08Z 2020-10-20T22:21:14Z NONE

On the topic of fillna(), I'm seeing an odd unrelated issue that I don't have an explanation for.

I have a dataarray x that I'm able to call x.compute() on.

When I do x.fillna(0).compute(), I get the following error:

KeyError: ('where-3a3[...long hex string]', 100, 0, 0, 4)

Stack trace shows it's failing on a get_dependencies(dsk, key, task, as_list) call from a cull(dsk, keys) call in dask/optimization.py. get_dependencies itself is defined in dask/core.py.

I have no idea how to reproduce this simply... If it helps narrow things down, x is a dask array, one of the dimensions is a datetime64, and all other are strings. I've tried using both the default engine and netcdf4 when loading with open_mfdataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
708474940 https://github.com/pydata/xarray/issues/4482#issuecomment-708474940 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwODQ3NDk0MA== heerad 2560426 2020-10-14T15:21:29Z 2020-10-14T15:21:55Z NONE

Adding on, whatever the solution is that avoids blowing up memory, especially when using with construct, it would be useful to be implemented for both fillna(0) and notnull(). One common use-case would be so that you can take a weighted mean which normalizes by the sum of weights corresponding only to non-null entries, as in here: https://github.com/pydata/xarray/blob/333e8dba55f0165ccadf18f2aaaee9257a4d716b/xarray/core/weighted.py#L169

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
708088129 https://github.com/pydata/xarray/issues/4482#issuecomment-708088129 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwODA4ODEyOQ== max-sixty 5635139 2020-10-14T00:50:40Z 2020-10-14T00:50:40Z MEMBER

Right — that makes sense now. Given that .fillna(0) creates a copy, when we're doing stride tricks in the form of construct then that copy can be huge.

So I think there's a spectrum of implementations of skipna, o/w two are: - as a convenient alias of .fillna(0), like @mathause 's example above IIUC - ensuring a copy isn't made, which may require diving into np.einops (or open to alternatives)

The second would be required for @heerad 's case above

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
707331260 https://github.com/pydata/xarray/issues/4482#issuecomment-707331260 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwNzMzMTI2MA== heerad 2560426 2020-10-12T20:31:26Z 2020-10-12T21:05:24Z NONE

See below. I temporarily write some files to netcdf then recombine them lazily using open_mfdataset.

The issue seems to present itself more consistently when my x is a constructed rolling window, and especially when it's a rolling window of a stacked dimension as in below.

I used the memory_profiler package and associated notebook extension (%%memit cell magic) to do memory profiling.

``` import numpy as np import xarray as xr import os

N = 1000 N_per_file = 10 M = 100 K = 10 window_size = 150

tmp_dir = 'tmp'

os.mkdir(tmp_dir)

save many netcdf files, later to be concatted into a dask.delayed dataset

for i in range(0, N, N_per_file):

# 3 dimensions:
# d1 is the dim we're splitting our files/chunking along
# d2 is a common dim among all files/chunks
# d3 is a common dim among all files/chunks, where the first half is 0 and the second half is nan
x_i = xr.DataArray([[[0]*(K//2) + [np.nan]*(K//2)]*M]*N_per_file,
    [('d1', [x for x in range(i, i+N_per_file)]), 
     ('d2', [x for x in range(M)]),
     ('d3', [x for x in range(K)])]

x_i.to_dataset(name='vals').to_netcdf('{}/file_{}.nc'.format(tmp_dir,i))

open lazily

x = xr.open_mfdataset('{}/*.nc'.format(tmp_dir), parallel=True, concat_dim='d1').vals

a rolling window along a stacked dimension

x_windows = x.stack(d13=['d1', 'd3']).rolling(d13=window_size).construct('window')

we'll dot x_windows with y along the window dimension

y = xr.DataArray([1]*window_size, dims='window')

incremental memory: 1.94 MiB

x_windows.dot(y).compute()

incremental memory: 20.00 MiB

x_windows.notnull().dot(y).compute()

incremental memory: 182.13 MiB

x_windows.fillna(0.).dot(y).compute()

incremental memory: 211.52 MiB

x_windows.weighted(y).mean('window', skipna=True).compute() ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
707270259 https://github.com/pydata/xarray/issues/4482#issuecomment-707270259 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwNzI3MDI1OQ== shoyer 1217238 2020-10-12T18:08:55Z 2020-10-12T18:08:55Z MEMBER

I'm happy to live with a memory copy for now with fillna and notnull, but allocating the full, un-chunked array into memory is a showstopper. Is there a different workaround that I can use in the meantime?

This is surprising behavior, and definitely sounds like a bug!

If you could put together a minimal test case for reproducing the issue, we could look into it. It's hard to say what a work-around would be without knowing the source of the issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
707238146 https://github.com/pydata/xarray/issues/4482#issuecomment-707238146 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwNzIzODE0Ng== heerad 2560426 2020-10-12T17:01:54Z 2020-10-12T17:16:07Z NONE

Adding on here, even if fillna were to create a memory copy, we'd only expect memory usage to double. However, in my case with dask-based chunking (via parallel=True in open_mfdataset) I'm seeing the memory blow up multiple times that (10x+) until all available memory is eaten up.

This is happening with x.fillna(0).dot(y) as well as x.notnull().dot(y) and x.weighted(y).sum(skipna=True). x is the array that's chunked. This suggests that dask-based chunking isn't following through into the fillna and notnull ops, and the entire non-chunked arrays are being computed.

More evidence in favor: if I do (x*y).sum(skipna=True) I get the following error:

MemoryError: Unable to allocate [xxx] GiB for an array with shape [un-chunked array shape] and data type float64

I'm happy to live with a memory copy for now with fillna and notnull, but allocating the full, un-chunked array into memory is a showstopper. Is there a different workaround that I can use in the meantime?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
706442785 https://github.com/pydata/xarray/issues/4482#issuecomment-706442785 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwNjQ0Mjc4NQ== max-sixty 5635139 2020-10-09T23:29:23Z 2020-10-09T23:29:23Z MEMBER

Maybe on very small arrays it's quicker to do a product than a copy? As the array scales, it's surely not — dot product is O(n^3) or similar. I would be interested to see a repro...

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
706140256 https://github.com/pydata/xarray/issues/4482#issuecomment-706140256 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwNjE0MDI1Ng== mathause 10194086 2020-10-09T12:01:37Z 2020-10-09T12:01:37Z MEMBER

fillna is basically where(notnull(data), data, other). I think what takes the longest is the where part - possibly making a memory copy(?).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
704634447 https://github.com/pydata/xarray/issues/4482#issuecomment-704634447 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwNDYzNDQ0Nw== max-sixty 5635139 2020-10-07T01:12:09Z 2020-10-07T01:12:09Z MEMBER

Any idea why x.fillna(0.) is so much more expensive than x.dot(y)?

For some functions where fillna(0.) has a different result to skipna=True, I could see it being worthwhile to writing new routines. But for dot product, it's definitionally the same. And fillna seems a much simpler operation than dot...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
704064370 https://github.com/pydata/xarray/issues/4482#issuecomment-704064370 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwNDA2NDM3MA== shoyer 1217238 2020-10-06T06:38:51Z 2020-10-06T06:38:51Z MEMBER

I agree this would be welcome! Even if it isn't much faster than the options already shown here, at least we could point users to the best option we know of.

I suspect achieving the full speed of dot() with skip-NA support is impossible, but we can probably do much better. I might start by prototyping something in Numba, just to get a sense of what is achievable with a low-level approach. But keep in mind that functions like np.dot and np.einsum ("GEMM") are a few of the most highly optimized routines in numerical computing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
702965721 https://github.com/pydata/xarray/issues/4482#issuecomment-702965721 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwMjk2NTcyMQ== mathause 10194086 2020-10-02T21:27:33Z 2020-10-02T21:27:33Z MEMBER

Yes that would be very helpful. This is used for the weighted operations:

https://github.com/pydata/xarray/blob/333e8dba55f0165ccadf18f2aaaee9257a4d716b/xarray/core/weighted.py#L129-L135

and it would be great if this could be done upstream. However, dot is implemented using np.einsum which is quite a gnarly beast.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
702939943 https://github.com/pydata/xarray/issues/4482#issuecomment-702939943 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwMjkzOTk0Mw== heerad 2560426 2020-10-02T20:20:53Z 2020-10-02T20:32:32Z NONE

Great, looks like I missed that option. Thanks.

For reference, x.fillna(0).dot(y) takes 18 seconds in that same example, so a little better.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
702937094 https://github.com/pydata/xarray/issues/4482#issuecomment-702937094 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwMjkzNzA5NA== max-sixty 5635139 2020-10-02T20:13:37Z 2020-10-02T20:13:37Z MEMBER

I agree this would be a welcome option.

As a workaround, you could fillna with zero?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 11.096ms · About: xarray-datasette