github: issue_comments: 13 rows where issue = 713834297 sorted by updated

13 rows where issue = 713834297 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
713172015	https://github.com/pydata/xarray/issues/4482#issuecomment-713172015	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcxMzE3MjAxNQ==	heerad 2560426	2020-10-20T22:17:08Z	2020-10-20T22:21:14Z	NONE	On the topic of fillna(), I'm seeing an odd unrelated issue that I don't have an explanation for. I have a dataarray `x` that I'm able to call `x.compute()` on. When I do `x.fillna(0).compute()`, I get the following error: `KeyError: ('where-3a3[...long hex string]', 100, 0, 0, 4)` Stack trace shows it's failing on a `get_dependencies(dsk, key, task, as_list)` call from a `cull(dsk, keys)` call in dask/optimization.py. `get_dependencies` itself is defined in dask/core.py. I have no idea how to reproduce this simply... If it helps narrow things down, `x` is a dask array, one of the dimensions is a datetime64, and all other are strings. I've tried using both the default engine and `netcdf4` when loading with `open_mfdataset`.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
708474940	https://github.com/pydata/xarray/issues/4482#issuecomment-708474940	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwODQ3NDk0MA==	heerad 2560426	2020-10-14T15:21:29Z	2020-10-14T15:21:55Z	NONE	Adding on, whatever the solution is that avoids blowing up memory, especially when using with `construct`, it would be useful to be implemented for both `fillna(0)` and `notnull()`. One common use-case would be so that you can take a weighted mean which normalizes by the sum of weights corresponding only to non-null entries, as in here: https://github.com/pydata/xarray/blob/333e8dba55f0165ccadf18f2aaaee9257a4d716b/xarray/core/weighted.py#L169	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
708088129	https://github.com/pydata/xarray/issues/4482#issuecomment-708088129	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwODA4ODEyOQ==	max-sixty 5635139	2020-10-14T00:50:40Z	2020-10-14T00:50:40Z	MEMBER	Right — that makes sense now. Given that `.fillna(0)` creates a copy, when we're doing stride tricks in the form of `construct` then that copy can be huge. So I think there's a spectrum of implementations of `skipna`, o/w two are: - as a convenient alias of `.fillna(0)`, like @mathause 's example above IIUC - ensuring a copy isn't made, which may require diving into `np.einops` (or open to alternatives) The second would be required for @heerad 's case above	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
707331260	https://github.com/pydata/xarray/issues/4482#issuecomment-707331260	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwNzMzMTI2MA==	heerad 2560426	2020-10-12T20:31:26Z	2020-10-12T21:05:24Z	NONE	See below. I temporarily write some files to netcdf then recombine them lazily using `open_mfdataset`. The issue seems to present itself more consistently when my `x` is a constructed rolling window, and especially when it's a rolling window of a stacked dimension as in below. I used the `memory_profiler` package and associated notebook extension (`%%memit` cell magic) to do memory profiling. ``` import numpy as np import xarray as xr import os N = 1000 N_per_file = 10 M = 100 K = 10 window_size = 150 tmp_dir = 'tmp' os.mkdir(tmp_dir) save many netcdf files, later to be concatted into a dask.delayed dataset for i in range(0, N, N_per_file): `# 3 dimensions: # d1 is the dim we're splitting our files/chunking along # d2 is a common dim among all files/chunks # d3 is a common dim among all files/chunks, where the first half is 0 and the second half is nan x_i = xr.DataArray([[[0](K//2) + [np.nan](K//2)]M]N_per_file, [('d1', [x for x in range(i, i+N_per_file)]), ('d2', [x for x in range(M)]), ('d3', [x for x in range(K)])] x_i.to_dataset(name='vals').to_netcdf('{}/file_{}.nc'.format(tmp_dir,i))` open lazily x = xr.open_mfdataset('{}/.nc'.format(tmp_dir), parallel=True, concat_dim='d1').vals a rolling window along a stacked dimension x_windows = x.stack(d13=['d1', 'd3']).rolling(d13=window_size).construct('window') we'll dot x_windows with y along the window dimension y = xr.DataArray([1]window_size, dims='window') incremental memory: 1.94 MiB x_windows.dot(y).compute() incremental memory: 20.00 MiB x_windows.notnull().dot(y).compute() incremental memory: 182.13 MiB x_windows.fillna(0.).dot(y).compute() incremental memory: 211.52 MiB x_windows.weighted(y).mean('window', skipna=True).compute() ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
707270259	https://github.com/pydata/xarray/issues/4482#issuecomment-707270259	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwNzI3MDI1OQ==	shoyer 1217238	2020-10-12T18:08:55Z	2020-10-12T18:08:55Z	MEMBER	I'm happy to live with a memory copy for now with `fillna` and `notnull`, but allocating the full, un-chunked array into memory is a showstopper. Is there a different workaround that I can use in the meantime? This is surprising behavior, and definitely sounds like a bug! If you could put together a minimal test case for reproducing the issue, we could look into it. It's hard to say what a work-around would be without knowing the source of the issue.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
707238146	https://github.com/pydata/xarray/issues/4482#issuecomment-707238146	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwNzIzODE0Ng==	heerad 2560426	2020-10-12T17:01:54Z	2020-10-12T17:16:07Z	NONE	Adding on here, even if `fillna` were to create a memory copy, we'd only expect memory usage to double. However, in my case with dask-based chunking (via `parallel=True` in `open_mfdataset`) I'm seeing the memory blow up multiple times that (10x+) until all available memory is eaten up. This is happening with `x.fillna(0).dot(y)` as well as `x.notnull().dot(y)` and `x.weighted(y).sum(skipna=True)`. `x` is the array that's chunked. This suggests that dask-based chunking isn't following through into the `fillna` and `notnull` ops, and the entire non-chunked arrays are being computed. More evidence in favor: if I do `(x*y).sum(skipna=True)` I get the following error: `MemoryError: Unable to allocate [xxx] GiB for an array with shape [un-chunked array shape] and data type float64` I'm happy to live with a memory copy for now with `fillna` and `notnull`, but allocating the full, un-chunked array into memory is a showstopper. Is there a different workaround that I can use in the meantime?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
706442785	https://github.com/pydata/xarray/issues/4482#issuecomment-706442785	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwNjQ0Mjc4NQ==	max-sixty 5635139	2020-10-09T23:29:23Z	2020-10-09T23:29:23Z	MEMBER	Maybe on very small arrays it's quicker to do a product than a copy? As the array scales, it's surely not — dot product is O(n^3) or similar. I would be interested to see a repro...	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
706140256	https://github.com/pydata/xarray/issues/4482#issuecomment-706140256	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwNjE0MDI1Ng==	mathause 10194086	2020-10-09T12:01:37Z	2020-10-09T12:01:37Z	MEMBER	`fillna` is basically `where(notnull(data), data, other)`. I think what takes the longest is the `where` part - possibly making a memory copy(?).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
704634447	https://github.com/pydata/xarray/issues/4482#issuecomment-704634447	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwNDYzNDQ0Nw==	max-sixty 5635139	2020-10-07T01:12:09Z	2020-10-07T01:12:09Z	MEMBER	Any idea why `x.fillna(0.)` is so much more expensive than `x.dot(y)`? For some functions where `fillna(0.)` has a different result to `skipna=True`, I could see it being worthwhile to writing new routines. But for dot product, it's definitionally the same. And `fillna` seems a much simpler operation than `dot`...	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
704064370	https://github.com/pydata/xarray/issues/4482#issuecomment-704064370	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwNDA2NDM3MA==	shoyer 1217238	2020-10-06T06:38:51Z	2020-10-06T06:38:51Z	MEMBER	I agree this would be welcome! Even if it isn't much faster than the options already shown here, at least we could point users to the best option we know of. I suspect achieving the full speed of `dot()` with skip-NA support is impossible, but we can probably do much better. I might start by prototyping something in Numba, just to get a sense of what is achievable with a low-level approach. But keep in mind that functions like `np.dot` and `np.einsum` ("GEMM") are a few of the most highly optimized routines in numerical computing.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
702965721	https://github.com/pydata/xarray/issues/4482#issuecomment-702965721	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwMjk2NTcyMQ==	mathause 10194086	2020-10-02T21:27:33Z	2020-10-02T21:27:33Z	MEMBER	Yes that would be very helpful. This is used for the weighted operations: https://github.com/pydata/xarray/blob/333e8dba55f0165ccadf18f2aaaee9257a4d716b/xarray/core/weighted.py#L129-L135 and it would be great if this could be done upstream. However, `dot` is implemented using `np.einsum` which is quite a gnarly beast.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
702939943	https://github.com/pydata/xarray/issues/4482#issuecomment-702939943	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwMjkzOTk0Mw==	heerad 2560426	2020-10-02T20:20:53Z	2020-10-02T20:32:32Z	NONE	Great, looks like I missed that option. Thanks. For reference, `x.fillna(0).dot(y)` takes 18 seconds in that same example, so a little better.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297
702937094	https://github.com/pydata/xarray/issues/4482#issuecomment-702937094	https://api.github.com/repos/pydata/xarray/issues/4482	MDEyOklzc3VlQ29tbWVudDcwMjkzNzA5NA==	max-sixty 5635139	2020-10-02T20:13:37Z	2020-10-02T20:13:37Z	MEMBER	I agree this would be a welcome option. As a workaround, you could fillna with zero?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow skipna in .dot() 713834297

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

13 rows where issue = 713834297 sorted by updated_at descending

save many netcdf files, later to be concatted into a dask.delayed dataset

open lazily

a rolling window along a stacked dimension

we'll dot x_windows with y along the window dimension

incremental memory: 1.94 MiB

incremental memory: 20.00 MiB

incremental memory: 182.13 MiB

incremental memory: 211.52 MiB

Advanced export