html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/4482#issuecomment-713172015,https://api.github.com/repos/pydata/xarray/issues/4482,713172015,MDEyOklzc3VlQ29tbWVudDcxMzE3MjAxNQ==,2560426,2020-10-20T22:17:08Z,2020-10-20T22:21:14Z,NONE,"On the topic of fillna(), I'm seeing an odd unrelated issue that I don't have an explanation for.
I have a dataarray `x` that I'm able to call `x.compute()` on.
When I do `x.fillna(0).compute()`, I get the following error:
```
KeyError: ('where-3a3[...long hex string]', 100, 0, 0, 4)
```
Stack trace shows it's failing on a `get_dependencies(dsk, key, task, as_list)` call from a `cull(dsk, keys)` call in dask/optimization.py. `get_dependencies` itself is defined in dask/core.py.
I have no idea how to reproduce this simply... If it helps narrow things down, `x` is a dask array, one of the dimensions is a datetime64, and all other are strings. I've tried using both the default engine and `netcdf4` when loading with `open_mfdataset`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297
https://github.com/pydata/xarray/issues/4482#issuecomment-708474940,https://api.github.com/repos/pydata/xarray/issues/4482,708474940,MDEyOklzc3VlQ29tbWVudDcwODQ3NDk0MA==,2560426,2020-10-14T15:21:29Z,2020-10-14T15:21:55Z,NONE,"Adding on, whatever the solution is that avoids blowing up memory, especially when using with `construct`, it would be useful to be implemented for both `fillna(0)` and `notnull()`. One common use-case would be so that you can take a weighted mean which normalizes by the sum of weights corresponding only to non-null entries, as in here: https://github.com/pydata/xarray/blob/333e8dba55f0165ccadf18f2aaaee9257a4d716b/xarray/core/weighted.py#L169","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297
https://github.com/pydata/xarray/issues/4482#issuecomment-707331260,https://api.github.com/repos/pydata/xarray/issues/4482,707331260,MDEyOklzc3VlQ29tbWVudDcwNzMzMTI2MA==,2560426,2020-10-12T20:31:26Z,2020-10-12T21:05:24Z,NONE,"See below. I temporarily write some files to netcdf then recombine them lazily using `open_mfdataset`.
The issue seems to present itself more consistently when my `x` is a constructed rolling window, and especially when it's a rolling window of a stacked dimension as in below.
I used the `memory_profiler` package and associated notebook extension (`%%memit` cell magic) to do memory profiling.
```
import numpy as np
import xarray as xr
import os
N = 1000
N_per_file = 10
M = 100
K = 10
window_size = 150
tmp_dir = 'tmp'
os.mkdir(tmp_dir)
# save many netcdf files, later to be concatted into a dask.delayed dataset
for i in range(0, N, N_per_file):
# 3 dimensions:
# d1 is the dim we're splitting our files/chunking along
# d2 is a common dim among all files/chunks
# d3 is a common dim among all files/chunks, where the first half is 0 and the second half is nan
x_i = xr.DataArray([[[0]*(K//2) + [np.nan]*(K//2)]*M]*N_per_file,
[('d1', [x for x in range(i, i+N_per_file)]),
('d2', [x for x in range(M)]),
('d3', [x for x in range(K)])]
x_i.to_dataset(name='vals').to_netcdf('{}/file_{}.nc'.format(tmp_dir,i))
# open lazily
x = xr.open_mfdataset('{}/*.nc'.format(tmp_dir), parallel=True, concat_dim='d1').vals
# a rolling window along a stacked dimension
x_windows = x.stack(d13=['d1', 'd3']).rolling(d13=window_size).construct('window')
# we'll dot x_windows with y along the window dimension
y = xr.DataArray([1]*window_size, dims='window')
# incremental memory: 1.94 MiB
x_windows.dot(y).compute()
# incremental memory: 20.00 MiB
x_windows.notnull().dot(y).compute()
# incremental memory: 182.13 MiB
x_windows.fillna(0.).dot(y).compute()
# incremental memory: 211.52 MiB
x_windows.weighted(y).mean('window', skipna=True).compute()
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297
https://github.com/pydata/xarray/issues/4482#issuecomment-707238146,https://api.github.com/repos/pydata/xarray/issues/4482,707238146,MDEyOklzc3VlQ29tbWVudDcwNzIzODE0Ng==,2560426,2020-10-12T17:01:54Z,2020-10-12T17:16:07Z,NONE,"Adding on here, even if `fillna` were to create a memory copy, we'd only expect memory usage to double. However, in my case with dask-based chunking (via `parallel=True` in `open_mfdataset`) I'm seeing the memory blow up multiple times that (10x+) until all available memory is eaten up.
This is happening with `x.fillna(0).dot(y)` as well as `x.notnull().dot(y)` and `x.weighted(y).sum(skipna=True)`. `x` is the array that's chunked. This suggests that dask-based chunking isn't following through into the `fillna` and `notnull` ops, and the entire non-chunked arrays are being computed.
More evidence in favor: if I do `(x*y).sum(skipna=True)` I get the following error:
```
MemoryError: Unable to allocate [xxx] GiB for an array with shape [un-chunked array shape] and data type float64
```
I'm happy to live with a memory copy for now with `fillna` and `notnull`, but allocating the full, un-chunked array into memory is a showstopper. Is there a different workaround that I can use in the meantime?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297
https://github.com/pydata/xarray/issues/4482#issuecomment-702939943,https://api.github.com/repos/pydata/xarray/issues/4482,702939943,MDEyOklzc3VlQ29tbWVudDcwMjkzOTk0Mw==,2560426,2020-10-02T20:20:53Z,2020-10-02T20:32:32Z,NONE,"Great, looks like I missed that option. Thanks.
For reference, `x.fillna(0).dot(y)` takes 18 seconds in that same example, so a little better.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297