html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/4482#issuecomment-713172015,https://api.github.com/repos/pydata/xarray/issues/4482,713172015,MDEyOklzc3VlQ29tbWVudDcxMzE3MjAxNQ==,2560426,2020-10-20T22:17:08Z,2020-10-20T22:21:14Z,NONE,"On the topic of fillna(), I'm seeing an odd unrelated issue that I don't have an explanation for. I have a dataarray `x` that I'm able to call `x.compute()` on. When I do `x.fillna(0).compute()`, I get the following error: ``` KeyError: ('where-3a3[...long hex string]', 100, 0, 0, 4) ``` Stack trace shows it's failing on a `get_dependencies(dsk, key, task, as_list)` call from a `cull(dsk, keys)` call in dask/optimization.py. `get_dependencies` itself is defined in dask/core.py. I have no idea how to reproduce this simply... If it helps narrow things down, `x` is a dask array, one of the dimensions is a datetime64, and all other are strings. I've tried using both the default engine and `netcdf4` when loading with `open_mfdataset`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297 https://github.com/pydata/xarray/issues/4482#issuecomment-708474940,https://api.github.com/repos/pydata/xarray/issues/4482,708474940,MDEyOklzc3VlQ29tbWVudDcwODQ3NDk0MA==,2560426,2020-10-14T15:21:29Z,2020-10-14T15:21:55Z,NONE,"Adding on, whatever the solution is that avoids blowing up memory, especially when using with `construct`, it would be useful to be implemented for both `fillna(0)` and `notnull()`. One common use-case would be so that you can take a weighted mean which normalizes by the sum of weights corresponding only to non-null entries, as in here: https://github.com/pydata/xarray/blob/333e8dba55f0165ccadf18f2aaaee9257a4d716b/xarray/core/weighted.py#L169","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297 https://github.com/pydata/xarray/issues/4482#issuecomment-707331260,https://api.github.com/repos/pydata/xarray/issues/4482,707331260,MDEyOklzc3VlQ29tbWVudDcwNzMzMTI2MA==,2560426,2020-10-12T20:31:26Z,2020-10-12T21:05:24Z,NONE,"See below. I temporarily write some files to netcdf then recombine them lazily using `open_mfdataset`. The issue seems to present itself more consistently when my `x` is a constructed rolling window, and especially when it's a rolling window of a stacked dimension as in below. I used the `memory_profiler` package and associated notebook extension (`%%memit` cell magic) to do memory profiling. ``` import numpy as np import xarray as xr import os N = 1000 N_per_file = 10 M = 100 K = 10 window_size = 150 tmp_dir = 'tmp' os.mkdir(tmp_dir) # save many netcdf files, later to be concatted into a dask.delayed dataset for i in range(0, N, N_per_file): # 3 dimensions: # d1 is the dim we're splitting our files/chunking along # d2 is a common dim among all files/chunks # d3 is a common dim among all files/chunks, where the first half is 0 and the second half is nan x_i = xr.DataArray([[[0]*(K//2) + [np.nan]*(K//2)]*M]*N_per_file, [('d1', [x for x in range(i, i+N_per_file)]), ('d2', [x for x in range(M)]), ('d3', [x for x in range(K)])] x_i.to_dataset(name='vals').to_netcdf('{}/file_{}.nc'.format(tmp_dir,i)) # open lazily x = xr.open_mfdataset('{}/*.nc'.format(tmp_dir), parallel=True, concat_dim='d1').vals # a rolling window along a stacked dimension x_windows = x.stack(d13=['d1', 'd3']).rolling(d13=window_size).construct('window') # we'll dot x_windows with y along the window dimension y = xr.DataArray([1]*window_size, dims='window') # incremental memory: 1.94 MiB x_windows.dot(y).compute() # incremental memory: 20.00 MiB x_windows.notnull().dot(y).compute() # incremental memory: 182.13 MiB x_windows.fillna(0.).dot(y).compute() # incremental memory: 211.52 MiB x_windows.weighted(y).mean('window', skipna=True).compute() ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297 https://github.com/pydata/xarray/issues/4482#issuecomment-707238146,https://api.github.com/repos/pydata/xarray/issues/4482,707238146,MDEyOklzc3VlQ29tbWVudDcwNzIzODE0Ng==,2560426,2020-10-12T17:01:54Z,2020-10-12T17:16:07Z,NONE,"Adding on here, even if `fillna` were to create a memory copy, we'd only expect memory usage to double. However, in my case with dask-based chunking (via `parallel=True` in `open_mfdataset`) I'm seeing the memory blow up multiple times that (10x+) until all available memory is eaten up. This is happening with `x.fillna(0).dot(y)` as well as `x.notnull().dot(y)` and `x.weighted(y).sum(skipna=True)`. `x` is the array that's chunked. This suggests that dask-based chunking isn't following through into the `fillna` and `notnull` ops, and the entire non-chunked arrays are being computed. More evidence in favor: if I do `(x*y).sum(skipna=True)` I get the following error: ``` MemoryError: Unable to allocate [xxx] GiB for an array with shape [un-chunked array shape] and data type float64 ``` I'm happy to live with a memory copy for now with `fillna` and `notnull`, but allocating the full, un-chunked array into memory is a showstopper. Is there a different workaround that I can use in the meantime?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297 https://github.com/pydata/xarray/issues/4482#issuecomment-702939943,https://api.github.com/repos/pydata/xarray/issues/4482,702939943,MDEyOklzc3VlQ29tbWVudDcwMjkzOTk0Mw==,2560426,2020-10-02T20:20:53Z,2020-10-02T20:32:32Z,NONE,"Great, looks like I missed that option. Thanks. For reference, `x.fillna(0).dot(y)` takes 18 seconds in that same example, so a little better.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297