html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/3961#issuecomment-778841149,https://api.github.com/repos/pydata/xarray/issues/3961,778841149,MDEyOklzc3VlQ29tbWVudDc3ODg0MTE0OQ==,2560426,2021-02-14T21:01:21Z,2021-02-14T21:01:21Z,NONE,"> Or alternatively you can try to set sleep between openings.
To clarify, do you mean adding a sleep of e.g. 1 second prior to your `preprocess` function (and setting `preprocess` to just sleep then `return ds` if you're not doing any preprocessing)? Or, are you instead sleeping before the entire `open_mfdataset` call?
Is this solution only addressing the issue of opening the same ds multiple times within a python process, or would it also address multiple processes opening the same ds?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,597657663
https://github.com/pydata/xarray/issues/3961#issuecomment-778838527,https://api.github.com/repos/pydata/xarray/issues/3961,778838527,MDEyOklzc3VlQ29tbWVudDc3ODgzODUyNw==,2560426,2021-02-14T20:40:38Z,2021-02-14T20:40:38Z,NONE,"Also seeing this as of version 0.16.1.
In some cases, I need `lock=False` otherwise I'll run into hung processes a certain percentage of the time. `ds.load()` prior to `to_netcdf()` does not solve the problem.
In other cases, I need `lock=None` otherwise I'll consistently get `RuntimeError: NetCDF: Not a valid ID`.
Is the current recommended solution to set `lock=False` and retry until success? Or, is it to keep `lock=None` and use `zarr` instead? @dcherian ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,597657663
https://github.com/pydata/xarray/issues/4482#issuecomment-713172015,https://api.github.com/repos/pydata/xarray/issues/4482,713172015,MDEyOklzc3VlQ29tbWVudDcxMzE3MjAxNQ==,2560426,2020-10-20T22:17:08Z,2020-10-20T22:21:14Z,NONE,"On the topic of fillna(), I'm seeing an odd unrelated issue that I don't have an explanation for.
I have a dataarray `x` that I'm able to call `x.compute()` on.
When I do `x.fillna(0).compute()`, I get the following error:
```
KeyError: ('where-3a3[...long hex string]', 100, 0, 0, 4)
```
Stack trace shows it's failing on a `get_dependencies(dsk, key, task, as_list)` call from a `cull(dsk, keys)` call in dask/optimization.py. `get_dependencies` itself is defined in dask/core.py.
I have no idea how to reproduce this simply... If it helps narrow things down, `x` is a dask array, one of the dimensions is a datetime64, and all other are strings. I've tried using both the default engine and `netcdf4` when loading with `open_mfdataset`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297
https://github.com/pydata/xarray/issues/4482#issuecomment-708474940,https://api.github.com/repos/pydata/xarray/issues/4482,708474940,MDEyOklzc3VlQ29tbWVudDcwODQ3NDk0MA==,2560426,2020-10-14T15:21:29Z,2020-10-14T15:21:55Z,NONE,"Adding on, whatever the solution is that avoids blowing up memory, especially when using with `construct`, it would be useful to be implemented for both `fillna(0)` and `notnull()`. One common use-case would be so that you can take a weighted mean which normalizes by the sum of weights corresponding only to non-null entries, as in here: https://github.com/pydata/xarray/blob/333e8dba55f0165ccadf18f2aaaee9257a4d716b/xarray/core/weighted.py#L169","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297
https://github.com/pydata/xarray/issues/4482#issuecomment-707331260,https://api.github.com/repos/pydata/xarray/issues/4482,707331260,MDEyOklzc3VlQ29tbWVudDcwNzMzMTI2MA==,2560426,2020-10-12T20:31:26Z,2020-10-12T21:05:24Z,NONE,"See below. I temporarily write some files to netcdf then recombine them lazily using `open_mfdataset`.
The issue seems to present itself more consistently when my `x` is a constructed rolling window, and especially when it's a rolling window of a stacked dimension as in below.
I used the `memory_profiler` package and associated notebook extension (`%%memit` cell magic) to do memory profiling.
```
import numpy as np
import xarray as xr
import os
N = 1000
N_per_file = 10
M = 100
K = 10
window_size = 150
tmp_dir = 'tmp'
os.mkdir(tmp_dir)
# save many netcdf files, later to be concatted into a dask.delayed dataset
for i in range(0, N, N_per_file):
# 3 dimensions:
# d1 is the dim we're splitting our files/chunking along
# d2 is a common dim among all files/chunks
# d3 is a common dim among all files/chunks, where the first half is 0 and the second half is nan
x_i = xr.DataArray([[[0]*(K//2) + [np.nan]*(K//2)]*M]*N_per_file,
[('d1', [x for x in range(i, i+N_per_file)]),
('d2', [x for x in range(M)]),
('d3', [x for x in range(K)])]
x_i.to_dataset(name='vals').to_netcdf('{}/file_{}.nc'.format(tmp_dir,i))
# open lazily
x = xr.open_mfdataset('{}/*.nc'.format(tmp_dir), parallel=True, concat_dim='d1').vals
# a rolling window along a stacked dimension
x_windows = x.stack(d13=['d1', 'd3']).rolling(d13=window_size).construct('window')
# we'll dot x_windows with y along the window dimension
y = xr.DataArray([1]*window_size, dims='window')
# incremental memory: 1.94 MiB
x_windows.dot(y).compute()
# incremental memory: 20.00 MiB
x_windows.notnull().dot(y).compute()
# incremental memory: 182.13 MiB
x_windows.fillna(0.).dot(y).compute()
# incremental memory: 211.52 MiB
x_windows.weighted(y).mean('window', skipna=True).compute()
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297
https://github.com/pydata/xarray/issues/4482#issuecomment-707238146,https://api.github.com/repos/pydata/xarray/issues/4482,707238146,MDEyOklzc3VlQ29tbWVudDcwNzIzODE0Ng==,2560426,2020-10-12T17:01:54Z,2020-10-12T17:16:07Z,NONE,"Adding on here, even if `fillna` were to create a memory copy, we'd only expect memory usage to double. However, in my case with dask-based chunking (via `parallel=True` in `open_mfdataset`) I'm seeing the memory blow up multiple times that (10x+) until all available memory is eaten up.
This is happening with `x.fillna(0).dot(y)` as well as `x.notnull().dot(y)` and `x.weighted(y).sum(skipna=True)`. `x` is the array that's chunked. This suggests that dask-based chunking isn't following through into the `fillna` and `notnull` ops, and the entire non-chunked arrays are being computed.
More evidence in favor: if I do `(x*y).sum(skipna=True)` I get the following error:
```
MemoryError: Unable to allocate [xxx] GiB for an array with shape [un-chunked array shape] and data type float64
```
I'm happy to live with a memory copy for now with `fillna` and `notnull`, but allocating the full, un-chunked array into memory is a showstopper. Is there a different workaround that I can use in the meantime?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297
https://github.com/pydata/xarray/issues/4482#issuecomment-702939943,https://api.github.com/repos/pydata/xarray/issues/4482,702939943,MDEyOklzc3VlQ29tbWVudDcwMjkzOTk0Mw==,2560426,2020-10-02T20:20:53Z,2020-10-02T20:32:32Z,NONE,"Great, looks like I missed that option. Thanks.
For reference, `x.fillna(0).dot(y)` takes 18 seconds in that same example, so a little better.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,713834297
https://github.com/pydata/xarray/issues/4474#issuecomment-702346076,https://api.github.com/repos/pydata/xarray/issues/4474,702346076,MDEyOklzc3VlQ29tbWVudDcwMjM0NjA3Ng==,2560426,2020-10-01T19:20:50Z,2020-10-01T19:23:31Z,NONE,"Looks like it's all in here: https://github.com/pydata/xarray/blob/6d8ac11ca0a785a6fe176eeca9b735c321a35527/xarray/core/dask_array_ops.py
And it's used here: https://github.com/pydata/xarray/blob/6d8ac11ca0a785a6fe176eeca9b735c321a35527/xarray/core/rolling.py#L299","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712052219
https://github.com/pydata/xarray/issues/4474#issuecomment-702331156,https://api.github.com/repos/pydata/xarray/issues/4474,702331156,MDEyOklzc3VlQ29tbWVudDcwMjMzMTE1Ng==,2560426,2020-10-01T18:52:18Z,2020-10-01T18:52:18Z,NONE,"Yes, see http://xarray.pydata.org/en/stable/computation.html#rolling-window-operations.
`rolling` works with dask, but `rolling_exp` does not.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712052219
https://github.com/pydata/xarray/issues/4475#issuecomment-702307334,https://api.github.com/repos/pydata/xarray/issues/4475,702307334,MDEyOklzc3VlQ29tbWVudDcwMjMwNzMzNA==,2560426,2020-10-01T18:07:55Z,2020-10-01T18:07:55Z,NONE,"Sounds good, I'll do this in the meantime. Still quite interested in `save_mfdataset` dealing with these lower level details, if possible. The ideal case would be loading with `load_mfdataset`, defining some ops lazily, then piping that directly to `save_mfdataset`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-702265883,https://api.github.com/repos/pydata/xarray/issues/4475,702265883,MDEyOklzc3VlQ29tbWVudDcwMjI2NTg4Mw==,2560426,2020-10-01T16:52:59Z,2020-10-01T16:52:59Z,NONE,"Multiple threads (the default), because it's recommended ""for numeric code that releases the GIL (like NumPy, Pandas, Scikit-Learn, Numba, …)"" according to the dask docs.
I guess I could do multi-threaded for the compute part (everything up to the definition of `ds`), then multi-process for the write part, but doesn't that then require me to load everything into memory before writing?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4474#issuecomment-702181324,https://api.github.com/repos/pydata/xarray/issues/4474,702181324,MDEyOklzc3VlQ29tbWVudDcwMjE4MTMyNA==,2560426,2020-10-01T14:39:01Z,2020-10-01T14:39:01Z,NONE,"Great! This will be a common use-case for me, and I imagine others who are doing any sort of time series computation on large datasets.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712052219
https://github.com/pydata/xarray/issues/4475#issuecomment-702178407,https://api.github.com/repos/pydata/xarray/issues/4475,702178407,MDEyOklzc3VlQ29tbWVudDcwMjE3ODQwNw==,2560426,2020-10-01T14:34:28Z,2020-10-01T14:34:28Z,NONE,"Thank you, this works for me. However, it's quite slow and seems to scale faster than linearly as the length of `datasets` increases (the number of groups in the `groupby`).
Could it be connected to https://github.com/pydata/xarray/issues/2912#issuecomment-485497398 where they suggest to use `save_mfdataset` instead of `to_netcdf`? If so, there's a stronger case for supporting delayed objects in `save_mfdataset` as you said.
Appreciate the help!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206
https://github.com/pydata/xarray/issues/4475#issuecomment-701676076,https://api.github.com/repos/pydata/xarray/issues/4475,701676076,MDEyOklzc3VlQ29tbWVudDcwMTY3NjA3Ng==,2560426,2020-09-30T22:17:24Z,2020-09-30T22:17:24Z,NONE,"Unfortunately that doesn't work:
`TypeError: save_mfdataset only supports writing Dataset objects, received type `","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,712189206