home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

14 rows where user = 2560426 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 4

  • Allow skipna in .dot() 5
  • Preprocess function for save_mfdataset 4
  • Implement rolling_exp for dask arrays 3
  • Hangs while saving netcdf file opened using xr.open_mfdataset with lock=None 2

user 1

  • heerad · 14 ✖

author_association 1

  • NONE 14
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
778841149 https://github.com/pydata/xarray/issues/3961#issuecomment-778841149 https://api.github.com/repos/pydata/xarray/issues/3961 MDEyOklzc3VlQ29tbWVudDc3ODg0MTE0OQ== heerad 2560426 2021-02-14T21:01:21Z 2021-02-14T21:01:21Z NONE

Or alternatively you can try to set sleep between openings.

To clarify, do you mean adding a sleep of e.g. 1 second prior to your preprocess function (and setting preprocess to just sleep then return ds if you're not doing any preprocessing)? Or, are you instead sleeping before the entire open_mfdataset call?

Is this solution only addressing the issue of opening the same ds multiple times within a python process, or would it also address multiple processes opening the same ds?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Hangs while saving netcdf file opened using xr.open_mfdataset with lock=None 597657663
778838527 https://github.com/pydata/xarray/issues/3961#issuecomment-778838527 https://api.github.com/repos/pydata/xarray/issues/3961 MDEyOklzc3VlQ29tbWVudDc3ODgzODUyNw== heerad 2560426 2021-02-14T20:40:38Z 2021-02-14T20:40:38Z NONE

Also seeing this as of version 0.16.1.

In some cases, I need lock=False otherwise I'll run into hung processes a certain percentage of the time. ds.load() prior to to_netcdf() does not solve the problem.

In other cases, I need lock=None otherwise I'll consistently get RuntimeError: NetCDF: Not a valid ID.

Is the current recommended solution to set lock=False and retry until success? Or, is it to keep lock=None and use zarr instead? @dcherian

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Hangs while saving netcdf file opened using xr.open_mfdataset with lock=None 597657663
713172015 https://github.com/pydata/xarray/issues/4482#issuecomment-713172015 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcxMzE3MjAxNQ== heerad 2560426 2020-10-20T22:17:08Z 2020-10-20T22:21:14Z NONE

On the topic of fillna(), I'm seeing an odd unrelated issue that I don't have an explanation for.

I have a dataarray x that I'm able to call x.compute() on.

When I do x.fillna(0).compute(), I get the following error:

KeyError: ('where-3a3[...long hex string]', 100, 0, 0, 4)

Stack trace shows it's failing on a get_dependencies(dsk, key, task, as_list) call from a cull(dsk, keys) call in dask/optimization.py. get_dependencies itself is defined in dask/core.py.

I have no idea how to reproduce this simply... If it helps narrow things down, x is a dask array, one of the dimensions is a datetime64, and all other are strings. I've tried using both the default engine and netcdf4 when loading with open_mfdataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
708474940 https://github.com/pydata/xarray/issues/4482#issuecomment-708474940 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwODQ3NDk0MA== heerad 2560426 2020-10-14T15:21:29Z 2020-10-14T15:21:55Z NONE

Adding on, whatever the solution is that avoids blowing up memory, especially when using with construct, it would be useful to be implemented for both fillna(0) and notnull(). One common use-case would be so that you can take a weighted mean which normalizes by the sum of weights corresponding only to non-null entries, as in here: https://github.com/pydata/xarray/blob/333e8dba55f0165ccadf18f2aaaee9257a4d716b/xarray/core/weighted.py#L169

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
707331260 https://github.com/pydata/xarray/issues/4482#issuecomment-707331260 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwNzMzMTI2MA== heerad 2560426 2020-10-12T20:31:26Z 2020-10-12T21:05:24Z NONE

See below. I temporarily write some files to netcdf then recombine them lazily using open_mfdataset.

The issue seems to present itself more consistently when my x is a constructed rolling window, and especially when it's a rolling window of a stacked dimension as in below.

I used the memory_profiler package and associated notebook extension (%%memit cell magic) to do memory profiling.

``` import numpy as np import xarray as xr import os

N = 1000 N_per_file = 10 M = 100 K = 10 window_size = 150

tmp_dir = 'tmp'

os.mkdir(tmp_dir)

save many netcdf files, later to be concatted into a dask.delayed dataset

for i in range(0, N, N_per_file):

# 3 dimensions:
# d1 is the dim we're splitting our files/chunking along
# d2 is a common dim among all files/chunks
# d3 is a common dim among all files/chunks, where the first half is 0 and the second half is nan
x_i = xr.DataArray([[[0]*(K//2) + [np.nan]*(K//2)]*M]*N_per_file,
    [('d1', [x for x in range(i, i+N_per_file)]), 
     ('d2', [x for x in range(M)]),
     ('d3', [x for x in range(K)])]

x_i.to_dataset(name='vals').to_netcdf('{}/file_{}.nc'.format(tmp_dir,i))

open lazily

x = xr.open_mfdataset('{}/*.nc'.format(tmp_dir), parallel=True, concat_dim='d1').vals

a rolling window along a stacked dimension

x_windows = x.stack(d13=['d1', 'd3']).rolling(d13=window_size).construct('window')

we'll dot x_windows with y along the window dimension

y = xr.DataArray([1]*window_size, dims='window')

incremental memory: 1.94 MiB

x_windows.dot(y).compute()

incremental memory: 20.00 MiB

x_windows.notnull().dot(y).compute()

incremental memory: 182.13 MiB

x_windows.fillna(0.).dot(y).compute()

incremental memory: 211.52 MiB

x_windows.weighted(y).mean('window', skipna=True).compute() ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
707238146 https://github.com/pydata/xarray/issues/4482#issuecomment-707238146 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwNzIzODE0Ng== heerad 2560426 2020-10-12T17:01:54Z 2020-10-12T17:16:07Z NONE

Adding on here, even if fillna were to create a memory copy, we'd only expect memory usage to double. However, in my case with dask-based chunking (via parallel=True in open_mfdataset) I'm seeing the memory blow up multiple times that (10x+) until all available memory is eaten up.

This is happening with x.fillna(0).dot(y) as well as x.notnull().dot(y) and x.weighted(y).sum(skipna=True). x is the array that's chunked. This suggests that dask-based chunking isn't following through into the fillna and notnull ops, and the entire non-chunked arrays are being computed.

More evidence in favor: if I do (x*y).sum(skipna=True) I get the following error:

MemoryError: Unable to allocate [xxx] GiB for an array with shape [un-chunked array shape] and data type float64

I'm happy to live with a memory copy for now with fillna and notnull, but allocating the full, un-chunked array into memory is a showstopper. Is there a different workaround that I can use in the meantime?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
702939943 https://github.com/pydata/xarray/issues/4482#issuecomment-702939943 https://api.github.com/repos/pydata/xarray/issues/4482 MDEyOklzc3VlQ29tbWVudDcwMjkzOTk0Mw== heerad 2560426 2020-10-02T20:20:53Z 2020-10-02T20:32:32Z NONE

Great, looks like I missed that option. Thanks.

For reference, x.fillna(0).dot(y) takes 18 seconds in that same example, so a little better.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow skipna in .dot() 713834297
702346076 https://github.com/pydata/xarray/issues/4474#issuecomment-702346076 https://api.github.com/repos/pydata/xarray/issues/4474 MDEyOklzc3VlQ29tbWVudDcwMjM0NjA3Ng== heerad 2560426 2020-10-01T19:20:50Z 2020-10-01T19:23:31Z NONE

Looks like it's all in here: https://github.com/pydata/xarray/blob/6d8ac11ca0a785a6fe176eeca9b735c321a35527/xarray/core/dask_array_ops.py

And it's used here: https://github.com/pydata/xarray/blob/6d8ac11ca0a785a6fe176eeca9b735c321a35527/xarray/core/rolling.py#L299

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Implement rolling_exp for dask arrays 712052219
702331156 https://github.com/pydata/xarray/issues/4474#issuecomment-702331156 https://api.github.com/repos/pydata/xarray/issues/4474 MDEyOklzc3VlQ29tbWVudDcwMjMzMTE1Ng== heerad 2560426 2020-10-01T18:52:18Z 2020-10-01T18:52:18Z NONE

Yes, see http://xarray.pydata.org/en/stable/computation.html#rolling-window-operations.

rolling works with dask, but rolling_exp does not.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Implement rolling_exp for dask arrays 712052219
702307334 https://github.com/pydata/xarray/issues/4475#issuecomment-702307334 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMjMwNzMzNA== heerad 2560426 2020-10-01T18:07:55Z 2020-10-01T18:07:55Z NONE

Sounds good, I'll do this in the meantime. Still quite interested in save_mfdataset dealing with these lower level details, if possible. The ideal case would be loading with load_mfdataset, defining some ops lazily, then piping that directly to save_mfdataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
702265883 https://github.com/pydata/xarray/issues/4475#issuecomment-702265883 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMjI2NTg4Mw== heerad 2560426 2020-10-01T16:52:59Z 2020-10-01T16:52:59Z NONE

Multiple threads (the default), because it's recommended "for numeric code that releases the GIL (like NumPy, Pandas, Scikit-Learn, Numba, …)" according to the dask docs.

I guess I could do multi-threaded for the compute part (everything up to the definition of ds), then multi-process for the write part, but doesn't that then require me to load everything into memory before writing?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
702181324 https://github.com/pydata/xarray/issues/4474#issuecomment-702181324 https://api.github.com/repos/pydata/xarray/issues/4474 MDEyOklzc3VlQ29tbWVudDcwMjE4MTMyNA== heerad 2560426 2020-10-01T14:39:01Z 2020-10-01T14:39:01Z NONE

Great! This will be a common use-case for me, and I imagine others who are doing any sort of time series computation on large datasets.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Implement rolling_exp for dask arrays 712052219
702178407 https://github.com/pydata/xarray/issues/4475#issuecomment-702178407 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMjE3ODQwNw== heerad 2560426 2020-10-01T14:34:28Z 2020-10-01T14:34:28Z NONE

Thank you, this works for me. However, it's quite slow and seems to scale faster than linearly as the length of datasets increases (the number of groups in the groupby).

Could it be connected to https://github.com/pydata/xarray/issues/2912#issuecomment-485497398 where they suggest to use save_mfdataset instead of to_netcdf? If so, there's a stronger case for supporting delayed objects in save_mfdataset as you said.

Appreciate the help!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206
701676076 https://github.com/pydata/xarray/issues/4475#issuecomment-701676076 https://api.github.com/repos/pydata/xarray/issues/4475 MDEyOklzc3VlQ29tbWVudDcwMTY3NjA3Ng== heerad 2560426 2020-09-30T22:17:24Z 2020-09-30T22:17:24Z NONE

Unfortunately that doesn't work:

TypeError: save_mfdataset only supports writing Dataset objects, received type <class 'dask.delayed.Delayed'>

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Preprocess function for save_mfdataset 712189206

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 24.997ms · About: xarray-datasette