home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where issue = 659142789 and user = 8098361 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • prs247au · 3 ✖

issue 1

  • Allow passing args to preprocess function in open_mfdataset · 3 ✖

author_association 1

  • NONE 3
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
662868741 https://github.com/pydata/xarray/issues/4236#issuecomment-662868741 https://api.github.com/repos/pydata/xarray/issues/4236 MDEyOklzc3VlQ29tbWVudDY2Mjg2ODc0MQ== prs247au 8098361 2020-07-23T07:53:23Z 2020-07-23T07:53:23Z NONE

My minimal functools.partial has some weird behaviour. ``` import xarray as xr from functools import partial from pathlib import Path

def preprocessing(doys, ds):

print(doys)

ds = ds.sel(time=((ds['time.dayofyear'] >= doys[0])
            & (ds['time.dayofyear'] < doys[1])))
return ds

def get_data_set(doys, parallel=True): ds = xr.open_mfdataset( files, combine='nested', concat_dim='time', parallel=parallel, preprocess=partial(preprocessing, doys) ) return ds

if name == 'main': pth = "/path/to/data" day_of_year_range = (100, 140) files = list(Path(pth).rglob('*.nc')) ds = get_data_set(day_of_year_range, parallel=False) print(ds) If I run with `parallel=True` the python kernel crashes, or I get something like File "netCDF4_netCDF4.pyx", line 2344, in netCDF4._netCDF4.Dataset.init File "netCDF4_netCDF4.pyx", line 1789, in netCDF4._netCDF4._get_vars File "netCDF4_netCDF4.pyx", line 1887, in netCDF4._netCDF4._ensure_nc_success RuntimeError: NetCDF: Can't open HDF5 attribute `` Ifparallel=False` (same set of input files) everything is OK. Passing a new day of year range works, it's all good.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow passing args to preprocess function in open_mfdataset 659142789
662806773 https://github.com/pydata/xarray/issues/4236#issuecomment-662806773 https://api.github.com/repos/pydata/xarray/issues/4236 MDEyOklzc3VlQ29tbWVudDY2MjgwNjc3Mw== prs247au 8098361 2020-07-23T03:56:09Z 2020-07-23T03:56:09Z NONE

Thanks for the suggestion of functools.partial. I have (amazingly) never used it before so it's great to learn new things. If it's a way of 'fixing' existing args to a function that requires more arguments than you want to pass it -- The sum(x, y) => sum2=partial(sum(x, 2)) => sum2(x) sort of example -- then at first glance isn't this the opposite to what I want to do? ie. to pass more args to the callback. I suspect I'm approaching this the wrong way though, going from your last paragraph above. I'm just playing with a minimal sample now.

Otherwise, I do agree with you about when args would need to be passed, ie. individual file processing that can't be done outside. Obviously if you don't need args, don't pass any. While I see now my use case doesn't need that, there still might be others that do, though this might be rare (later I'll need to add a dimension for each file with a value that varies between files, but luckily I can extract that from the filename). I was imagining additional args working something like the way the schedule module handles Job callbacks . ``` import schedule schedule.Job.do? Signature: schedule.Job.do(self, job_func, args, *kwargs) Docstring: Specifies the job_func that should be called every time the job runs.

Any additional arguments are passed on to job_func when the job runs.

:param job_func: The function to be scheduled :return: The invoked job instance File: d:\anaconda3\lib\site-packages\schedule__init__.py Type: function ``` My original intent was cutting down the data I was loading from large files by managing that through the preprocess callback. But this is where I readily admit not knowing how xarray handles things under the covers which means I do things the wrong (sub-optimal?) way. I'm not the only one that is struggling with what is optimal though; Unexpected behaviour when chunking with multiple netcdf files in xarray/dask

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow passing args to preprocess function in open_mfdataset 659142789
660459866 https://github.com/pydata/xarray/issues/4236#issuecomment-660459866 https://api.github.com/repos/pydata/xarray/issues/4236 MDEyOklzc3VlQ29tbWVudDY2MDQ1OTg2Ng== prs247au 8098361 2020-07-18T10:03:58Z 2020-07-18T10:03:58Z NONE

I've cleaned up some code so hopefully it shows my two methods more clearly;

Current method

```

Set some day of year globals

DOY1 = 1; DOY2 = 31

def select_time(ds): # METHOD 1: Derive start/end date from external ipy widget values # Problem: Doesn't work with kwarg parallel=True (pickling error) # Unknown: if the widget values here will actually change when widgets are changed year_min, year_max = ds.time.dt.year.min(), ds.time.dt.year.max() start_date = pd.Timestamp(dateparse(str(int(year_min)) + mmddW.value)) end_date = pd.Timestamp(start_date + timedelta(days=daysW.value)) # Test using fixed values to create start/end dates...this works with pickling # start_date = pd.Timestamp(dateparse(str(int(year_min)) + '0101')) # end_date = pd.Timestamp(start_date + timedelta(days=30)) ds = ds.sel(time=slice(start_date, end_date))

# METHOD 2: Select time range based on day of year, where DOY1,DOY2 are
# globals set outside this function. Does pickle, so works with parallel option.
#   Problem: DOY1, DOY2 don't update here when changed externally after
#   function declaration
ds = ds.sel(time=((ds['time.dayofyear']>=DOY1) & (ds['time.dayofyear']<=DOY2)))
return ds

ds = xr.open_mfdataset( files, chunks={'lat': 50, 'lon':50}, combine='nested', concat_dim='time', preprocess=select_time, parallel=True ) ``` I can appreciate the pickling error for Method 1 is actually because of the reference to the (global) ipy widgets mmddW & daysW. After all why should it be expected to pickle those? Interesting that's only a problem for the parallel option though.

I don't fully understand, but can also appreciate, Method 2 only references DOY1/2 when they're declared and seems to be static thereafter even if DOY1/2 are modified.

Both methods are variations on a theme: I'm trying to use globals in the preprocess function as an alternative to passing extra args. The broader question is whether extra arguments could be useful feature to have.

Another solution

I think the actual solution to my problem is to forget about preprocessing. Since nothing is loaded at that stage ``` ds = xr.open_mfdataset( files, combine='nested', concat_dim='time', parallel=True

ds = ds.sel(time=((ds['time.dayofyear']>=DOY1) & (ds['time.dayofyear']<=DOY2))) ds = ds.chunk({'time': -1, 'lat':50, 'lon':50}).persist() `` Doing everything after theopen_mfdataset` and seems to work more efficiently. This sort of thing is counter intuitive to me still. Loading less would seem better from the outset but the after-the-fact processing seems to take care of this problem.

Still, it's a side-step around the arg passing issue.

Before I think about this further - could your problem be solved using functools.partial? I've never used functools.partial. From my reading it seems this is used to wrap functions and fix certain arguments so you can call the wrapper with less args. I don't know how to use it to help my current situation.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow passing args to preprocess function in open_mfdataset 659142789

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 215.131ms · About: xarray-datasette