github: issue_comments: 6 rows where issue = 659142789 sorted by updated

6 rows where issue = 659142789 sorted by updated_at descending

Search:

✖

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
662868741	https://github.com/pydata/xarray/issues/4236#issuecomment-662868741	https://api.github.com/repos/pydata/xarray/issues/4236	MDEyOklzc3VlQ29tbWVudDY2Mjg2ODc0MQ==	prs247au 8098361	2020-07-23T07:53:23Z	2020-07-23T07:53:23Z	NONE	My minimal `functools.partial` has some weird behaviour. ``` import xarray as xr from functools import partial from pathlib import Path def preprocessing(doys, ds): print(doys) `ds = ds.sel(time=((ds['time.dayofyear'] >= doys[0]) & (ds['time.dayofyear'] < doys[1]))) return ds` def get_data_set(doys, parallel=True): ds = xr.open_mfdataset( files, combine='nested', concat_dim='time', parallel=parallel, preprocess=partial(preprocessing, doys) ) return ds if name == 'main': pth = "/path/to/data" day_of_year_range = (100, 140) files = list(Path(pth).rglob('*.nc')) ds = get_data_set(day_of_year_range, parallel=False) print(ds) If I run with `parallel=True` the python kernel crashes, or I get something like File "netCDF4_netCDF4.pyx", line 2344, in netCDF4._netCDF4.Dataset.init File "netCDF4_netCDF4.pyx", line 1789, in netCDF4._netCDF4._get_vars File "netCDF4_netCDF4.pyx", line 1887, in netCDF4._netCDF4._ensure_nc_success RuntimeError: NetCDF: Can't open HDF5 attribute `` Ifparallel=False` (same set of input files) everything is OK. Passing a new day of year range works, it's all good.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow passing args to preprocess function in open_mfdataset 659142789
662806773	https://github.com/pydata/xarray/issues/4236#issuecomment-662806773	https://api.github.com/repos/pydata/xarray/issues/4236	MDEyOklzc3VlQ29tbWVudDY2MjgwNjc3Mw==	prs247au 8098361	2020-07-23T03:56:09Z	2020-07-23T03:56:09Z	NONE	Thanks for the suggestion of `functools.partial`. I have (amazingly) never used it before so it's great to learn new things. If it's a way of 'fixing' existing args to a function that requires more arguments than you want to pass it -- The `sum(x, y) => sum2=partial(sum(x, 2)) => sum2(x)` sort of example -- then at first glance isn't this the opposite to what I want to do? ie. to pass more args to the callback. I suspect I'm approaching this the wrong way though, going from your last paragraph above. I'm just playing with a minimal sample now. Otherwise, I do agree with you about when args would need to be passed, ie. individual file processing that can't be done outside. Obviously if you don't need args, don't pass any. While I see now my use case doesn't need that, there still might be others that do, though this might be rare (later I'll need to add a dimension for each file with a value that varies between files, but luckily I can extract that from the filename). I was imagining additional args working something like the way the `schedule` module handles `Job` callbacks . ``` import schedule schedule.Job.do? Signature: schedule.Job.do(self, job_func, args, *kwargs) Docstring: Specifies the job_func that should be called every time the job runs. Any additional arguments are passed on to job_func when the job runs. :param job_func: The function to be scheduled :return: The invoked job instance File: d:\anaconda3\lib\site-packages\schedule__init__.py Type: function ``` My original intent was cutting down the data I was loading from large files by managing that through the preprocess callback. But this is where I readily admit not knowing how xarray handles things under the covers which means I do things the wrong (sub-optimal?) way. I'm not the only one that is struggling with what is optimal though; Unexpected behaviour when chunking with multiple netcdf files in xarray/dask	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow passing args to preprocess function in open_mfdataset 659142789
661852012	https://github.com/pydata/xarray/issues/4236#issuecomment-661852012	https://api.github.com/repos/pydata/xarray/issues/4236	MDEyOklzc3VlQ29tbWVudDY2MTg1MjAxMg==	TomNicholas 35968931	2020-07-21T13:11:00Z	2020-07-21T13:11:00Z	MEMBER	I think the actual solution to my problem is to forget about preprocessing. I'm glad you've found an alternative way to solve your problem! Still, it's a side-step around the arg passing issue. So, please tell me if you disagree, but I see it like this: the only time that you would need to be able to pass arguments in to `preprocess` is if you need to perform an operation within preprocess (i.e. not simply before or after `open_mfdataset`) that requires a different argument for each file, but when that argument cannot be derived from each file individually. If you need to pass in global arguments to preprocess, you can use `functools.partial` to define the `preprocess` function as having those arguments already set, and if you need only knowledge about the file being currently opened, then that's the use case preprocess is intended for. I can see that there might be other cases where you can't do either of the above, but how often do they actually occur?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow passing args to preprocess function in open_mfdataset 659142789
660459866	https://github.com/pydata/xarray/issues/4236#issuecomment-660459866	https://api.github.com/repos/pydata/xarray/issues/4236	MDEyOklzc3VlQ29tbWVudDY2MDQ1OTg2Ng==	prs247au 8098361	2020-07-18T10:03:58Z	2020-07-18T10:03:58Z	NONE	I've cleaned up some code so hopefully it shows my two methods more clearly; Current method ``` Set some day of year globals DOY1 = 1; DOY2 = 31 def select_time(ds): # METHOD 1: Derive start/end date from external ipy widget values # Problem: Doesn't work with kwarg parallel=True (pickling error) # Unknown: if the widget values here will actually change when widgets are changed year_min, year_max = ds.time.dt.year.min(), ds.time.dt.year.max() start_date = pd.Timestamp(dateparse(str(int(year_min)) + mmddW.value)) end_date = pd.Timestamp(start_date + timedelta(days=daysW.value)) # Test using fixed values to create start/end dates...this works with pickling # start_date = pd.Timestamp(dateparse(str(int(year_min)) + '0101')) # end_date = pd.Timestamp(start_date + timedelta(days=30)) ds = ds.sel(time=slice(start_date, end_date)) `# METHOD 2: Select time range based on day of year, where DOY1,DOY2 are # globals set outside this function. Does pickle, so works with parallel option. # Problem: DOY1, DOY2 don't update here when changed externally after # function declaration ds = ds.sel(time=((ds['time.dayofyear']>=DOY1) & (ds['time.dayofyear']<=DOY2))) return ds` ds = xr.open_mfdataset( files, chunks={'lat': 50, 'lon':50}, combine='nested', concat_dim='time', preprocess=select_time, parallel=True ) ``` I can appreciate the pickling error for Method 1 is actually because of the reference to the (global) ipy widgets mmddW & daysW. After all why should it be expected to pickle those? Interesting that's only a problem for the parallel option though. I don't fully understand, but can also appreciate, Method 2 only references DOY1/2 when they're declared and seems to be static thereafter even if DOY1/2 are modified. Both methods are variations on a theme: I'm trying to use globals in the `preprocess` function as an alternative to passing extra args. The broader question is whether extra arguments could be useful feature to have. Another solution I think the actual solution to my problem is to forget about preprocessing. Since nothing is loaded at that stage ``` ds = xr.open_mfdataset( files, combine='nested', concat_dim='time', parallel=True ds = ds.sel(time=((ds['time.dayofyear']>=DOY1) & (ds['time.dayofyear']<=DOY2))) ds = ds.chunk({'time': -1, 'lat':50, 'lon':50}).persist() `` Doing everything after theopen_mfdataset` and seems to work more efficiently. This sort of thing is counter intuitive to me still. Loading less would seem better from the outset but the after-the-fact processing seems to take care of this problem. Still, it's a side-step around the arg passing issue. Before I think about this further - could your problem be solved using `functools.partial`? I've never used `functools.partial`. From my reading it seems this is used to wrap functions and fix certain arguments so you can call the wrapper with less args. I don't know how to use it to help my current situation.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow passing args to preprocess function in open_mfdataset 659142789
660181397	https://github.com/pydata/xarray/issues/4236#issuecomment-660181397	https://api.github.com/repos/pydata/xarray/issues/4236	MDEyOklzc3VlQ29tbWVudDY2MDE4MTM5Nw==	dcherian 2448579	2020-07-17T15:47:46Z	2020-07-17T15:47:46Z	MEMBER	'm using other functions like dateparse, or timedelta inside the preprocess function to calculate the dayofyear (which itself is derived from a ipywidget). `ds.time.dt.dayofyear` should do this for you: https://xarray.pydata.org/en/stable/time-series.html#datetime-components	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow passing args to preprocess function in open_mfdataset 659142789
660077641	https://github.com/pydata/xarray/issues/4236#issuecomment-660077641	https://api.github.com/repos/pydata/xarray/issues/4236	MDEyOklzc3VlQ29tbWVudDY2MDA3NzY0MQ==	TomNicholas 35968931	2020-07-17T12:22:59Z	2020-07-17T12:22:59Z	MEMBER	Before I think about this further - could your problem be solved using `functools.partial`?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	Allow passing args to preprocess function in open_mfdataset 659142789

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

6 rows where issue = 659142789 sorted by updated_at descending

print(doys)

Current method

Set some day of year globals

Another solution

Advanced export