html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/4236#issuecomment-662868741,https://api.github.com/repos/pydata/xarray/issues/4236,662868741,MDEyOklzc3VlQ29tbWVudDY2Mjg2ODc0MQ==,8098361,2020-07-23T07:53:23Z,2020-07-23T07:53:23Z,NONE,"My minimal `functools.partial` has some weird behaviour.
```
import xarray as xr
from functools import partial
from pathlib import Path

def preprocessing(doys, ds):
#    print(doys)
    ds = ds.sel(time=((ds['time.dayofyear'] >= doys[0])
                & (ds['time.dayofyear'] < doys[1])))
    return ds

def get_data_set(doys, parallel=True):
    ds = xr.open_mfdataset(
        files,
        combine='nested',
        concat_dim='time',
        parallel=parallel,
        preprocess=partial(preprocessing, doys)
    )
    return ds

if __name__ == '__main__':
    pth = ""/path/to/data""
    day_of_year_range = (100, 140)
    files = list(Path(pth).rglob('*.nc'))
    ds = get_data_set(day_of_year_range, parallel=False)
    print(ds)
```
If I run with `parallel=True` the python kernel crashes, or I get something like
```
  File ""netCDF4\_netCDF4.pyx"", line 2344, in netCDF4._netCDF4.Dataset.__init__
  File ""netCDF4\_netCDF4.pyx"", line 1789, in netCDF4._netCDF4._get_vars
  File ""netCDF4\_netCDF4.pyx"", line 1887, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: Can't open HDF5 attribute
```
If `parallel=False` (same set of input files) everything is OK. Passing a new day of year range works, it's all good.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,659142789
https://github.com/pydata/xarray/issues/4236#issuecomment-662806773,https://api.github.com/repos/pydata/xarray/issues/4236,662806773,MDEyOklzc3VlQ29tbWVudDY2MjgwNjc3Mw==,8098361,2020-07-23T03:56:09Z,2020-07-23T03:56:09Z,NONE,"Thanks for the suggestion of `functools.partial`. I have (amazingly) never used it before so it's great to learn new things. If it's a way of 'fixing' existing args to a function that requires more arguments than you want to pass it -- The `sum(x, y) => sum2=partial(sum(x, 2)) => sum2(x)` sort of example -- then at first glance isn't this the opposite to what I want to do? ie. to pass _more_ args to the callback. I suspect I'm approaching this the wrong way though, going from your last paragraph above. I'm just playing with a minimal sample now.

Otherwise, I do agree with you about when args would need to be passed, ie. individual file processing that can't be done outside. Obviously if you don't need args, don't pass any. While I see now my use case doesn't need that, there still might be others that do, though this might be rare (later I'll need to add a dimension for each file with a value that varies between files, but luckily I can extract that from the filename). I was imagining additional args working something like the way the `schedule` module handles `Job` callbacks . 
```
import schedule
schedule.Job.do?
Signature: schedule.Job.do(self, job_func, *args, **kwargs)
Docstring:
Specifies the job_func that should be called every time the
job runs.

Any additional arguments are passed on to job_func when
the job runs.

:param job_func: The function to be scheduled
:return: The invoked job instance
File:      d:\anaconda3\lib\site-packages\schedule\__init__.py
Type:      function
```
My original intent was cutting down the data I was loading from large files by managing that through the preprocess callback. But this is where I readily admit not knowing how xarray handles things under the covers which means I do things the wrong (sub-optimal?) way. I'm not the only one that is struggling with what is optimal though;
[Unexpected behaviour when chunking with multiple netcdf files in xarray/dask](https://stackoverflow.com/questions/62932044/unexpected-behaviour-when-chunking-with-multiple-netcdf-files-in-xarray-dask)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,659142789
https://github.com/pydata/xarray/issues/4236#issuecomment-660459866,https://api.github.com/repos/pydata/xarray/issues/4236,660459866,MDEyOklzc3VlQ29tbWVudDY2MDQ1OTg2Ng==,8098361,2020-07-18T10:03:58Z,2020-07-18T10:03:58Z,NONE,"I've cleaned up some code so hopefully it shows my two methods more clearly;
### Current method
```
# Set some day of year globals
DOY1 = 1; DOY2 = 31

def select_time(ds):
    # METHOD 1: Derive start/end date from external ipy widget values
    #   Problem: Doesn't work with kwarg parallel=True (pickling error)
    #   Unknown: if the widget values here will actually change when widgets are changed
    year_min, year_max = ds.time.dt.year.min(), ds.time.dt.year.max()
    start_date = pd.Timestamp(dateparse(str(int(year_min)) + mmddW.value))
    end_date = pd.Timestamp(start_date + timedelta(days=daysW.value))
    # Test using fixed values to create start/end dates...this works with pickling
    #    start_date = pd.Timestamp(dateparse(str(int(year_min)) + '0101'))
    #    end_date = pd.Timestamp(start_date + timedelta(days=30))
    ds = ds.sel(time=slice(start_date, end_date))
    
    # METHOD 2: Select time range based on day of year, where DOY1,DOY2 are
    # globals set outside this function. Does pickle, so works with parallel option.
    #   Problem: DOY1, DOY2 don't update here when changed externally after
    #   function declaration
    ds = ds.sel(time=((ds['time.dayofyear']>=DOY1) & (ds['time.dayofyear']<=DOY2)))
    return ds

ds = xr.open_mfdataset(
    files,
    chunks={'lat': 50, 'lon':50},
    combine='nested', concat_dim='time',
    preprocess=select_time,
    parallel=True
)
```
I can appreciate the pickling error for Method 1 is actually because of the reference to the (global) ipy widgets mmddW & daysW. After all why should it be expected to pickle those? Interesting that's only a problem for the parallel option though.

I don't fully understand, but can also appreciate, Method 2 only references DOY1/2 when they're declared and seems to be static thereafter even if DOY1/2 are modified.

Both methods are variations on a theme: I'm trying to use globals in the `preprocess` function as an alternative to passing extra args.
The broader question is whether extra arguments could be useful feature to have.

### Another solution
I think the actual solution to my problem is to forget about preprocessing. Since nothing is loaded at that stage 
```
ds = xr.open_mfdataset(
    files,
    combine='nested', concat_dim='time',
    parallel=True

ds = ds.sel(time=((ds['time.dayofyear']>=DOY1) & (ds['time.dayofyear']<=DOY2)))
ds = ds.chunk({'time': -1, 'lat':50, 'lon':50}).persist()
```
Doing everything after the `open_mfdataset` and seems to work more efficiently. This sort of thing is counter intuitive to me still. Loading less would seem better from the outset but the after-the-fact processing seems to take care of this problem.

Still, it's a side-step around the arg passing issue.

> 
> Before I think about this further - could your problem be solved using `functools.partial`?
I've never used `functools.partial`. From my reading it seems this is used to wrap functions and fix certain arguments so you can call the wrapper with less args. I don't know how to use it to help my current situation.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,659142789