id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 374279704,MDU6SXNzdWUzNzQyNzk3MDQ=,2514,interpolate_na with limit argument changes size of chunks,102827,closed,0,,,8,2018-10-26T08:31:35Z,2021-03-26T19:50:50Z,2021-03-26T19:50:50Z,CONTRIBUTOR,,,,"#### Code Sample, a copy-pastable example if possible ```python import pandas as pd import xarray as xr import numpy as np t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H') foo = np.sin(np.arange(len(t))) bar = np.cos(np.arange(len(t))) foo[1] = np.NaN bar[2] = np.NaN ds_test = xr.Dataset(data_vars={'foo': ('time', foo), 'bar': ('time', bar)}, coords={'time': t}).chunk() print(ds_test) print(""\n\n### After `.interpolate_na(dim='time')`\n"") print(ds_test.interpolate_na(dim='time')) print(""\n\n### After `.interpolate_na(dim='time', limit=5)`\n"") print(ds_test.interpolate_na(dim='time', limit=5)) print(""\n\n### After `.interpolate_na(dim='time', limit=20)`\n"") print(ds_test.interpolate_na(dim='time', limit=20)) ``` Output of the above code. Note the different chunk sizes, depending on the value of `limit`: ``` Dimensions: (time: 745) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01 Data variables: foo (time) float64 dask.array bar (time) float64 dask.array ### After `.interpolate_na(dim='time')` Dimensions: (time: 745) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01 Data variables: foo (time) float64 dask.array bar (time) float64 dask.array ### After `.interpolate_na(dim='time', limit=5)` Dimensions: (time: 745) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01 Data variables: foo (time) float64 dask.array bar (time) float64 dask.array ### After `.interpolate_na(dim='time', limit=20)` Dimensions: (time: 745) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01 Data variables: foo (time) float64 dask.array bar (time) float64 dask.array ``` #### Problem description When using `xarray.DataArray.interpolate_na()` with the `limit` kwarg this changes the chunksize of the resulting `dask.arrays`. #### Expected Output The chunksize should not change. Very small chunks which results from typical small values of `limit` are not optimal for the performance of `dask`. Also, things like `.rolling()` will fail if the chunksize is smaller than the window length of the rolling window. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None xarray: 0.10.9 pandas: 0.23.3 numpy: 1.13.3 scipy: 1.0.0 netCDF4: 1.4.1 h5netcdf: 0.5.0 h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.1 PseudonetCDF: None rasterio: None iris: None bottleneck: 1.2.1 cyordereddict: 1.0.0 dask: 0.19.4 distributed: 1.23.3 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.8.1 setuptools: 38.5.2 pip: 9.0.1 conda: 4.5.11 pytest: 3.4.2 IPython: 5.5.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2514/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 376162232,MDExOlB1bGxSZXF1ZXN0MjI3NDQzNTI3,2532,[WIP] Fix problem with wrong chunksizes when using rolling_window on dask.array,102827,closed,0,,,2,2018-10-31T21:12:03Z,2021-03-26T19:50:50Z,2021-03-26T19:50:50Z,CONTRIBUTOR,,0,pydata/xarray/pulls/2532," - [ ] Closes #2514 - [ ] Closes #2531 - [ ] Tests added (for all bug fixes or enhancements) - [ ] Fully documented, including `whats-new.rst` for all changes ## Short summary The two rolling-window functions for `dask.array` * [dask_rolling_wrapper](https://github.com/pydata/xarray/blob/b622c5e7da928524ef949d9e389f6c7f38644494/xarray/core/dask_array_ops.py#L23) * [rolling_window](https://github.com/pydata/xarray/blob/b622c5e7da928524ef949d9e389f6c7f38644494/xarray/core/dask_array_ops.py#L43) will be fixed to preserve `dask.array` chunksizes. ## Long summary The specific initial problem with chunksizes and `interpolate_na()` in #2514 is caused by the padding done in https://github.com/pydata/xarray/blob/5940100761478604080523ebb1291ecff90e779e/xarray/core/dask_array_ops.py#L74-L85 which adds a small array with a small chunk to the initial array. There is another related problem where `DataArray.rolling()` changes the size and distribution of `dask.array` chunks which stems from this code https://github.com/pydata/xarray/blob/b622c5e7da928524ef949d9e389f6c7f38644494/xarray/core/dask_array_ops.py#L23 For some (historic) reason there are these two rolling-window functions for `dask`. Both need to be fixed to preserve chunksize of a `dask.array` in all cases. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2532/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 376154741,MDU6SXNzdWUzNzYxNTQ3NDE=,2531,DataArray.rolling() does not preserve chunksizes in some cases,102827,closed,0,,,2,2018-10-31T20:50:33Z,2021-03-26T19:50:49Z,2021-03-26T19:50:49Z,CONTRIBUTOR,,,,"This issue was found and discussed in the related issue #2514 I open a separate issue for clarity. #### Code Sample, a copy-pastable example if possible ```python import pandas as pd import numpy as np import xarray as xr t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H') bar = np.sin(np.arange(len(t))) baz = np.cos(np.arange(len(t))) da_test = xr.DataArray(data=np.stack([bar, baz]), coords={'time': t, 'sensor': ['one', 'two']}, dims=('sensor', 'time')) print(da_test.chunk({'time': 100}).rolling(time=60).mean().chunks) print(da_test.chunk({'time': 100}).rolling(time=60).count().chunks) ``` ``` Output for `mean`: ((2,), (745,)) Output for `count`: ((2,), (100, 100, 100, 100, 100, 100, 100, 45)) Desired Output: ((2,), (100, 100, 100, 100, 100, 100, 100, 45)) ``` #### Problem description DataArray.rolling() does not preserve the chunksizes, apparently depending on the applied method. #### Output of ``xr.show_versions()``
INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None xarray: 0.10.9 pandas: 0.23.3 numpy: 1.13.3 scipy: 1.0.0 netCDF4: 1.4.1 h5netcdf: 0.5.0 h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.1 PseudonetCDF: None rasterio: None iris: None bottleneck: 1.2.1 cyordereddict: 1.0.0 dask: 0.19.4 distributed: 1.23.3 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.8.1 setuptools: 38.5.2 pip: 9.0.1 conda: 4.5.11 pytest: 3.4.2 IPython: 5.5.0 sphinx: None
","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/2531/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue 229807027,MDExOlB1bGxSZXF1ZXN0MTIxMzc5NjAw,1414,Speed up `decode_cf_datetime`,102827,closed,0,,,12,2017-05-18T21:15:40Z,2017-07-26T07:40:24Z,2017-07-25T17:42:52Z,CONTRIBUTOR,,0,pydata/xarray/pulls/1414," - [x] Closes #1399 - [x] Tests added / passed - [x] Passes ``git diff upstream/master | flake8 --diff`` - [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API Instead of casting the input numeric dates to float, they are now casted to nanoseconds as int64 which makes `pd.to_timedelta()` work much faster (x100 speedup on my machine). On my machine all existing tests for `conventions.py` pass. Overflows should be handled by [these two already existing lines](https://github.com/cchwala/xarray/commit/d7d7c01f3e2f14c38c44e62f648b30474469b078#diff-d94eba38daa73be812c57c756f01f0daR158) since everything in the valid range of `pd.to_datetime` should be save.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1414/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 226549366,MDU6SXNzdWUyMjY1NDkzNjY=,1399,`decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed,102827,closed,0,,,6,2017-05-05T11:48:00Z,2017-07-25T17:42:52Z,2017-07-25T17:42:52Z,CONTRIBUTOR,,,,"Hi, `decode_cf_datetime` is slowed down because it [always passes floats](https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L129) to [`pd.to_timedelta`](https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L154), while `pd.to_timedelta` is much faster when working on integers. [Here](https://gist.github.com/cchwala/157b87d4e413b560f8ad8555a330b937#file-timing_for_timedelta64_and_pandas_timedelta-ipynb) is a notebook that shows the differences. Working with integers is approx. one order of magnitude faster. Hence, it would be great to automatically do the conversion from raw time value floats to integers in nanoseconds where possible (likely limited to resolutions bellow days or hours to avoid coping with different durations numbers of nanoseconds within e.g. different months). As alternative, maybe avoid forcing the cast to floats and indicate in the docstring that the raw values should be integers to speed up the conversion. This could possibly also be resolved in `pd.to_timedelta` but I assume it will be more complicated to deal with all the edge cases there. ","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/1399/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,completed,13221727,issue