github: issues: 6 rows where user = 102827 sorted by updated

6 rows where user = 102827 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	closed_at	author_association	draft	pull_request	body	reactions	state_reason	repo	type
289342234	MDU6SXNzdWUyODkzNDIyMzQ=	1836	HDF5 error when working with compressed NetCDF files and the dask multiprocessing scheduler	cchwala 102827	open	5	2018-01-17T17:05:56Z	2022-06-21T14:50:02Z		CONTRIBUTOR			Code Sample, a copy-pastable example if possible ```python import xarray as xr import numpy as np import dask.multiprocessing Generate dummy data and build xarray dataset mat = np.random.rand(10, 90, 90) ds = xr.Dataset(data_vars={'foo': (('time', 'x', 'y'), mat)}) Write dataset to netcdf without compression ds.to_netcdf('dummy_data_3d.nc') Write with zlib compersison ds.to_netcdf('dummy_data_3d_with_compression.nc', encoding={'foo': {'zlib': True}}) Write data as int16 with scale factor applied ds.to_netcdf('dummy_data_3d_with_scale_factor.nc', encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.01, '_FillValue': -9999}}) Load data from netCDF files ds_vanilla = xr.open_dataset('dummy_data_3d.nc', chunks={'time': 1}) ds_scaled = xr.open_dataset('dummy_data_3d_with_scale_factor.nc', chunks={'time': 1}) ds_compressed = xr.open_dataset('dummy_data_3d_with_compression.nc', chunks={'time': 1}) Do computation using dask's multiprocessing scheduler foo = ds_vanilla.foo.mean(dim=['x', 'y']).compute(get=dask.multiprocessing.get) foo = ds_scaled.foo.mean(dim=['x', 'y']).compute(get=dask.multiprocessing.get) foo = ds_compressed.foo.mean(dim=['x', 'y']).compute(get=dask.multiprocessing.get) The last line fails ``` Problem description If NetCDF files are compressed (which is often the case) and opened with chunking enabled to use them with dask, computations using the multiprocessing scheduler fail. The above code shows this in a short example. The last line fails with a long HDF5 error log: ``` HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 140736213758912: #000: H5Dio.c line 171 in H5Dread(): can't read data major: Dataset minor: Read failed #001: H5Dio.c line 544 in H5D__read(): can't read data major: Dataset minor: Read failed #002: H5Dchunk.c line 2022 in H5D__chunk_read(): error looking up chunk address major: Dataset minor: Can't get value #003: H5Dchunk.c line 2768 in H5D__chunk_lookup(): can't query chunk address major: Dataset minor: Can't get value #004: H5Dbtree.c line 1047 in H5D__btree_idx_get_addr(): can't get chunk info major: Dataset minor: Can't get value #005: H5B.c line 341 in H5B_find(): unable to load B-tree node major: B-Tree node minor: Unable to protect metadata #006: H5AC.c line 1763 in H5AC_protect(): H5C_protect() failed major: Object cache minor: Unable to protect metadata #007: H5C.c line 2561 in H5C_protect(): can't load entry major: Object cache minor: Unable to load metadata into cache #008: H5C.c line 6877 in H5C_load_entry(): Can't deserialize image major: Object cache minor: Unable to load metadata into cache #009: H5Bcache.c line 181 in H5B__cache_deserialize(): wrong B-tree signature major: B-Tree node minor: Bad value Traceback (most recent call last): File "hdf5_bug_minimal_working_example.py", line 27, in <module> foo = ds_compressed.foo.mean(dim=['x', 'y']).compute(get=dask.multiprocessing.get) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/dataarray.py", line 658, in compute return new.load(kwargs) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/dataarray.py", line 632, in load ds = self._to_temp_dataset().load(kwargs) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/dataset.py", line 491, in load evaluated_data = da.compute(lazy_data.values(), kwargs) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/base.py", line 333, in compute results = get(dsk, keys, kwargs) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/multiprocessing.py", line 177, in get raise_exception=reraise, kwargs) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 521, in get_async raise_exception(exc, tb) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 290, in execute_task result = _execute_task(task, data) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 270, in _execute_task args2 = [_execute_task(a, cache) for a in args] File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 270, in _execute_task args2 = [_execute_task(a, cache) for a in args] File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 267, in _execute_task return [_execute_task(a, cache) for a in arg] File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/local.py", line 271, in _execute_task return func(args2) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/dask/array/core.py", line 72, in getter c = np.asarray(c) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 531, in asarray return array(a, dtype, copy=False, order=order) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/indexing.py", line 538, in __array__ return np.asarray(self.array, dtype=dtype) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/numpy/core/numeric.py", line 531, in asarray return array(a, dtype, copy=False, order=order) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/core/indexing.py", line 505, in __array__ return np.asarray(array[self.key], dtype=None) File "/Users/chwala-c/miniconda2/lib/python2.7/site-packages/xarray/backends/netCDF4_.py", line 61, in __getitem__ data = getitem(self.get_array(), key) File "netCDF4/_netCDF4.pyx", line 3961, in netCDF4._netCDF4.Variable.__getitem__ File "netCDF4/_netCDF4.pyx", line 4798, in netCDF4._netCDF4.Variable._get File "netCDF4/_netCDF4.pyx", line 1638, in netCDF4._netCDF4._ensure_nc_success RuntimeError: NetCDF: HDF error ``` A possible workaround, if the dataset fits into memory, is to use `python ds = ds.persist()` I could split up my dataset to accomplish this, but the beauty of xarray and dask gets lost a little when doing this... Output of `xr.show_versions()` ``` INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None xarray: 0.10.0 pandas: 0.21.0 numpy: 1.13.3 scipy: 1.0.0 netCDF4: 1.3.1 h5netcdf: 0.5.0 Nio: None bottleneck: 1.2.1 cyordereddict: 1.0.0 dask: 0.16.0 matplotlib: 2.1.0 cartopy: None seaborn: 0.8.1 setuptools: 36.7.2 pip: 9.0.1 conda: 4.3.29 pytest: 3.2.5 IPython: 5.5.0 sphinx: None ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1836/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	issue
374279704	MDU6SXNzdWUzNzQyNzk3MDQ=	2514	interpolate_na with limit argument changes size of chunks	cchwala 102827	closed	8	2018-10-26T08:31:35Z	2021-03-26T19:50:50Z	2021-03-26T19:50:50Z	CONTRIBUTOR			Code Sample, a copy-pastable example if possible ```python import pandas as pd import xarray as xr import numpy as np t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H') foo = np.sin(np.arange(len(t))) bar = np.cos(np.arange(len(t))) foo[1] = np.NaN bar[2] = np.NaN ds_test = xr.Dataset(data_vars={'foo': ('time', foo), 'bar': ('time', bar)}, coords={'time': t}).chunk() print(ds_test) print("\n\n### After `.interpolate_na(dim='time')`\n") print(ds_test.interpolate_na(dim='time')) print("\n\n### After `.interpolate_na(dim='time', limit=5)`\n") print(ds_test.interpolate_na(dim='time', limit=5)) print("\n\n### After `.interpolate_na(dim='time', limit=20)`\n") print(ds_test.interpolate_na(dim='time', limit=20)) ``` Output of the above code. Note the different chunk sizes, depending on the value of `limit`: ``` <xarray.Dataset> Dimensions: (time: 745) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01 Data variables: foo (time) float64 dask.array<shape=(745,), chunksize=(745,)> bar (time) float64 dask.array<shape=(745,), chunksize=(745,)> After `.interpolate_na(dim='time')` <xarray.Dataset> Dimensions: (time: 745) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01 Data variables: foo (time) float64 dask.array<shape=(745,), chunksize=(745,)> bar (time) float64 dask.array<shape=(745,), chunksize=(745,)> After `.interpolate_na(dim='time', limit=5)` <xarray.Dataset> Dimensions: (time: 745) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01 Data variables: foo (time) float64 dask.array<shape=(745,), chunksize=(3,)> bar (time) float64 dask.array<shape=(745,), chunksize=(3,)> After `.interpolate_na(dim='time', limit=20)` <xarray.Dataset> Dimensions: (time: 745) Coordinates: * time (time) datetime64[ns] 2018-01-01 2018-01-01T01:00:00 ... 2018-02-01 Data variables: foo (time) float64 dask.array<shape=(745,), chunksize=(10,)> bar (time) float64 dask.array<shape=(745,), chunksize=(10,)> ``` Problem description When using `xarray.DataArray.interpolate_na()` with the `limit` kwarg this changes the chunksize of the resulting `dask.arrays`. Expected Output The chunksize should not change. Very small chunks which results from typical small values of `limit` are not optimal for the performance of `dask`. Also, things like `.rolling()` will fail if the chunksize is smaller than the window length of the rolling window. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None xarray: 0.10.9 pandas: 0.23.3 numpy: 1.13.3 scipy: 1.0.0 netCDF4: 1.4.1 h5netcdf: 0.5.0 h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.1 PseudonetCDF: None rasterio: None iris: None bottleneck: 1.2.1 cyordereddict: 1.0.0 dask: 0.19.4 distributed: 1.23.3 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.8.1 setuptools: 38.5.2 pip: 9.0.1 conda: 4.5.11 pytest: 3.4.2 IPython: 5.5.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2514/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
376162232	MDExOlB1bGxSZXF1ZXN0MjI3NDQzNTI3	2532	[WIP] Fix problem with wrong chunksizes when using rolling_window on dask.array	cchwala 102827	closed	2	2018-10-31T21:12:03Z	2021-03-26T19:50:50Z	2021-03-26T19:50:50Z	CONTRIBUTOR	0	pydata/xarray/pulls/2532	[ ] Closes #2514 [ ] Closes #2531 [ ] Tests added (for all bug fixes or enhancements) [ ] Fully documented, including `whats-new.rst` for all changes Short summary The two rolling-window functions for `dask.array` * dask_rolling_wrapper * rolling_window will be fixed to preserve `dask.array` chunksizes. Long summary The specific initial problem with chunksizes and `interpolate_na()` in #2514 is caused by the padding done in https://github.com/pydata/xarray/blob/5940100761478604080523ebb1291ecff90e779e/xarray/core/dask_array_ops.py#L74-L85 which adds a small array with a small chunk to the initial array. There is another related problem where `DataArray.rolling()` changes the size and distribution of `dask.array` chunks which stems from this code https://github.com/pydata/xarray/blob/b622c5e7da928524ef949d9e389f6c7f38644494/xarray/core/dask_array_ops.py#L23 For some (historic) reason there are these two rolling-window functions for `dask`. Both need to be fixed to preserve chunksize of a `dask.array` in all cases.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2532/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
376154741	MDU6SXNzdWUzNzYxNTQ3NDE=	2531	DataArray.rolling() does not preserve chunksizes in some cases	cchwala 102827	closed	2	2018-10-31T20:50:33Z	2021-03-26T19:50:49Z	2021-03-26T19:50:49Z	CONTRIBUTOR			This issue was found and discussed in the related issue #2514 I open a separate issue for clarity. Code Sample, a copy-pastable example if possible ```python import pandas as pd import numpy as np import xarray as xr t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H') bar = np.sin(np.arange(len(t))) baz = np.cos(np.arange(len(t))) da_test = xr.DataArray(data=np.stack([bar, baz]), coords={'time': t, 'sensor': ['one', 'two']}, dims=('sensor', 'time')) print(da_test.chunk({'time': 100}).rolling(time=60).mean().chunks) print(da_test.chunk({'time': 100}).rolling(time=60).count().chunks) Output for `mean`: ((2,), (745,)) Output for `count`: ((2,), (100, 100, 100, 100, 100, 100, 100, 45)) Desired Output: ((2,), (100, 100, 100, 100, 100, 100, 100, 45)) ``` Problem description DataArray.rolling() does not preserve the chunksizes, apparently depending on the applied method. Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 2.7.15.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: None.None xarray: 0.10.9 pandas: 0.23.3 numpy: 1.13.3 scipy: 1.0.0 netCDF4: 1.4.1 h5netcdf: 0.5.0 h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.1 PseudonetCDF: None rasterio: None iris: None bottleneck: 1.2.1 cyordereddict: 1.0.0 dask: 0.19.4 distributed: 1.23.3 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.8.1 setuptools: 38.5.2 pip: 9.0.1 conda: 4.5.11 pytest: 3.4.2 IPython: 5.5.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2531/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
229807027	MDExOlB1bGxSZXF1ZXN0MTIxMzc5NjAw	1414	Speed up `decode_cf_datetime`	cchwala 102827	closed	12	2017-05-18T21:15:40Z	2017-07-26T07:40:24Z	2017-07-25T17:42:52Z	CONTRIBUTOR	0	pydata/xarray/pulls/1414	[x] Closes #1399 [x] Tests added / passed [x] Passes `git diff upstream/master \| flake8 --diff` [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API Instead of casting the input numeric dates to float, they are now casted to nanoseconds as int64 which makes `pd.to_timedelta()` work much faster (x100 speedup on my machine). On my machine all existing tests for `conventions.py` pass. Overflows should be handled by these two already existing lines since everything in the valid range of `pd.to_datetime` should be save.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1414/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
226549366	MDU6SXNzdWUyMjY1NDkzNjY=	1399	`decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed	cchwala 102827	closed	6	2017-05-05T11:48:00Z	2017-07-25T17:42:52Z	2017-07-25T17:42:52Z	CONTRIBUTOR			Hi, `decode_cf_datetime` is slowed down because it always passes floats to `pd.to_timedelta`, while `pd.to_timedelta` is much faster when working on integers. Here is a notebook that shows the differences. Working with integers is approx. one order of magnitude faster. Hence, it would be great to automatically do the conversion from raw time value floats to integers in nanoseconds where possible (likely limited to resolutions bellow days or hours to avoid coping with different durations numbers of nanoseconds within e.g. different months). As alternative, maybe avoid forcing the cast to floats and indicate in the docstring that the raw values should be integers to speed up the conversion. This could possibly also be resolved in `pd.to_timedelta` but I assume it will be more complicated to deal with all the edge cases there.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/1399/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

6 rows where user = 102827 sorted by updated_at descending

Code Sample, a copy-pastable example if possible

Generate dummy data and build xarray dataset

Write dataset to netcdf without compression

Write with zlib compersison

Write data as int16 with scale factor applied

Load data from netCDF files

Do computation using dask's multiprocessing scheduler

The last line fails

Problem description

Output of `xr.show_versions()`

Code Sample, a copy-pastable example if possible

After `.interpolate_na(dim='time')`

After `.interpolate_na(dim='time', limit=5)`

After `.interpolate_na(dim='time', limit=20)`

Problem description

Expected Output

Output of `xr.show_versions()`

Short summary

Long summary

Code Sample, a copy-pastable example if possible

Problem description

Output of `xr.show_versions()`

Advanced export

issues

6 rows where user = 102827 sorted by updated_at descending

Code Sample, a copy-pastable example if possible

Generate dummy data and build xarray dataset

Write dataset to netcdf without compression

Write with zlib compersison

Write data as int16 with scale factor applied

Load data from netCDF files

Do computation using dask's multiprocessing scheduler

The last line fails

Problem description

Output of xr.show_versions()

Code Sample, a copy-pastable example if possible

After .interpolate_na(dim='time')

After .interpolate_na(dim='time', limit=5)

After .interpolate_na(dim='time', limit=20)

Problem description

Expected Output

Output of xr.show_versions()

Short summary

Long summary

Code Sample, a copy-pastable example if possible

Problem description

Output of xr.show_versions()

Advanced export

Output of `xr.show_versions()`

After `.interpolate_na(dim='time')`

After `.interpolate_na(dim='time', limit=5)`

After `.interpolate_na(dim='time', limit=20)`

Output of `xr.show_versions()`

Output of `xr.show_versions()`