github: issues: 4 rows where comments = 1, state = "open" and user = 20629530 sorted by updated

4 rows where comments = 1, state = "open" and user = 20629530 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	author_association	draft	pull_request	body	reactions	repo	type
991544027	MDExOlB1bGxSZXF1ZXN0NzI5OTkzMTE0	5781	Add encodings to save_mfdataset	aulemahal 20629530	open	1	2021-09-08T21:24:13Z	2022-10-06T21:44:18Z	CONTRIBUTOR	0	pydata/xarray/pulls/5781	[ ] Closes #xxxx [x] Tests added [x] Passes `pre-commit run --all-files` [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` [ ] New functions/methods are listed in `api.rst` Simply add a `encodings` argument to `save_mfdataset`. As for the other args, it expects a list of dictionaries, with encoding information to be passed to `to_netcdf` for each dataset. Added a minimal test, simply to see if the argument was taken into account.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5781/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	pull
906175200	MDExOlB1bGxSZXF1ZXN0NjU3MjA1NTM2	5402	`dt.to_pytimedelta` to allow arithmetic with cftime objects	aulemahal 20629530	open	1	2021-05-28T22:48:50Z	2022-06-09T14:50:16Z	CONTRIBUTOR	0	pydata/xarray/pulls/5402	[ ] Closes #xxxx [x] Tests added [x] Passes `pre-commit run --all-files` [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` [ ] New functions/methods are listed in `api.rst` When playing with cftime objects a problem I encountered many times is that I can sub two arrays and them add it back to another. Subtracting to cftime datetime arrays result in an array of `np.timedelta64`. And when trying to add it back to another cftime array, we get a `UFuncTypeError` because the two arrays have incompatible dtypes : '<m8[ns]' and 'O'. Example: ```python import xarray as xr da = xr.DataArray(xr.cftime_range('1900-01-01', freq='D', periods=10), dims=('time',)) An array of timedelta64[ns] dt = da - da[0] da[-1] + dt # Fails ``` However, if the two arrays were of 'O' dtype, then the subtraction would be made by `cftime` which supports `datetime.timedelta` objects. This solution here adds a `to_pytimedelta` to the `TimedeltaAccessor`, mirroring the name of the similar function on `pd.Series.dt`. It uses a monkeypatching workaround to prevent xarray to case the array back into numpy objects. The user still has to check if the data is in cftime or numpy to adapt the operation (calling `dt.to_pytimedelta` or not), but custom workaround were always overly complicated for such a simple problem, this helps. Also, this doesn't work with dask arrays because loading a dask array triggers the variable constructor and thus recasts the array of `datetime.timedelta` to `numpy.timedelta[64]`. I realize I maybe should have opened an issue before, but I had this idea and it all rushed along.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5402/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	pull
969079775	MDU6SXNzdWU5NjkwNzk3NzU=	5701	Performance issues using map_blocks with cftime indexes.	aulemahal 20629530	open	1	2021-08-12T15:47:29Z	2022-04-19T02:44:37Z	CONTRIBUTOR			What happened: When using `map_blocks` on an object that is dask-backed and has a `CFTimeIndex` coordinate, the construction step (not computation done) is very slow. I've seen up to 100x slower than an equivalent object with a numpy datetime index. What you expected to happen: I would understand a performance difference since numpy/pandas objects are usually more optimized than cftime/xarray objects, but the difference is quite large here. Minimal Complete Verifiable Example: Here is a MCVE that I ran in a jupyter notebook. Performance is basically measured by execution time (wall time). I included the current workaround I have for my usecase. ```python import numpy as np import pandas as pd import xarray as xr import dask.array as da from dask.distributed import Client c = Client(n_workers=1, threads_per_worker=8) Test Data Nt = 10_000 Nx = Ny = 100 chks = (Nt, 10, 10) A = xr.DataArray( da.zeros((Nt, Ny, Nx), chunks=chks), dims=('time', 'y', 'x'), coords={'time': pd.date_range('1900-01-01', freq='D', periods=Nt), 'x': np.arange(Nx), 'y': np.arange(Ny) }, name='data' ) Copy of a, but with a cftime coordinate B = A.copy() B['time'] = xr.cftime_range('1900-01-01', freq='D', periods=Nt, calendar='noleap') A dumb function to apply def func(data): return data + data Test 1 : numpy-backed time coordinate %time outA = A.map_blocks(func, template=A) # %time outA.load(); Res on my machine: CPU times: user 130 ms, sys: 6.87 ms, total: 136 ms Wall time: 127 ms CPU times: user 3.01 s, sys: 8.09 s, total: 11.1 s Wall time: 13.4 s Test 2 : cftime-backed time coordinate %time outB = B.map_blocks(func, template=B) %time outB.load(); Res on my machine CPU times: user 4.42 s, sys: 219 ms, total: 4.64 s Wall time: 4.48 s CPU times: user 13.2 s, sys: 3.1 s, total: 16.3 s Wall time: 26 s Workaround in my code def func_cf(data): data['time'] = xr.decode_cf(data.coords.to_dataset()).time return data + data def map_blocks_cf(func, data): data2 = data.copy() data2['time'] = xr.conventions.encode_cf_variable(data.time) return data2.map_blocks(func, template=data) Test 3 : cftime time coordinate with encoding-decoding %time outB2 = map_blocks_cf(func_cf, B) %time outB2.load(); Res CPU times: user 536 ms, sys: 10.5 ms, total: 546 ms Wall time: 528 ms CPU times: user 9.57 s, sys: 2.23 s, total: 11.8 s Wall time: 21.7 s ``` Anything else we need to know?: After exploration I found 2 culprits for this slowness. I used `%%prun` to profile the construction phase of `map_blocks` and found that in the second case (cftime time coordinate): In `map_blocks` calls to `dask.base.tokenize` take the most time. Precisely, tokenizing a numpy ndarray of O dtype goes through the pickling process of the array. This is already quite slow and cftime objects take even more time to pickle. See Unidata/cftime#253 for the corresponding issue. Most of the construction phase execution time is spent pickling the same datetime array at least once per chunk. Second, but only significant when the time coordinate is very large (55000 in my use case). `CFTimeIndex.__new__` is called more than twice as many times as there are chunks. And within the object creation there is this line : https://github.com/pydata/xarray/blob/3956b73a7792f41e4410349f2c40b9a9a80decd2/xarray/coding/cftimeindex.py#L228 The larger the array, the more time is spent in this iteration. Changing the example above to use `Nt = 50_000`, the code spent a total of 25 s in `dask.base.tokenize` calls and 5 s in `CFTimeIndex.__new__` calls. My workaround is not the best, but it was easy to code without touching xarray. The encoding of the time coordinate changes it to an integer array, which is super fast to tokenize. And the speed up of the construction phase is because there is only one call to `encode_cf_variable` compared to `N_chunks` calls to the pickling, As shown above, I have not seen a slowdown in the computation phase. I think this is mostly because the added `decode_cf` calls are done in parallel, but there might be other reason I do not understand. I do not know for sure how/why this tokenization works, but I guess the best improvment in xarray could be to: - Look into the inputs of `map_blocks` and spot cftime-backed coordinates - Convert those coordinates to a ndarray of a basic dtype. - At the moment of tokenization of the time coordinates, do a switheroo and pass the converted arrays instead. I have no idea if that would work, but if it does that would be the best speed-up I think. Environment: Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.9.6 \| packaged by conda-forge \| (default, Jul 11 2021, 03:39:48) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: ('en_CA', 'UTF-8') libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.19.1.dev18+g4bb9d9c.d20210810 pandas: 1.3.1 numpy: 1.21.1 scipy: 1.7.0 netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.8.3 cftime: 1.5.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.07.1 distributed: 2021.07.1 matplotlib: 3.4.2 cartopy: 0.19.0 seaborn: None numbagg: None pint: 0.17 setuptools: 49.6.0.post20210108 pip: 21.2.1 conda: None pytest: None IPython: 7.25.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5701/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
1035607476	I_kwDOAMm_X849uh20	5897	ds.mean bugs with cftime objects	aulemahal 20629530	open	1	2021-10-25T21:55:12Z	2021-10-27T14:51:07Z	CONTRIBUTOR			What happened: Given a dataset that has a variable with cftime objects along dimension A, averaging (`mean`) leads to buggy behaviour: Averaging over 'A' drops the variable instead of averaging it. Averaging over any other dimension will fail if that variable is on the dask backend. What you expected to happen: I expected the average to fail in the case of a dask-backed cftime variable, given that this code exists: https://github.com/pydata/xarray/blob/fdabf3bea5c750939a4a2ae60f80ed34a6aebd58/xarray/core/duck_array_ops.py#L562-L572 And I expected the average to work (not drop the var) in the case of the numpy backend. I expected the fact that dask is used to be irrelevant to the result. I expected the mean to conserve the cftime variable as-is since it doesn't include the averaged dimension. Minimal Complete Verifiable Example: ```python Put your MCVE code here import xarray as xr ds = xr.Dataset({ 'var1': (('time',), xr.cftime_range('2021-10-31', periods=10, freq='D')), 'var2': (('x',), list(range(10))) }) var1 contains cftime objects var2 contains integers They do not share dims ds.mean('time') # var1 has disappeared instead of being averaged ds.mean('x') # Everything ok dsc = ds.chunk({}) dsc.mean('time') # var1 has disappeared. I would expected this line to fail. dsc.mean('x') # Raises NotImplementedError. I would expect this line to run flawlessly. ``` Anything else we need to know?: A culprit is #5393, but maybe the bug is older? I think the change introduced there causes the issue (2) above. In `duck_array_ops.py` the mean operation is declared `numeric_only`, which is kinda incoherent with the implementation allowing means of datetime objects. This setting causes my (1) above. Environment: Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: fdabf3bea5c750939a4a2ae60f80ed34a6aebd58 python: 3.9.7 \| packaged by conda-forge \| (default, Sep 29 2021, 19:20:46) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.14.12-arch1-1 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_CA.utf8 LOCALE: ('fr_CA', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 0.19.1.dev89+gfdabf3be pandas: 1.3.4 numpy: 1.21.3 scipy: 1.7.1 netCDF4: 1.5.7 pydap: installed h5netcdf: 0.11.0 h5py: 3.4.0 Nio: None zarr: 2.10.1 cftime: 1.5.1 nc_time_axis: 1.4.0 PseudoNetCDF: installed rasterio: 1.2.10 cfgrib: 0.9.9.1 iris: 3.1.0 bottleneck: 1.3.2 dask: 2021.10.0 distributed: 2021.10.0 matplotlib: 3.4.3 cartopy: 0.20.1 seaborn: 0.11.2 numbagg: 0.2.1 fsspec: 2021.10.1 cupy: None pint: 0.17 sparse: 0.13.0 setuptools: 58.2.0 pip: 21.3.1 conda: None pytest: 6.2.5 IPython: 7.28.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5897/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

4 rows where comments = 1, state = "open" and user = 20629530 sorted by updated_at descending

An array of timedelta64[ns]

Test Data

Copy of a, but with a cftime coordinate

A dumb function to apply

Test 1 : numpy-backed time coordinate

Res on my machine:

CPU times: user 130 ms, sys: 6.87 ms, total: 136 ms

Wall time: 127 ms

CPU times: user 3.01 s, sys: 8.09 s, total: 11.1 s

Wall time: 13.4 s

Test 2 : cftime-backed time coordinate

Res on my machine

CPU times: user 4.42 s, sys: 219 ms, total: 4.64 s

Wall time: 4.48 s

CPU times: user 13.2 s, sys: 3.1 s, total: 16.3 s

Wall time: 26 s

Workaround in my code

Test 3 : cftime time coordinate with encoding-decoding

Res

CPU times: user 536 ms, sys: 10.5 ms, total: 546 ms

Wall time: 528 ms

CPU times: user 9.57 s, sys: 2.23 s, total: 11.8 s

Wall time: 21.7 s

Put your MCVE code here

var1 contains cftime objects

var2 contains integers

They do not share dims

Advanced export