github: issues: 3 rows where user = 5179430 sorted by updated

3 rows where user = 5179430 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	closed_at	author_association	draft	pull_request	body	reactions	state_reason	repo	type
1563270549	PR_kwDOAMm_X85I2_Fl	7494	Update contains_cftime_datetimes to avoid loading entire variable array	agoodm 5179430	closed	8	2023-01-30T21:54:35Z	2023-03-07T16:22:24Z	2023-03-07T16:10:30Z	CONTRIBUTOR	0	pydata/xarray/pulls/7494	[x] Closes #7484 [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst` This PR greatly improves the performance for opening datasets with large arrays of object type (typically string arrays) since `contains_cftime_datetimes` was triggering the entire array to be read from the file just to check the very first element in the entire array. @Illviljan continuing our discussion from the issue thread, I did try to pass in `var._data` to `_contains_cftime_datetimes`, but I had a lot of trouble finding a way to generalize how to index the first array element. The best I could do was `var._data.array.get_array()`, but I don't think `get_array` is implemented for every backend. So for now I am leaving my original proposed solution.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7494/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
1561508426	I_kwDOAMm_X85dErpK	7484	Opening datasets with large object dtype arrays is very slow	agoodm 5179430	closed	3	2023-01-29T23:31:51Z	2023-03-07T16:10:31Z	2023-03-07T16:10:31Z	CONTRIBUTOR			What is your issue? Opening a dataset with a very large array with object dtype is much slower than it should be. I initially noticed this when working with a dataset spanned by around 24000 netcdf files. I have been using kerchunk references to load them with a consolidated metadata key so I was expecting it to be fairly quick to open, but it actually took several minutes. I realized that all this time was spent on one variable consisting of strings, so when dropping it the whole dataset opens up in seconds. Sharing this would be a bit difficult so instead I will illustrate this with a simple easy to reproduce example with the latest released versions of xarray and zarr installed: ```python str_array = np.arange(100000000).astype(str) ds = xr.DataArray(dims=('x',), data=str_array).to_dataset(name='str_array') ds['str_array'] = ds.str_array.astype('O') # Needs to actually be object dtype to show the problem ds.to_zarr('str_array.zarr') %time xr.open_zarr('str_array.zarr/') CPU times: user 8.24 s, sys: 5.23 s, total: 13.5 s Wall time: 12.9 s ``` I did some digging and found that pretty much all the time was spent on the check being done by `contains_cftime_datetimes` in https://github.com/pydata/xarray/blob/d385e2063a6b5919e1fe9dd3e27a24bc7117137e/xarray/core/common.py#L1793 This operation is not lazy and ends up requiring every single chunk for this variable to be opened, all for the sake of checking the very first element in the entire array. A quick fix I tried is updating `contains_cftime_datetimes` to do the following: `python def contains_cftime_datetimes(var) -> bool: """Check if an xarray.Variable contains cftime.datetime objects""" if var.dtype == np.dtype("O") and var.size > 0: ndims = len(var.shape) first_idx = np.zeros(ndims, dtype='int32') array = var[*first_idx].data return _contains_cftime_datetimes(array) else: return False` This drastically reduced the time to open the dataset as expected: `python %time xr.open_zarr('str_array.zarr/') CPU times: user 384 ms, sys: 502 ms, total: 887 ms Wall time: 985 ms` I would like to make a PR with this change but I realize that this change could effect every backend, and although I have been using xarray for many years this would be my first contribution and so I would like to briefly discuss it in case there are better ways to address the issue. Thanks!	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7484/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue
351000813	MDU6SXNzdWUzNTEwMDA4MTM=	2370	Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed	agoodm 5179430	closed	6	2018-08-15T23:18:41Z	2020-08-17T00:07:12Z	2020-08-17T00:07:12Z	CONTRIBUTOR			Code Sample, a copy-pastable example if possible Data file used is here: test.nc.zip Output from each statement is commented out. ```python import xarray as xr ds = xr.open_dataset('test.nc') ds.cold_rad_cnts.min() 13038. ds.cold_rad_cnts.max() 13143. ds.cold_rad_cnts.mean() 12640.583984 ds.cold_rad_cnts.std() 455.035156 ds.cold_rad_cnts.sum() 4.472997e+10 ``` Problem description As you can see above, the mean falls outside the range of the data, and the standard deviation is nearly two orders of magnitude higher than it should be. This is because a significant loss of precision is occurring when using bottleneck's `nansum()` on data with a `float32` dtype. I demonstrated this effect here: https://github.com/kwgoodman/bottleneck/issues/193. Naturally, this means that converting the data to `float64` or any `int` dtype will give the correct result, as well as using numpy's built-in functions instead or uninstalling bottleneck. An example is shown below. Expected Output ```python In [8]: import numpy as np In [9]: np.nansum(ds.cold_rad_cnts) Out[9]: 46357123000.0 In [10]: np.nanmean(ds.cold_rad_cnts) Out[10]: 13100.413 In [11]: np.nanstd(ds.cold_rad_cnts) Out[11]: 8.158843 ``` Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.0 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.2 distributed: 1.22.1 matplotlib: None cartopy: None seaborn: None setuptools: 40.0.0 pip: 10.0.1 conda: None pytest: None IPython: 6.5.0 sphinx: None Unfortunately this will probably not be fixed downstream anytime soon, so I think it would be nice if xarray provided some sort of automatic workaround for this rather than having to remember to manually convert my data if it's `float32`. I am thinking making `float64` the default (as discussed in #2304 ) would be nice but perhaps it might also be good if there was at least a warning whenever bottleneck's `nansum()` is used on `float32` arrays.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2370/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

3 rows where user = 5179430 sorted by updated_at descending

What is your issue?

Code Sample, a copy-pastable example if possible

13038.

13143.

12640.583984

455.035156

4.472997e+10

Problem description

Expected Output

Output of xr.show_versions()

Advanced export

Output of `xr.show_versions()`