issues
3 rows where user = 5179430 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: created_at (date), updated_at (date), closed_at (date)
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at ▲ | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1563270549 | PR_kwDOAMm_X85I2_Fl | 7494 | Update contains_cftime_datetimes to avoid loading entire variable array | agoodm 5179430 | closed | 0 | 8 | 2023-01-30T21:54:35Z | 2023-03-07T16:22:24Z | 2023-03-07T16:10:30Z | CONTRIBUTOR | 0 | pydata/xarray/pulls/7494 |
This PR greatly improves the performance for opening datasets with large arrays of object type (typically string arrays) since @Illviljan continuing our discussion from the issue thread, I did try to pass in |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/7494/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
xarray 13221727 | pull | |||||
1561508426 | I_kwDOAMm_X85dErpK | 7484 | Opening datasets with large object dtype arrays is very slow | agoodm 5179430 | closed | 0 | 3 | 2023-01-29T23:31:51Z | 2023-03-07T16:10:31Z | 2023-03-07T16:10:31Z | CONTRIBUTOR | What is your issue?Opening a dataset with a very large array with object dtype is much slower than it should be. I initially noticed this when working with a dataset spanned by around 24000 netcdf files. I have been using kerchunk references to load them with a consolidated metadata key so I was expecting it to be fairly quick to open, but it actually took several minutes. I realized that all this time was spent on one variable consisting of strings, so when dropping it the whole dataset opens up in seconds. Sharing this would be a bit difficult so instead I will illustrate this with a simple easy to reproduce example with the latest released versions of xarray and zarr installed: ```python str_array = np.arange(100000000).astype(str) ds = xr.DataArray(dims=('x',), data=str_array).to_dataset(name='str_array') ds['str_array'] = ds.str_array.astype('O') # Needs to actually be object dtype to show the problem ds.to_zarr('str_array.zarr') %time xr.open_zarr('str_array.zarr/') CPU times: user 8.24 s, sys: 5.23 s, total: 13.5 s Wall time: 12.9 s ``` I did some digging and found that pretty much all the time was spent on the check being done by This operation is not lazy and ends up requiring every single chunk for this variable to be opened, all for the sake of checking the very first element in the entire array. A quick fix I tried is updating This drastically reduced the time to open the dataset as expected:
I would like to make a PR with this change but I realize that this change could effect every backend, and although I have been using xarray for many years this would be my first contribution and so I would like to briefly discuss it in case there are better ways to address the issue. Thanks! |
{ "url": "https://api.github.com/repos/pydata/xarray/issues/7484/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 } |
completed | xarray 13221727 | issue | ||||||
351000813 | MDU6SXNzdWUzNTEwMDA4MTM= | 2370 | Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed | agoodm 5179430 | closed | 0 | 6 | 2018-08-15T23:18:41Z | 2020-08-17T00:07:12Z | 2020-08-17T00:07:12Z | CONTRIBUTOR | Code Sample, a copy-pastable example if possibleData file used is here: test.nc.zip Output from each statement is commented out. ```python import xarray as xr ds = xr.open_dataset('test.nc') ds.cold_rad_cnts.min() 13038.ds.cold_rad_cnts.max() 13143.ds.cold_rad_cnts.mean() 12640.583984ds.cold_rad_cnts.std() 455.035156ds.cold_rad_cnts.sum() 4.472997e+10``` Problem descriptionAs you can see above, the mean falls outside the range of the data, and the standard deviation is nearly two orders of magnitude higher than it should be. This is because a significant loss of precision is occurring when using bottleneck's Naturally, this means that converting the data to Expected Output```python In [8]: import numpy as np In [9]: np.nansum(ds.cold_rad_cnts) Out[9]: 46357123000.0 In [10]: np.nanmean(ds.cold_rad_cnts) Out[10]: 13100.413 In [11]: np.nanstd(ds.cold_rad_cnts) Out[11]: 8.158843 ``` Output of
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/2370/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | xarray 13221727 | issue |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issues] ( [id] INTEGER PRIMARY KEY, [node_id] TEXT, [number] INTEGER, [title] TEXT, [user] INTEGER REFERENCES [users]([id]), [state] TEXT, [locked] INTEGER, [assignee] INTEGER REFERENCES [users]([id]), [milestone] INTEGER REFERENCES [milestones]([id]), [comments] INTEGER, [created_at] TEXT, [updated_at] TEXT, [closed_at] TEXT, [author_association] TEXT, [active_lock_reason] TEXT, [draft] INTEGER, [pull_request] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [state_reason] TEXT, [repo] INTEGER REFERENCES [repos]([id]), [type] TEXT ); CREATE INDEX [idx_issues_repo] ON [issues] ([repo]); CREATE INDEX [idx_issues_milestone] ON [issues] ([milestone]); CREATE INDEX [idx_issues_assignee] ON [issues] ([assignee]); CREATE INDEX [idx_issues_user] ON [issues] ([user]);