github: issues: 8 rows where state = "open" and user = 39069044 sorted by updated

8 rows where state = "open" and user = 39069044 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	author_association	draft	pull_request	body	reactions	repo	type
1307112340	I_kwDOAMm_X85N6POU	6799	`interp` performance with chunked dimensions	slevang 39069044	open	9	2022-07-17T14:25:17Z	2024-04-26T21:41:31Z	CONTRIBUTOR			What is your issue? I'm trying to perform 2D interpolation on a large 3D array that is heavily chunked along the interpolation dimensions and not the third dimension. The application could be extracting a timeseries from a reanalysis dataset chunked in space but not time, to compare to observed station data with more precise coordinates. I use the advanced interpolation method as described in the documentation, with the interpolation coordinates specified by DataArray's with a shared dimension like so: ```python %load_ext memory_profiler import numpy as np import dask.array as da import xarray as xr Synthetic dataset chunked in the two interpolation dimensions nt = 40000 nx = 200 ny = 200 ds = xr.Dataset( data_vars = { 'foo':( ('t', 'x', 'y'), da.random.random(size=(nt, nx, ny), chunks=(-1, 10, 10))), }, coords = { 't': np.linspace(0, 1, nt), 'x': np.linspace(0, 1, nx), 'y': np.linspace(0, 1, ny), } ) Interpolate to some random 2D locations ni = 10 xx = xr.DataArray(np.random.random(ni), dims='z', name='x') yy = xr.DataArray(np.random.random(ni), dims='z', name='y') interpolated = ds.foo.interp(x=xx, y=yy) %memit interpolated.compute() ``` With just 10 interpolation points, this example calculation uses about `1.5 * ds.nbytes` of memory, and saturates around `2 * ds.nbytes` by about 100 interpolation points. This could definitely work better, as each interpolated point usually only requires a single chunk of the input dataset, and at most 4 if it is right on the corner of a chunk. For example we can instead do it in a loop and get very reasonable memory usage, but this isn't very scalable: `python interpolated = [] for n in range(ni): interpolated.append(ds.foo.interp(x=xx.isel(z=n), y=yy.isel(z=n))) interpolated = xr.concat(interpolated, dim='z') %memit interpolated.compute()` I tried adding a `.chunk({'z':1})` to the interpolation coordinates but this doesn't help. We can also do `.sel(x=xx, y=yy, method='nearest')` with very good performance. Any tips to make this calculation work better with existing options, or otherwise ways we might improve the `interp` method to handle this case? Given the performance behavior I'm guessing we may be doing sequntial interpolation for the dimensions, basically an `interp1d` call for all the `xx` points and from there another to the `yy` points, which for even a small number of points would require nearly all chunks to be loaded in. But I haven't explored the code enough yet to understand the details.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6799/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
2220689594	PR_kwDOAMm_X85rcmw1	8904	Handle extra indexes for zarr region writes	slevang 39069044	open	8	2024-04-02T14:34:00Z	2024-04-03T19:20:37Z	CONTRIBUTOR	0	pydata/xarray/pulls/8904	[x] Tests added [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` Small follow up to #8877. If we're going to drop the indices anyways for region writes, we may as well not raise if they are still in the dataset. This makes the user experience of region writes simpler: ```python ds = xr.tutorial.open_dataset("air_temperature") ds.to_zarr("test.zarr") region = {"time": slice(0, 10)} This fails unless we remember to ds.drop_vars(["lat", "lon"]) ds.isel(**region).to_zarr("test.zarr", region=region) ``` I find this annoying because I often have a dataset with a bunch of unrelated indexes and have to remember which ones to drop, or use some verbose `set` logic. I thought #8877 might have already done this, but not quite. By just reordering the point at which we drop indices, we can now skip this. We still raise if data vars are passed that don't overlap with the region. cc @dcherian	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8904/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	pull
2126356395	I_kwDOAMm_X85-vZ-r	8725	`broadcast_like()` doesn't copy chunking structure	slevang 39069044	open	2	2024-02-09T02:07:19Z	2024-03-26T18:33:13Z	CONTRIBUTOR			What is your issue? ```python import dask.array import xarray as xr da1 = xr.DataArray(dask.array.ones((3,3), chunks=(1, 1)), dims=["x", "y"]) da2 = xr.DataArray(dask.array.ones((3,), chunks=(1,)), dims=["x"]) da2.broadcast_like(da1).chunksizes Frozen({'x': (1, 1, 1), 'y': (3,)}) ``` Was surprised to not find any other issues around this. Feels like a major limitation of the method for a lot of use cases. Is there an easy hack around this?	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8725/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
1875857414	I_kwDOAMm_X85vz1AG	8129	Sort the values of an nD array	slevang 39069044	open	11	2023-08-31T16:20:40Z	2023-09-01T15:37:34Z	CONTRIBUTOR			Is your feature request related to a problem? As far as I know, there is no straightforward API in xarray to do what `np.sort` or `pandas.sort_values` does. We have `DataArray.sortby("x")`, which will sort the array according to the coordinate itself. But if instead you want to sort the values of the array to be monotonic, you're on your own. There are probably a lot of ways we could do this, but I ended up with the couple line solution below after a little trial and error. Describe the solution you'd like Would there be interest in implementing a `Dataset`/`DataArray.sort_values(dim="x")` method? Note: this 1D example is not really relevant, see the 2D version and more obvious implementation in comments below for what I really want. `python def sort_values(self, dim: str): sort_idx = self.argsort(axis=self.get_axis_num(dim)).drop_vars(dim) return self.isel({dim: sort_idx}).drop_vars(dim).assign_coords({dim: self[dim]})` The goal is to handle arrays that we want to monotize like so: ```python da = xr.DataArray([1, 3, 2, 4], coords={"x": [1, 2, 3, 4]}) da.sort_values("x") <xarray.DataArray (x: 4)> array([1, 2, 3, 4]) Coordinates: * x (x) int64 1 2 3 4 ``` In addition to `sortby` which can deal with an array that is just unordered according to the coordinate: ```python da = xr.DataArray([1, 3, 2, 4], coords={"x": [1, 3, 2, 4]}) da.sortby("x") <xarray.DataArray (x: 4)> array([1, 2, 3, 4]) Coordinates: * x (x) int64 1 2 3 4 ``` Describe alternatives you've considered I don't know if `argsort` is dask-enabled (the docs just point to the `numpy` function). Is there a more intelligent way to implement this with `apply_ufunc` and something else? I assume chunking in the sort dimension would be problematic. Additional context Some past related threads on this topic: https://github.com/pydata/xarray/issues/3957 https://stackoverflow.com/questions/64518239/sorting-dataset-along-axis-with-dask	{ "url": "https://api.github.com/repos/pydata/xarray/issues/8129/reactions", "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
1397104515	I_kwDOAMm_X85TRh-D	7130	Passing keyword arguments to external functions	slevang 39069044	open	3	2022-10-05T02:51:35Z	2023-03-26T19:15:00Z	CONTRIBUTOR			What is your issue? Follow on from #6891 and #6978 to discuss how we could homogenize the passing of keyword arguments to wrapped external functions across xarray methods. There are quite a few methods like this where we are ultimately passing data to numpy, scipy, or some other library and want the option to send variable length kwargs to that underlying function. There are two different ways of doing this today: xarray method accepts flexible `kwargs` so these can be written directly in the xarray call xarray method accepts a single dict `kwargs` (sometimes named differently) and passes these along in expanded form via `kwargs` I could only find a few examples of the latter: `Dataset.interp`, which takes `kwargs` `Dataset.curvefit`, which takes `kwargs` (although the docstring is wrong here) `xr.apply_ufunc`, which takes `kwargs` passed to `func` and `dask_gufunc_kwargs` passed to `dask.array.apply_gufunc` `xr.open_dataset`, which takes either `kwargs` or `backend_kwargs` and merges the two Allowing direct passage with `kwargs` seems nice from a user perspective. But, this could occasionally be problematic, for example in the `Dataset.interp` case where this method also accepts the kwarg form of `coords` with `coords_kwargs`. There are many methods like this that use `indexers_kwargs` or `**chunks_kwargs` with `either_dict_or_kwargs` but don't happen to wrap external functions.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7130/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	issue
1581046647	I_kwDOAMm_X85ePNt3	7522	Differences in `to_netcdf` for dask and numpy backed arrays	slevang 39069044	open	7	2023-02-11T23:06:37Z	2023-03-01T23:12:11Z	CONTRIBUTOR			What is your issue? I make use of `fsspec` to quickly open netcdf files in the cloud and pull out slices of data without needing to read the entire file. Quick and dirty is just `ds = xr.open_dataset(fs.open("gs://..."))`. This works great, in that a many GB file can be lazy-loaded as a dataset in a few hundred milliseconds, by only parsing the netcdf headers with under-the-hood byte range requests. But, only if the netcdf is written from dask-backed arrays. Somehow, writing from numpy-backed arrays produces a different netcdf that requires reading deeper into the file to parse as a dataset. I spent some time digging into the backends and see xarray is ultimately passing off the store write to `dask.array` here. A look at `ncdump` and `Dataset.encoding` didn't reveal any obvious differences between these files, but there is clearly something. Anyone know why the straight xarray store methods would produce a different netcdf structure, despite the underlying data and encoding being identical? This should work as an MCVE: ```python import os import string import fsspec import numpy as np import xarray as xr fs = fsspec.filesystem("gs") bucket = "gs://<your-bucket>" create a ~160MB dataset with 20 variables variables = {v: (["x", "y"], np.random.random(size=(1000, 1000))) for v in string.ascii_letters[:20]} ds = xr.Dataset(variables) Save one version from numpy backed arrays and one from dask backed arrays ds.compute().to_netcdf("numpy.nc") ds.chunk().to_netcdf("dask.nc") Copy these to a bucket of your choice fs.put("numpy.nc", bucket) fs.put("dask.nc", bucket) ``` Then time reading in these files as datasets with fsspec: ```python %timeit xr.open_dataset(fs.open(os.path.join(bucket, "numpy.nc"))) 2.15 s ± 40.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` ```python %timeit xr.open_dataset(fs.open(os.path.join(bucket, "dask.nc"))) 187 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ```	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7522/reactions", "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 1 }	xarray 13221727	issue
1359368857	PR_kwDOAMm_X84-PSvu	6978	fix passing of curvefit kwargs	slevang 39069044	open	5	2022-09-01T20:26:01Z	2022-10-11T18:50:45Z	CONTRIBUTOR	0	pydata/xarray/pulls/6978	[x] Closes #6891 [x] Tests added [x] User visible changes (including notable bug fixes) are documented in `whats-new.rst`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/6978/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	xarray 13221727	pull
1043746973	PR_kwDOAMm_X84uC1vs	5933	Reimplement `.polyfit()` with `apply_ufunc`	slevang 39069044	open	6	2021-11-03T15:29:58Z	2022-10-06T21:42:09Z	CONTRIBUTOR	0	pydata/xarray/pulls/5933	[x] Closes #4554 [x] Closes #5629 [x] Closes #5644 [ ] Tests added [x] Passes `pre-commit run --all-files` [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` Reimplement `polyfit` using `apply_ufunc` rather than `dask.array.linalg.lstsq`. This should solve a number of issues with memory usage and chunking that were reported on the current version of `polyfit`. The main downside is that variables chunked along the fitting dimension cannot be handled with this approach. There is a bunch of fiddly code here for handling the differing outputs from `np.polyfit` depending on the values of the `full` and `cov` args. Depending on the performance implications, we could simplify some by keeping these in `apply_ufunc` and dropping later. Much of this parsing would still be required though, because the only way to get the covariances is to set `cov=True, full=False`. A few minor departures from the previous implementation: 1. The `rank` and `singular_values` diagnostic variables returned by `np.polyfit` are now returned on a pointwise basis, since these can change depending on skipped nans. `np.polyfit` also returns the `rcond` used for each fit which I've included here. 2. As mentioned above, this breaks fitting done along a chunked dimension. To avoid regression, we could set `allow_rechunk=True` and warn about memory implications. 3. Changed default `skipna=True`, since the previous behavior seemed to be a limitation of the computational method. 4. For consistency with the previous version, I included a `transpose` operation to put `degree` as the first dimension. This is arbitrary though, and actually the opposite of how `curvefit` returns ordering. So we could match up with `curvefit` but it would be breaking for polyfit. No new tests have been added since the previous suite was fairly comprehensive. Would be great to get some performance reports on real-world data such as the climate model detrending application in #5629.	{ "url": "https://api.github.com/repos/pydata/xarray/issues/5933/reactions", "total_count": 2, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }	xarray 13221727	pull

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

8 rows where state = "open" and user = 39069044 sorted by updated_at descending

What is your issue?

Synthetic dataset chunked in the two interpolation dimensions

Interpolate to some random 2D locations

This fails unless we remember to ds.drop_vars(["lat", "lon"])

What is your issue?

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

What is your issue?

What is your issue?

create a ~160MB dataset with 20 variables

Save one version from numpy backed arrays and one from dask backed arrays

Copy these to a bucket of your choice

2.15 s ± 40.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

187 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Advanced export