home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

4 rows where state = "open", type = "issue" and user = 13662783 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue · 4 ✖

state 1

  • open · 4 ✖

repo 1

  • xarray 4
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1173497454 I_kwDOAMm_X85F8iZu 6377 [FEATURE]: Add a replace method Huite 13662783 open 0     8 2022-03-18T11:46:37Z 2023-06-25T07:52:46Z   CONTRIBUTOR      

Is your feature request related to a problem?

If I have a DataArray of values:

python da = xr.DataArray([0, 1, 2, 3, 4, 5]) And I'd like to replace to_replace=[1, 3, 5] by value=[10, 30, 50], there's no method da.replace(to_replace, value) to do this.

There's no easy way like pandas (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) to do this.

(Apologies if I've missed related issues, searching for "replace" gives many hits as the word is obviously used quite often.)

Describe the solution you'd like

python da = xr.DataArray([0, 1, 2, 3, 4, 5]) replaced = da.replace([1, 3, 5], [10, 30, 50]) print(replaced)

<xarray.DataArray (dim_0: 6)> array([ 0, 10, 2, 30, 4, 50]) Dimensions without coordinates: dim_0

I've had a try at a relatively efficient implementation below. I'm wondering whether it's a worthwhile addition to xarray?

Describe alternatives you've considered

Ignoring issues such as dealing with NaNs, chunks, etc., a simple dict lookup:

python def dict_replace(da, to_replace, value): d = {k: v for k, v in zip(to_replace, value)} out = np.vectorize(lambda x: d.get(x, x))(da.values) return da.copy(data=out)

Alternatively, leveraging pandas:

python def pandas_replace(da, to_replace, value): df = pd.DataFrame() df["values"] = da.values.ravel() df["values"].replace(to_replace, value, inplace=True) return da.copy(data=df["values"].values.reshape(da.shape))

But I also tried my hand at a custom implementation, letting np.unique do the heavy lifting: ```python def custom_replace(da, to_replace, value): # Use np.unique to create an inverse index flat = da.values.ravel() uniques, index = np.unique(flat, return_inverse=True)
replaceable = np.isin(flat, to_replace)

# Create a replacement array in which there is a 1:1 relation between
# uniques and the replacement values, so that we can use the inverse index
# to select replacement values. 
valid = np.isin(to_replace, uniques, assume_unique=True)
# Remove to_replace values that are not present in da. If no overlap
# exists between to_replace and the values in da, just return a copy.
if not valid.any():
    return da.copy()
to_replace = to_replace[valid]
value = value[valid]

replacement = np.zeros_like(uniques)
replacement[np.searchsorted(uniques, to_replace)] = value

out = flat.copy()
out[replaceable] = replacement[index[replaceable]]
return da.copy(data=out.reshape(da.shape))

```

Such an approach seems like it's consistently the fastest:

```python da = xr.DataArray(np.random.randint(0, 100, 100_000)) to_replace = np.random.choice(np.arange(100), 10, replace=False) value = to_replace * 200

test1 = custom_replace(da, to_replace, value) test2 = pandas_replace(da, to_replace, value) test3 = dict_replace(da, to_replace, value)

assert test1.equals(test2) assert test1.equals(test3)

6.93 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit custom_replace(da, to_replace, value)

9.37 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pandas_replace(da, to_replace, value)

26.8 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit dict_replace(da, to_replace, value) ```

With the advantage growing the number of values involved:

```python da = xr.DataArray(np.random.randint(0, 10_000, 100_000)) to_replace = np.random.choice(np.arange(10_000), 10_000, replace=False) value = to_replace * 200

test1 = custom_replace(da, to_replace, value) test2 = pandas_replace(da, to_replace, value) test3 = dict_replace(da, to_replace, value)

assert test1.equals(test2) assert test1.equals(test3)

21.6 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit custom_replace(da, to_replace, value)

3.12 s ± 574 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pandas_replace(da, to_replace, value)

42.7 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit dict_replace(da, to_replace, value) ```

In my real-life example, with a DataArray of approx 110 000 elements, with 60 000 values to replace, the custom one takes 33 ms, the dict one takes 135 ms, while pandas takes 26 s (!).

Additional context

In all cases, we need dealing with NaNs, checking the input, etc.:

```python def replace(da: xr.DataArray, to_replace: Any, value: Any): from xarray.core.utils import is_scalar

if is_scalar(to_replace):
    if not is_scalar(value):
        raise TypeError("if to_replace is scalar, then value must be a scalar")
    if np.isnan(to_replace):
        return da.fillna(value) 
    else:
        return da.where(da != to_replace, other=value)
else:
    to_replace = np.asarray(to_replace)
    if to_replace.ndim != 1:
        raise ValueError("to_replace must be 1D or scalar")
    if is_scalar(value):
        value = np.full_like(to_replace, value)
    else:
        value = np.asarray(value)
        if to_replace.shape != value.shape:
            raise ValueError(
                f"Replacement arrays must match in shape. "
                f"Expecting {to_replace.shape} got {value.shape} "
            )

_, counts = np.unique(to_replace, return_counts=True)
if (counts > 1).any():
    raise ValueError("to_replace contains duplicates")

# Replace NaN values separately, as they will show up as separate values
# from numpy.unique.
isnan = np.isnan(to_replace)
if isnan.any():
    i = np.nonzero(isnan)[0]
    da = da.fillna(value[i])

# Use np.unique to create an inverse index
flat = da.values.ravel()
uniques, index = np.unique(flat, return_inverse=True)    
replaceable = np.isin(flat, to_replace)

# Create a replacement array in which there is a 1:1 relation between
# uniques and the replacement values, so that we can use the inverse index
# to select replacement values. 
valid = np.isin(to_replace, uniques, assume_unique=True)
# Remove to_replace values that are not present in da. If no overlap
# exists between to_replace and the values in da, just return a copy.
if not valid.any():
    return da.copy()
to_replace = to_replace[valid]
value = value[valid]

replacement = np.zeros_like(uniques)
replacement[np.searchsorted(uniques, to_replace)] = value

out = flat.copy()
out[replaceable] = replacement[index[replaceable]]
return da.copy(data=out.reshape(da.shape))

```

It think it should be easy to use e.g. let it operate on the numpy arrays so e.g. apply_ufunc will work. The primary issue is whether values can be sorted; in such a case the dict lookup might be an okay fallback? I've had a peek at the pandas implementation, but didn't become much wiser.

Anyway, for your consideration! I'd be happy to submit a PR.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6377/reactions",
    "total_count": 9,
    "+1": 9,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
620468256 MDU6SXNzdWU2MjA0NjgyNTY= 4076 Zarr ZipStore versus DirectoryStore: ZipStore requires .close() Huite 13662783 open 0     4 2020-05-18T19:58:21Z 2022-04-28T22:37:48Z   CONTRIBUTOR      

I was saving my dataset into a ZipStore -- apparently succesfully -- but then I couldn't reopen them.

The issue appears to be that a regular DirectoryStore behaves a little differently: it doesn't need to be closed, while a ZipStore.

(I'm not sure how this relates to #2586, the remarks there don't appear to be applicable anymore.)

MCVE Code Sample

This errors: ```python import xarray as xr import zarr

works as expected

ds = xr.Dataset({'foo': [2,3,4], 'bar': ('x', [1, 2]), 'baz': 3.14}) ds.to_zarr(zarr.DirectoryStore("test.zarr")) print(xr.open_zarr(zarr.DirectoryStore("test.zarr")))

error with ValueError "group not found at path ''

ds.to_zarr(zarr.ZipStore("test.zip")) print(xr.open_zarr(zarr.ZipStore("test.zip"))) ```

Calling close, or using with does the trick:

```python store = zarr.ZipStore("test2.zip") ds.to_zarr(store) store.close() print(xr.open_zarr(zarr.ZipStore("test2.zip")))

with zarr.ZipStore("test3.zip") as store: ds.to_zarr(store) print(xr.open_zarr(zarr.ZipStore("test3.zip"))) ```

Expected Output

I think it would be preferable to close the ZipStore in this case. But I might be missing something?

Problem Description

Because to_zarr works in this situation with a DirectoryStore, it's easy to assume a ZipStore will work similarly. However, I couldn't get it to read my data back in this case.

Versions

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.15.2.dev41+g8415eefa.d20200419 pandas: 0.25.3 numpy: 1.17.5 scipy: 1.3.1 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.2 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2.14.0+23.gbea4c9a2 distributed: 2.14.0 matplotlib: 3.1.2 cartopy: None seaborn: 0.10.0 numbagg: None pint: None setuptools: 46.1.3.post20200325 pip: 20.0.2 conda: None pytest: 5.3.4 IPython: 7.13.0
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4076/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
386596872 MDU6SXNzdWUzODY1OTY4NzI= 2587 DataArray constructor still coerces to np.datetime64[ns], not cftime in 0.11.0 Huite 13662783 open 0     3 2018-12-02T20:34:36Z 2022-04-18T16:06:12Z   CONTRIBUTOR      

Code Sample

```python import xarray as xr import numpy as np from datetime import datetime

time = [np.datetime64(datetime.strptime("10000101", "%Y%m%d"))] print(time[0]) print(np.dtype(time[0]))

da = xr.DataArray(time, ("time",), {"time":time}) print(da) Results in: 1000-01-01T00:00:00.000000 datetime64[us]

<xarray.DataArray (time: 1)> array(['2169-02-08T23:09:07.419103232'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2169-02-08T23:09:07.419103232 ```

Problem description

I was happy to see cftime as default in the release notes for 0.11.0:

Xarray will now always use cftime.datetime objects, rather than by default trying to coerce them into np.datetime64[ns] objects. A CFTimeIndex will be used for indexing along time coordinates in these cases.

However, it seems that the DataArray constructor does not use cftime (yet?), and coerces to np.datetime64[ns] here: https://github.com/pydata/xarray/blob/0d6056e8816e3d367a64f36c7f1a5c4e1ce4ed4e/xarray/core/variable.py#L183-L189

Expected Output

I think I'd expect cftime.datetime in this case as well. Some coercion happens anyway as pandas timestamps are turned into np.datetime64[ns].

(But perhaps this was already on your radar, and am I just a little too eager!)

Output of xr.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None xarray: 0.11.0 pandas: 0.23.3 numpy: 1.15.3 scipy: 1.1.0 netCDF4: 1.3.1 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.0 PseudonetCDF: None rasterio: 1.0.0 iris: None bottleneck: 1.2.1 cyordereddict: None dask: 0.19.2 distributed: 1.23.2 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 40.5.0 pip: 18.1 conda: None pytest: 3.6.3 IPython: 6.4.0 sphinx: 1.7.5 ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2587/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
441341340 MDU6SXNzdWU0NDEzNDEzNDA= 2947 xr.merge always sorts indexes ascending Huite 13662783 open 0     2 2019-05-07T17:06:06Z 2019-05-07T21:07:26Z   CONTRIBUTOR      

Code Sample, a copy-pastable example if possible

```python import xarray as xr import numpy as np

nrow, ncol = (4, 5) dx, dy = (1.0, -1.0) xmins = (0.0, 3.0, 3.0, 0.0) xmaxs = (5.0, 8.0, 8.0, 5.0) ymins = (0.0, 2.0, 0.0, 2.0) ymaxs = (4.0, 6.0, 4.0, 6.0) data = np.ones((nrow, ncol), dtype=np.float64)

das = [] for xmin, xmax, ymin, ymax in zip(xmins, xmaxs, ymins, ymaxs): kwargs = dict( name="example", dims=("y", "x"), coords={"y": np.arange(ymax, ymin, dy), "x": np.arange(xmin, xmax, dx)}, ) das.append(xr.DataArray(data, **kwargs))

xr.merge(das)

This won't flip the coordinate:

xr.merge([das[0]))

```

Problem description

Let's say I have a number of geospatial grids that I'd like to merge (for example, loaded with xr.open_rasterio). To quote https://www.perrygeo.com/python-affine-transforms.html

The typical geospatial coordinate reference system is defined on a cartesian plane with the 0,0 origin in the bottom left and X and Y increasing as you go up and to the right. But raster data, coming from its image processing origins, uses a different referencing system to access pixels. We refer to rows and columns with the 0,0 origin in the upper left and rows increase and you move down while the columns increase as you go right. Still a cartesian plane but not the same one.

xr.merge will alway return the result with ascending coordinates, which creates some issues / confusion later on if you try to write it back to a GDAL format, for example (I've been scratching my head for some time looking at upside-down .tifs).

Expected Output

I think the expected output for these geospatial grids is that; if you provide only DataArrays with positive dx, negative dy; that the merged result comes out with a positive dx and a negative dy as well.

When the DataArrays to merge are mixed in coordinate direction (some with ascending, some with descending coordinate values), defaulting to an ascending sort seems sensible.

A suggestion

I saw that the sort is occurring here, in pandas; and that there's a is_monotonic_decreasing property in pandas.core.indexes.base.Index

I think this could work (it solves my issue at least), in xarray.core.alignment python index = joiner(matching_indexes) if all( (matching_index.is_monotonic_decreasing for matching_index in matching_indexes) ): index = index[::-1] joined_indexes[dim] = index But I lack the knowledge to say whether this plays nice in all cases. And does index[::-1] return a view or a copy? (And does it matter?)

For reference this is what it looks like now: python if (any(not matching_indexes[0].equals(other) for other in matching_indexes[1:]) or dim in unlabeled_dim_sizes): if join == 'exact': raise ValueError( 'indexes along dimension {!r} are not equal' .format(dim)) index = joiner(matching_indexes) joined_indexes[dim] = index else: index = matching_indexes[0] It's also worth highlighting that the else branch causes, arguably, some inconsistency. If the indexes are equal, no reversion occurs.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2947/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 2957.322ms · About: xarray-datasette