home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

9 rows where repo = 13221727, state = "closed" and user = 13662783 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, closed_at, created_at (date), updated_at (date), closed_at (date)

type 2

  • issue 7
  • pull 2

state 1

  • closed · 9 ✖

repo 1

  • xarray · 9 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1987770706 I_kwDOAMm_X852evlS 8440 nD integer indexing on dask data is very slow Huite 13662783 closed 0     2 2023-11-10T14:47:08Z 2023-11-12T04:56:23Z 2023-11-12T04:56:22Z CONTRIBUTOR      

What happened?

I ran into a situation where I was indexing with a 2D integer array into some chunked netCDF data. This indexing operation is extremely slow. Using a flat 1D index instead is as fast as expected.

What did you expect to happen?

I would expect indexing on dask data to be very quick since the work is delayed, and indeed it is so in the 1D case. However, the 2D case is very slow -- slower than actually doing the all the work with numpy arrays!

Minimal Complete Verifiable Example

```Python import dask.array import numpy as np import xarray as xr

%%

da = xr.DataArray( data=np.random.rand(100, 1_000_000), dims=("time", "x"), ) dask_da = xr.DataArray( data=dask.array.from_array(da.to_numpy(), chunks=(1, 1_000_000)), dims=("time", "x"), )

indexer = np.random.randint(0, 1_000_000, size=100_000) indexer2d = xr.DataArray( data=indexer.reshape((4, -1)), dims=("a", "b"), )

%%

%timeit da.isel(x=indexer) # 162 ms %timeit da.isel(x=indexer2d) # 164 ms %timeit dask_da.isel(x=indexer) # 5.3 ms %timeit dask_da.isel(x=indexer2d) # 860 ms according to timeit, but 6 to 14 (!) seconds in interactive use ```

MVCE confirmation

  • [x] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [x] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [x] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:34:57) [MSC v.1936 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 151 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('English_Netherlands', '1252') libhdf5: 1.14.2 libnetcdf: 4.9.2 xarray: 2023.10.2.dev31+ge5d163a8.d20231110 pandas: 2.1.2 numpy: 1.26.0 scipy: 1.11.3 netCDF4: 1.6.5 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.3 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: 2023.10.1 distributed: 2023.10.1 matplotlib: 3.8.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.10.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.1.2 pip: 23.2.1 conda: 23.3.1 pytest: None mypy: None IPython: 8.17.2 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/8440/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
793245791 MDU6SXNzdWU3OTMyNDU3OTE= 4842 nbytes does not return the true size for sparse variables Huite 13662783 closed 0     2 2021-01-25T10:17:56Z 2022-07-22T17:25:33Z 2022-07-22T17:25:33Z CONTRIBUTOR      

This wasn't entirely surprising to me, but nbytes currently doesn't return the right value for sparse data -- at least, I think nbytes should show the actual size in memory?

Since it uses size here:

https://github.com/pydata/xarray/blob/a0c71c1508f34345ad7eef244cdbbe224e031c1b/xarray/core/variable.py#L349

Rather than something like data.nnz, which of course only exists for sparse arrays... I'm not sure if there's a sparse flag or something, or whether you'd have to do a typecheck?

Minimal Complete Verifiable Example:

```python import pandas as pd import numpy as np import xarray as xr

df = pd.DataFrame() df["x"] = np.repeat(np.random.rand(10_000), 10) df["y"] = np.repeat(np.random.rand(10_000), 10) df["time"] = np.tile(pd.date_range("2000-01-01", "2000-03-10", freq="W"), 10_000) df["rate"] = 10.0 df = df.set_index(["time", "y", "x"])

sparse_ds = xr.Dataset.from_dataframe(df, sparse=True) print(sparse_ds["rate"].nbytes) ```

python 8000000000 Anything else we need to know?:

Environment:

Output of <tt>xr.show_versions()</tt> ``` INSTALLED VERSIONS ------------------ commit: None python: 3.7.9 (default, Aug 31 2020, 17:10:11) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.16.1 pandas: 1.1.2 numpy: 1.19.1 scipy: 1.5.2 netCDF4: 1.5.3 pydap: None h5netcdf: 0.8.0 h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.2 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2.27.0 distributed: 2.30.1 matplotlib: 3.3.1 cartopy: None seaborn: 0.11.0 numbagg: None pint: None setuptools: 49.6.0.post20201009 pip: 20.3.3 conda: None pytest: 6.1.0 IPython: 7.19.0 sphinx: 3.2.1 ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4842/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
656165548 MDU6SXNzdWU2NTYxNjU1NDg= 4222 Extension proposal: extension for UGRID and unstructured mesh utils Huite 13662783 closed 0     5 2020-07-13T21:44:12Z 2022-04-22T18:19:53Z 2022-04-18T15:37:35Z CONTRIBUTOR      

Disclaimer: I'm not a 100% sure this is necessarily the right place to discuss this. I think some of the things I propose could be broadly useful. At any rate, I appreciate feedback!

EDIT: to clarify, I don't think my original wording was entirely clear: I'm not suggesting adding this to xarray, but I'm thinking about a dataset accessor or a wrapping object in a separate package.

Introduction

Xarray is a great tool for structured data, but there isn't anything quite like it for unstructured meshes of convex cells. I'm coming from a hydrological background, and I believe many of my colleagues working with unstructured meshes could greatly benefit from xarray.

In terms of data models and formats, there are many unstructured mesh formats, I think UGRID is the most widely shared (outside of specific model formats) and it supports many aspects (e.g. data on faces, nodes, edges). Although Gridded exists and provides a number of features, I really need something that consumes and produces xarray Datasets (I don't want to go without e.g. dask, or deal with the netcdf4 library myself).

My most immediate need is some utilities for 2D unstructured meshes (y, x), and layered meshes (layer, y, x). This is also what UGRID is mostly used for now, as there don't seem to be many examples of the 3D unstructured mesh definitions in the wild. Sticking to 2D simplies things somewhat, so I'll ignore the third dimension for now -- but I don't see any reason why one couldn't generalize many ideas to 3D.

I currently use xarray a great deal with structured data, in dealing with "regular" netCDFs. In my thinking, it's possible to "mirror" a few methods which by and large behave similarly to xarray, using the UGRID data model for the mesh topology.

A simple UGRID example

To clarify what I'm talking about, this is what UGRID looks like in xarray, there's a Dataset with a dummy variable which describes in which variables the mesh topology is stored, in this case for a triangle and a quad.

python ds = xr.Dataset() ds["mesh2d"] = xr.DataArray( data=0, attrs={ "cf_role": "mesh_topology", "long_name": "Topology data of 2D mesh", "topology_dimension": 2, "node_coordinates": "node_x node_y", "face_node_connectivity": "face_nodes", } ) ds = ds.assign_coords( node_x=xr.DataArray( data=np.array([0.0, 10.0, 10.0, 0.0, 20.0, 20.0]), dims=["node"], ) ) ds = ds.assign_coords( node_y=xr.DataArray( data=np.array([0.0, 0.0, 10.0, 10.0, 0.0, 10.0]), dims=["node"], ) ) ds["face_nodes"] = xr.DataArray( data=np.array([ [0, 1, 2, 3], [1, 4, 5, -1] ]), coords={ "face_x": ("face", np.array([5.0, 15.0])), "face_y": ("face", np.array([5.0, 5.0])), }, dims=["face", "nmax_face"], attrs={ "cf_role": "face_node_connectivity", "long_name": "Vertex nodes of mesh faces (counterclockwise)", "start_index": 0, "_FillValue": -1, } )

It's only the topology that is special cased. A time dependent variable looks the same as it would otherwise, except it would have dimensions ("time", "face") rather than ("time", "y", "x").

UGRID netCDFs can in principle be read and written without issue. I would prefer to work with standard xarray Datasets like this if possible, because it means there's no need to touch any IO methods, as well as maintaining the full flexibility of xarray.

Some methods

I'm thinking of a few methods which behave similarly as the xarray methods except that they know about the mesh topology (in x and y) and the dependencies between the different variables. So, you could do:

```Python import xugrid # name of extension

selection_ds = ds.xugrid.sel(node_x=slice(-30.0, 30.0)) ```

And it would slice based on the x coordinate of the nodes, but update all the other variables to produce a consistent mesh (and renumber the face_nodes appriopriately). xugrid.isel would behave in the same way.

xugrid.plot would e.g. perform triangulation automatically (if the mesh isn't triangular already), and create a matplotlib.tri.Triangulation object, and so forth.

Regridding

I think one of the major needs for dealing with unstructured grids is being able to map data from one mesh to another, go from (GIS) raster data to unstructured, or go from unstructured model output to GIS raster data. For my use case, I often require area weighted methods, which requires computing overlap between source and destination meshes. I haven't really found a suitable package, e.g. xemsf looks very useful, but doesn't have quite the methods I'd want. Also troubling is that I can't really get it installed on Windows.

Existing GIS routines for vector data are far too slow (they have to take too many special cases into account, I think), see some benchmarks here.

I think numba provides an interesting opportunity here, not just for ease of development and distribution, but also because of JIT-compilation, the user can provide their own regridding methods. E.g.

```python import numba as nb import numpy as np

def median(values, weights): return np.nanpercentile(values, 50)

class Regridder: def init(self, source, destination): # compute and store weights # ...

# Use a closure to avoid numba's function overhead:
# https://numba.pydata.org/numba-doc/latest/user/faq.html#can-i-pass-a-function-as-an-argument-to-a-jitted-function
def _make_regrid(self, method):
    jitted_func = nb.njit(method)
    def regrid(src_indices, dst_indices, data, weights):
        out = np.full(dst_indices.size, np.nan)
        for src_index, dst_index in zip(src_indices, dst_indices):  # or something
            # accumulate values          
            out[dst_index] = jitted_func(values)
        return out
    return regrid

def regrid(source_data, method):
    self._regrid = self._make_regrid(method)
    # Cache the method in the object or something to avoid repeat compilation
    # provide default methods (e.g. min or max) as string or enumerators
    return self._regrid(self.src_indices, self.dst_indices, source_data, self.weights)

```

The best way to find overlapping cells seems via a cell tree, as implemented here, and requires searching for bounding boxes rather than points. I've ported the CellTree2D to numba, and added a bounding box search.

The serial version is a little bit slower than C++ for point searches, but with numba's prange (and enough processors) searches are a few times faster. (In retrospect, I could've just re-used the C++ tree rather than porting it, but numba does have some benefits...)

Interpolation on an unstructured grids can probably be handled by pyinterp, and can be made available via the regridder. (Scipy interpolate takes too much memory in my experience for large meshes.)

API proposal

```python class Regridder: # Basically same API as xemsf def init(self, source: xr.Dataset, destination: xr.Dataset) -> None: pass

def regrid(self, source_data: xr.DataArray, method: Union[str, Callable]) -> xr.Dataset:
    pass

xugrid.triangulate(self) -> xr.Dataset: xugrid.rasterize(self, resolution: Mapping) -> xr.Dataset: xugrid.reproject(self, dst_crs, src_crs) -> xr.Dataset: # just offload all coordinates to pyproj?

xugrid.sel(self, indexers: Mapping) -> xr.Dataset: xugrid.isel(self, indexers: Mapping) -> xr.Dataset: xugrid.plot(self) -> None: # calls triangulate if needed xugrid.plot.imshow(self) -> None: # maybe calls rasterize first, provides a quicker peek

xugrid.slice_vertical(self, linestring) -> xr.Dataset: # traces one or more rays through the mesh

xugrid.check_mesh(self) -> None: # check for convexity, all indices present, etc. xugrid.connectivity_matrix(self) -> scipy.sparse? xugrid.reverse_cuthill_mckee(self) -> xr.Dataset: # use scipy.sparse.csgraph xugrid.binary_erosion(self, iterations: int) -> xr.DataArray: xugrid.binary_dilation(self, iterations: int) -> xr.DataArray:

xugrid.meshfix(self) # call pymeshfix xugrid.to_pyvista(self) -> pyvista.UnstructuredGrid: # create a pyvista (vtk) mesh ```

Most of these methods require dealing with numpy arrays (or numpy arrays representing sparse arrays), so I think numba is very suitable for the algorithms (if they don't already exist).

For most methods, 3D meshes would be possible too. It requires some changes in e.g. the CellTree, and plotting would first require a slice_horizontal or something to get a plan view image.

Some methods might not be very needed, I'm basing it off of my impression what I currently use xarray for in combination with some other tools that assume structured data.

Resources

  • Cell tree https://github.com/NOAA-ORR-ERD/cell_tree2d
  • Numba KD tree: https://github.com/jackd/numba-neighbors
  • Interpolation: https://github.com/CNES/pangeo-pyinterp
  • UGRID conventions: https://ugrid-conventions.github.io/ugrid-conventions/
  • Gridded: https://github.com/NOAA-ORR-ERD/gridded
  • Triangulating polygons (could be useful for vector data -> mesh): https://pypi.org/project/mapbox-earcut/
  • Pymeshfix: https://github.com/pyvista/pymeshfix

(Have to check for non-permissive / copyleft licenses...)

Thoughts?

Does this sound useful? Does it belong somewhere else / does it already exist?

(I'm aware a number e.g. the tree structures are available in vtk as well, but the Python API unfortunately doesn't expose them; having a numba version also makes changes a lot easier.)

Something that popped up: Some of the algorithms benefit from some persistent data structures (e.g. a cell tree or connectivity matrix), can these be maintained via an xarray-extension?

I'm assuming mesh topologies that fit within memory, that seems challenging enough for now...

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/4222/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
601364609 MDExOlB1bGxSZXF1ZXN0NDA0NjM0Mjk0 3979 full_like: error on non-scalar fill_value Huite 13662783 closed 0     2 2020-04-16T19:18:50Z 2020-04-24T07:15:49Z 2020-04-24T07:15:44Z CONTRIBUTOR   0 pydata/xarray/pulls/3979
  • [x] Closes #3977
  • [x] Tests added
  • [x] Passes isort -rc . && black . && mypy . && flake8
  • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

@dcherian's suggestion in #3977 seemed straightforward enough for me to have a try.

Two thoughts: * does the np.isscalar check belong in the outer function, or the inner? The inner function is only called by the outer one. * Bugfix or arguably a breaking change?

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3979/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
601224688 MDU6SXNzdWU2MDEyMjQ2ODg= 3977 xr.full_like (often) fails when other is chunked and fill_value is non-scalar Huite 13662783 closed 0     1 2020-04-16T16:23:59Z 2020-04-24T07:15:44Z 2020-04-24T07:15:44Z CONTRIBUTOR      

I've been running into some issues when using xr.full_like, when my other.data is a chunked dask array, and the fill_value is a numpy array.

Now, I just checked, full_like mentions only scalar in the signature. However, this is a very convenient way to get all the coordinates and dimensions attached to an array like this, so it feels like desirable functionality. And as I mention below, both numpy and dask function similary, taking much more than just scalars. https://xarray.pydata.org/en/stable/generated/xarray.full_like.html

MCVE Code Sample

python x = [1, 2, 3, 4] y = [1, 2, 3] da1 = xr.DataArray(dask.array.ones((3, 4), chunks=(1, 4)), {"y": y, "x": x}, ("y", "x")) da2 = xr.full_like(da1, np.ones((3, 4))) print(da2.values)

This results in an error: ValueError: could not broadcast input array from shape (1,3) into shape (1,4)

Expected Output

Expected is a DataArray with the dimensions and coords of other, and the numpy array of fill_value as its data.

Problem Description

The issue lies here: https://github.com/pydata/xarray/blob/2c77eb531b6689f9f1d2adbde0d8bf852f1f7362/xarray/core/common.py#L1420-L1436

Calling dask.array.full with the given number of chunks results in it trying to to apply the fill_value for every individual chunk.

As one would expect, if I set fill_value to the size of a single chunk it doesn't error: python da2 = xr.full_like(da1, np.ones((1, 4))) print(da2.values)

It does fail on a similarly chunked dask array (since it's applying it for every chunk): python da2 = xr.full_like(da1, dask.array.ones((3, 4))) print(da2.values)

The most obvious solution would be to force it down the np.full_like route, since all the values already exist in memory anyway. So maybe another type check does the trick. However, full() accepts quite a variety of arguments for the fill value (scalars, numpy arrays, lists, tuples, ranges). The dask docs mention only a scalar in the signature for dask.array.full: https://docs.dask.org/en/latest/array-api.html#dask.array.full As does numpy.full: https://docs.scipy.org/doc/numpy/reference/generated/numpy.full.html

However, in all cases, they still broadcast automatically... ```python a = np.full((2, 2), [1, 2]

array([[1, 2], [1, 2]]) ```

So kind of undefined behavior of a blocked full?

Versions

Output of `xr.show_versions()` INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 21:48:41) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.15.1 pandas: 0.25.3 numpy: 1.17.5 scipy: 1.3.1 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.2 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2.9.2 distributed: 2.10.0 matplotlib: 3.1.2 cartopy: None seaborn: 0.10.0 numbagg: None setuptools: 46.1.3.post20200325 pip: 20.0.2 conda: None pytest: 5.3.4 IPython: 7.13.0 sphinx: 2.3.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3977/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
545762025 MDU6SXNzdWU1NDU3NjIwMjU= 3664 Regression 0.14.0 -> 0.14.1: __setattr__ no longer updates matching index Huite 13662783 closed 0     2 2020-01-06T14:42:50Z 2020-01-06T16:03:16Z 2020-01-06T15:37:17Z CONTRIBUTOR      

MCVE Code Sample

This came up in the following (simplified) case: ```python import numpy as np import xarray as xr

x = [1, 2, 3, 4] data = [1., 2., 3. , 4.] dims = ["x"] coords = {"x": x} da = xr.DataArray(data, coords, dims) da1 = da.sel(x=[1, 3]) da2 = da.sel(x=[2, 4]) da2["x"].values = da1["x"].values da3 = da2 - da1 print(da3) ```

In 0.14.1, this results in an empty DataArray da3: python <xarray.DataArray (x: 0)> array([], dtype=float64) Coordinates: * x (x) int64

0.14.0 (and before), no issues: python <xarray.DataArray (x: 2)> array([1., 1.]) Coordinates: * x (x) int32 1 3

Interestingly, the values of da["x"] are okay, print(da2["x"]") gives: array([1, 3])

However, the index has not been updated, print(da2.indexes["x"]): Int64Index([2, 4], dtype='int64', name='x')

Expected Output

I'm expecting output as in 0.14.0 and before. I've briefly stepped through the code for da2["x"].values = da1["x"].values; I don't see any changes there between 0.14.0 and 0.14.1, so I'm guessing it's due to some change in the indexes. Apparently the index no longer uses the same array?

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.3 scipy: 1.3.2 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.1 cfgrib: None iris: None bottleneck: None dask: 2.6.0 distributed: 2.6.0 matplotlib: 3.1.1 cartopy: None seaborn: None numbagg: None setuptools: 41.6.0.post20191030 pip: 19.3.1 conda: None pytest: None IPython: 7.9.0 sphinx: 2.2.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3664/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
354298235 MDU6SXNzdWUzNTQyOTgyMzU= 2383 groupby().apply() on variable with NaNs raises IndexError Huite 13662783 closed 0     1 2018-08-27T12:27:06Z 2019-10-28T23:46:41Z 2019-10-28T23:46:41Z CONTRIBUTOR      

Code Sample

```python import xarray as xr import numpy as np

def standardize(x): return (x - x.mean()) / x.std()

ds = xr.Dataset() ds["variable"] = xr.DataArray(np.random.rand(4,3,5), {"lat":np.arange(4), "lon":np.arange(3), "time":np.arange(5)}, ("lat", "lon", "time"), )

ds["id"] = xr.DataArray(np.arange(12.0).reshape((4,3)), {"lat": np.arange(4), "lon":np.arange(3)}, ("lat", "lon"), )

ds["id"].values[0,0] = np.nan

ds.groupby("id").apply(standardize) ```

Problem description

This results in an IndexError. This is mildly confusing, it took me a little while to figure out the NaN's were to blame. I'm guessing the NaN doesn't get filtered out everywhere.

The traceback: ```


IndexError Traceback (most recent call last) <ipython-input-2-267ba57bc264> in <module>() 15 ds["id"].values[0,0] = np.nan 16 ---> 17 ds.groupby("id").apply(standardize)

C:\Miniconda3\envs\main\lib\site-packages\xarray\core\groupby.py in apply(self, func, kwargs) 607 kwargs.pop('shortcut', None) # ignore shortcut if set (for now) 608 applied = (func(ds, kwargs) for ds in self._iter_grouped()) --> 609 return self._combine(applied) 610 611 def _combine(self, applied):

C:\Miniconda3\envs\main\lib\site-packages\xarray\core\groupby.py in _combine(self, applied) 614 coord, dim, positions = self._infer_concat_args(applied_example) 615 combined = concat(applied, dim) --> 616 combined = _maybe_reorder(combined, dim, positions) 617 if coord is not None: 618 combined[coord.name] = coord

C:\Miniconda3\envs\main\lib\site-packages\xarray\core\groupby.py in _maybe_reorder(xarray_obj, dim, positions) 428 429 def _maybe_reorder(xarray_obj, dim, positions): --> 430 order = _inverse_permutation_indices(positions) 431 432 if order is None:

C:\Miniconda3\envs\main\lib\site-packages\xarray\core\groupby.py in _inverse_permutation_indices(positions) 109 positions = [np.arange(sl.start, sl.stop, sl.step) for sl in positions] 110 --> 111 indices = nputils.inverse_permutation(np.concatenate(positions)) 112 return indices 113

C:\Miniconda3\envs\main\lib\site-packages\xarray\core\nputils.py in inverse_permutation(indices) 52 # use intp instead of int64 because of windows :( 53 inverse_permutation = np.empty(len(indices), dtype=np.intp) ---> 54 inverse_permutation[indices] = np.arange(len(indices), dtype=np.intp) 55 return inverse_permutation 56

IndexError: index 11 is out of bounds for axis 0 with size 11

```

Expected Output

My assumption was that it would throw out the values that fall within the NaN group, likepandas:

```python import pandas as pd import numpy as np

df = pd.DataFrame() df["var"] = np.random.rand(10) df["id"] = np.arange(10) df["id"].iloc[0:2] = np.nan df.groupby("id").mean() ```

Out: python var id 2.0 0.565366 3.0 0.744443 4.0 0.190983 5.0 0.196922 6.0 0.377282 7.0 0.141419 8.0 0.957526 9.0 0.207360

Output of xr.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None xarray: 0.10.8 pandas: 0.23.3 numpy: 1.15.0 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.2 distributed: 1.22.0 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 40.0.0 pip: 18.0 conda: None pytest: 3.7.1 IPython: 6.4.0 sphinx: 1.7.5 ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2383/reactions",
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
510505541 MDExOlB1bGxSZXF1ZXN0MzMwODY3ODI0 3428 Address multiplication DeprecationWarning in rasterio backend Huite 13662783 closed 0     2 2019-10-22T08:32:25Z 2019-10-22T18:45:32Z 2019-10-22T18:45:24Z CONTRIBUTOR   0 pydata/xarray/pulls/3428

Very minor change to address this DeprecationWarning:

xarray\backends\rasterio_.py:260: DeprecationWarning: Right multiplication will be prohibited in version 3.0 x, _ = (np.arange(nx) + 0.5, np.zeros(nx) + 0.5) * riods.transform

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3428/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 pull
365854029 MDU6SXNzdWUzNjU4NTQwMjk= 2454 Regression from 0.10.8 to 0.10.9, "shape mismatch" IndexError in reindex_like() after open_rasterio().squeeze() Huite 13662783 closed 0     1 2018-10-02T11:28:16Z 2018-10-08T18:17:02Z 2018-10-08T18:17:02Z CONTRIBUTOR      

Code Sample

```python import xarray as xr import numpy as np import rasterio from affine import Affine

def write_tif(path, epsg): nrow = 5 ncol = 8 values = 10.0 * np.random.rand(nrow, ncol)

profile = dict()
profile["crs"] = rasterio.crs.CRS.from_epsg(epsg)
profile["transform"] = Affine.translation(0.0, 0.0)
profile["driver"] = "GTiff"
profile["height"] = nrow
profile["width"] = ncol
profile["count"] = 1
profile["dtype"] = np.float64

with rasterio.Env():
    with rasterio.open(path, "w", **profile) as dst:
        dst.write(values, 1)

write_tif("basic.tif", epsg=28992) da = xr.open_rasterio("basic.tif").squeeze("band", drop=True) likevalues = np.empty((10, 16)) likecoords = {"y": np.arange(0.25, 5.0, 0.5), "x": np.arange(0.25, 8.0, 0.5)} likedims = ("y", "x") like = xr.DataArray(likevalues, likecoords, likedims)

newda = da.reindex_like(like) ```

Problem description

The code above executes without issues in xarray version 0.10.8. However, it results in a "shape-mismatch" error in 0.10.9:

``` Traceback (most recent call last):

File "<ipython-input-57-f099072e9750>", line 8, in <module> newda = da.reindex_like(like)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\dataarray.py", line 919, in reindex_like **indexers)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\dataarray.py", line 969, in reindex indexers=indexers, method=method, tolerance=tolerance, copy=copy)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\dataset.py", line 1924, in reindex tolerance, copy=copy)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\alignment.py", line 377, in reindex_variables new_var = var._getitem_with_mask(key)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\variable.py", line 655, in _getitem_with_mask data = duck_array_ops.where(mask, fill_value, data)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\duck_array_ops.py", line 183, in where return _where(condition, *as_shared_dtype([x, y]))

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\duck_array_ops.py", line 115, in as_shared_dtype arrays = [asarray(x) for x in scalars_or_arrays]

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\duck_array_ops.py", line 115, in <listcomp> arrays = [asarray(x) for x in scalars_or_arrays]

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\duck_array_ops.py", line 110, in asarray return data if isinstance(data, dask_array_type) else np.asarray(data)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\numpy\core\numeric.py", line 492, in asarray return array(a, dtype, copy=False, order=order)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\indexing.py", line 624, in array self._ensure_cached()

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\indexing.py", line 621, in _ensure_cached self.array = NumpyIndexingAdapter(np.asarray(self.array))

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\numpy\core\numeric.py", line 492, in asarray return array(a, dtype, copy=False, order=order)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\indexing.py", line 602, in array return np.asarray(self.array, dtype=dtype)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\numpy\core\numeric.py", line 492, in asarray return array(a, dtype, copy=False, order=order)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\indexing.py", line 508, in array return np.asarray(array[self.key], dtype=None)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\backends\rasterio_.py", line 119, in getitem key, self.shape, indexing.IndexingSupport.OUTER, self._getitem)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\core\indexing.py", line 776, in explicit_indexing_adapter result = raw_indexing_method(raw_key.tuple)

File "d:\bootsma\miniconda3\envs\main\lib\site-packages\xarray\backends\rasterio_.py", line 115, in _getitem return out[np_inds]

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (10,) (16,)

```

I only get it to occur when loading a raster using open_rasterio(). Defining da directly does not lead to issues, using either numpy or dask.array.

Python values = 10.0 * np.random.rand(1, 5, 8) coords = {"band": [1.0], "y": np.arange(0.5, 5.0, 1.0), "x": np.arange(0.5, 8.0, 1.0)} dims = ("band", "y", "x") da = xr.DataArray(values, coords, dims) da = da.squeeze("band", drop=True) Python import dask.array values = 10.0 * dask.array.random.random((1, 5, 8), (1, 5, 8)) coords = {"band": [1.0], "y": np.arange(0.5, 5.0, 1.0), "x": np.arange(0.5, 8.0, 1.0)} dims = ("band", "y", "x") da = xr.DataArray(values, coords, dims) da = da.squeeze("band", drop=True)

The "shape mismatch" makes me think that maybe the squeeze() isn't executed properly or in time. This works: ```Python write_tif("basic.tif", epsg=28992) da = xr.open_rasterio("basic.tif") likevalues = np.empty((1, 10, 16)) likecoords = {"band":[1], "y": np.arange(0.25, 5.0, 0.5), "x": np.arange(0.25, 8.0, 0.5)} likedims = ("band", "y", "x") like = xr.DataArray(likevalues, likecoords, likedims)

newda = da.reindex_like(like) ```

So it looks to me like some lazy evaluation is going awry when open_rasterio is involved, in this case. Anything which forces a compute() before reindex_like in this case seems to make the problem go away, like calling da in the console, print(da) after opening the .tif, etc.

```Python write_tif("basic.tif", epsg=28992) da = xr.open_rasterio("basic.tif").compute().squeeze("band", drop=True) likevalues = np.empty((10, 16)) likecoords = {"y": np.arange(0.25, 5.0, 0.5), "x": np.arange(0.25, 8.0, 0.5)} likedims = ("y", "x") like = xr.DataArray(likevalues, likecoords, likedims)

newda = da.reindex_like(like) ```

Interestingly, I can't easily inspect it, because inspecting makes the problem go away!

Expected Output

A succesfully reindexed DataArray.

Output of xr.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None xarray: 0.10.9 pandas: 0.23.3 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.3.1 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None cftime: 1.0.0 PseudonetCDF: None rasterio: 1.0.0 iris: None bottleneck: 1.2.1 cyordereddict: None dask: 0.18.1 distributed: 1.22.0 matplotlib: 2.2.2 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 40.0.0 pip: 18.0 conda: None pytest: 3.6.3 IPython: 6.4.0 sphinx: 1.7.5 ```
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2454/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 73.46ms · About: xarray-datasette