home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1358960570

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1358960570 I_kwDOAMm_X85RABe6 6976 dataset.sel inconsistent results when argument is a list or a slice. 12818667 open 0     5 2022-09-01T14:40:34Z 2022-09-02T14:13:27Z   NONE      

What happened?

I am not sure if what I report is a bug; however, it is certainly not what I expect from a careful reading of the documentation, and I wonder if it is leading to some issues I describe below.

I am working with a large dataset produced by merging the output of runs made by different MPI processes. There are two coordinates, ("trajectory","obs"). All of the "obs" in the dataset are in order, but the "trajectory" coordinate is not in order. I made a smaller dataset that illustrates the issue below by reducing the number of "obs" from 250 to 2; this dataset can be found at http://oxbow.sr.unh.edu/data/smallExample.zarr.zip . This dataset looks like: Dimensions: (trajectory: 39363539, obs: 2) Coordinates: * obs (obs) int32 0 1 * trajectory (trajectory) int64 100 210 227 ... 39363210 39363255 39363379 Data variables: age (trajectory, obs) float32 dask.array<chunksize=(50000, 2), meta=np.ndarray> lat (trajectory, obs) float64 dask.array<chunksize=(50000, 2), meta=np.ndarray> lon (trajectory, obs) float64 dask.array<chunksize=(50000, 2), meta=np.ndarray> time (trajectory, obs) datetime64[ns] dask.array<chunksize=(40625, 2), meta=np.ndarray> z (trajectory, obs) float64 dask.array<chunksize=(50000, 2), meta=np.ndarray> Note that the trajectory coordinate is not in order; this is due to how the problem is partitioned into MPI jobs.

If I want an ordered set of trajectories, say trajectories [1,2,3,4,5,6,7,8,9,10], and I do this with subSet=dataIn.sel(trajectory=arange(1,11)), I get what I would expect from the documentation: The first through 10th trajectories, in order: <xarray.Dataset> Dimensions: (trajectory: 10, obs: 2) Coordinates: * obs (obs) int32 0 1 * trajectory (trajectory) int64 1 2 3 4 5 6 7 8 9 10 Data variables: age (trajectory, obs) float32 dask.array<chunksize=(1, 2), meta=np.ndarray> ....

But if I use the slice operator to specify what I want, I get something very different: dataIn.sel(trajectory=slice(1,11)) returns 2567339 trajectories, starting with the location of trajectory coordinate 1 in dataIn and extending to the location of trajectory coordinate 10 in dataIn: <xarray.Dataset> Dimensions: (trajectory: 2567339, obs: 2) Coordinates: * obs (obs) int32 0 1 * trajectory (trajectory) int64 1 27 57 59 ... 39363486 39363495 39363528 11 Data variables: age (trajectory, obs) float32 dask.array<chunksize=(17944, 2), meta=np.ndarray> ... This is not what I expect -- as I understand the documentation, .sel should work in coordinate space, and I would expect dataIn.sel(trajectory=slice(1,11)) and subSet=dataIn.sel(trajectory=arange(1,11)) to return the same thing. If I am wrong in this interpretation, perhaps a documentation update would be helpful.

I have had all sorts of issues with the full dataset, including .to_zarr(dataset, compute=False) failing to return a delayedObject because it used all the memory, .sortby(['trajectory']) failing by memory exhaustion, etc. I wonder if the issue reported here can be at the root of many of these issues? On a side note, the failure of .sortby(['trajectory']) makes re-ordering the dataset difficult, and I would be happy to hear any suggestions on that front.

What did you expect to happen?

See above for full descriptions

Minimal Complete Verifiable Example

```Python

get data from http://oxbow.sr.unh.edu/data/smallExample.zarr.zip

dataIn=xr.open_zarr('smallExample.zarr') print(dataIn.sel(trajectory=arange(1,11))) print(dataIn.sel(trajectory=slice(1,11))) ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 5.15.0-46-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.23.2 scipy: 1.9.0 netCDF4: 1.6.0 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.12.0 cftime: 1.6.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.8.1 distributed: 2022.8.1 matplotlib: 3.5.3 cartopy: 0.20.3 seaborn: None numbagg: None fsspec: 2022.7.1 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.2.0 pip: 22.2.2 conda: None pytest: 7.1.2 IPython: 8.4.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6976/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 5 rows from issue in issue_comments
Powered by Datasette · Queries took 0.787ms · About: xarray-datasette