home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 1324446752

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1324446752 I_kwDOAMm_X85O8XQg 6858 Inconsistent results due to dask.Cache 34112954 closed 0     1 2022-08-01T13:38:35Z 2024-02-26T06:52:07Z 2024-02-26T06:52:06Z NONE      

What happened?

In my workflow, I was opening a dataset from a netcdf, modifying the contents, saving the dataset as a different file and then repeat the process with the same file but a different modification.

It looks like Dask Cache is messing up the results due to using cached information when it shouldn't.

Also, that leds to different results for data_array.values.sum() and data_array.sum().values.

What did you expect to happen?

I expect that when I open a file from the disk, if I access the values from that file these will correspond to what it is actually on the file.

Instead it is reusing data values that should have been discarded when loading the file again.

Minimal Complete Verifiable Example

```Python import numpy as np import xarray as xr import dask from dask.cache import Cache

cache = Cache(2e9) cache.register()

print(f"Xarray Version: {xr.version=}") print(f"Dask version: {dask.version=}")

def file_opener(file_path): """ Open a file and return a dataset object. :param file_path: :return: """ return xr.open_dataset(file_path, chunks={})

def main(): # Define names for the files file_path = "./dummy_dataset.nc" modified_file_path = file_path.replace(".nc", ".modified.nc")

# Create a synthetic dataset and save it into a file
data = np.zeros(shape=(4, 3))
foo = xr.DataArray(data, dims=['time', 'space'])
ds = foo.to_dataset(name='foo')
ds.to_netcdf(file_path)

# Open the file and compute the sum of the values
for i in range(3):
    print(f"Iteration:{i}")
    print("\tOriginal")
    with file_opener(file_path) as ds:
        print(f"\t\t{ds.foo.sum().values=:2f}")
        print(f"\t\t{ds.foo.values.sum()=:2f}")

        # Modify the dataset and save it as a different file
        ds.foo.values += 1.0
        ds.to_netcdf(modified_file_path)

    # Open the modified file and compute the sum of the values
    print("Modified")
    print(f"\t\t{ds.foo.sum().values=:2f}")
    print(f"\t\t{ds.foo.values.sum()=:2f}")

if name == "main": main() ```

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [x] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

```Python The output of running the provided MCV:

Xarray Version: xr.version='2022.3.0' Dask version: dask.version='2022.05.2' Iteration:0 Original ds.foo.sum().values=0.000000 ds.foo.values.sum()=0.000000 Modified ds.foo.sum().values=12.000000 ds.foo.values.sum()=12.000000 Iteration:1 Original ds.foo.sum().values=0.000000 ds.foo.values.sum()=12.000000 Modified ds.foo.sum().values=24.000000 ds.foo.values.sum()=24.000000 Iteration:2 Original ds.foo.sum().values=0.000000 ds.foo.values.sum()=24.000000 Modified ds.foo.sum().values=36.000000 ds.foo.values.sum()=36.000000 ```

Anything else we need to know?

I couldn't run the example in the Binder notebook because it uses Cache from dask.Cache which couldn't be loaded there. However, I could run reproduce the results in a clean environment installing the following dependencies: - xarray - dask - cachey - netCDF4

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.4.0-109-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.22.4 scipy: 1.8.1 netCDF4: 1.5.8 pydap: None h5netcdf: 1.0.0 h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.05.2 distributed: 2022.5.2 matplotlib: 3.5.2 cartopy: 0.20.2 seaborn: 0.11.2 numbagg: None fsspec: 2022.5.0 cupy: None pint: 0.19.2 sparse: None setuptools: 57.4.0 pip: 22.1.2 conda: None pytest: 7.1.2 IPython: None sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6858/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 0.894ms · About: xarray-datasette