issues: 1324446752
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1324446752 | I_kwDOAMm_X85O8XQg | 6858 | Inconsistent results due to dask.Cache | 34112954 | closed | 0 | 1 | 2022-08-01T13:38:35Z | 2024-02-26T06:52:07Z | 2024-02-26T06:52:06Z | NONE | What happened?In my workflow, I was opening a dataset from a netcdf, modifying the contents, saving the dataset as a different file and then repeat the process with the same file but a different modification. It looks like Dask Cache is messing up the results due to using cached information when it shouldn't. Also, that leds to different results for What did you expect to happen?I expect that when I open a file from the disk, if I access the values from that file these will correspond to what it is actually on the file. Instead it is reusing data values that should have been discarded when loading the file again. Minimal Complete Verifiable Example```Python import numpy as np import xarray as xr import dask from dask.cache import Cache cache = Cache(2e9) cache.register() print(f"Xarray Version: {xr.version=}") print(f"Dask version: {dask.version=}") def file_opener(file_path): """ Open a file and return a dataset object. :param file_path: :return: """ return xr.open_dataset(file_path, chunks={}) def main(): # Define names for the files file_path = "./dummy_dataset.nc" modified_file_path = file_path.replace(".nc", ".modified.nc")
if name == "main": main() ``` MVCE confirmation
Relevant log output```Python The output of running the provided MCV: Xarray Version: xr.version='2022.3.0' Dask version: dask.version='2022.05.2' Iteration:0 Original ds.foo.sum().values=0.000000 ds.foo.values.sum()=0.000000 Modified ds.foo.sum().values=12.000000 ds.foo.values.sum()=12.000000 Iteration:1 Original ds.foo.sum().values=0.000000 ds.foo.values.sum()=12.000000 Modified ds.foo.sum().values=24.000000 ds.foo.values.sum()=24.000000 Iteration:2 Original ds.foo.sum().values=0.000000 ds.foo.values.sum()=24.000000 Modified ds.foo.sum().values=36.000000 ds.foo.values.sum()=36.000000 ``` Anything else we need to know?I couldn't run the example in the Binder notebook because it uses Cache from dask.Cache which couldn't be loaded there. However, I could run reproduce the results in a clean environment installing the following dependencies: - xarray - dask - cachey - netCDF4 Environment
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46)
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-109-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4
xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.4
scipy: 1.8.1
netCDF4: 1.5.8
pydap: None
h5netcdf: 1.0.0
h5py: 3.7.0
Nio: None
zarr: None
cftime: 1.6.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.05.2
distributed: 2022.5.2
matplotlib: 3.5.2
cartopy: 0.20.2
seaborn: 0.11.2
numbagg: None
fsspec: 2022.5.0
cupy: None
pint: 0.19.2
sparse: None
setuptools: 57.4.0
pip: 22.1.2
conda: None
pytest: 7.1.2
IPython: None
sphinx: None
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/6858/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |