home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 853473276

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
853473276 MDU6SXNzdWU4NTM0NzMyNzY= 5132 Backend caching should not use a relative path 367900 closed 0     4 2021-04-08T13:27:03Z 2021-04-15T12:12:26Z 2021-04-15T12:12:26Z CONTRIBUTOR      

Datasets opened from disk are cached with a key based (amongst other things) on their filename. If you have the same filename in different directories, and open them after changing directory, a cache collision occurs as the filename is the same and so the first opened dataset is always returned.

Minimal Complete Verifiable Example:

```python import os from pathlib import Path import tempfile

import numpy as np import xarray as xr

with tempfile.TemporaryDirectory() as d: base = Path(d).resolve()

# Create some data in separate directories but with same filename.
(base / "zeros").mkdir()
z_fn = base / "zeros" / "data.nc"
xr.DataArray(np.zeros((5, 5), dtype=int)).to_netcdf(z_fn)
(base / "ones").mkdir()
o_fn = base / "ones" / "data.nc"
xr.DataArray(np.ones((5, 5), dtype=int)).to_netcdf(o_fn)

# Open with the absolute path and check we get what we expect.
z_abs = xr.open_dataarray(z_fn)
o_abs = xr.open_dataarray(o_fn)
assert (z_abs == 0).all(), "zeros with absolute path incorrect"
assert (o_abs == 1).all(), "zeros with absolute path incorrect"

# Open with relative path from base directory.
os.chdir(base)
z_base = xr.open_dataarray("zeros/data.nc")
o_base = xr.open_dataarray("ones/data.nc")
assert (z_base == 0).all(), "zeros with relative path from base incorrect"
assert (o_base == 1).all(), "zeros with relative path from base incorrect"

# Open from containing directory.
os.chdir(base / "zeros")
z_local = xr.open_dataarray("data.nc")
os.chdir(base / "ones")
o_local = xr.open_dataarray("data.nc")
assert (z_local == 0).all(), "zeros opened from containing dir incorrect"
assert (o_local == 1).all(), "ones opened from containing dir incorrect"

```

What happened: On master, the final assertion is triggered as the cache returns the zeros array instead of the ones.

What you expected to happen: No assertion.

Anything else we need to know?: This was introduced in 50d97e9d. I found this with the above test script (named cache_bug.py) with a Git bisect session:

console $ git bisect start master v0.16.2 Bisecting: 88 revisions left to test after this (roughly 7 steps) [d555172c7d069ca9cf7a9a32bfd5f422be133861] Allow swap_dims to take kwargs (#4841) $ git bisect run python cache_bug.py ... 50d97e9d35bac783850827fa66ff5eb768e62905 is the first bad commit ...

I then manually confirmed this by running the script on 50d97e9d and its parent.

The caching is performed by xarray.backends.file_manager.CachingFileManager. The obvious solution would be to use pathlib / os.path (whichever is preferred in xarray) to convert the paths to absolute before caching. For example, changing the default netCDF4 backend from

https://github.com/pydata/xarray/blob/e56905889c836c736152b11a7e6117a229715975/xarray/backends/netCDF4_.py#L375-L377

to

python manager = CachingFileManager( netCDF4.Dataset, os.path.abspath(filename), mode=mode, kwargs=kwargs )

fixes this for me. I guess this should be done (if needed) by each backend to keep CachingFileManager as general as possible.

If my analysis and proposed solution seems correct, I'm happy to work up a pull request with these fixes and some regression tests.

If you're wondering about the use case where I bumped into this problem: we're using Click for a CLI, and using its test helpers. One of these (isolated_filesystem) creates and changes into an empty temporary directory before running the CLI function under test, so we can use open_dataset("output.nc") to load the CLI output for checking. Since it does this in the same process, using a parametrized test function means the first created file is always loaded for checking. Took a while to track down what was happening!

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: ec4e8b5f279e28588eee8ff43a328ca6c2f89f01 python: 3.9.2 (default, Feb 20 2021, 18:40:11) [GCC 10.2.0] python-bits: 64 OS: Linux OS-release: 5.11.11-arch1-1 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_NZ.UTF-8 LOCALE: en_NZ.UTF-8 libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.17.0 pandas: 1.2.3 numpy: 1.20.1 scipy: 1.6.2 netCDF4: 1.5.6 pydap: None h5netcdf: 0.9.0 h5py: 3.1.0 Nio: None zarr: None cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.1 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.03.0 distributed: 2021.03.0 matplotlib: 3.4.1 cartopy: 0.18.0 seaborn: None numbagg: None pint: None setuptools: 54.2.0 pip: 20.3.1 conda: None pytest: 6.2.3 IPython: 7.22.0 sphinx: 3.5.2
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5132/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 4 rows from issue in issue_comments
Powered by Datasette · Queries took 241.329ms · About: xarray-datasette