home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

2 rows where state = "open" and user = 367900 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue 2

state 1

  • open · 2 ✖

repo 1

  • xarray 2
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
856900805 MDU6SXNzdWU4NTY5MDA4MDU= 5148 Handling of non-string dimension names bcbnz 367900 open 0     5 2021-04-13T12:13:44Z 2022-04-09T01:36:19Z   CONTRIBUTOR      

While working on a pull request (#5149) for #5146 I came across an inconsistency in allowed dimension names. If I try and create a DataArray with a non-string dimension, I get a TypeError:

```python console

import xarray as xr da = xr.DataArray(np.ones((5, 5)), dims=[1, "y"]) ... TypeError: dimension 1 is not a string ```

But creating it with a string and renaming it works:

```python console

da = xr.DataArray(np.ones((5, 5)), dims=["x", "y"]).rename(x=1) da <xarray.DataArray (1: 5, y: 5)> array([[1., 1., 1., 1., 1.], [1., 1., 1., 1., 1.], [1., 1., 1., 1., 1.], [1., 1., 1., 1., 1.], [1., 1., 1., 1., 1.]]) Dimensions without coordinates: 1, y ```

I can create a dataset via this renaming, but trying to get the repr value fails as xarray.core.utils.SortedKeysDict tries to sort it and cannot compare the string dimension to the int dimension:

```python console

import xarray as xr ds = xr.Dataset({"test": xr.DataArray(np.ones((5, 5)), dims=["x", "y"]).rename(x=1)}) ds ... ~/software/external/xarray/xarray/core/formatting.py in dataset_repr(ds) 519 520 dims_start = pretty_print("Dimensions:", col_width) --> 521 summary.append("{}({})".format(dims_start, dim_summary(ds))) 522 523 if ds.coords:

~/software/external/xarray/xarray/core/formatting.py in dim_summary(obj) 422 423 def dim_summary(obj): --> 424 elements = [f"{k}: {v}" for k, v in obj.sizes.items()] 425 return ", ".join(elements) 426

~/software/external/xarray/xarray/core/formatting.py in <listcomp>(.0) 422 423 def dim_summary(obj): --> 424 elements = [f"{k}: {v}" for k, v in obj.sizes.items()] 425 return ", ".join(elements) 426

/usr/lib/python3.9/_collections_abc.py in iter(self) 847 848 def iter(self): --> 849 for key in self._mapping: 850 yield (key, self._mapping[key]) 851

~/software/external/xarray/xarray/core/utils.py in iter(self) 437 438 def iter(self) -> Iterator[K]: --> 439 return iter(self.mapping) 440 441 def len(self) -> int:

~/software/external/xarray/xarray/core/utils.py in iter(self) 504 def iter(self) -> Iterator[K]: 505 # see #4571 for the reason of the type ignore --> 506 return iter(sorted(self.mapping)) # type: ignore[type-var] 507 508 def len(self) -> int:

TypeError: '<' not supported between instances of 'str' and 'int' ``` The same thing happens if I call rename on the dataset rather than the array it is initialised with.

If the initialiser requires the dimension names to be strings, and other code (which includes the HTML formatter I was looking at when I found this) assume that they are, then rename and any other method which can alter dimension names should also enforce the string requirement.

Environment:

Output of <tt>xr.show_versions()</tt> INSTALLED VERSIONS ------------------ commit: 851d85b9203b49039237b447b3707b270d613db5 python: 3.9.2 (default, Feb 20 2021, 18:40:11) [GCC 10.2.0] python-bits: 64 OS: Linux OS-release: 5.11.13-arch1-1 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_NZ.UTF-8 LOCALE: en_NZ.UTF-8 libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 0.17.0 pandas: 1.2.3 numpy: 1.20.1 scipy: 1.6.2 netCDF4: 1.5.6 pydap: None h5netcdf: 0.10.0 h5py: 3.2.1 Nio: None zarr: None cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.2 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.03.0 distributed: 2021.03.0 matplotlib: 3.4.1 cartopy: 0.18.0 seaborn: 0.11.1 numbagg: None pint: None setuptools: 54.2.0 pip: 20.3.1 conda: None pytest: 6.2.3 IPython: 7.22.0 sphinx: 3.5.4
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/5148/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
567678992 MDU6SXNzdWU1Njc2Nzg5OTI= 3781 to_netcdf() doesn't work with multiprocessing scheduler bcbnz 367900 open 0     4 2020-02-19T16:28:22Z 2021-09-25T16:02:41Z   CONTRIBUTOR      

If I create a chunked lazily-computed array, writing it to disk with to_netcdf() computes and writes it with the threading and distributed schedulers, but not with the multiprocessing scheduler. The only reference I've found when searching for the exception message comes from this StackOverflow question.

MCVE Code Sample

```python import dask import numpy as np import xarray as xr

if name == "main": # Simple worker function. def inner(ds): if sum(ds.dims.values()) == 0: return ds return ds**2

# Some random data to work with.
ds = xr.Dataset(
        {"test": (("a", "b"), np.random.uniform(size=(1000, 1000)))},
        {"a": np.arange(1000), "b": np.arange(1000)}
)

# Chunk it and apply the worker to each chunk.
ds_chunked = ds.chunk({"a": 100, "b": 200})
ds_squared = ds_chunked.map_blocks(inner)

# Thread pool scheduler can compute while writing.
dask.config.set(scheduler="threads")
print("Writing thread pool test to disk.")
ds_squared.to_netcdf("test-threads.nc")

# Local cluster with distributed works too.
c = dask.distributed.Client()
dask.config.set(scheduler=c)
print("Writing local cluster test to disk.")
ds_squared.to_netcdf("test-localcluster.nc")

# Process pool scheduler can compute.
dask.config.set(scheduler="processes")
print("Computing with process pool scheduler.")
ds_squared.compute()

# But it cannot compute while writing.
print("Trying to write process pool test to disk.")
ds_squared.to_netcdf("test-process.nc")

```

Expected Output

Complete netCDF files should be created from all three schedulers.

Problem Description

The thread pool and distributed local cluster schedulers result in a complete output. The process pool scheduler fails when trying to write (note that test-process.nc is created with the header and coordinate information, but no actual data is written). The traceback is:

pytb Traceback (most recent call last): File "bug.py", line 54, in <module> ds_squared.to_netcdf("test-process.nc") File "/usr/lib/python3.8/site-packages/xarray/core/dataset.py", line 1535, in to_netcdf return to_netcdf( File "/usr/lib/python3.8/site-packages/xarray/backends/api.py", line 1097, in to_netcdf writes = writer.sync(compute=compute) File "/usr/lib/python3.8/site-packages/xarray/backends/common.py", line 198, in sync delayed_store = da.store( File "/usr/lib/python3.8/site-packages/dask/array/core.py", line 923, in store result.compute(**kwargs) File "/usr/lib/python3.8/site-packages/dask/base.py", line 165, in compute (result,) = compute(self, traverse=False, **kwargs) File "/usr/lib/python3.8/site-packages/dask/base.py", line 436, in compute results = schedule(dsk, keys, **kwargs) File "/usr/lib/python3.8/site-packages/dask/multiprocessing.py", line 212, in get result = get_async( File "/usr/lib/python3.8/site-packages/dask/local.py", line 494, in get_async fire_task() File "/usr/lib/python3.8/site-packages/dask/local.py", line 460, in fire_task dumps((dsk[key], data)), File "/usr/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 62, in dumps cp.dump(obj) File "/usr/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 538, in dump return Pickler.dump(self, obj) File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 101, in __getstate__ context.assert_spawning(self) File "/usr/lib/python3.8/multiprocessing/context.py", line 363, in assert_spawning raise err RuntimeError: Lock objects should only be shared between processes through inheritance

With a bit of editing of the system multiprocessing module I was able to determine that the lock being reported by this exception was the first lock created. I then added a breakpoint to the Lock constructor to get a traceback of what was creating it:

| File | Line | Function |----------------------|------|------------------------- | core/dataset.py | 1535 | Dataset.to_netcdf | backends/api.py | 1071 | to_netcdf | backends/netCDF4_.py | 350 | open | backends/locks.py | 114 | get_write_lock | backends/locks.py | 39 | _get_multiprocessing_lock

This last function creates the offending multiprocessing.Lock() object. Note that there are six Locks constructed and so its possible that the later-created ones would also cause an issue.

The h5netcdf backend has the same problem with Lock. However the SciPy backend gives a NotImplementedError for this:

python ds_squared.to_netcdf("test-process.nc", engine="scipy")

pytb Traceback (most recent call last): File "bug.py", line 54, in <module> ds_squared.to_netcdf("test-process.nc", engine="scipy") File "/usr/lib/python3.8/site-packages/xarray/core/dataset.py", line 1535, in to_netcdf return to_netcdf( File "/usr/lib/python3.8/site-packages/xarray/backends/api.py", line 1056, in to_netcdf raise NotImplementedError( NotImplementedError: Writing netCDF files with the scipy backend is not currently supported with dask's multiprocessing scheduler

I'm not sure how simple it would be to get this working with the multiprocessing scheduler, or how vital it is given that the distributed scheduler works. If nothing else, it would be good to get the same NotImplementedError as with the SciPy backend.

Output of xr.show_versions()

commit: None python: 3.8.1 (default, Jan 22 2020, 06:38:00) [GCC 9.2.0] python-bits: 64 OS: Linux OS-release: 5.5.4-arch1-1 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_NZ.UTF-8 LOCALE: en_NZ.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.15.0 pandas: 1.0.1 numpy: 1.18.1 scipy: 1.4.1 netCDF4: 1.5.3 pydap: None h5netcdf: 0.7.4 h5py: 2.10.0 Nio: None zarr: None cftime: 1.1.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.10.1 distributed: 2.10.0 matplotlib: 3.1.3 cartopy: 0.17.0 seaborn: None numbagg: None setuptools: 45.2.0 pip: 19.3 conda: None pytest: 5.3.5 IPython: 7.12.0 sphinx: 2.4.2
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3781/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 8000.942ms · About: xarray-datasette