home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

2 rows where repo = 13221727, type = "issue" and user = 3309802 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

type 1

  • issue · 2 ✖

state 1

  • open 2

repo 1

  • xarray · 2 ✖
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
1188946146 I_kwDOAMm_X85G3eDi 6432 Improve UX/documentation for loading data in cloud storage gjoseph92 3309802 open 0     0 2022-03-31T22:39:39Z 2022-04-04T15:47:04Z   NONE      

What is your issue?

I recently tried to use xarray to open some netCDF files stored in a bucket, and was surprised how hard it was to figure out the right incantation to make this work.

The fact that passing an fsspec URL (like "s3://bucket/path/data.zarr") to open_dataset "just works" for zarr is a little misleading, since it makes you think you could do something similar for other types of files. However, this doesn't work for netCDF, GRIB, and I assume most others.

However, h5netcdf does work if you pass an fsspec file-like object (not sure if other engines support this as well?). But to add to the confusion, you can't pass the fsspec.OpenFile you get from fsspec.open; you have to pass a concrete type like S3File, GCSFile, etc:

```python

import xarray as xr import fsspec

url = "s3://noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp" # a netCDF file in s3 You can't use the URL as a string directly:python xr.open_dataset(url, engine='h5netcdf')


KeyError Traceback (most recent call last) ... FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = 's3://noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0) Ok, what about `fsspec.open`?python

f = fsspec.open(url) ... f <OpenFile 'noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp'> xr.open_dataset(f, engine='h5netcdf')


AttributeError Traceback (most recent call last) ...

File ~/miniconda3/envs/xarray-buckets/lib/python3.10/site-packages/xarray/backends/common.py:23, in _normalize_path(path) 21 def _normalize_path(path): 22 if isinstance(path, os.PathLike): ---> 23 path = os.fspath(path) 25 if isinstance(path, str) and not is_remote_uri(path): 26 path = os.path.abspath(os.path.expanduser(path))

File ~/miniconda3/envs/xarray-buckets/lib/python3.10/site-packages/fsspec/core.py:98, in OpenFile.fspath(self) 96 def fspath(self): 97 # may raise if cannot be resolved to local file ---> 98 return self.open().fspath()

AttributeError: 'S3File' object has no attribute 'fspath' But if you somehow know that an `fsspec.OpenFile` isn't actually a file-like object, and you double-`open` it, then it works! (xref https://github.com/pydata/xarray/pull/5879#issuecomment-1085091126)python

s3f = f.open() ... s3f <File-like object S3FileSystem, noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp> xr.open_dataset(s3f, engine='h5netcdf') <xarray.Dataset> Dimensions: (time: 1, reference_time: 1, feature_id: 2776738) Coordinates: * time (time) datetime64[ns] 1979-02-01T01:00:00 * reference_time (reference_time) datetime64[ns] 1979-02-01 * feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804 latitude (feature_id) float32 ... longitude (feature_id) float32 ... ... `` (And even then, you have to know to use theh5netcdfengine, and notnetcdf4orscipy`.)


Some things that might be nice: 1. Explicit documentation on working with data in cloud storage, perhaps broken down by file type/engine (xref https://github.com/pydata/xarray/issues/2712). It might be nice to have a table/quick reference of which engines support reading from cloud storage, and how to pass in the URL (string? fsspec file object?) 2. Informative error linking to these docs when opening fails and is_remote_uri(filename_or_obj) 3. Either make fsspec.OpenFile objects work, so you don't have to do the double-open, or raise an informative error when one is passed in telling you what to do instead.

As more and more data is available on cloud storage, newcomers to xarray will probably be increasingly looking to use it with remote data. Since xarray already supports this in some cases, this is great! With a few tweaks to docs and error messages, I think we could change an experience that took me multiple hours of debugging and reading the source into an easy 30sec experience for new users.

cc @martindurant @phobson

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6432/reactions",
    "total_count": 7,
    "+1": 7,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue
1188965542 I_kwDOAMm_X85G3iym 6433 Rename/reword `parallel=True` option to `open_mfdataset` gjoseph92 3309802 open 0     2 2022-03-31T22:52:09Z 2022-04-01T11:15:39Z   NONE      

What is your issue?

Based on its name, I was surprised to find that open_mfdataset(..., parallel=True) computed the whole dataset eagerly, whereas parallel=False just returned it in dask form. (I generally think of "dask" as related to "parallel".)

I guess the docs do technically say this, but it's a bit hard to parse:

If True, the open and preprocess steps of this function will be performed in parallel using dask.delayed. Default is False.

The docstring could maybe instead mention "If False (default), the data is returned in dask form. If True, it will be computed immediately (using dask), then returned in NumPy form".

More intuitive to me would be renaming the argument to compute=False. Or even deprecating the argument entirely and having a load_mfdataset function, in the same way that load_dataset is the eager version of open_dataset.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/6433/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 74.816ms · About: xarray-datasette