home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

11 rows where user = 57705593 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 3

  • [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 5
  • Only auxiliary coordinates are listed in nc variable attribute 5
  • netCDF4: support byte strings as attribute values 1

user 1

  • tovogt · 11 ✖

author_association 1

  • CONTRIBUTOR 11
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1287864880 https://github.com/pydata/xarray/issues/7186#issuecomment-1287864880 https://api.github.com/repos/pydata/xarray/issues/7186 IC_kwDOAMm_X85Mw0Iw tovogt 57705593 2022-10-22T17:37:05Z 2022-10-22T17:37:41Z CONTRIBUTOR

The reason for this behavior is that the netcdf4 python package automatically determines the type of the attribute (NC_CHAR or NC_STRING) by attempting the conversion to ASCII: https://github.com/Unidata/netcdf4-python/issues/529 However, if the value is a byte string, no conversion is done. So, the easiest solution would be to manually encode as utf-8 and then passing the byte string to netcdf4. Unfortunately, xarray doesn't support byte strings as attribute values even though this is a valid data type for the netcdf4 engine: https://github.com/pydata/xarray/blob/6cb97f645475bddf2f3b1e1a5f24f0f9de690683/xarray/backends/api.py#L175 In the long term, I would suggest to add bytes as a supported type in that list above on xarray's side.

A quick workaround for you might be to encode the string as utf-8 and convert to a numpy array, since xarray accepts numpy arrays as data type and netcdf4 automatically extracts the data if the array contains only a single item: python ds["x"].attrs["third_str"] = np.array("hää".encode("utf-8"))

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  netCDF4: support byte strings as attribute values 1414669747
1069096876 https://github.com/pydata/xarray/issues/6310#issuecomment-1069096876 https://api.github.com/repos/pydata/xarray/issues/6310 IC_kwDOAMm_X84_uR-s tovogt 57705593 2022-03-16T12:55:28Z 2022-03-16T12:55:28Z CONTRIBUTOR

Yes, I tested this locally, and here is the PR for that suggestion: #6366

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Only auxiliary coordinates are listed in nc variable attribute 1154014066
1069089232 https://github.com/pydata/xarray/issues/6310#issuecomment-1069089232 https://api.github.com/repos/pydata/xarray/issues/6310 IC_kwDOAMm_X84_uQHQ tovogt 57705593 2022-03-16T12:46:33Z 2022-03-16T12:46:33Z CONTRIBUTOR

No, in this case "coordinates" in encoding will resolve to False, so that's not the problematic step.

I think the problem is that the author of line 773 assumed that pop_to would return the value of attrs['coordinates'] if encoding['coordinates'] is not set. That is wrong and the line should be fixed to look like this instead: python coords_str = pop_to(encoding, attrs, "coordinates") or attrs.get("coordinates")

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Only auxiliary coordinates are listed in nc variable attribute 1154014066
1069073130 https://github.com/pydata/xarray/issues/6310#issuecomment-1069073130 https://api.github.com/repos/pydata/xarray/issues/6310 IC_kwDOAMm_X84_uMLq tovogt 57705593 2022-03-16T12:27:14Z 2022-03-16T12:27:14Z CONTRIBUTOR

@DWesl Thanks for digging into the details of the CF! I read your post in the sense that solution (1) is the one to choose.

I have one more question though before we close this:

When setting "coordinates" as a variable attribute instead of as an encoding, the value of that variable attribute is overwritten when writing to a NetCDF file, see this example: python import xarray as xr ds = xr.Dataset( {"values": ('time', [0.0, 0.1])}, coords={ 'time': ('time', [0, 1]), 'lat': ('time', [5, 4]), 'lon': ('time', [10, 12]) }) ds['values'].attrs['coordinates'] = "time lon lat" ds.to_netcdf("test.nc") shell $ ncdump -h test.nc netcdf test { dimensions: time = 2 ; variables: double values(time) ; values:_FillValue = NaN ; values:coordinates = "lon lat" ; int64 time(time) ; int64 lat(time) ; int64 lon(time) ; } Is this intended, and if so, what is the reason?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Only auxiliary coordinates are listed in nc variable attribute 1154014066
1068869457 https://github.com/pydata/xarray/issues/6310#issuecomment-1068869457 https://api.github.com/repos/pydata/xarray/issues/6310 IC_kwDOAMm_X84_tadR tovogt 57705593 2022-03-16T08:39:02Z 2022-03-16T08:40:08Z CONTRIBUTOR

Hey @dcherian,

Thanks for your consideration! I just started to create a new branch with my suggested changes, but then I noticed that setting the coordinates attribute to include non-auxiliary dimensions by default will let a lot of unit tests fail. That made me think about this again: In fact, always setting the coordinates attribute is permissible according to the conventions. However, most users will probably be confused about the clutter and we might run into some discussions about this in the future.

In my original post, the example can easily be fixed to reflect the coordinate attributes in the conventions (https://cfconventions.org/cf-conventions/cf-conventions.html#_single_trajectory) by enforcing the attribute value in the encoding: python import xarray as xr ds = xr.Dataset( {"values": ('time', [0.0, 0.1])}, coords={ 'time': ('time', [0, 1]), 'lat': ('time', [5, 4]), 'lon': ('time', [10, 12]) }) ds['values'].encoding['coordinates'] = "time lon lat" ds.to_netcdf("test.nc") shell $ ncdump -h test.nc netcdf test { dimensions: time = 2 ; variables: double values(time) ; values:_FillValue = NaN ; values:coordinates = "time lon lat" ; int64 time(time) ; int64 lat(time) ; int64 lon(time) ; }

From here, there are several ways to proceed: 1. Stick to the current logic which might be non-conformal with the CF conventions in case of "Discrete Sampling Geometries". However, users can manually fix this by setting the coordinates in encoding. 2. Enforce to always list all non-auxiliary coordinates in the variable attribute, which is conformal with the conventions, but generates a lot of clutter. 3. Implement a logic to recognize cases where a dataset is a "Discrete Sampling Geometry" and only then list the non-auxiliary coordinates in the variable attribute. This is a bit tricky, and I don't have the time to implement this, I'm afraid.

I actually suggest to choose solution (1) and close this issue as "won't fix". What do you think?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Only auxiliary coordinates are listed in nc variable attribute 1154014066
1054195542 https://github.com/pydata/xarray/issues/6310#issuecomment-1054195542 https://api.github.com/repos/pydata/xarray/issues/6310 IC_kwDOAMm_X84-1b9W tovogt 57705593 2022-02-28T12:13:46Z 2022-02-28T12:33:07Z CONTRIBUTOR

If you are interested in a fix, I would modify the following section: https://github.com/pydata/xarray/blob/613a8fda4f07181fbc41d6ff2296fec3726fd351/xarray/conventions.py#L726-L744 The variable_coordinates should be initialized for each variable v with the names in v.dims. Currently, the dimension names are explicitly excluded. So it seems to me like this behavior has originally been implemented on purpose because the main use of the coordinates attribute according to the CF convention is to list the auxiliary coordinates. However, the conventions also say: "Note that it is permissible, but optional, to list coordinate variables as well as auxiliary coordinate variables in the coordinates attribute." (https://cfconventions.org/cf-conventions/cf-conventions.html#coordinate-system)

So, this issue is about including the non-auxiliary coordinates by default because it's permissible in general, and it is even required for so-called "Discrete Sampling Geometries".

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Only auxiliary coordinates are listed in nc variable attribute 1154014066
1028730657 https://github.com/pydata/xarray/issues/6174#issuecomment-1028730657 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X849US8h tovogt 57705593 2022-02-03T08:39:45Z 2022-02-03T08:41:16Z CONTRIBUTOR

Have you seen xarray.save_mfdataset?

In principle, it was designed for exactly this sort of thing.

Thanks for the hint! Unfortunately, it says already in the docstring that "it is no different than calling to_netcdf repeatedly". And I explained in my OP that this would cause repeated file open/close operations - which is the whole point of this issue.

Furthermore, when using save_mfdataset with my setup, it complains: ValueError: cannot use mode='w' when writing multiple datasets to the same path But when using mode='a' instead, it will complain that the file doesn't exist.

However, it might still be the way to go API-wise. So, when talking about the solution of this issue, we could aim at fixing save_mfdataset: 1) Writing to the same file should use a single open/close operation. 2) Support mode='w' (or mode='w+') when writing several datasets to the same path.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101
1019879801 https://github.com/pydata/xarray/issues/6174#issuecomment-1019879801 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X848yiF5 tovogt 57705593 2022-01-24T09:16:40Z 2022-01-24T09:16:40Z CONTRIBUTOR

That's good at least! Do you have any suggestions for where the docs should be improved? PRs are of course always welcome too

Here is my PR for the docstring improvements: https://github.com/pydata/xarray/pull/6187

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101
1019849836 https://github.com/pydata/xarray/issues/6174#issuecomment-1019849836 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X848yaxs tovogt 57705593 2022-01-24T08:43:36Z 2022-01-24T08:43:36Z CONTRIBUTOR

It's not at all tricky to implement the listing of groups in a NETCDF4 file, at least not for the "netcdf4" engine. The code for that is in my OP above: ```python

def _xr_nc4_groups_from_store(store): """List all groups contained in the given NetCDF4 data store

Parameters
----------
store : xarray.backend.NetCDF4DataStore

Returns
-------
list of str
"""
def iter_groups(ds, prefix=""):
    groups = [""]
    for group_name, group_ds in ds.groups.items():
        groups.extend([f"{prefix}{group_name}{subgroup}"
                       for subgroup in iter_groups(group_ds, prefix="/")])
    return groups
with store._manager.acquire_context(False) as root:
    return iter_groups(root)

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101
1018257806 https://github.com/pydata/xarray/issues/6174#issuecomment-1018257806 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X848sWGO tovogt 57705593 2022-01-21T07:40:55Z 2022-01-21T07:46:06Z CONTRIBUTOR

When I first posted this issue, I thought, the best solution is to just implement my proposed helper functions as part of the official xarray API. I don't think our project would add DataTree as a new dependency just for this as long as we have a very easy and viable solution of ourselves.

But now I have a new idea. At first, I noticed that open_dataset won't actually close the file handle, but reuse it later if needed. So, at least there is no performance problem with the current read setup. For writing, there should be an option in to_netcdf that ensures that xarray is not closing the file handle. xarray already uses a CachingFileManager to open NetCDF4-files: https://github.com/pydata/xarray/blob/0ffb0f42282a1b67c4950e90e1e4ecd146307aa8/xarray/backends/netCDF4_.py#L379-L381 That means, that manager already ensures that the same file handle is re-used in subsequent operations of to_netcdf with the same file, unless it's closed in the meantime. Closing is managed here: https://github.com/pydata/xarray/blob/0ffb0f42282a1b67c4950e90e1e4ecd146307aa8/xarray/backends/api.py#L1072-L1094 It's a bit intransparent, when closing is actually triggered in practice - especially if you only look at the current docstrings. I found that, in fact, setting compute=False in to_netcdf will prevent the closing until you explicitly call compute on the returned object: python for name, ds in zip(ds_names, ds_list): delayed = ds.to_netcdf(path, group=name, compute=False) delayed.compute() If this would be communicated more transparently in the docstrings, it would bring us a big step closer to the solution of this issue :slightly_smiling_face: Apart from that, there is only one problem left: Getting a full list of all groups contained in a NetCDF4 file so that we can read them all in. In DataTree, you fall back to using directly the NetCDF4 (or h5netcdf) API for that purpose: _get_nc_dataset_class and _iter_nc_groups. That's not the worst solution. However, I would insist that xarray should be able to do this. Maybe we need a open_datasets_from_groups function for that, or rather a function list_datasets. But it should somehow be solvable within the xarray API without requiring a two-year debate about the management and representation of hierarchical data structures.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101
1017298572 https://github.com/pydata/xarray/issues/6174#issuecomment-1017298572 https://api.github.com/repos/pydata/xarray/issues/6174 IC_kwDOAMm_X848or6M tovogt 57705593 2022-01-20T09:53:16Z 2022-01-20T09:53:32Z CONTRIBUTOR

Thanks for your quick response, Tom!

I'm sure that DataTree is a really neat solution for most people working with hierarchically structured data.

In my case, we are talking about a very unusual application of the NetCDF4 groups feature: We store literally thousands of very small NetCDF datasets in a single file. A file containing 3000 datasets is typically not larger than 100 MB.

With that setup, the I/O performance is critical. Opening and closing the file on each group read/write is very, very bad. On our cluster this means that writing that 100 MB file takes 10 hours with your DataTree implementation, and 30 minutes with my helper functions. For reading, the effect is smaller, but still noticeable.

So, my request is really about the I/O performance, and I don't need a full-fledged hierarchical data management API in xarray for that.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [FEATURE]: Read from/write to several NetCDF4 groups with a single file open/close operation 1108138101

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.198ms · About: xarray-datasette