home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

11 rows where author_association = "MEMBER" and issue = 187859705 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • shoyer 6
  • benbovy 4
  • rabernat 1

issue 1

  • Dataset groups · 11 ✖

author_association 1

  • MEMBER · 11 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
873227866 https://github.com/pydata/xarray/issues/1092#issuecomment-873227866 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDg3MzIyNzg2Ng== shoyer 1217238 2021-07-02T19:56:49Z 2021-07-02T19:56:49Z MEMBER

There's a parallel discussion hierarchical storage going on over in https://github.com/pydata/xarray/issues/4118. I'm going to close this issue in favor of the other one just to keep the ongoing discussion in one place.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
290256766 https://github.com/pydata/xarray/issues/1092#issuecomment-290256766 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI5MDI1Njc2Ng== benbovy 4160723 2017-03-29T23:26:50Z 2017-03-29T23:26:50Z MEMBER

How to handle dimensions and coordinate names when assigning groups is clearly one of the important design decisions here. It's obvious that data variables should be grouped but less clear how to handle dimensions/coordinates.

I would be +1 for allowing tuples for data variables names but not for dimensions/coordinates names. It indeed looks like that using tuples for the latter would be a greater source of confusion and would add too much complexity for only little (or no real?) benefit.

I'd be fine with raising an error when loading a netCDF4 file which have groups with conflicting dimensions or when assigning an incompatible Dataset as a new group (e.g., ds['flux'] = incompatible_ds).

For groups that share common dimensions/coordinates with some differences, a data structure built on top of Dataset (like DatasetGroup or DatasetNode) would be more appropriate I think.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
290250866 https://github.com/pydata/xarray/issues/1092#issuecomment-290250866 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI5MDI1MDg2Ng== shoyer 1217238 2017-03-29T22:54:17Z 2017-03-29T22:54:17Z MEMBER

I'm also having thoughts about the attribute access: if ds['flux']['poloidal'] = subset does not work, then neither does ds.flux.poloidal = subset, correct? If so, it is almost pointless to have the attribute access in the first place.

Yes, this is correct. But note that ds.flux = array is not also supported -- only attribute access in xarray only works for getting, not setting. If you try it, you get an error message, e.g., AttributeError: cannot set attribute 'bar' on a 'DataArray' object. Use __setitem__ style assignment (e.g., `ds['name'] = ...`) instead to assign variables.

A big difference between xarray and netCDF4-python datasets is that the children datasets in xarray can go have a life of their own, independent of their parent & the file it represents.

Yes, this is true. We would possibly want to make another Dataset subclass for the sub-datasets to ensure that their variables are linked to the parent, e.g., xarray.LinkedDataset. This would enable ds['flux']['poloidal'] = subset.

But I'm also not convinced this is actually worth the trouble given how easy it is to write ds['flux', 'poloidal'].

NumPy has similar issues, e.g., x[i, j] = y works but x[i][j] = y does not.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
290165548 https://github.com/pydata/xarray/issues/1092#issuecomment-290165548 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI5MDE2NTU0OA== shoyer 1217238 2017-03-29T17:38:03Z 2017-03-29T17:38:49Z MEMBER

The background flux is defined to be uniform in some coordinates, so it is lower-dimensionality than the total flux. It doesn't make sense to turn a 1-D variable into a 3-D variable just to match the others so I can put it into an array.

Yes, totally agreed, and I've encountered similar cases in my own work. These sort of "ragged" arrays are great use case for groups.

Accessing via ds['flux','poloidal'] is a bit confusing because ds[] is (I think) a dictionary, but supplying multiple names is suggestive of either array indexing or getting a list with two things inside, flux and poloidal. That is, the syntax doesn't reflect the semantics very well.

Yes, it's a little confusing because it looks similar to ds[['flux','poloidal']], which has different meaning. But otherwise programmatic access starts turning into a mess of string manipulation, e.g., ds['flux', subgroup] rather than ds['flux/' + subgroup].

If I am at the console, and I start typing ds.flux and use the tab-completion, does that end up creating a new dataset just so I can see what is inside ds.flux? Is that an expensive operation?

Yes, it would create a new dataset, which could take ~1 ms. That's slow for inner loops (though we could add caching to help), but plenty fast for interactive use.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
290142369 https://github.com/pydata/xarray/issues/1092#issuecomment-290142369 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI5MDE0MjM2OQ== shoyer 1217238 2017-03-29T16:18:12Z 2017-03-29T16:18:12Z MEMBER

@shoyer do you have an idea on how it would work with serialization to netCDF?

With netCDF4, we could potentially just use groups. Or we could use some sort of naming convention for strings, e.g., joining together together the parts of the tuple with ..

One challenge here is that unless we also let dimensions be group specific, not every netCDF4 file with groups corresponds to a valid xarray Dataset: you can have conflicting sizes on dimensions for netCDF4 files in different groups.

In principle, it could be OK to use tuples for dimension names, but we already have lots of logic that distinguishes between single and multiple dimensions by looking for non-strings or tuples. So you would probably have to write ds.sum(dim=[('flux', 'time')]) if you wanted to sum over the 'time' dimension of the 'flux' group. https://github.com/pydata/xarray/issues/1231 would help here (e.g., to enable ds.sel({('flux', 'time'): time})), but cases like ds.sum(dim=('flux', 'time')) would still be a source of confusion.

How to handle dimensions and coordinate names when assigning groups is clearly one of the important design decisions here. It's obvious that data variables should be grouped but less clear how to handle dimensions/coordinates.

We would also have to decide how to display groups in the repr of the flat dataset...

Some sort of further indentation seems natural, possibly with truncation like ... for cases when the number of variables is very long (>10), e.g., Data variables: flux poloidal perturbed

This is another case where an HTML repr could be powerful, allowing for clearer visual links and potentially interactive expanding/contracting of the tree.

Would the domain for this just be to simulate the tree-like structure that NetCDF permits, or could it extend to multiple datasets on disk?

From xarray's perspective, there isn't really a distinction between multiple files and groups in one netCDF file -- it's just a matter of creating a Dataset with data organized in a different way. Presumably we could write helper methods for converting a dimension into a group level (and vice-versa).

But it's worth noting that there still limitations to opening large numbers of files in a single dataset, even with groups, because xarray reads all the metadata for every variable into memory at once, and that metadata is copied in every xarray operation. For this reason, you will still probably want a different datastructure (convertible into an xarray.Dataset) when navigating very large datasets like CMIP, which consists of many thousands of files.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
290130241 https://github.com/pydata/xarray/issues/1092#issuecomment-290130241 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI5MDEzMDI0MQ== benbovy 4160723 2017-03-29T15:38:46Z 2017-03-29T15:38:46Z MEMBER

@darothen you might be interested by the discussion we had here, although it doesn't solve anything related to selection across similar Dataset objects.

I think that the collection of Dataset objects with like-dimensions that you suggest is indeed different than the tree-like structure within a dataset that is proposed here (the latter still using a unique set of dimensions and coordinates).

Both approaches may co-exist, though. I can imagine the case where we have (1) a set of, e.g., grid-search or monte-carlo model runs and (2) for each model run we have diagnostic variables defined in different places on the grid (e.g., nodes, edges...). The tuple-defined groups within a Dataset is useful for 2 and the collection of Dataset objects is useful for 1.

As pointed out by @shoyer, such a collection of Dataset objects might be (preferably) implemented outside of xarray.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
290065632 https://github.com/pydata/xarray/issues/1092#issuecomment-290065632 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI5MDA2NTYzMg== benbovy 4160723 2017-03-29T11:48:12Z 2017-03-29T11:48:12Z MEMBER

Just want to say that I'm very enthusiastic about this!

Like @lamorton, I also find myself having a lot of variables with names containing the name(s) of their "group(s)".

My initial idea was also to keep flat datasets and add some logic to get/set groups, but it wasn't very clear and well explained.

One important reason to keep the tree-like structure within a dataset is that it provides some assurance to the recipient of the dataset that all the variables 'belong' in the same coordinate space.

Makes perfect sense!

I also find the idea of using tuples very clever! @shoyer do you have an idea on how it would work with serialization to netCDF? We would also have to decide how to display groups in the repr of the flat dataset...

@lamorton @shoyer unless you want to open a PR, I'd be willing to start working on this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
289923078 https://github.com/pydata/xarray/issues/1092#issuecomment-289923078 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI4OTkyMzA3OA== shoyer 1217238 2017-03-28T22:22:31Z 2017-03-28T22:24:31Z MEMBER

@lamorton Thanks for explaining the use case here. This makes more sense to me now. I like your idea of groups as syntactic sugar around flat datasets with named keys.

With an appropriate naming convention, we might even be able to put this into xarray.Dataset proper. Tuples like ('flux', 'poloidal', 'perturbed') would be more appropriate than a string based convention, because they are easier to use programmatically.

  • In Python syntax, ds[x, y] is equivalent to ds[(x, y)]. Thus ds['flux', 'poloidal', 'perturbed'] works to pull out a named variable.
  • If no variable with the name flux is found, ds['flux'] or ds.flux would return a Dataset with all variables with names given by tuples starting with 'flux', removing the prefix 'flux' from each name (e.g., ('poloidal', 'perturbed') would be a variable in ds.flux). This means that ds.flux.poloidal.perturbed and ds['flux']['poloidal']['perturbed'] should automatically work.
  • ds['flux', 'poloidal', 'perturbed'] = data_array would assign the variable ('flux', 'poloidal', 'perturbed'), and implicitly create the 'flux' group (which in turn contains the 'poloidal' group). Note that it's not possible to make assignments like ds['flux']['poloidal']['perturbed'] = data_array work, so we should discourage this syntax.
  • ds['flux'] = poloidal_ds would become valid, and work by assigning all variables in poloidal_ds into ds by prefixing their names with 'flux'.
  • Similarly, nested arguments could also be supported in the Dataset constructor, e.g., xarray.Dataset({'flux': {'poloidal': {'perturbed': data_array}}}) becomes syntactic sugar for xarray.Dataset({('flux', 'poloidal', 'perturbed'): data_array}).
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
259390660 https://github.com/pydata/xarray/issues/1092#issuecomment-259390660 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI1OTM5MDY2MA== benbovy 4160723 2016-11-09T11:15:01Z 2016-11-09T11:24:51Z MEMBER

For example, how do groups get updated when you slice, aggregate or concatenate datasets?

Yep once again I haven't thought about all the implications this would have! This would indeed add much complexity at the end.

I'll try to follow you suggestion of building another data structure, for example - correct me if it's a wrong approach too - a DatasetGroup class which would be very similar to netCDF4.Group or h5py.Group but which would here contain a single Dataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
259208431 https://github.com/pydata/xarray/issues/1092#issuecomment-259208431 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI1OTIwODQzMQ== rabernat 1197350 2016-11-08T17:51:00Z 2016-11-08T17:51:00Z MEMBER

This suggestion has some significant overlap with the data store / data discovery discussion from last weekend:

https://aospy.hackpad.com/Data-StorageDiscovery-Design-Document-fM6LgfwrJ2K

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
259206339 https://github.com/pydata/xarray/issues/1092#issuecomment-259206339 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI1OTIwNjMzOQ== shoyer 1217238 2016-11-08T17:43:22Z 2016-11-08T17:43:22Z MEMBER

I am reluctant to add the additional complexity of groups directly into the xarray.Dataset data model. For example, how do groups get updated when you slice, aggregate or concatenate datasets? The rules for coordinates are already pretty complex.

I would rather see this living in another data structure built on top of xarray.Dataset, either in xarray or in a separate library.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.527ms · About: xarray-datasette