home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where issue = 187859705 and user = 23484003 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • lamorton · 3 ✖

issue 1

  • Dataset groups · 3 ✖

author_association 1

  • NONE 3
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
290224441 https://github.com/pydata/xarray/issues/1092#issuecomment-290224441 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI5MDIyNDQ0MQ== lamorton 23484003 2017-03-29T21:00:42Z 2017-03-29T21:04:08Z NONE

@shoyer I see your point about the string manipulation. On the other hand, this is exactly how h5py and netCDF4-python implement the group/subgroup access syntax: just like a filepath.

I'm also having thoughts about the attribute access: if ds['flux']['poloidal'] = subset does not work, then neither does ds.flux.poloidal = subset, correct? If so, it is almost pointless to have the attribute access in the first place. I suppose that is the price to pay for merely making it appear as though there is attribute-access.

For my own understanding, I tried to translate between xarray and netCDF4-python : - nc.Variable <--> xr.Variable - nc.????? <--> xr.DataArray (netCDF doesn't distinguish vars/coords, so no analog is possible) - nc.Group <--> xr.NestableDataset - nc.Dataset <--> xr.NestableDataset

From netCDF4-python

Groups define a hierarchical namespace within a netCDF file. They are analogous to directories in a unix filesystem. Each Group behaves like a Dataset within a Dataset, and can contain it's own variables, dimensions and attributes (and other Groups). Group inherits from Dataset, so all the Dataset class methods and variables are available to a Group instance (except the close method).

It appears that the only things special about a nc.Dataset as compared to an nc.Group are: 1. The file access is tied to the nc.Dataset. 2. The nc.Dataset group has children but no parent.

A big difference between xarray and netCDF4-python datasets is that the children datasets in xarray can go have a life of their own, independent of their parent & the file it represents. It makes sense to me to have just a single xarray type (modified version of xarray.Dataset) to deal with both of these cases.

The nc.Group instances have an attribute groups that lists all the subgroups. So one option I suppose would be to follow that route and actually have Datasets that contain other datasets alongside everything else.

As an aside, it seems that ragged arrays are now supported in netCDF4-python:VLen.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
290159834 https://github.com/pydata/xarray/issues/1092#issuecomment-290159834 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI5MDE1OTgzNA== lamorton 23484003 2017-03-29T17:18:23Z 2017-03-29T17:19:19Z NONE

@darothen: Hmm, are your coordinate grids identical for each simulation (ie, any(ds1.x != ds2.x) evaluates as false)?
- If so, then it really does make sense to do what you described and create new dimensions for the experimental factors, on top of the spatial dimensions of the simulations.
- If not, but the length of all the dimensions is the same, one could still keep all the simulations in the same dataset, one would just need to index the coordinates with the experimental factors as well. - Finally, if the shape of the coordinate arrays varies with the experimental factor (for instance, doing convergence studies with finer meshes), that violates the xarray data model for having a single set of dimensions, each of which has a fixed length throughout the dataset, in order to enable smart broadcasting by dimension name. If (and only if) the dimensions are changing length, it would be better to keep a collection of datasets in some other type of data structure.

It might work for my case to convert my 'tags' to indexes for new dimensions (ie, ds.sel(quantity='flux',direction='poloidal',variation='perturbed'). However, there are two issues: 1. The background flux is defined to be uniform in some coordinates, so it is lower-dimensionality than the total flux. It doesn't make sense to turn a 1-D variable into a 3-D variable just to match the others so I can put it into an array. This goes especially for scalars and metadata that really should not be turned into arrays, but do belong with the subsets. 2. During my processing sequence, I may want to add something like ds.flux.helical.background. In order to do this, however, I'd be forced to define the 'perturbed' and 'total' helical fluxes at that time. But often I don't want or need to compute these.

There is still a good reason to have a flexible data model for lumping more heterogeneous collections together under some headings, with the potential for recursion. I suppose my question is, what is the most natural data model & corresponding access syntax?
- Attribute-style access is convenient and idiomatic; it implies a tree-like structure. This probably makes the most sense. - An alternative data model would be sets with subsets, which could be accessed by something similar to ds.sel but accepting set names as *args rather than **kwargs. Then requesting members of some set could return a dataset with those members, and the new dataset would lack the membership flag for variables, much the way slicing reduces dimensionality. In fact, one could even keep a record of the applied set requests much like point axes. A variable's key in data_vars would essentially just be a list/tuple of sets of which it is a member. Assignment would be tricky because it could create new sets, and the membership of existing elements in a new set would probably require user intervention to clarify...

@shoyer: Your approach is quite clever, and 'smells' much better than parsing strings. I do have two quibbles though. - Accessing via ds['flux','poloidal'] is a bit confusing because ds[] is (I think) a dictionary, but supplying multiple names is suggestive of either array indexing or getting a list with two things inside, flux and poloidal. That is, the syntax doesn't reflect the semantics very well. - If I am at the console, and I start typing ds.flux and use the tab-completion, does that end up creating a new dataset just so I can see what is inside ds.flux? Is that an expensive operation?

[Edited for formatting]

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
289916013 https://github.com/pydata/xarray/issues/1092#issuecomment-289916013 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDI4OTkxNjAxMw== lamorton 23484003 2017-03-28T21:51:30Z 2017-03-28T21:51:30Z NONE

One important reason to keep the tree-like structure within a dataset is that it provides some assurance to the recipient of the dataset that all the variables 'belong' in the same coordinate space. Constructing a tree (from a nested dictionary, say) whose leaves are datasets or dataArrays doesn't guarantee that the coordinates/dimensions in all the leaves are compatible, whereas a tree within the dataset does make a guarantee about the leaves.

As far as motivation for making trees, I find myself with several dozen variable names such as ds.fluxPoloidalPerturbation and ds.fieldToroidalBackground and various permutations, so it would be logical to be able to write ds.flux.poloidal and get a sub-dataset that contains dataArrays named perturbation and background.

As far as implementation, the DataGroup could really just be syntactic sugar around a flat dataset that is hidden from the user, and has keys like 'flux.poloidal.perturbed,' so that dg.flux.poloidal.perturbed would be an alias to dg.__hiddenDataset__['flux.poloidal.perturbed'], and dg.flux.poloidal would be an alias to dg.__hiddenDataset__[['flux.poloidal.perturbed','flux.poloidal.background']]. Seems like it would require mucking with dg.__getattr__, dg.__setattr__, and dg.__dir__ at a minimum to get it off the ground, but by making the tree virtual, one avoids the difficulties with slicing, etc. The return type of dg.__getattr__ should be another DataGroup as long as there are branches in the output, but it should fall back to a Dataset when there are only leaves.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.838ms · About: xarray-datasette