home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where user = 7611856 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

issue 5

  • Dataset groups 1
  • Feature Request: Hierarchical storage and processing in xarray 1
  • disallow boolean coordinates? 1
  • Make creating a MultiIndex in stack optional 1
  • Conversion to pandas for zero-dimensional Data(Set|Array) 1

user 1

  • martinitus · 5 ✖

author_association 1

  • NONE 5
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
879685814 https://github.com/pydata/xarray/issues/5598#issuecomment-879685814 https://api.github.com/repos/pydata/xarray/issues/5598 MDEyOklzc3VlQ29tbWVudDg3OTY4NTgxNA== martinitus 7611856 2021-07-14T08:05:37Z 2021-07-14T08:05:37Z NONE

I'm not a pandas expert, but maybe one can create a dummy index that enforces the size=1 constraint. E.g. an index which only supports one value (e.g. None or zero). That could potentially be used to fix the round-trip.

Also potentially related: #5202 (also contains discussions about the multiindex/dataset handling)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Conversion to pandas for zero-dimensional Data(Set|Array) 943112510
876397215 https://github.com/pydata/xarray/issues/4118#issuecomment-876397215 https://api.github.com/repos/pydata/xarray/issues/4118 MDEyOklzc3VlQ29tbWVudDg3NjM5NzIxNQ== martinitus 7611856 2021-07-08T12:27:58Z 2021-07-08T12:27:58Z NONE

As a user who (so far) does not use any netCDF or HDF5 features of xarray I obviously would not like to have a otherwise potentially useful feature blocked by restrictions imposed by netCDF or HDF5 ;-).

That said - I think @tacaswell comment about round trips is very reasonable and such invariants should be maintained! It would be extremely confusing for users if netcdf -> xarray-> netcdf is not a "no-op". The same obviously holds true for any other storage format. As a user I would generally expect something like the following: python a1= xarray.load("foo.myformat") xarray.save( a1, "bar.myformat") a2= xarray.load("bar.myformat") assert a1 == a2, "Why should they not be exactly equal?!?"

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Feature Request: Hierarchical storage and processing in xarray 628719058
868324949 https://github.com/pydata/xarray/issues/1092#issuecomment-868324949 https://api.github.com/repos/pydata/xarray/issues/1092 MDEyOklzc3VlQ29tbWVudDg2ODMyNDk0OQ== martinitus 7611856 2021-06-25T08:36:03Z 2021-06-25T08:45:23Z NONE

Hey Folks, I stumbled over this discussion having a similar use case as described in some comments above: A DataSet with a bunch of arrays called count_a, test_count_a, train_count_a, count_b, ... , controlled_test_mean, controlled_train_mean, ... controlled_test_sigma, ... Obviously a hierarchical structure would help to arrange this.

However, one point I didn't see in the discussion is the following:

Hierarchical structures often force a user to come up with some arbitrary order of hierarchy levels. The classical example is document filing: do you put your health insurance documents under /insurance/health/2021, 2021/health/insurance,....?

One solution to that is a tagging of documents instead of putting them into a hierarchy. This would give the full flexibility to retrieve any flat DataSet out of a TaggedDataSet by specifying the set of tags that the individual DataArrays must be listed under.

Back to the above example, one could think of stuff like:

```python

get a flat view (DataSet-like object) on all arrays of tagged that have the 'count' tag

ds: DataSet(View) = tagged.tag_select("count") bar1 = ds.mean(dim="foo")

get a flat view (DataSet-like object) on all arrays of tagged that have the "train and "controlled" tag

bar2 = tagged.tag_select("train", "controlled").mean(dim="foo") # order of arguments to tag_select is irrelevant! ``` I hope it is clear what I mean, I know that there is e.g. some awesome file system plugins (he has incredibly nice high level documentation on the topic) that use such a data model.

Just wanted to add that aspect to the discussion even if it might collide with the hierarchical approach!

One side note: If every array in the tagged container has exactly one tag, and tags do not repeat, then the whole thing should be semantically identical to a DataSet because every tag_select will yield a single DataArray - I.e. it might be possible to integrate such functionality directly into DataSet !?!

Regards,

Martin

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Dataset groups 187859705
855822204 https://github.com/pydata/xarray/issues/5202#issuecomment-855822204 https://api.github.com/repos/pydata/xarray/issues/5202 MDEyOklzc3VlQ29tbWVudDg1NTgyMjIwNA== martinitus 7611856 2021-06-07T10:49:49Z 2021-06-07T10:49:49Z NONE

Besides the CPU requirements, IMHO, the memory consumption is even worse.

Imagine you want to hold a 1000x1000x1000 int64 array. That would be ~ 7.5 GB and still fits into RAM on most machines. Let's assume float coordinates for all three axes. Their memory consumption of 3000*8 bytes is negligible.

Now if you stack that, you end up with three additional 7.5GB arrays. With higher dimensions the situation gets even worse.

That said, while it generally should be possible to create the coordinates of the stacked array on the fly, I don't have a solution for it.

Side note: I stumbled over that when combining xarray with pytorch, where I want to evaluate a model on a large cartesian grid. For that I stacked the array and batched the stacked coordinates to feed them to pytorch, which makes the iteration over the cartesian space really nice and smooth in code.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Make creating a MultiIndex in stack optional 864249974
810959744 https://github.com/pydata/xarray/issues/4892#issuecomment-810959744 https://api.github.com/repos/pydata/xarray/issues/4892 MDEyOklzc3VlQ29tbWVudDgxMDk1OTc0NA== martinitus 7611856 2021-03-31T10:30:49Z 2021-03-31T10:30:49Z NONE

I don't know the internals of delegation between .sel and .isel. But from the user side I would expect that boolean indexing requires me to use .isel naturally. I mean, I have to provide a boolean mask that fits the shape of the array, i.e. it is naturally index based and should only be used with .isel irrespective of the coordinate types.

While that probably be a breaking change for some people, I think it makes a quite complicated topic slightly easier to document, and figure out intentions in written code.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  disallow boolean coordinates? 806218687

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.58ms · About: xarray-datasette