home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

14 rows where issue = 1475567394 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 4

  • hmaarrfk 10
  • Illviljan 2
  • dcherian 1
  • TomNicholas 1

author_association 2

  • CONTRIBUTOR 10
  • MEMBER 4

issue 1

  • Avoid loading entire dataset by getting the nbytes in an array · 14 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1474176353 https://github.com/pydata/xarray/pull/7356#issuecomment-1474176353 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85X3iVh dcherian 2448579 2023-03-17T17:30:51Z 2023-03-17T17:31:22Z MEMBER

Because we have lazy data reading functionality ```python import xarray as xr ds = xr.tutorial.open_dataset("air_temperature") var = ds.air.variable

print(type(var._data)) # memory cached array print(type(var._data.array.array)) # ah that's wrapping a lazy array, no data read in yet print(var._data.size) # can access size print(type(var._data.array.array)) # still a lazy array

.data forces a disk load

print(type(var.data)) # oops disk-load print(type(var._data)) # "still memory cached array" print(type(var._data.array.array)) # but that's wrapping numpy data in memory ```

<class 'xarray.core.indexing.MemoryCachedArray'> <class 'xarray.core.indexing.LazilyIndexedArray'> 3869000 <class 'xarray.core.indexing.LazilyIndexedArray'> <class 'numpy.ndarray'> <class 'xarray.core.indexing.MemoryCachedArray'> <class 'numpy.ndarray'>

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1474149056 https://github.com/pydata/xarray/pull/7356#issuecomment-1474149056 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85X3brA TomNicholas 35968931 2023-03-17T17:10:44Z 2023-03-17T17:10:44Z MEMBER

This came up in the xarray office hours today, and I'm confused why this PR made any difference to the behavior at all? The .data property just points to ._data, so why would it matter which one we check?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1362322800 https://github.com/pydata/xarray/pull/7356#issuecomment-1362322800 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85RM2Vw hmaarrfk 90008 2022-12-22T02:40:59Z 2022-12-22T02:40:59Z CONTRIBUTOR

Any chance of a release, this is quite breaking for large datasets that can only be out of memory.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1346924547 https://github.com/pydata/xarray/pull/7356#issuecomment-1346924547 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85QSHAD hmaarrfk 90008 2022-12-12T17:27:47Z 2022-12-12T17:27:47Z CONTRIBUTOR

👍🏾

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1339624818 https://github.com/pydata/xarray/pull/7356#issuecomment-1339624818 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85P2Q1y hmaarrfk 90008 2022-12-06T16:19:19Z 2022-12-06T16:19:19Z CONTRIBUTOR

Yes, without chunks of anything

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1339624418 https://github.com/pydata/xarray/pull/7356#issuecomment-1339624418 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85P2Qvi hmaarrfk 90008 2022-12-06T16:18:59Z 2022-12-06T16:18:59Z CONTRIBUTOR

Very smart test!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1339575144 https://github.com/pydata/xarray/pull/7356#issuecomment-1339575144 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85P2Eto Illviljan 14371165 2022-12-06T15:44:01Z 2022-12-06T15:44:01Z MEMBER

I'm not really opposed to this change, shape and dtype uses self._data aswell.

Without using chunks={} in open_dataset? I just find it a little odd that it's not a duck_array, what type is self._data?

This test just looked so similar to the tests in #6797. I think you can do a similar lazy test taking inspiration from: https://github.com/pydata/xarray/blob/ed60c6ccd3d6725cd91190b8796af4355f3085c2/xarray/tests/test_formatting.py#L715-L727

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1339457617 https://github.com/pydata/xarray/pull/7356#issuecomment-1339457617 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85P1oBR hmaarrfk 90008 2022-12-06T14:18:11Z 2022-12-06T14:18:11Z CONTRIBUTOR

The data is loaded from an NetCDF store through open_dataset

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1339452942 https://github.com/pydata/xarray/pull/7356#issuecomment-1339452942 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85P1m4O hmaarrfk 90008 2022-12-06T14:14:57Z 2022-12-06T14:14:57Z CONTRIBUTOR

No explicit test was added to ensure that the data wasn't loaded. I just experienced this bug enough (we would accidentally load 100GB files in our code base) that I knew exactly how to fix it.

If you want i can add a test to ensure that future optimizations to nbytes do not trigger a data load.

I was hoping the 1 line fix would be a shoe in.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1339423992 https://github.com/pydata/xarray/pull/7356#issuecomment-1339423992 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85P1fz4 Illviljan 14371165 2022-12-06T13:53:03Z 2022-12-06T13:53:03Z MEMBER

Is that test targetting your issue with RAM crashing the laptop? Shouldn't there be some check if the values were loaded?

How did you import your data? self.data looks like this: https://github.com/pydata/xarray/blob/ed60c6ccd3d6725cd91190b8796af4355f3085c2/xarray/core/variable.py#L420-L435

I was expecting your data to be a duck_array?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1336731702 https://github.com/pydata/xarray/pull/7356#issuecomment-1336731702 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85PrOg2 hmaarrfk 90008 2022-12-05T04:20:08Z 2022-12-05T04:20:08Z CONTRIBUTOR

It seems that checking hasattr on the _data variable achieves both purposes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1336711830 https://github.com/pydata/xarray/pull/7356#issuecomment-1336711830 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85PrJqW hmaarrfk 90008 2022-12-05T03:58:50Z 2022-12-05T03:58:50Z CONTRIBUTOR

I think that at the very lease, the current implementation works as well as the old one for arrays that are defined by the sparse package.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1336700669 https://github.com/pydata/xarray/pull/7356#issuecomment-1336700669 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85PrG79 hmaarrfk 90008 2022-12-05T03:36:31Z 2022-12-05T03:36:31Z CONTRIBUTOR

Looking into the history a little more. I seem to be proposing to revert: https://github.com/pydata/xarray/commit/60f8c3d3488d377b0b21009422c6121e1c8f1f70

I think this is important since many users have arrays that are larger than memory. For me, I found this bug when trying to access the number of bytes in a 16GB dataset that I'm trying to load on my wimpy laptop. Not fun to start swapping. I feel like others might be hitting this too.

xref: https://github.com/pydata/xarray/pull/6797 https://github.com/pydata/xarray/issues/4842

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394
1336696899 https://github.com/pydata/xarray/pull/7356#issuecomment-1336696899 https://api.github.com/repos/pydata/xarray/issues/7356 IC_kwDOAMm_X85PrGBD hmaarrfk 90008 2022-12-05T03:30:31Z 2022-12-05T03:30:31Z CONTRIBUTOR

I personally do not even think the hasattr is really that useful. You might as well use size and itemsize

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Avoid loading entire dataset by getting the nbytes in an array 1475567394

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.635ms · About: xarray-datasette