home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where issue = 614275938 and user = 6628425 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 1

  • spencerkclark · 3 ✖

issue 1

  • Millisecond precision is lost on datetime64 during IO roundtrip · 3 ✖

author_association 1

  • MEMBER 3
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
744103639 https://github.com/pydata/xarray/issues/4045#issuecomment-744103639 https://api.github.com/repos/pydata/xarray/issues/4045 MDEyOklzc3VlQ29tbWVudDc0NDEwMzYzOQ== spencerkclark 6628425 2020-12-14T00:50:46Z 2020-12-14T00:50:46Z MEMBER

@half-adder I've verified that #4684 fixes your initial issue. Note, however, that outside of the time you referenced, your Dataset contained times that required nanosecond precision, e.g.:

```python

data.time.isel(animal=0, timepoint=0, pair=-1, wavelength=0) <xarray.DataArray 'time' ()> array('2017-02-22T16:24:14.722999999', dtype='datetime64[ns]') Coordinates: wavelength <U3 '410' strain object 'HD233' stage_x float64 1.64e+04 stage_y float64 -429.0 stage_z float64 2.155e+04 bin_x float64 4.0 bin_y float64 4.0 exposure float64 90.0 mvmt-anterior uint8 0 mvmt-posterior uint8 0 mvmt-sides_of_tip uint8 0 mvmt-tip uint8 0 experiment_id object '2017_02_22-HD233_SAY47' time datetime64[ns] 2017-02-22T16:24:14.722999999 animal_ uint64 0 ```

So in order for things to be round-tripped accurately you will need to override the original units in the dataset with nanoseconds instead of microseconds. This was not possible before, but now is with #4684.

```python

data.time.encoding["units"] = "nanoseconds since 1900-01-01" ```

With #4684 you could also just simply delete the original units, and xarray will now automatically choose the appropriate units so that the datetimes can be serialized with int64 values (and hence be round-tripped exactly).

```python

del data.time.encoding["units"] ```

{
    "total_count": 3,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Millisecond precision is lost on datetime64 during IO roundtrip 614275938
735789517 https://github.com/pydata/xarray/issues/4045#issuecomment-735789517 https://api.github.com/repos/pydata/xarray/issues/4045 MDEyOklzc3VlQ29tbWVudDczNTc4OTUxNw== spencerkclark 6628425 2020-11-30T13:35:26Z 2020-11-30T13:40:50Z MEMBER

Internally, datetime64[ns] is simply an 8-byte int. Why on earth would it be serialized in a lossy way as a float64?...

The short answer is that CF conventions allow for dates to be encoded with floating point values, so we encounter that in data that xarray ingests from other sources (i.e. files that were not even produced with Python, let alone xarray). If we didn't have to worry about roundtripping files that followed those conventions, I agree we would just encode everything with nanosecond units as int64 values.

This is a huge issue, as anyone using nanosecond-precision timestamps with xarray would unknowingly and silently read wrong data after deserializing.

Yes, I can see why this would be quite frustrating. In principle we should be able to handle this (contributions are welcome); it just has not been a priority up to this point. In my experience xarray's current encoding and decoding methods for standard calendar times work well up to at least second precision.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Millisecond precision is lost on datetime64 during IO roundtrip 614275938
626257580 https://github.com/pydata/xarray/issues/4045#issuecomment-626257580 https://api.github.com/repos/pydata/xarray/issues/4045 MDEyOklzc3VlQ29tbWVudDYyNjI1NzU4MA== spencerkclark 6628425 2020-05-10T01:15:53Z 2020-05-10T01:15:53Z MEMBER

Thanks for the report @half-adder.

This indeed is related to times being encoded as floats, but actually is not cftime-related (the times here not being encoded using cftime; we only use cftime for non-standard calendars and out of nanosecond-resolution bounds dates).

Here's a minimal working example that illustrates the issue with the current logic in coding.times.encode_cf_datetime: ``` In [1]: import numpy as np; import pandas as pd

In [2]: times = pd.DatetimeIndex([np.datetime64("2017-02-22T16:27:08.732000000")])

In [3]: reference = pd.Timestamp("1900-01-01")

In [4]: units = np.timedelta64(1, "us")

In [5]: (times - reference).values[0] Out[5]: numpy.timedelta64(3696769628732000000,'ns')

In [6]: ((times - reference) / units).values[0] Out[6]: 3696769628732000.5 ``` In principle, we should be able to represent the difference between this date and the reference date in an integer amount of microseconds, but timedelta division produces a float. We currently try to cast these floats to integers when possible, but that's not always safe to do, e.g. in the case above.

It would be great to make roundtripping times -- particularly standard calendar datetimes like these -- more robust. It's possible we could now leverage floor division (i.e. //) of timedeltas within NumPy for this (assuming we first check that the unit conversion divisor exactly divides each timedelta; if it doesn't we'd fall back to using floats):

In [7]: ((times - reference) // units).values[0] Out[7]: 3696769628732000 These precision issues can be tricky, however, so we'd need to think things through carefully. Even if we fixed this on the encoding side, things are converted to floats during decoding, so we'd need to make a change there too.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Millisecond precision is lost on datetime64 during IO roundtrip 614275938

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 21.411ms · About: xarray-datasette