home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 695786924

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/4422#issuecomment-695786924 https://api.github.com/repos/pydata/xarray/issues/4422 695786924 MDEyOklzc3VlQ29tbWVudDY5NTc4NjkyNA== 6628425 2020-09-20T13:23:58Z 2020-09-20T13:23:58Z MEMBER

Description of the problem

I believe the issue here stems from the units attribute in the original dataset: ``` In [1]: import xarray as xr

In [2]: url = "https://nomads.ncep.noaa.gov/dods/gfs_0p25_1hr/gfs20200920/gfs_0p25_1hr_00z"

In [3]: ds = xr.open_dataset(url, decode_times=False)

In [4]: ds.time.attrs["units"] Out[4]: 'days since 1-1-1 00:00:0.0' ```

This is an unusual format -- ordinarily we'd expect zero-padded year, month, and day values. Pandas misinterprets this and parses the reference date to 2001-01-01:

``` In [5]: import pandas as pd

In [6]: pd.Timestamp("1-1-1 00:00:0.0") Out[6]: Timestamp('2001-01-01 00:00:00') ```

Of course, with time values on the order 700000 and units of days, this results in dates outside the nanosecond-precision range of the np.datetime64 dtype and throws an error; xarray catches this error and then uses cftime to decode the dates. cftime parses the reference date properly, so in the end the dates are decoded correctly (good!).

There's a catch though. When saving the dates back out to a file, the odd units remain in the encoding of the time variable. When parsing the reference date, xarray again first tries using pandas. This time, there's nothing that stops xarray from proceeding, because we are no longer bound by integer overflow (taking the difference between a date in 2020 and a date in 2001 is perfectly valid for nanosecond-precision dates). So encoding succeeds, and we no longer need to try with cftime.

``` In [7]: ds = xr.decode_cf(ds)

In [8]: subset = ds["ugrd10m"].isel(time=slice(0, 8))

In [9]: subset.to_netcdf("test.nc")

In [10]: recovered = xr.open_dataset("test.nc", decode_times=False)

In [11]: recovered.time.attrs["units"] Out[11]: 'days since 2001-01-01' ```

Thus when we read the file the first time, decoding happens with cftime and when we read the file the second time, decoding happens with pandas (encoding was also different for the two files). This is the reason for the difference in values.

Workaround

To get an accurate round-trip I would recommend overwriting the units attribute to something that pandas parses correctly:

``` In [12]: ds = xr.open_dataset(url, decode_times=False)

In [13]: ds.time.attrs["units"] = "days since 0001-01-01"

In [14]: ds = xr.decode_cf(ds)

In [15]: subset = ds["ugrd10m"].isel(time=slice(0, 8))

In [16]: subset.to_netcdf("test.nc")

In [17]: recovered = xr.open_dataset("test.nc")

In [18]: recovered.time Out[18]: <xarray.DataArray 'time' (time: 8)> array(['2020-09-20T00:00:00.000000000', '2020-09-20T00:59:59.999997000', '2020-09-20T02:00:00.000003000', '2020-09-20T03:00:00.000000000', '2020-09-20T03:59:59.999997000', '2020-09-20T05:00:00.000003000', '2020-09-20T06:00:00.000000000', '2020-09-20T06:59:59.999997000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2020-09-20 ... 2020-09-20T06:59:59.999997 Attributes: grads_dim: t grads_mapping: linear grads_size: 121 grads_min: 00z20sep2020 grads_step: 1hr long_name: time minimum: 00z20sep2020 maximum: 00z25sep2020 resolution: 0.041666668 ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  701062999
Powered by Datasette · Queries took 2.747ms · About: xarray-datasette