issue_comments: 695786924
This data as json
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/pydata/xarray/issues/4422#issuecomment-695786924 | https://api.github.com/repos/pydata/xarray/issues/4422 | 695786924 | MDEyOklzc3VlQ29tbWVudDY5NTc4NjkyNA== | 6628425 | 2020-09-20T13:23:58Z | 2020-09-20T13:23:58Z | MEMBER | Description of the problemI believe the issue here stems from the units attribute in the original dataset: ``` In [1]: import xarray as xr In [2]: url = "https://nomads.ncep.noaa.gov/dods/gfs_0p25_1hr/gfs20200920/gfs_0p25_1hr_00z" In [3]: ds = xr.open_dataset(url, decode_times=False) In [4]: ds.time.attrs["units"] Out[4]: 'days since 1-1-1 00:00:0.0' ``` This is an unusual format -- ordinarily we'd expect zero-padded year, month, and day values. Pandas misinterprets this and parses the reference date to 2001-01-01: ``` In [5]: import pandas as pd In [6]: pd.Timestamp("1-1-1 00:00:0.0") Out[6]: Timestamp('2001-01-01 00:00:00') ``` Of course, with time values on the order 700000 and units of days, this results in dates outside the nanosecond-precision range of the There's a catch though. When saving the dates back out to a file, the odd units remain in the encoding of the time variable. When parsing the reference date, xarray again first tries using pandas. This time, there's nothing that stops xarray from proceeding, because we are no longer bound by integer overflow (taking the difference between a date in 2020 and a date in 2001 is perfectly valid for nanosecond-precision dates). So encoding succeeds, and we no longer need to try with cftime. ``` In [7]: ds = xr.decode_cf(ds) In [8]: subset = ds["ugrd10m"].isel(time=slice(0, 8)) In [9]: subset.to_netcdf("test.nc") In [10]: recovered = xr.open_dataset("test.nc", decode_times=False) In [11]: recovered.time.attrs["units"] Out[11]: 'days since 2001-01-01' ``` Thus when we read the file the first time, decoding happens with cftime and when we read the file the second time, decoding happens with pandas (encoding was also different for the two files). This is the reason for the difference in values. WorkaroundTo get an accurate round-trip I would recommend overwriting the units attribute to something that pandas parses correctly: ``` In [12]: ds = xr.open_dataset(url, decode_times=False) In [13]: ds.time.attrs["units"] = "days since 0001-01-01" In [14]: ds = xr.decode_cf(ds) In [15]: subset = ds["ugrd10m"].isel(time=slice(0, 8)) In [16]: subset.to_netcdf("test.nc") In [17]: recovered = xr.open_dataset("test.nc") In [18]: recovered.time Out[18]: <xarray.DataArray 'time' (time: 8)> array(['2020-09-20T00:00:00.000000000', '2020-09-20T00:59:59.999997000', '2020-09-20T02:00:00.000003000', '2020-09-20T03:00:00.000000000', '2020-09-20T03:59:59.999997000', '2020-09-20T05:00:00.000003000', '2020-09-20T06:00:00.000000000', '2020-09-20T06:59:59.999997000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2020-09-20 ... 2020-09-20T06:59:59.999997 Attributes: grads_dim: t grads_mapping: linear grads_size: 121 grads_min: 00z20sep2020 grads_step: 1hr long_name: time minimum: 00z20sep2020 maximum: 00z25sep2020 resolution: 0.041666668 ``` |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
701062999 |