issue_comments: 695786924

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/4422#issuecomment-695786924	https://api.github.com/repos/pydata/xarray/issues/4422	695786924	MDEyOklzc3VlQ29tbWVudDY5NTc4NjkyNA==	6628425	2020-09-20T13:23:58Z	2020-09-20T13:23:58Z	MEMBER	Description of the problem I believe the issue here stems from the units attribute in the original dataset: ``` In [1]: import xarray as xr In [2]: url = "https://nomads.ncep.noaa.gov/dods/gfs_0p25_1hr/gfs20200920/gfs_0p25_1hr_00z" In [3]: ds = xr.open_dataset(url, decode_times=False) In [4]: ds.time.attrs["units"] Out[4]: 'days since 1-1-1 00:00:0.0' ``` This is an unusual format -- ordinarily we'd expect zero-padded year, month, and day values. Pandas misinterprets this and parses the reference date to 2001-01-01: ``` In [5]: import pandas as pd In [6]: pd.Timestamp("1-1-1 00:00:0.0") Out[6]: Timestamp('2001-01-01 00:00:00') ``` Of course, with time values on the order 700000 and units of days, this results in dates outside the nanosecond-precision range of the `np.datetime64` dtype and throws an error; xarray catches this error and then uses cftime to decode the dates. cftime parses the reference date properly, so in the end the dates are decoded correctly (good!). There's a catch though. When saving the dates back out to a file, the odd units remain in the encoding of the time variable. When parsing the reference date, xarray again first tries using pandas. This time, there's nothing that stops xarray from proceeding, because we are no longer bound by integer overflow (taking the difference between a date in 2020 and a date in 2001 is perfectly valid for nanosecond-precision dates). So encoding succeeds, and we no longer need to try with cftime. ``` In [7]: ds = xr.decode_cf(ds) In [8]: subset = ds["ugrd10m"].isel(time=slice(0, 8)) In [9]: subset.to_netcdf("test.nc") In [10]: recovered = xr.open_dataset("test.nc", decode_times=False) In [11]: recovered.time.attrs["units"] Out[11]: 'days since 2001-01-01' ``` Thus when we read the file the first time, decoding happens with cftime and when we read the file the second time, decoding happens with pandas (encoding was also different for the two files). This is the reason for the difference in values. Workaround To get an accurate round-trip I would recommend overwriting the units attribute to something that pandas parses correctly: ``` In [12]: ds = xr.open_dataset(url, decode_times=False) In [13]: ds.time.attrs["units"] = "days since 0001-01-01" In [14]: ds = xr.decode_cf(ds) In [15]: subset = ds["ugrd10m"].isel(time=slice(0, 8)) In [16]: subset.to_netcdf("test.nc") In [17]: recovered = xr.open_dataset("test.nc") In [18]: recovered.time Out[18]: <xarray.DataArray 'time' (time: 8)> array(['2020-09-20T00:00:00.000000000', '2020-09-20T00:59:59.999997000', '2020-09-20T02:00:00.000003000', '2020-09-20T03:00:00.000000000', '2020-09-20T03:59:59.999997000', '2020-09-20T05:00:00.000003000', '2020-09-20T06:00:00.000000000', '2020-09-20T06:59:59.999997000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 2020-09-20 ... 2020-09-20T06:59:59.999997 Attributes: grads_dim: t grads_mapping: linear grads_size: 121 grads_min: 00z20sep2020 grads_step: 1hr long_name: time minimum: 00z20sep2020 maximum: 00z25sep2020 resolution: 0.041666668 ```	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		701062999