html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/issues/4738#issuecomment-998992357,https://api.github.com/repos/pydata/xarray/issues/4738,998992357,IC_kwDOAMm_X847i2nl,13301940,2021-12-21T18:14:15Z,2021-12-21T18:14:15Z,MEMBER,"Okay... I think the following comment is still valid:
> The issue appears to be caused by the coordinates which are used in __dask_tokenize__
It appears that the deterministic behavior of the tokenization process is affected depending on whether the dataset/datarray contains **non-dimension coordinates** or **dimension coordinates**
```python
In [2]: ds = xr.tutorial.open_dataset('rasm')
```
```python
In [39]: a = ds.isel(time=0)
In [40]: a
Out[40]:
Dimensions: (y: 205, x: 275)
Coordinates:
time object 1980-09-16 12:00:00
xc (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
yc (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
Tair (y, x) float64 ...
In [41]: dask.base.tokenize(a) == dask.base.tokenize(a)
Out[41]: True
```
```python
In [42]: b = ds.isel(y=0)
In [43]: b
Out[43]:
Dimensions: (time: 36, x: 275)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (x) float64 189.2 189.4 189.6 189.7 ... 293.5 293.8 294.0 294.3
yc (x) float64 16.53 16.78 17.02 17.27 ... 27.61 27.36 27.12 26.87
Dimensions without coordinates: x
Data variables:
Tair (time, x) float64 ...
In [44]: dask.base.tokenize(b) == dask.base.tokenize(b)
Out[44]: False
```
**This looks like a bug in my opinion...** ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,775502974
https://github.com/pydata/xarray/issues/4738#issuecomment-998948715,https://api.github.com/repos/pydata/xarray/issues/4738,998948715,IC_kwDOAMm_X847ir9r,13301940,2021-12-21T17:06:51Z,2021-12-21T17:11:47Z,MEMBER,"> The issue appears to be caused by the coordinates which are used in __dask_tokenize__
I tried running the reproducer above and things seem to be working fine. I can't for the life of me understand why I got non-deterministic behavior four hours ago :(
```python
In [1]: import dask, xarray as xr
In [2]: ds = xr.tutorial.open_dataset('rasm')
In [3]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[3]: True
In [4]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[4]: True
```
```python
In [5]: xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 20:33:18)
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1
xarray: 0.20.1
pandas: 1.3.4
numpy: 1.20.3
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.11.0
h5py: 3.6.0
Nio: None
zarr: 2.10.3
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.2
distributed: 2021.11.2
matplotlib: 3.5.0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: 0.18
sparse: None
setuptools: 59.4.0
pip: 21.3.1
conda: None
pytest: None
IPython: 7.30.0
sphinx: 4.3.1
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,775502974
https://github.com/pydata/xarray/issues/4738#issuecomment-998764799,https://api.github.com/repos/pydata/xarray/issues/4738,998764799,IC_kwDOAMm_X847h_D_,13301940,2021-12-21T13:08:21Z,2021-12-21T13:09:01Z,MEMBER,"
> @andersy005 if you can rely on dask always being present, `dask.base.tokenize(xarray_object)` will do what you want.
@dcherian, I just realized that `dask.base.tokenize` deosn't return a deterministic token for xarray objects:
```python
In [2]: import dask, xarray as xr
In [3]: ds = xr.tutorial.open_dataset('rasm')
In [4]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[4]: False
In [5]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[5]: False
```
The issue appears to be caused by the coordinates which are used in `__dask_tokenize__`
https://github.com/pydata/xarray/blob/dbc02d4e51fe404e8b61656f2089efadbf99de28/xarray/core/dataarray.py#L870-L873
```python
In [8]: dask.base.tokenize(ds.Tair.data) == dask.base.tokenize(ds.Tair.data)
Out[8]: True
```
```python
In [16]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[16]: False
```
Is this the expected behavior or am I missing something?
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,775502974
https://github.com/pydata/xarray/issues/4738#issuecomment-757504237,https://api.github.com/repos/pydata/xarray/issues/4738,757504237,MDEyOklzc3VlQ29tbWVudDc1NzUwNDIzNw==,13301940,2021-01-10T16:34:20Z,2021-01-10T16:34:20Z,MEMBER,"> @andersy005 if you can rely on dask always being present, `dask.base.tokenize(xarray_object)` will do what you want.
👍🏽 `dask.base.tokenize()` achieves what I need for my use case.
> I asked because this isn't an operation I've used directly on pandas objects in the past. I'm not opposed, but my suggestion would be to write a separate utility function, e.g., in xarray.util (similar to what is in pandas) rather than making it method on xarray objects themselves.
Due to the simplicity of `dask.base.tokenize()`, I am now wondering whether it's even worth having a utility function in `xarray.util` for computing a deterministic token (~hash) for an xarray object? I'm happy to work on this if there's interest from other folks, otherwise I will close this issue. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,775502974
https://github.com/pydata/xarray/issues/4738#issuecomment-752154350,https://api.github.com/repos/pydata/xarray/issues/4738,752154350,MDEyOklzc3VlQ29tbWVudDc1MjE1NDM1MA==,13301940,2020-12-29T16:47:03Z,2020-12-29T16:47:03Z,MEMBER,"Pandas has a built-in utility function `pd.util.hash_pandas_object`:
```python
In [1]: import pandas as pd
In [3]: df = pd.DataFrame({'A': [4, 5, 6, 7], 'B': [10, 20, 30, 40], 'C': [100, 50, -30, -50]})
In [4]: df
Out[4]:
A B C
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [6]: row_hashes = pd.util.hash_pandas_object(df)
In [7]: row_hashes
Out[7]:
0 14190898035981950066
1 16858535338008670510
2 1055569624497948892
3 5944630256416341839
dtype: uint64
```
Combining the returned value of `hash_pandas_object()` with Python's hashlib gives something one can work with:
```python
In [8]: import hashlib
In [10]: hashlib.sha1(row_hashes.values).hexdigest() # Compute overall hash of all rows.
Out[10]: '1e1244d9b0489e1f479271f147025956d4994f67'
```
Regarding dask, I have no idea :) cc @TomAugspurger
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,775502974