html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/3686#issuecomment-576422784,https://api.github.com/repos/pydata/xarray/issues/3686,576422784,MDEyOklzc3VlQ29tbWVudDU3NjQyMjc4NA==,15016780,2020-01-20T20:35:47Z,2020-01-20T20:35:47Z,NONE,Closing as using `mask_and_scale=False` produced precise results,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,548475127 https://github.com/pydata/xarray/issues/3686#issuecomment-573458081,https://api.github.com/repos/pydata/xarray/issues/3686,573458081,MDEyOklzc3VlQ29tbWVudDU3MzQ1ODA4MQ==,15016780,2020-01-12T21:17:11Z,2020-01-12T21:17:11Z,NONE,"Thanks @rabernat I would like to use [assert_allclose](http://xarray.pydata.org/en/stable/generated/xarray.testing.assert_allclose.html) to test the output but at first pass it seems that might be prohibitively slow to test for large datasets, do you recommend sampling or other good testing strategies (e.g. to assert the xarray datasets are equal to some precision)","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,548475127 https://github.com/pydata/xarray/issues/3686#issuecomment-573444233,https://api.github.com/repos/pydata/xarray/issues/3686,573444233,MDEyOklzc3VlQ29tbWVudDU3MzQ0NDIzMw==,15016780,2020-01-12T18:37:59Z,2020-01-12T18:37:59Z,NONE,"@dmedv Thanks for this, it all makes sense to me and I see the same results, however I wasn't able to ""convert back"" using `scale_factor` and `add_offset` ``` from netCDF4 import Dataset d = Dataset(fileObjs[0]) v = d.variables['analysed_sst'] print(""Result with mask_and_scale=True"") ds_unchunked = xr.open_dataset(fileObjs[0]) print(ds_unchunked.analysed_sst.sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(""Result with mask_and_scale=False"") ds_unchunked = xr.open_dataset(fileObjs[0], mask_and_scale=False) scaled = ds_unchunked.analysed_sst * v.scale_factor + v.add_offset scaled.sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values ``` ^^ That returns a different result than what I expect. I wonder if this is because of the `_FillValue` missing from trying to convert back. _However_ this led me to another seemingly related issue: https://github.com/pydata/xarray/issues/2304 Loss of precision seems to be the key here, so coercing the `float32`s to `float64`s appears to get the same results from both chunked and unchunked versions - but still not ``` print(""results from unchunked dataset"") ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords') ds_unchunked['analysed_sst'] = ds_unchunked['analysed_sst'].astype(np.float64) print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(f""results from chunked dataset using {chunks}"") ds_chunked = xr.open_mfdataset(fileObjs, chunks=chunks, combine='by_coords') ds_chunked['analysed_sst'] = ds_chunked['analysed_sst'].astype(np.float64) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) print(""results from chunked dataset using 'auto'"") ds_chunked = xr.open_mfdataset(fileObjs, chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'}, combine='by_coords') ds_chunked['analysed_sst'] = ds_chunked['analysed_sst'].astype(np.float64) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) ``` returns: ``` results from unchunked dataset 290.1375818862207 results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600} 290.1375818862207 results from chunked dataset using 'auto' 290.1375818862207 ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,548475127 https://github.com/pydata/xarray/issues/3306#issuecomment-531617569,https://api.github.com/repos/pydata/xarray/issues/3306,531617569,MDEyOklzc3VlQ29tbWVudDUzMTYxNzU2OQ==,15016780,2019-09-16T01:22:09Z,2019-09-16T01:22:09Z,NONE,"Thanks @rabernat. I tried what you suggested (with a small subset, the source files are quite large) and it seems to work on smaller subsets, writing locally. Which leads me to suspect trying to run the same process with larger datasets might be overloading memory, but I can't assert the root cause yet. This isn't blocking my current strategy so closing for now.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,493058488 https://github.com/pydata/xarray/issues/3306#issuecomment-531493820,https://api.github.com/repos/pydata/xarray/issues/3306,531493820,MDEyOklzc3VlQ29tbWVudDUzMTQ5MzgyMA==,15016780,2019-09-14T16:34:56Z,2019-09-14T16:34:56Z,NONE,I recall this also happening when storing locally but I can't reproduce that at the moment since the kubernetes cluster I am using now is not a pangeo hub and not setup to use EFS.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,493058488 https://github.com/pydata/xarray/issues/3306#issuecomment-531486715,https://api.github.com/repos/pydata/xarray/issues/3306,531486715,MDEyOklzc3VlQ29tbWVudDUzMTQ4NjcxNQ==,15016780,2019-09-14T15:03:04Z,2019-09-14T15:03:04Z,NONE,"@rabernat good points. One thing I'm not sure of how to make reproducible is calling a remote file store, since I think it usually requires calling to a write-protected cloud storage provider. Any tips on this? I have what should be an otherwise working example here: https://gist.github.com/abarciauskas-bgse/d0aac2ae9bf0b06f52a577d0a6251b2d - let me know if this is an ok format to share for reproducing the issue. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,493058488 https://github.com/pydata/xarray/issues/3306#issuecomment-531435069,https://api.github.com/repos/pydata/xarray/issues/3306,531435069,MDEyOklzc3VlQ29tbWVudDUzMTQzNTA2OQ==,15016780,2019-09-14T01:42:22Z,2019-09-14T01:42:22Z,NONE,"Update: I've made some progress on determining the source of this issue. It seems related to the source dataset's variables. When I use 2 opendap urls with 4 parameterized variables things work fine Using 2 urls like: ``` https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc?time[0:1:0],lat[0:1:17998],lon[0:1:35999],analysed_sst[0:1:0][0:1:17998][0:1:35999],analysis_error[0:1:0][0:1:17998][0:1:35999],mask[0:1:0][0:1:17998][0:1:35999],sea_ice_fraction[0:1:0][0:1:17998][0:1:35999] ``` I get back a dataset : ``` Dimensions: (lat: 17999, lon: 36000, time: 2) Coordinates: * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99 * lon (lon) float32 -179.99 -179.98 -179.97 ... 179.99 180.0 * time (time) datetime64[ns] 2018-04-22T09:00:00 2018-04-23T09:00:00 Data variables: analysed_sst (time, lat, lon) float32 dask.array analysis_error (time, lat, lon) float32 dask.array Attributes: Conventions: CF-1.5 title: Daily MUR SST, Final product ``` however if I omit the parameterized data variables using urls such as: ``` https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc ``` I get back an additional variable: ``` Dimensions: (lat: 17999, lon: 36000, time: 2) Coordinates: * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99 * lon (lon) float32 -179.99 -179.98 -179.97 ... 179.99 180.0 * time (time) datetime64[ns] 2018-04-22T09:00:00 2018-04-23T09:00:00 Data variables: analysed_sst (time, lat, lon) float32 dask.array analysis_error (time, lat, lon) float32 dask.array mask (time, lat, lon) float32 dask.array sea_ice_fraction (time, lat, lon) float32 dask.array dt_1km_data (time, lat, lon) timedelta64[ns] dask.array Attributes: Conventions: CF-1.5 title: Daily MUR SST, Final product ``` In the first case (with the parameterized variables) I achieve the expected result (data is stored on S3). In the second case (no parameterized variables), `store` store is never included in the graph the workers seem to stall. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,493058488