home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where issue = 548475127 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • dmedv 3
  • abarciauskas-bgse 3
  • rabernat 1

author_association 2

  • NONE 6
  • MEMBER 1

issue 1

  • Different data values from xarray open_mfdataset when using chunks · 7 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
576422784 https://github.com/pydata/xarray/issues/3686#issuecomment-576422784 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3NjQyMjc4NA== abarciauskas-bgse 15016780 2020-01-20T20:35:47Z 2020-01-20T20:35:47Z NONE

Closing as using mask_and_scale=False produced precise results

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127
573458081 https://github.com/pydata/xarray/issues/3686#issuecomment-573458081 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3MzQ1ODA4MQ== abarciauskas-bgse 15016780 2020-01-12T21:17:11Z 2020-01-12T21:17:11Z NONE

Thanks @rabernat I would like to use assert_allclose to test the output but at first pass it seems that might be prohibitively slow to test for large datasets, do you recommend sampling or other good testing strategies (e.g. to assert the xarray datasets are equal to some precision)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127
573455625 https://github.com/pydata/xarray/issues/3686#issuecomment-573455625 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3MzQ1NTYyNQ== dmedv 3922329 2020-01-12T20:48:20Z 2020-01-12T20:51:01Z NONE

Actually, there is no need to separate them. One can simply do something like this to apply the mask: ds.analysed_sst.where(ds.analysed_sst != fill_value).mean() * scale_factor + offset It's not a bug, but if we set mask_and_scale=False, it's left up to us to apply the mask manually.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127
573455048 https://github.com/pydata/xarray/issues/3686#issuecomment-573455048 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3MzQ1NTA0OA== rabernat 1197350 2020-01-12T20:41:53Z 2020-01-12T20:41:53Z MEMBER

Thanks for the useful issue @abarciauskas-bgse and valuable test @dmedv.

I believe this is fundamentally a Dask issue. In general, Dask's algorithms do not guarantee numerically identical results for different chunk sizes. Roundoff errors accrue slightly differently based on how the array is split up. These errors are usually acceptable to users. For example, 290.13754 vs 290.13757, the error is in the 8th significant digit, 1 part in 100,00,000. Since there are only 65,536 16-bit integers (the original data type in the netCDF file), this seems more than adequate precision to me.

Calling .mean() on a dask array is not the same as a checksum. As with all numerical calculations, equality should be verified with a precision appropriate to the data type and algorithm, e.g. using assert_allclose.

There appears to be a second issue here related to fill values, but I haven't quite grasped whether we think there is a bug.

I think it would be nice if it were possible to control the mask application in open_dataset separately from scale/offset.

There may be a reason why these operations are coupled. Would have to look more closely at the code to know for sure.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127
573451230 https://github.com/pydata/xarray/issues/3686#issuecomment-573451230 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3MzQ1MTIzMA== dmedv 3922329 2020-01-12T19:59:31Z 2020-01-12T20:25:16Z NONE

@abarciauskas-bgse Yes, indeed, I forgot about _FillValue. That would mess up the mean calculation with mask_and_scale=False. I think it would be nice if it were possible to control the mask application in open_dataset separately from scale/offset.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127
573444233 https://github.com/pydata/xarray/issues/3686#issuecomment-573444233 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3MzQ0NDIzMw== abarciauskas-bgse 15016780 2020-01-12T18:37:59Z 2020-01-12T18:37:59Z NONE

@dmedv Thanks for this, it all makes sense to me and I see the same results, however I wasn't able to "convert back" using scale_factor and add_offset ``` from netCDF4 import Dataset

d = Dataset(fileObjs[0]) v = d.variables['analysed_sst']

print("Result with mask_and_scale=True") ds_unchunked = xr.open_dataset(fileObjs[0]) print(ds_unchunked.analysed_sst.sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

print("Result with mask_and_scale=False") ds_unchunked = xr.open_dataset(fileObjs[0], mask_and_scale=False) scaled = ds_unchunked.analysed_sst * v.scale_factor + v.add_offset scaled.sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values `` ^^ That returns a different result than what I expect. I wonder if this is because of the_FillValue` missing from trying to convert back.

However this led me to another seemingly related issue: https://github.com/pydata/xarray/issues/2304

Loss of precision seems to be the key here, so coercing the float32s to float64s appears to get the same results from both chunked and unchunked versions - but still not

``` print("results from unchunked dataset") ds_unchunked = xr.open_mfdataset(fileObjs, combine='by_coords') ds_unchunked['analysed_sst'] = ds_unchunked['analysed_sst'].astype(np.float64) print(ds_unchunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

print(f"results from chunked dataset using {chunks}") ds_chunked = xr.open_mfdataset(fileObjs, chunks=chunks, combine='by_coords') ds_chunked['analysed_sst'] = ds_chunked['analysed_sst'].astype(np.float64) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values)

print("results from chunked dataset using 'auto'") ds_chunked = xr.open_mfdataset(fileObjs, chunks={'time': 'auto', 'lat': 'auto', 'lon': 'auto'}, combine='by_coords') ds_chunked['analysed_sst'] = ds_chunked['analysed_sst'].astype(np.float64) print(ds_chunked.analysed_sst[1,:,:].sel(lat=slice(20,50),lon=slice(-170,-110)).mean().values) ```

returns: results from unchunked dataset 290.1375818862207 results from chunked dataset using {'time': 1, 'lat': 1799, 'lon': 3600} 290.1375818862207 results from chunked dataset using 'auto' 290.1375818862207

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127
573380688 https://github.com/pydata/xarray/issues/3686#issuecomment-573380688 https://api.github.com/repos/pydata/xarray/issues/3686 MDEyOklzc3VlQ29tbWVudDU3MzM4MDY4OA== dmedv 3922329 2020-01-12T04:18:43Z 2020-01-12T04:27:23Z NONE

Actually, that's true not just for open_mfdataset, but even for open_dataset with a single file. I've tried it with one of those files from PO.DAAC, and got similar results - slightly different values depending on the chunking strategy.

Just a guess, but I think the problem here is that the calculations are done in floating-point arithmetic (probably float32...), and you get accumulated precision errors depending on the number of chunks.

Internally in the NetCDF file analysed_sst values are stored as int16, with real scale and offset values, so the correct way to calculate the mean would be to do it in original int16, and then apply scale and offset to the result. Automatic scaling is on by default (i.e. it will replace original array values with new scaled values), but you can turn it off in open_dataset with the mask_and_scale=False option: http://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html I tried doing this, and then I got identical results with chunked and unchunked versions. Can pass this option to open_mfdataset as well with **kwargs.

I'm basically just starting to use xarray myself, so please someone correct me if any of the above is wrong.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Different data values from xarray open_mfdataset when using chunks  548475127

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 12.701ms · About: xarray-datasette