home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where author_association = "CONTRIBUTOR" and user = 1554921 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, created_at (date), updated_at (date)

issue 4

  • Multi-dimensional coordinate mixup when writing to netCDF 2
  • Fix multidimensional coordinates 1
  • to_netcdf(compute=False) can be slow 1
  • Writing Datasets to netCDF4 with "inconsistent" chunks 1

user 1

  • neishm · 5 ✖

author_association 1

  • CONTRIBUTOR · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
400825442 https://github.com/pydata/xarray/issues/2254#issuecomment-400825442 https://api.github.com/repos/pydata/xarray/issues/2254 MDEyOklzc3VlQ29tbWVudDQwMDgyNTQ0Mg== neishm 1554921 2018-06-27T20:53:27Z 2018-06-27T20:53:27Z CONTRIBUTOR

So yes, it looks like we could fix this by checking chunks on each array independently like you suggest. There's no reason why all dask arrays need to have the same chunking for storing with to_netcdf().

I could throw together a pull request if that's all that's involved.

This is because you need to indicate chunks for variables separately, via encoding: http://xarray.pydata.org/en/stable/io.html#writing-encoded-data

Thanks! I was able to write chunked output the netCDF file by adding chunksizes to the encoding attribute of the variables. I found I also had to specify original_shape as a workaround for #2198.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Writing Datasets to netCDF4 with "inconsistent" chunks 336273865
399495668 https://github.com/pydata/xarray/issues/2242#issuecomment-399495668 https://api.github.com/repos/pydata/xarray/issues/2242 MDEyOklzc3VlQ29tbWVudDM5OTQ5NTY2OA== neishm 1554921 2018-06-22T16:10:45Z 2018-06-22T16:10:45Z CONTRIBUTOR

True, I would expect some performance hit due to writing chunk-by-chunk, however that same performance hit is present in both of the test cases.

In addition to the snippet @shoyer mentioned, I found that xarray also intentionally uses autoclose=True when writing chunks to netCDF: https://github.com/pydata/xarray/blob/73b476e4db6631b2203954dd5b138cb650e4fb8c/xarray/backends/netCDF4_.py#L45-L48

However, ensure_open only uses autoclose if the file isn't already open:

https://github.com/pydata/xarray/blob/73b476e4db6631b2203954dd5b138cb650e4fb8c/xarray/backends/common.py#L496-L503

So if the file is already open before getting to BaseNetCDF4Array__setitem__, it will remain open. If the file isn't yet opened, it will be opened, but then immediately closed after writing the chunk. I suspect this is what's happening in the delayed version - the starting state of NetCDF4DataStore._isopen is False for some reason, and so it is doomed to re-close itself for each chunk processed.

If I remove the autoclose=True from BaseNetCDF4Array__setitem__, the file remains open and performance is comparable between the two tests.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  to_netcdf(compute=False) can be slow 334633212
350292555 https://github.com/pydata/xarray/issues/1763#issuecomment-350292555 https://api.github.com/repos/pydata/xarray/issues/1763 MDEyOklzc3VlQ29tbWVudDM1MDI5MjU1NQ== neishm 1554921 2017-12-08T15:34:01Z 2017-12-08T15:34:01Z CONTRIBUTOR

I think I've duplicated the logic from _construct_dataarray into _encode_coordinates. Test cases are passing, and my actual files are writing out properly. Hopefully nothing else got broken along the way.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multi-dimensional coordinate mixup when writing to netCDF 279832457
350090601 https://github.com/pydata/xarray/pull/1768#issuecomment-350090601 https://api.github.com/repos/pydata/xarray/issues/1768 MDEyOklzc3VlQ29tbWVudDM1MDA5MDYwMQ== neishm 1554921 2017-12-07T20:51:27Z 2017-12-07T20:51:27Z CONTRIBUTOR

No fix yet, just added a test case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fix multidimensional coordinates 280274296
350015214 https://github.com/pydata/xarray/issues/1763#issuecomment-350015214 https://api.github.com/repos/pydata/xarray/issues/1763 MDEyOklzc3VlQ29tbWVudDM1MDAxNTIxNA== neishm 1554921 2017-12-07T16:11:55Z 2017-12-07T16:11:55Z CONTRIBUTOR

I can try putting together a pull request, hopefully without breaking any existing use cases. I just tested switching the any condition to all in the above code, and it does fix my one test case...

...However, it breaks other cases, such as if there's another axis in the data (such as a time axis). I think the all condition would require "time" to be one of the dimensions of the coordinates.

Here's an updated test case:

```python import xarray as xr import numpy as np

zeros1 = np.zeros((1,5,3)) zeros2 = np.zeros((1,6,3)) zeros3 = np.zeros((1,5,4)) d = xr.Dataset({ 'lon1': (['x1','y1'], zeros1.squeeze(0), {}), 'lon2': (['x2','y1'], zeros2.squeeze(0), {}), 'lon3': (['x1','y2'], zeros3.squeeze(0), {}), 'lat1': (['x1','y1'], zeros1.squeeze(0), {}), 'lat2': (['x2','y1'], zeros2.squeeze(0), {}), 'lat3': (['x1','y2'], zeros3.squeeze(0), {}), 'foo1': (['time','x1','y1'], zeros1, {'coordinates': 'lon1 lat1'}), 'foo2': (['time','x2','y1'], zeros2, {'coordinates': 'lon2 lat2'}), 'foo3': (['time','x1','y2'], zeros3, {'coordinates': 'lon3 lat3'}), 'time': ('time', [0.], {'units': 'hours since 2017-01-01'}), }) d = xr.conventions.decode_cf(d) The resulting Dataset: <xarray.Dataset> Dimensions: (time: 1, x1: 5, x2: 6, y1: 3, y2: 4) Coordinates: lat1 (x1, y1) float64 ... * time (time) datetime64[ns] 2017-01-01 lat3 (x1, y2) float64 ... lat2 (x2, y1) float64 ... lon1 (x1, y1) float64 ... lon3 (x1, y2) float64 ... lon2 (x2, y1) float64 ... Dimensions without coordinates: x1, x2, y1, y2 Data variables: foo1 (time, x1, y1) float64 ... foo2 (time, x2, y1) float64 ... foo3 (time, x1, y2) float64 ... saved to netCDF usingpython d.to_netcdf("test.nc") ```

With the any condition, I have too many coordinates: ~$ ncdump -h test.nc netcdf test { dimensions: x1 = 5 ; y1 = 3 ; time = 1 ; y2 = 4 ; x2 = 6 ; variables: ... double foo1(time, x1, y1) ; foo1:_FillValue = NaN ; foo1:coordinates = "lat1 lat3 lat2 lon1 lon3 lon2" ; double foo2(time, x2, y1) ; foo2:_FillValue = NaN ; foo2:coordinates = "lon1 lon2 lat1 lat2" ; double foo3(time, x1, y2) ; foo3:_FillValue = NaN ; foo3:coordinates = "lon1 lon3 lat1 lat3" ; ... }

With the all condition, I don't get any variable coordinates (they're dumped into the global attributes): ``` ~$ ncdump -h test.nc netcdf test { dimensions: x1 = 5 ; y1 = 3 ; time = 1 ; y2 = 4 ; x2 = 6 ; variables: ... double foo1(time, x1, y1) ; foo1:_FillValue = NaN ; double foo2(time, x2, y1) ; foo2:_FillValue = NaN ; double foo3(time, x1, y2) ; foo3:_FillValue = NaN ;

// global attributes: :_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ; :coordinates = "lat1 lat3 lat2 lon1 lon3 lon2" ; }

```

So the update may be a bit more tricky to get right. I know the DataArray objects (foo1,foo2,foo3) already have the right coordinates associated with them before writing to netCDF, so maybe the logic in _encode_coordinates could be changed to utilize v.coords somehow? I'll see if I can get something working for my test cases...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Multi-dimensional coordinate mixup when writing to netCDF 279832457

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 178.666ms · About: xarray-datasette