home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where author_association = "NONE" and user = 9655353 sorted by updated_at descending

✖
✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, created_at (date), updated_at (date)

issue 5

  • Unexpected decoded time in xarray >= 0.10.1 4
  • selecting a point from an mfdataset 2
  • If a NetCDF file is chunked on disk, open it with compatible dask chunks 2
  • ValueError not raised when doing difference of two non-intersecting datasets 1
  • groupby() fails with a stack trace when Dask 0.15.3 is used 1

user 1

  • JanisGailis · 10 ✖

author_association 1

  • NONE · 10 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
374956179 https://github.com/pydata/xarray/issues/2002#issuecomment-374956179 https://api.github.com/repos/pydata/xarray/issues/2002 MDEyOklzc3VlQ29tbWVudDM3NDk1NjE3OQ== JanisGailis 9655353 2018-03-21T14:27:57Z 2018-03-21T14:27:57Z NONE

You have just reproduced the issue.

The correct datetime values are in the filenames. So, you open two files, one from 1991-09-01T12:00:00.00 and the other from 1991-09-02T12:00:00.00, but the decoded time dimension becomes: array(['1981-01-01T00:00:00.564166656', '1980-12-31T23:59:58.707073024'], dtype='datetime64[ns]')

Which is exactly the problem I'm facing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected decoded time in xarray >= 0.10.1 307224717
374951019 https://github.com/pydata/xarray/issues/2002#issuecomment-374951019 https://api.github.com/repos/pydata/xarray/issues/2002 MDEyOklzc3VlQ29tbWVudDM3NDk1MTAxOQ== JanisGailis 9655353 2018-03-21T14:13:35Z 2018-03-21T14:13:35Z NONE

Thanks for looking into this!

I did try 0.10.2, same result as 0.10.1.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected decoded time in xarray >= 0.10.1 307224717
374922820 https://github.com/pydata/xarray/issues/2002#issuecomment-374922820 https://api.github.com/repos/pydata/xarray/issues/2002 MDEyOklzc3VlQ29tbWVudDM3NDkyMjgyMA== JanisGailis 9655353 2018-03-21T12:39:40Z 2018-03-21T12:39:40Z NONE

Actual data can be retrieved from here: ftp://anon-ftp.ceda.ac.uk/neodc/esacci/sst/data/lt/Analysis/L4/v01.1

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected decoded time in xarray >= 0.10.1 307224717
374920483 https://github.com/pydata/xarray/issues/2002#issuecomment-374920483 https://api.github.com/repos/pydata/xarray/issues/2002 MDEyOklzc3VlQ29tbWVudDM3NDkyMDQ4Mw== JanisGailis 9655353 2018-03-21T12:29:28Z 2018-03-21T12:29:28Z NONE

1859 seems to be related.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unexpected decoded time in xarray >= 0.10.1 307224717
332174498 https://github.com/pydata/xarray/issues/1592#issuecomment-332174498 https://api.github.com/repos/pydata/xarray/issues/1592 MDEyOklzc3VlQ29tbWVudDMzMjE3NDQ5OA== JanisGailis 9655353 2017-09-26T11:52:17Z 2017-09-26T11:52:17Z NONE

Yes, must be the same issue. Thanks for fixing it! Feel free to close this as a duplicate.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  groupby() fails with a stack trace when Dask 0.15.3 is used 260569191
307070835 https://github.com/pydata/xarray/issues/1440#issuecomment-307070835 https://api.github.com/repos/pydata/xarray/issues/1440 MDEyOklzc3VlQ29tbWVudDMwNzA3MDgzNQ== JanisGailis 9655353 2017-06-08T10:59:45Z 2017-06-08T10:59:45Z NONE

I quite like the approach you're suggesting! What I dislike the most currently with our approach is that it is a real possibility that a single netCDF chunk falls into multiple dask chunks, we don't control for that in any way! I'd happily swap our approach out to the more general one you suggest.

This does of course beg for input regarding the API constraints, as in, would it be a good idea to add more kwargs for chunk size threshold and edge ratio to open functions?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060
306850272 https://github.com/pydata/xarray/issues/1396#issuecomment-306850272 https://api.github.com/repos/pydata/xarray/issues/1396 MDEyOklzc3VlQ29tbWVudDMwNjg1MDI3Mg== JanisGailis 9655353 2017-06-07T16:30:04Z 2017-06-07T16:50:40Z NONE

That's great to know! I think there's no need to try my 'solution' then, maybe only out of pure interest.

It would of course be interesting to know why a 'custom' chunked dataset was apparently not affected by the bug. And if it was indeed the case.

EDIT: I read the discussion on dask github and the xarray mailinglist. It's probably because when explicit chunking is used, the chunks are not aliased and fusing works as expected.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  selecting a point from an mfdataset 225774140
306838587 https://github.com/pydata/xarray/issues/1396#issuecomment-306838587 https://api.github.com/repos/pydata/xarray/issues/1396 MDEyOklzc3VlQ29tbWVudDMwNjgzODU4Nw== JanisGailis 9655353 2017-06-07T15:51:34Z 2017-06-07T15:53:06Z NONE

We had similar performance issues with xarray+dask, which we solved by using a chunking heuristic when opening a dataset. You can read about it in #1440. Now, in our case the data really wouldn't fit in memory, which is clearly not the case in your gist. Anyway, I thought I'd play around with your gist and see if chunking can make a difference.

I couldn't use your example directly, as the data it generates in memory is too large for the dev VM I'm on with this. So I changed the generated file size to (12, 1000, 2000), the essence of your gist remained though, it would take ~25 seconds to do the time series extraction, whereas ~800 ms using extract_point_xarray().

So, I thought I'd try our 'chunking heuristic' on the generated test datasets. Simply split the dataset in 2x2 chunks along spatial dimensions. So:

python ds = xr.open_mfdataset(all_files, decode_cf=False, chunks={'time':12, 'x':1000, 'y':500})

To my surprise: ```python

time extracting a timeseries of a single point

y, x = 200, 300 with ProgressBar(): %time ts = ds.data[:, y, x].load() results in [########################################] | 100% Completed | 0.7s CPU times: user 124 ms, sys: 268 ms, total: 392 ms Wall time: 826 ms ```

I'm not entirely sure what's happening, as the file obviously fits in memory just fine because the looping thing works well. Maybe it's fine when you loop through them one by one, but the single file chunk turns out to be too large when dask wants to parallelize the whole thing. I really have no idea.

I'd be very intrigued to see if you can get a similar result by doing a simple 2x2xtime chunking. By the way, chunks={'x':1000, 'y':500, 'time':1} produces similar results with some overhead. Extraction took ~1.5 seconds.

EDIT: python print(xr.__version__) print(dask.__version__) 0.9.5 0.14.1

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  selecting a point from an mfdataset 225774140
306814837 https://github.com/pydata/xarray/issues/1440#issuecomment-306814837 https://api.github.com/repos/pydata/xarray/issues/1440 MDEyOklzc3VlQ29tbWVudDMwNjgxNDgzNw== JanisGailis 9655353 2017-06-07T14:37:14Z 2017-06-07T14:37:14Z NONE

We had a similar issue some time ago. We use xr.open_mfdataset to open long time series of data, where each time slice is a single file. In this case each file becomes a single dask chunk, which is appropriate for most data we have to work with (ESA CCI datasets).

We encountered a problem, however, with a few datasets that had very significant compression levels, such that a single file would fit in memory, but not a few of them, on a consumer-ish laptop. So, the machine would quickly run out of memory when working with the opened dataset.

As we have to be able to open 'automatically' all ESA CCI datasets, manually denoting the chunk sizes was not an option, so we explored a few ways how to do this. Aligning the chunk sizes with NetCDF chunking was not a great idea because of the reason shoyer mentions above. The chunk sizes for some datasets would be too small and the bottleneck moves from memory consumption to the amount of read/write operations.

We eventually figured (with help from shoyer :)) that the chunks should be small enough to fit in memory on an average user's laptop. yet as big as possible to maximize the amount of NetCDF chunks falling nicely in the dask chunk. Also, shape of the dask chunk can be of importance to maximize the amount of NetCDF chunks falling nicely in. We figured it's a good guess to divide both lat and lon dimensions by the same divisor, as that's also how NetCDF is often chunked.

So, we open the first file, determine it's 'uncompressed' size and then figure out if we should chunk it as 1, 2x2, 3x3, etc. It's far from a perfect solution, but it works in our case. Here's how we have implemented this: https://github.com/CCI-Tools/cate-core/blob/master/cate/core/ds.py#L506

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  If a NetCDF file is chunked on disk, open it with compatible dask chunks 233350060
288646219 https://github.com/pydata/xarray/issues/1316#issuecomment-288646219 https://api.github.com/repos/pydata/xarray/issues/1316 MDEyOklzc3VlQ29tbWVudDI4ODY0NjIxOQ== JanisGailis 9655353 2017-03-23T08:14:37Z 2017-03-23T08:14:37Z NONE

Thanks!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  ValueError not raised when doing difference of two non-intersecting datasets 216010508

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 16.511ms · About: xarray-datasette
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows