home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

18 rows where user = 102827 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 7

  • Speed up `decode_cf_datetime` 7
  • interpolate_na with limit argument changes size of chunks 4
  • `decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed 2
  • HDF5 error when working with compressed NetCDF files and the dask multiprocessing scheduler 2
  • DataArray.rolling() does not preserve chunksizes in some cases 1
  • [WIP] Fix problem with wrong chunksizes when using rolling_window on dask.array 1
  • Improving documentation on `apply_ufunc` 1

user 1

  • cchwala · 18 ✖

author_association 1

  • CONTRIBUTOR 18
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
896827548 https://github.com/pydata/xarray/issues/2808#issuecomment-896827548 https://api.github.com/repos/pydata/xarray/issues/2808 IC_kwDOAMm_X841dICc cchwala 102827 2021-08-11T13:28:08Z 2021-08-11T13:28:08Z CONTRIBUTOR

Thanks @keewis for linking the new tutorial. It helped me a lot figuring out how to use apply_ufunc for my 1D case. The fact that the tutorial shows the "typical" errors messages that you get when trying to use it, make the tutorial really nice to follow.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Improving documentation on `apply_ufunc` 420584430
434966059 https://github.com/pydata/xarray/pull/2532#issuecomment-434966059 https://api.github.com/repos/pydata/xarray/issues/2532 MDEyOklzc3VlQ29tbWVudDQzNDk2NjA1OQ== cchwala 102827 2018-11-01T08:13:48Z 2018-11-01T08:13:48Z CONTRIBUTOR

Yes. Test are still failing. The PR is WIP. I just wanted to open the PR now to have the discussion here instead of in the issues.

I will work on fixing the code to pass all current test. I will also check how the rechunking affects performance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  [WIP] Fix problem with wrong chunksizes when using rolling_window on dask.array 376162232
433454137 https://github.com/pydata/xarray/issues/2514#issuecomment-433454137 https://api.github.com/repos/pydata/xarray/issues/2514 MDEyOklzc3VlQ29tbWVudDQzMzQ1NDEzNw== cchwala 102827 2018-10-26T15:49:20Z 2018-10-31T21:14:48Z CONTRIBUTOR

EDIT: The issue of this post is now separated #2531

I think I have a fix, but wanted to write some failing tests before committing the changes. Doing this I discovered that also DataArray.rolling() does not preserve the chunksizes, apparently depending on the applied method.

```python import pandas as pd import numpy as np import xarray as xr

t = pd.date_range(start='2018-01-01', end='2018-02-01', freq='H') bar = np.sin(np.arange(len(t))) baz = np.cos(np.arange(len(t)))

da_test = xr.DataArray(data=np.stack([bar, baz]), coords={'time': t, 'sensor': ['one', 'two']}, dims=('sensor', 'time'))

print(da_test.chunk({'time': 100}).rolling(time=60).mean().chunks)

print(da_test.chunk({'time': 100}).rolling(time=60).count().chunks) Output for mean: ((2,), (745,)) Output for count: ((2,), (100, 100, 100, 100, 100, 100, 100, 45)) Desired Output: ((2,), (100, 100, 100, 100, 100, 100, 100, 45)) ```

My fix solves my initial problem, but maybe if done correctly it should also solve this bug, too.

Any idea why this depends on whether .mean() or .count() is used?

I have already pushed some WIP changes. Should I already open a PR if though most new test still fail?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  interpolate_na with limit argument changes size of chunks 374279704
434843563 https://github.com/pydata/xarray/issues/2531#issuecomment-434843563 https://api.github.com/repos/pydata/xarray/issues/2531 MDEyOklzc3VlQ29tbWVudDQzNDg0MzU2Mw== cchwala 102827 2018-10-31T20:52:49Z 2018-10-31T20:52:49Z CONTRIBUTOR

The cause has been explained by @fujiisoup here https://github.com/pydata/xarray/issues/2514#issuecomment-433528586

Nice catch!

For some historical reasons, mean and some reduction method uses bottleneck as default, while count does not.

mean goes through this function

xarray/xarray/core/dask_array_ops.py

Line 23 in b622c5e def dask_rolling_wrapper(moving_func, a, window, min_count=None, axis=-1):

It looks there is another but for this function.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DataArray.rolling() does not preserve chunksizes in some cases 376154741
433992180 https://github.com/pydata/xarray/issues/2514#issuecomment-433992180 https://api.github.com/repos/pydata/xarray/issues/2514 MDEyOklzc3VlQ29tbWVudDQzMzk5MjE4MA== cchwala 102827 2018-10-29T17:01:12Z 2018-10-29T17:01:12Z CONTRIBUTOR

@dcherian Okay. A WIP PR will follow, but might take some days.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  interpolate_na with limit argument changes size of chunks 374279704
433369567 https://github.com/pydata/xarray/issues/2514#issuecomment-433369567 https://api.github.com/repos/pydata/xarray/issues/2514 MDEyOklzc3VlQ29tbWVudDQzMzM2OTU2Nw== cchwala 102827 2018-10-26T10:53:32Z 2018-10-26T10:53:32Z CONTRIBUTOR

Thanks @fujiisoup for the quick response and the pointers. I will have a look and report back if a PR is within my capabilities or not.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  interpolate_na with limit argument changes size of chunks 374279704
433346685 https://github.com/pydata/xarray/issues/2514#issuecomment-433346685 https://api.github.com/repos/pydata/xarray/issues/2514 MDEyOklzc3VlQ29tbWVudDQzMzM0NjY4NQ== cchwala 102827 2018-10-26T09:27:19Z 2018-10-26T09:27:19Z CONTRIBUTOR

The problem seems to occur here

https://github.com/pydata/xarray/blob/5940100761478604080523ebb1291ecff90e779e/xarray/core/missing.py#L368-L376

because of the usage of .construct(). A quick try without it, shows that the chunksize is preserved then.

Hence, .construct() might need a fix for correctly dealing with the chunks of dask.arrays.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  interpolate_na with limit argument changes size of chunks 374279704
361532119 https://github.com/pydata/xarray/issues/1836#issuecomment-361532119 https://api.github.com/repos/pydata/xarray/issues/1836 MDEyOklzc3VlQ29tbWVudDM2MTUzMjExOQ== cchwala 102827 2018-01-30T09:32:26Z 2018-01-30T09:32:26Z CONTRIBUTOR

Thanks @jhamman for looking into this.

Currently I am fine with using persist() since I can break down my analysis workflow to certain time periods for which data fits into RAM on a large machine. As I have written, the distributed scheduler failed for me because of #1464. But I would like to use it in the future. From other discussions on the dask schedulers (here or on SO) using the distributed scheduler seems to be a general recommendation anyway.

In summary, I am fine with my current workaround. I do not think that solving this issue has a high priority, in particular when the distributed scheduler is further improved. The main annoyance was to track down the problem described in my first post. Hence, maybe the limitations of the schedulers could be described a bit better in the documentation. Would you want a PR on this?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  HDF5 error when working with compressed NetCDF files and the dask multiprocessing scheduler 289342234
358445479 https://github.com/pydata/xarray/issues/1836#issuecomment-358445479 https://api.github.com/repos/pydata/xarray/issues/1836 MDEyOklzc3VlQ29tbWVudDM1ODQ0NTQ3OQ== cchwala 102827 2018-01-17T21:07:43Z 2018-01-17T21:07:43Z CONTRIBUTOR

Thanks for the quick answer.

The problem is that my actual use case also involves writing back a xarray.Dataset via to_netcdf(). I left this out of the example above to isolate the problem. With the distributed scheduler and to_netcdf(), I ran into this issue #1464. As I can see, this might be fixed "soon" (#1793).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  HDF5 error when working with compressed NetCDF files and the dask multiprocessing scheduler 289342234
317786250 https://github.com/pydata/xarray/pull/1414#issuecomment-317786250 https://api.github.com/repos/pydata/xarray/issues/1414 MDEyOklzc3VlQ29tbWVudDMxNzc4NjI1MA== cchwala 102827 2017-07-25T16:03:46Z 2017-07-25T16:03:46Z CONTRIBUTOR

@jhamman @shoyer This should be ready to merge.

Should I open an xarray issue concerning the bug with pandas.to_timedelta() or is it enough to have the issue I submitted for pandas? I think the bug should be resolved in xarray when it is resolved in pandas because then the overflow check here should catch the cases I discovered.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up `decode_cf_datetime` 229807027
316963228 https://github.com/pydata/xarray/pull/1414#issuecomment-316963228 https://api.github.com/repos/pydata/xarray/issues/1414 MDEyOklzc3VlQ29tbWVudDMxNjk2MzIyOA== cchwala 102827 2017-07-21T10:10:54Z 2017-07-21T10:10:54Z CONTRIBUTOR

hmm... it's still complicated. To avoid the NaTs in my code, I tried to extend the current overflow check so that it switches to _decode_datetime_with_netcdf4() earlier. This was my attempt:

python (pd.to_timedelta(flat_num_dates.min(), delta) - pd.to_timedelta(1, 'd') + ref_date) (pd.to_timedelta(flat_num_dates.max(), delta) + pd.to_timedelta(1, 'd') + ref_date) But unfortunately, as shown in my notebook above, pandas.to_timedelta() has a bug and does not detect the overflow in those esoteric cases that I have identified... I have filed this Issue pandas-dev/pandas/issues/17037 because it should be solved there.

Since I do not think this will be fixed soon (I would gladly look at it, but have no time and probably not enough knowledge about the pandas core stuff), I am not sure what to do.

Do you want to merge this PR, knowing that there still is the overflow issue that was in the code before? Or should I continue to try to fix the current overflow bug in this PR?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up `decode_cf_datetime` 229807027
315643209 https://github.com/pydata/xarray/pull/1414#issuecomment-315643209 https://api.github.com/repos/pydata/xarray/issues/1414 MDEyOklzc3VlQ29tbWVudDMxNTY0MzIwOQ== cchwala 102827 2017-07-16T22:41:50Z 2017-07-16T22:41:50Z CONTRIBUTOR

...but wait. The NaTs that my code produces beyond the int64 overflow should be valid dates, produced using _decode_datetime_with_netcdf4, right?

Hence, I should still add a check for NaT results and fall back to the netcdf version then.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up `decode_cf_datetime` 229807027
315637844 https://github.com/pydata/xarray/pull/1414#issuecomment-315637844 https://api.github.com/repos/pydata/xarray/issues/1414 MDEyOklzc3VlQ29tbWVudDMxNTYzNzg0NA== cchwala 102827 2017-07-16T21:15:04Z 2017-07-16T21:34:12Z CONTRIBUTOR

@jhamman - I found some differences between the old code in master an my code when decoding values close to the np.datetime64 overflow. My code produces NaT where the old code returned some date.

First, I wanted to test and fix that. However, I may have found that the old implementation did not behave correctly when crossing the "overflow" line just slightly.

I have summed that up in a notebook here.

My conclusion would be, that the code in this PR here is not only faster, but also more correct than the old one. However, since it is quite late in the evening and my head needs some rest, I would like to get a second (or third) opinion...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up `decode_cf_datetime` 229807027
315322859 https://github.com/pydata/xarray/pull/1414#issuecomment-315322859 https://api.github.com/repos/pydata/xarray/issues/1414 MDEyOklzc3VlQ29tbWVudDMxNTMyMjg1OQ== cchwala 102827 2017-07-14T10:05:04Z 2017-07-14T10:05:04Z CONTRIBUTOR

@jhamman - Sorry. I was away from office (and everything related to work) for more than a month and had to catchup with a lot of things. I will sum up my stuff and post here, hopefully after todays lunch break.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up `decode_cf_datetime` 229807027
305469383 https://github.com/pydata/xarray/pull/1414#issuecomment-305469383 https://api.github.com/repos/pydata/xarray/issues/1414 MDEyOklzc3VlQ29tbWVudDMwNTQ2OTM4Mw== cchwala 102827 2017-06-01T11:43:27Z 2017-06-01T11:43:27Z CONTRIBUTOR

Just a short notice. Sorry, for the delay. I am still working on this PR, but I am too busy right now to finish the overflow testing. I think I found some edge cases which have to be handled. I will provide more details soon.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up `decode_cf_datetime` 229807027
302943727 https://github.com/pydata/xarray/pull/1414#issuecomment-302943727 https://api.github.com/repos/pydata/xarray/issues/1414 MDEyOklzc3VlQ29tbWVudDMwMjk0MzcyNw== cchwala 102827 2017-05-21T15:28:15Z 2017-05-21T15:28:15Z CONTRIBUTOR

Thanks @shoyer and @jhamman for the feedback. I will change things accordingly.

Concerning tests, I will think again about additional checking for correct handling of overflow. I must admit, that I am not 100% sure that every case is handled correctly by the current code and checked by the current tests. Will have to think about it a little when I find time within the next days...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up `decode_cf_datetime` 229807027
300072972 https://github.com/pydata/xarray/issues/1399#issuecomment-300072972 https://api.github.com/repos/pydata/xarray/issues/1399 MDEyOklzc3VlQ29tbWVudDMwMDA3Mjk3Mg== cchwala 102827 2017-05-09T06:26:36Z 2017-05-09T06:26:36Z CONTRIBUTOR

Okay. I will try to come up with a PR within the next days.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed 226549366
299819380 https://github.com/pydata/xarray/issues/1399#issuecomment-299819380 https://api.github.com/repos/pydata/xarray/issues/1399 MDEyOklzc3VlQ29tbWVudDI5OTgxOTM4MA== cchwala 102827 2017-05-08T09:32:58Z 2017-05-08T09:32:58Z CONTRIBUTOR

Hmm... The "nanosecond"-issue seems to need a fix very much at the foundation. As long as pandas and xarray rely on datetime64[ns] you cannot avoid nanoseconds, right? pd.to_datetime() forces the conversion to nanoscends even if you pass integers but for a time unit different to ns. This does not make me as nervous as Fabien since my data is always quite recent, but I see that this is far from ideal for a tool for climate scientists.

An intermediate fix (@shoyer, do you actually want one?) that I could think of for the performance issue right now would be to do the conversion to datetime64[ns] depending on the time unit, e.g.

  • multiply raw values (most likely floats) with number of nanoseconds in time unit for units smaller then days (or hours?) and use these values as integers in pd.to_datetime()
  • else, fall back to using netCDF4/netcdftime for months and years (as suggested by shoyer) casting the raw values to floats

The only thing that bothers me is that I am not sure if the "number of nanoseconds" is always the same in every day or hour in the view of datetime64, due to leap seconds or other particularities.

@shoyer: Does this sound reasonable or did I forget to take into account any side effects?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `decode_cf_datetime()` slow because `pd.to_timedelta()` is slow if floats are passed 226549366

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 18.622ms · About: xarray-datasette