home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

9 rows where author_association = "MEMBER" and issue = 245624267 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • shoyer 6
  • jhamman 2
  • mrocklin 1

issue 1

  • lazily load dask arrays to dask data frames by calling to_dask_dataframe · 9 ✖

author_association 1

  • MEMBER · 9 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
340125534 https://github.com/pydata/xarray/pull/1489#issuecomment-340125534 https://api.github.com/repos/pydata/xarray/issues/1489 MDEyOklzc3VlQ29tbWVudDM0MDEyNTUzNA== shoyer 1217238 2017-10-28T00:21:48Z 2017-10-28T00:21:48Z MEMBER

@jmunroe Thanks for your help here! I'm going to merge this now and take care of my remaining clean-up requests in a follow-on PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  lazily load dask arrays to dask data frames by calling to_dask_dataframe  245624267
339894999 https://github.com/pydata/xarray/pull/1489#issuecomment-339894999 https://api.github.com/repos/pydata/xarray/issues/1489 MDEyOklzc3VlQ29tbWVudDMzOTg5NDk5OQ== shoyer 1217238 2017-10-27T07:28:02Z 2017-10-27T07:28:02Z MEMBER

Just pushed a couple of commits, which should resolve the failures on Windows. It was typical int32 vs int64 NumPy on Windows nonsense.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  lazily load dask arrays to dask data frames by calling to_dask_dataframe  245624267
338424196 https://github.com/pydata/xarray/pull/1489#issuecomment-338424196 https://api.github.com/repos/pydata/xarray/issues/1489 MDEyOklzc3VlQ29tbWVudDMzODQyNDE5Ng== shoyer 1217238 2017-10-21T18:49:57Z 2017-10-21T18:49:57Z MEMBER

@mrocklin are you saying that it's easier to properly rechunk data on the xarray side (as arrays) before converting to dask dataframes? That does make sense -- we have some nice structure (as multi-dimensional arrays) that is lost once the data gets put in a DataFrame.

In this case, I suppose we really should add a keyword argument like dims_order to to_dask_dataframe() that lets the user choose how they want to order dimensions on the result.

Initially, I was concerned about the resulting dask graphs when flattening out arrays in the wrong order. Although that would have bad performance implications if you need to stream the data from disk, I see now the total number of chunks no longer blows up, thanks to @pitrou's impressive rewrite of dask.array.reshape().

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  lazily load dask arrays to dask data frames by calling to_dask_dataframe  245624267
338392039 https://github.com/pydata/xarray/pull/1489#issuecomment-338392039 https://api.github.com/repos/pydata/xarray/issues/1489 MDEyOklzc3VlQ29tbWVudDMzODM5MjAzOQ== mrocklin 306380 2017-10-21T12:47:34Z 2017-10-21T12:47:34Z MEMBER

I think that you would want to rechunk the dask.array so that its chunks align with the outputs divisions of the dask.dataframe. For example if you have a 2d array and are partitioning along the x-axis then you will want to align the array so that there is no chunking along the y axis. In this case set_index will also be free because your data is already aligned and you already know (I think) the division values.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  lazily load dask arrays to dask data frames by calling to_dask_dataframe  245624267
338368158 https://github.com/pydata/xarray/pull/1489#issuecomment-338368158 https://api.github.com/repos/pydata/xarray/issues/1489 MDEyOklzc3VlQ29tbWVudDMzODM2ODE1OA== shoyer 1217238 2017-10-21T06:33:27Z 2017-10-21T06:33:27Z MEMBER

@jcrist @mrocklin @jhamman do any of you have opinions on my latest design question above about the order of elements in dask dataframes? Is it as important as I suspect to keep chunking/divisions consistent when converting from arrays to dataframes?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  lazily load dask arrays to dask data frames by calling to_dask_dataframe  245624267
335307599 https://github.com/pydata/xarray/pull/1489#issuecomment-335307599 https://api.github.com/repos/pydata/xarray/issues/1489 MDEyOklzc3VlQ29tbWVudDMzNTMwNzU5OQ== jhamman 2443309 2017-10-09T22:29:45Z 2017-10-09T22:29:45Z MEMBER

@jmunroe - can we help move this forward? I'd like to see this get into v0.10 if possible.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  lazily load dask arrays to dask data frames by calling to_dask_dataframe  245624267
327300551 https://github.com/pydata/xarray/pull/1489#issuecomment-327300551 https://api.github.com/repos/pydata/xarray/issues/1489 MDEyOklzc3VlQ29tbWVudDMyNzMwMDU1MQ== jhamman 2443309 2017-09-05T20:55:25Z 2017-09-05T20:55:25Z MEMBER

@jmunroe -

I added the PR checklist back to the top of this issue. The most pressing to-do item is getting some documentation written for this.

  • The method will need to be added to api.rst
  • We need a note briefly describing this feature in whats-new.rst
  • We'll want to show an example of how this method can be used (either in the working with pandas or the dask doc sections)
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  lazily load dask arrays to dask data frames by calling to_dask_dataframe  245624267
321461998 https://github.com/pydata/xarray/pull/1489#issuecomment-321461998 https://api.github.com/repos/pydata/xarray/issues/1489 MDEyOklzc3VlQ29tbWVudDMyMTQ2MTk5OA== shoyer 1217238 2017-08-10T06:22:02Z 2017-08-10T06:22:02Z MEMBER

@jmunroe This is great functionality -- thanks for your work on this!

One concern: if possible, I would like to avoid adding explicit dask graph building code in xarray. It looks like the canonical way to transform from a list of dask/numpy arrays to a dask dataframe is to make use of dask.dataframe.from_array along with dask.dataframe.concat: ``` In [34]: import numpy as np

In [35]: import dask.dataframe as dd

In [36]: import dask.array as da

In [37]: x = da.from_array(np.arange(5), 2)

In [38]: y = da.from_array(np.linspace(-np.pi, np.pi, 5), 2)

notice that dtype is preserved properly

In [39]: dd.concat([dd.from_array(x), dd.from_array(y)], axis=1) Out[39]: Dask DataFrame Structure: 0 1 npartitions=2 0 int64 float64 2 ... ... 4 ... ... Dask Name: concat-indexed, 26 tasks ```

Can you look into refactoring your code to make use of these?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  lazily load dask arrays to dask data frames by calling to_dask_dataframe  245624267
318244646 https://github.com/pydata/xarray/pull/1489#issuecomment-318244646 https://api.github.com/repos/pydata/xarray/issues/1489 MDEyOklzc3VlQ29tbWVudDMxODI0NDY0Ng== shoyer 1217238 2017-07-27T02:58:35Z 2017-07-27T02:58:35Z MEMBER

Given that dask dataframes don't support MultiIndexes (among many other features), I have a hard time seeing them as a drop-in replacement for pandas.DataFrame. So maybe it would make sense to make this a separate method, e.g., to_dask_dataframe()?

We could also use a new method as an opportunity to slightly change the API, by not setting an index automatically. This lets us handle N-dimensional data while side-stepping the issue of MultiIndex support -- I don't think this would be very useful when limited to 1D arrays, and dask MultiIndex support seems to be a ways away (https://github.com/dask/dask/issues/1493). Also, set_index() in dask shuffles data, so it can be somewhat expensive.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  lazily load dask arrays to dask data frames by calling to_dask_dataframe  245624267

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 199.229ms · About: xarray-datasette