home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 196541604

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
196541604 MDU6SXNzdWUxOTY1NDE2MDQ= 1173 Some queries 7300413 closed 0     11 2016-12-19T22:53:32Z 2019-01-13T06:27:38Z 2019-01-13T06:00:22Z NONE      

Hello @shoyer @pwolfram @mrocklin @rabernat ,

I was trying to write a design/requirements doc with ref. to the Columbia meetup, and I had a few queries, on which I wanted your inputs (basically to ask whether they make sense or not!)

  1. If you serialize a labeled n-d data array using netCDF or HFD5, it gets written into a single file, which is not really a good option if you want to eventually do distributed processing of the data. Things like HDFS/lustreFS can split files, but that is not really what we want. How do you think this issue could be solved within the xarray+dask framework?
  2. is it a matter of adding some code to the dataset.to_netcdf() method or adding a new method that would split the DataArray (based on some user guidelines) into multiple files?
  3. Or does it make more sense to add a new serialization format like Zarr?
  4. Continuing along similar lines, how does xarray+dask currently decide on how to distribute the workload between dask workers? are there any heuristics to handle data locality? or does experience say that network I/O is fast enough that this is not an issue? I'm asking this question because of this article by Matt: http://blaze.pydata.org/blog/2015/10/28/distributed-hdfs/
  5. If this is desirable, how would one go about implementing it?
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1173/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 2 rows from issues_id in issues_labels
  • 11 rows from issue in issue_comments
Powered by Datasette · Queries took 162.669ms · About: xarray-datasette