home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where issue = 345715825 and user = 1197350 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • rabernat · 3 ✖

issue 1

  • Out-of-core processing with dask not working properly? · 3 ✖

author_association 1

  • MEMBER 3
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
408928221 https://github.com/pydata/xarray/issues/2329#issuecomment-408928221 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwODkyODIyMQ== rabernat 1197350 2018-07-30T16:37:05Z 2018-07-30T16:37:23Z MEMBER

Can you forget about zarr for a moment and just do a reduction on your dataset? For example: python ds.sum().load()

Keep the same chunk arguments you are currently using. This will help us understand if the problem is with reading the files.

Is it your intention to chunk the files contiguously in time? Depending on the underlying structure of the data within the netCDF file, this could amount to a complete transposition of the data, which could be very slow / expensive. This could have some parallels with #2004.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
408925488 https://github.com/pydata/xarray/issues/2329#issuecomment-408925488 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwODkyNTQ4OA== rabernat 1197350 2018-07-30T16:28:31Z 2018-07-30T16:28:31Z MEMBER

I was somehow expecting that each worker will read a chunk and then write it to zarr, streamlined.

Yes, this is what we want!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825
408860643 https://github.com/pydata/xarray/issues/2329#issuecomment-408860643 https://api.github.com/repos/pydata/xarray/issues/2329 MDEyOklzc3VlQ29tbWVudDQwODg2MDY0Mw== rabernat 1197350 2018-07-30T13:20:59Z 2018-07-30T13:20:59Z MEMBER

@lrntct - this sounds like a reasonable way to use zarr. We routinely do this sort of transcoding and it works reasonable well. Unfortunately something clearly isn't working right in your case. These things can be hard to debug, but we will try to help you.

You might want to start by reviewing the guide I wrote for Pangeo on preparing zarr datasets.

It would also be good to see a bit more detail. You posted a function netcdf2zarr that converts a single netcdf file to a single zarr file. How are you invoking that function? Are you trying to create one zarr store for each netCDF file? How many netCDF files are there? If there are many (e.g. one per timmestep), my recommendation is to create only one zarr store for the whole dataset. Open the netcdf files using open_mfdataset.

If instead you have just one big netCDF file as in the example you posted above, I think I see you problem: you are calling .chunk() after calling open_dataset(), rather calling open_dataset(nc_path, chunks=chunks). This probably means that you are loading the whole dataset in a single task and then re-chunking. That could be the source of the inefficiency.

More ideas: - explicitly specify the chunks (rather than using 'auto') - eliminate the negative number in your chunk sizes - make sure you really need clevel=9

Another useful piece of advice would be to use the dask distributed dashboard to monitor what is happening under the hood. You can do this by running python from dask.distributed import Client client = Client() client In a notebook, this should provide you a link to the scheduler dashboard. Once you call ds.to_zarr(), watch the task stream in the dashboard to see what is happening.

Hopefully these ideas can help you move forward.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Out-of-core processing with dask not working properly? 345715825

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 3924.868ms · About: xarray-datasette