home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where author_association = "MEMBER" and issue = 344621749 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 4

  • shoyer 2
  • jhamman 1
  • dcherian 1
  • scottyhq 1

issue 1

  • Chunked processing across multiple raster (geoTIF) files · 5 ✖

author_association 1

  • MEMBER · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1488891109 https://github.com/pydata/xarray/issues/2314#issuecomment-1488891109 https://api.github.com/repos/pydata/xarray/issues/2314 IC_kwDOAMm_X85Yvqzl dcherian 2448579 2023-03-29T16:01:05Z 2023-03-29T16:01:05Z MEMBER

We've deleted the internal rasterio backend in favor of rioxarray. If this issue is still relevant, please migrate the discussion to the rioxarray repo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417413527 https://github.com/pydata/xarray/issues/2314#issuecomment-417413527 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzQxMzUyNw== shoyer 1217238 2018-08-30T18:04:29Z 2018-08-30T18:04:29Z MEMBER

I see now that you are using dask-distributed, but I guess there are still too many intermediate outputs here to do a single rechunk operation.

The crude but effective way to solve this problem would be to loop over spatial tiles using an indexing operation to pull out only a limited extent, compute the calculation on each tile and then reassemble the tiles at the end. To see if this will work, you might try computing a single time-series on your merged dataset before calling .chunk(), e.g., merged.isel(x=0, y=0).compute().

In theory, I think using chunks in open_rasterio should achieve exactly what you want here (assuming the geotiffs are tiled), but as you note it makes for a giant task graph. To balance this tradeoff, I might try picking a very large initial chunksize, e.g., xr.open_rasterio(x, chunks={'x': 3500, 'y': 3500}). This would effectively split the "rechunk" operation into 9 entirely independent parts.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417412405 https://github.com/pydata/xarray/issues/2314#issuecomment-417412405 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzQxMjQwNQ== scottyhq 3924836 2018-08-30T18:01:02Z 2018-08-30T18:01:02Z MEMBER

As @darothen mentioned, first thing is to check that the geotiffs themselves are tiled (otherwise I'm guessing that open_rasterio() will open the entire thing. You can do this with:

python import rasterio with rasterio.open('image_001.tif') as src: print(src.profile)

Here is the mentioned example notebook which works for tiled geotiffs stored on google cloud: https://github.com/scottyhq/pangeo-example-notebooks/tree/binderfy

You can use the 'launch binder' button to run it with a pangeo dask-kubernetes cluster, or just read through the landsat8-cog-ndvi.ipynb notebook.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417404832 https://github.com/pydata/xarray/issues/2314#issuecomment-417404832 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzQwNDgzMg== shoyer 1217238 2018-08-30T17:38:40Z 2018-08-30T17:42:00Z MEMBER

I think the explicit chunk() call is the source of your woes here. That creates a bunch of tasks to reshard your data that require loading the entire array into memory. If you're using dask-distributed, I think the large intermediate outputs would get cached to disk but this fails if you're using the simpler multithreaded scheduler.

~~If you drop the line that calls .chunk() and manually index your array to pull out a single time-series before calling map_blocks, does that work properly? e.g., something like merged.isel(x=0, y=0).data.map_blocks(myfunction)~~ (nevermind, this is probably not a great idea)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417135276 https://github.com/pydata/xarray/issues/2314#issuecomment-417135276 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzEzNTI3Ng== jhamman 2443309 2018-08-29T23:04:10Z 2018-08-29T23:04:10Z MEMBER

pinging @scottyhq and @darothen who have both been exploring similar use cases here. I think you all met at the recent pangeo meeting.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 962.382ms · About: xarray-datasette