home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

11 rows where issue = 344621749 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 9

  • shoyer 2
  • darothen 2
  • jhamman 1
  • dcherian 1
  • lmadaus 1
  • gjoseph92 1
  • scottyhq 1
  • pblankenau2 1
  • shaprann 1

author_association 2

  • NONE 6
  • MEMBER 5

issue 1

  • Chunked processing across multiple raster (geoTIF) files · 11 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1488891109 https://github.com/pydata/xarray/issues/2314#issuecomment-1488891109 https://api.github.com/repos/pydata/xarray/issues/2314 IC_kwDOAMm_X85Yvqzl dcherian 2448579 2023-03-29T16:01:05Z 2023-03-29T16:01:05Z MEMBER

We've deleted the internal rasterio backend in favor of rioxarray. If this issue is still relevant, please migrate the discussion to the rioxarray repo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
1085125053 https://github.com/pydata/xarray/issues/2314#issuecomment-1085125053 https://api.github.com/repos/pydata/xarray/issues/2314 IC_kwDOAMm_X85ArbG9 gjoseph92 3309802 2022-03-31T21:15:59Z 2022-03-31T21:15:59Z NONE

Just noticed this issue; people needing to do this sort of thing might want to look at stackstac (especially playing with the chunks= parameter) or odc-stac for loading the data. The graph will be cleaner than what you'd get from xr.concat([xr.open_rasterio(...) for ...]).

still appears to "over-eagerly" load more than just what is being worked on

FYI, this is basically expected behavior for distributed, see: * https://github.com/dask/distributed/issues/5223 * https://github.com/dask/distributed/issues/5555

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
666422864 https://github.com/pydata/xarray/issues/2314#issuecomment-666422864 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDY2NjQyMjg2NA== darothen 4992424 2020-07-30T14:52:50Z 2020-07-30T14:52:50Z NONE

Hi @shaprann, I haven't re-visited this exact workflow recently, but one really good option (if you can manage the intermediate storage cost) would be to try to use new tools like http://github.com/pangeo-data/rechunker to pre-process and prepare your data archive prior to analysis.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
665976915 https://github.com/pydata/xarray/issues/2314#issuecomment-665976915 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDY2NTk3NjkxNQ== shaprann 43274047 2020-07-29T23:12:37Z 2020-07-29T23:12:37Z NONE

This particular use case is extremely common when working with spatio-temporal data. Can anyone suggest a good workaround for this?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
589284788 https://github.com/pydata/xarray/issues/2314#issuecomment-589284788 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDU4OTI4NDc4OA== pblankenau2 13680523 2020-02-20T20:09:31Z 2020-02-20T20:09:31Z NONE

Has there been any progress on this issue? I am bumping into the same problem.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417419321 https://github.com/pydata/xarray/issues/2314#issuecomment-417419321 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzQxOTMyMQ== lmadaus 2489879 2018-08-30T18:22:31Z 2018-08-30T18:22:31Z NONE

Thanks for all the suggestions!

An update from when I originally posted this. Aligning with @shoyer, @darothen and @scottyhq 's comments, we've tested the code using cloud-optimized geoTIF files and regular geoTIFs, and it does perform better with the cloud-optimized form, though still appears to "over-eagerly" load more than just what is being worked on. With the cloud-optimized form, performance is much better when we specify the chunking strategy on the initial open_rasterio and it aligns with the chunk sizes. e.g. rasterlist = [xr.open_rasterio(x, chunks={'x': 256, 'y': 256}) for x in filelist] vs. rasterlist = [xr.open_rasterio(x, chunks={'x': None, 'y': None}) for x in filelist]

The result is a larger task graph (and much more time spent developing the task graph) but more cases where we don't run into memory problems. There still appears to be a lot more memory used than I expect, but am actively working on exploring options.

We've also noticed better performance using a k8s Dask cluster distributed across multiple "independent" workers as opposed to using a LocalCluster on a single large machine. As in, with the distributed cluster the "myfunction" (fit) operation starts happening on chunks well before the entire dataset is loaded, whereas in the LocalCluster it still tends not to begin until all chunks have been loaded in. Not exactly sure what would cause that...

I'm intrigued by @shoyer 's last suggestion of an "intermediate" chunking step. Will test that and potentially the manual iteration over the tiles. Thanks for all the suggestions and thoughts!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417413527 https://github.com/pydata/xarray/issues/2314#issuecomment-417413527 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzQxMzUyNw== shoyer 1217238 2018-08-30T18:04:29Z 2018-08-30T18:04:29Z MEMBER

I see now that you are using dask-distributed, but I guess there are still too many intermediate outputs here to do a single rechunk operation.

The crude but effective way to solve this problem would be to loop over spatial tiles using an indexing operation to pull out only a limited extent, compute the calculation on each tile and then reassemble the tiles at the end. To see if this will work, you might try computing a single time-series on your merged dataset before calling .chunk(), e.g., merged.isel(x=0, y=0).compute().

In theory, I think using chunks in open_rasterio should achieve exactly what you want here (assuming the geotiffs are tiled), but as you note it makes for a giant task graph. To balance this tradeoff, I might try picking a very large initial chunksize, e.g., xr.open_rasterio(x, chunks={'x': 3500, 'y': 3500}). This would effectively split the "rechunk" operation into 9 entirely independent parts.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417412405 https://github.com/pydata/xarray/issues/2314#issuecomment-417412405 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzQxMjQwNQ== scottyhq 3924836 2018-08-30T18:01:02Z 2018-08-30T18:01:02Z MEMBER

As @darothen mentioned, first thing is to check that the geotiffs themselves are tiled (otherwise I'm guessing that open_rasterio() will open the entire thing. You can do this with:

python import rasterio with rasterio.open('image_001.tif') as src: print(src.profile)

Here is the mentioned example notebook which works for tiled geotiffs stored on google cloud: https://github.com/scottyhq/pangeo-example-notebooks/tree/binderfy

You can use the 'launch binder' button to run it with a pangeo dask-kubernetes cluster, or just read through the landsat8-cog-ndvi.ipynb notebook.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417404832 https://github.com/pydata/xarray/issues/2314#issuecomment-417404832 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzQwNDgzMg== shoyer 1217238 2018-08-30T17:38:40Z 2018-08-30T17:42:00Z MEMBER

I think the explicit chunk() call is the source of your woes here. That creates a bunch of tasks to reshard your data that require loading the entire array into memory. If you're using dask-distributed, I think the large intermediate outputs would get cached to disk but this fails if you're using the simpler multithreaded scheduler.

~~If you drop the line that calls .chunk() and manually index your array to pull out a single time-series before calling map_blocks, does that work properly? e.g., something like merged.isel(x=0, y=0).data.map_blocks(myfunction)~~ (nevermind, this is probably not a great idea)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417175383 https://github.com/pydata/xarray/issues/2314#issuecomment-417175383 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzE3NTM4Mw== darothen 4992424 2018-08-30T03:09:41Z 2018-08-30T03:09:41Z NONE

Can you provide a gdalinfo of one of the GeoTiffs? I'm still working on some documentation for use-cases with cloud-optimized GeoTiffs to supplement @scottyhq's fantastic example notebook. One of the wrinkles I'm tracking down and trying to document is when exactly the GDAL->rasterio->dask->xarray pipeline eagerly load the entire file versus when it defers reading or reads subsets of files. So far, it seems that if the GeoTiff is appropriately chunked ahead of time (when it's written to disk), things basically work "automagically."

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749
417135276 https://github.com/pydata/xarray/issues/2314#issuecomment-417135276 https://api.github.com/repos/pydata/xarray/issues/2314 MDEyOklzc3VlQ29tbWVudDQxNzEzNTI3Ng== jhamman 2443309 2018-08-29T23:04:10Z 2018-08-29T23:04:10Z MEMBER

pinging @scottyhq and @darothen who have both been exploring similar use cases here. I think you all met at the recent pangeo meeting.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Chunked processing across multiple raster (geoTIF) files 344621749

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.704ms · About: xarray-datasette