home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 344621749

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
344621749 MDU6SXNzdWUzNDQ2MjE3NDk= 2314 Chunked processing across multiple raster (geoTIF) files 2489879 closed 0     11 2018-07-25T21:55:47Z 2023-03-29T16:01:06Z 2023-03-29T16:01:05Z NONE      

Looking for guidance on best practices, or forcing dask/xarray to work in the way I want it to work :)

Task: I have hundreds of geoTIF files (500 for this test case), each containing a single 2d x-y array (~10000 x 10000). What I need to do is read these all in and apply a function to each x-y point whose argument is a 1-d array populated by the data from that same x-y point across all the files. This function is an optimization fit at each x-y point, so it's not something easily vectorizable, but each point's calculation is independent.

Because each point may be treated independently, the idea is to be able to split the dataset into chunks in the x-y directions and allow Dask to cycle through and fork out the tasks of applying the function to each chunk in turn. Here's how I'm currently trying to implement this:

```python import xarray as xr import numpy as np

Get list of files, set chunksize

filelist = ['image_001.tif', 'image_002.tif', ... 'image_500.tif'] chunksize=300 # Have explored a range of chunksizes from 250-1000

Open rasters individually in the absence of an open_mfdataset equivalent for rasterio

Alternatively tried chunking or not in x and y dimensions

Chunking at this stage blows up the task graph and eats up memory quickly

but see expected behavior below for reasoning behind trying to chunk at this stage

rasterlist = [xr.open_rasterio(x, chunks={'x': chunksize, 'y': chunksize}) for x in filelist]

rasterlist = [xr.open_rasterio(x, chunks={'x': None, 'y': None}) for x in filelist]

Concatenate along band dimension

merged = xr.concat(rasterlist, dim='band')

Rechunk

merged = merged.chunk({'band': None, 'x': chunksize, 'y': chunksize})

Use dask map_blocks to apply a function to each chunk. This triggers the compute.

output = merged.data.map_blocks(myfunction, dtype=np.float32, drop_axis=0) output = output.compute() ```

Problem description

What happens now is that all the raster datasets appear to be read into memory in full immediately and without respect to the final chunk alignment needed for the calculation. Then, there seems to be a lot of memory usage and communications overhead as the different chunks get rectified in the concatenation and merge-rechunk operations before myfunction is applied. This blows up the memory requirements, exceeding 300 GB RAM on a 48 worker machine (dataset on disk is ~100 GB). Below is a screenshot of the Dask dashboard right before workers start cutting out with memory limits/can't dump to disk errors.

Expected Output

What I'm looking for in this is an out-of-core computation where a single worker stays with a single chunk of the data all the way through to the end. Each "task" would then be a sequence of: - Open each of the raster input files and extract only the spatial chunk we were assigned for this task - Concatenate these for the computation - Perform this computation on this chunk - Return the output array for this chunk to be aggregated back together by Dask - Release data from memory; move on to the next spatial chunk

I could manually sort this all out with a bunch of dask.delayed tasks, keeping track of the indices/locations of each chunk as they were returned to "reassemble" them later. However, is there an easier way to simplify this operation directly through xarray/dask through some ordering of commands I haven't found yet? Does the netcdf open_mfdataset function somehow handle this sort of idea for multiple netcdf files that could be adapted for rasterio?

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.87-linuxkit-aufs machine: x86_64 processor: byteorder: little LC_ALL: None LANG: C.UTF- LOCALE: None.None xarray: 0.10.7 pandas: 0.23.2 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: 2.2.0 bottleneck: 1.2.1 cyordereddict: None dask: 0.18.1 distributed: 1.22.0 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 36.5.0.post20170921 pip: 9.0.1 conda: 4.5.5 pytest: None IPython: 6.4.0 sphinx: None
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/2314/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 11 rows from issue in issue_comments
Powered by Datasette · Queries took 78.676ms · About: xarray-datasette