home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 517799069

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
517799069 MDU6SXNzdWU1MTc3OTkwNjk= 3486 Should performance be equivalent when opening with chunks or re-chunking a dataset? 7799184 open 0     2 2019-11-05T14:14:58Z 2021-08-31T15:28:04Z   CONTRIBUTOR      

I was wondering if the chunking behaviour would be expected to be equivalent under two different use cases:

(1) When opening a dataset using the chunks option; (2) When re-chunking an existing dataset using Dataset.chunk method.

I'm interested in performance for slicing across different dimensions. In my case the performance is quite different, please see the example below:

Open dataset with one single chunk along station dimension (fast for slicing one time)

``` In [1]: import xarray as xr

In [2]: dset = xr.open_dataset( ...: "/source/wavespectra/tests/sample_files/spec20170101T00_spec.nc", ...: chunks={"station": None} ...: )

In [3]: dset Out[3]: <xarray.Dataset> Dimensions: (direction: 24, frequency: 25, station: 14048, time: 249) Coordinates: * time (time) datetime64[ns] 2017-01-01 ... 2017-02-01 * station (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04 * frequency (frequency) float32 0.04118 0.045298003 ... 0.40561208 * direction (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0 Data variables: longitude (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray> latitude (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray> efth (time, station, frequency, direction) float32 dask.array<chunksize=(249, 14048, 25, 24), meta=np.ndarray>

In [4]: %time lats = dset.latitude.isel(time=0).values CPU times: user 171 ms, sys: 49.2 ms, total: 220 ms Wall time: 219 ms ```

Open dataset with many size=1 chunks along station dimension (fast for slicing one station, slow for slicing one time)

``` In [5]: dset = xr.open_dataset( ...: "/source/wavespectra/tests/sample_files/spec20170101T00_spec.nc", ...: chunks={"station": 1} ...: )

In [6]: dset Out[6]: <xarray.Dataset> Dimensions: (direction: 24, frequency: 25, station: 14048, time: 249) Coordinates: * time (time) datetime64[ns] 2017-01-01 ... 2017-02-01 * station (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04 * frequency (frequency) float32 0.04118 0.045298003 ... 0.40561208 * direction (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0 Data variables: longitude (time, station) float32 dask.array<chunksize=(249, 1), meta=np.ndarray> latitude (time, station) float32 dask.array<chunksize=(249, 1), meta=np.ndarray> efth (time, station, frequency, direction) float32 dask.array<chunksize=(249, 1, 25, 24), meta=np.ndarray>

In [7]: %time lats = dset.latitude.isel(time=0).values CPU times: user 13.1 s, sys: 1.94 s, total: 15 s Wall time: 11.1 s ```

Try rechunk station into one single chunk (still slow to slice one time)

``` In [8]: dset = dset.chunk({"station": None})

In [8]: dset Out[8]: <xarray.Dataset> Dimensions: (direction: 24, frequency: 25, station: 14048, time: 249) Coordinates: * time (time) datetime64[ns] 2017-01-01 ... 2017-02-01 * station (station) float64 1.0 2.0 3.0 ... 1.405e+04 1.405e+04 * frequency (frequency) float32 0.04118 0.045298003 ... 0.40561208 * direction (direction) float32 90.0 75.0 60.0 45.0 ... 135.0 120.0 105.0 Data variables: longitude (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray> latitude (time, station) float32 dask.array<chunksize=(249, 14048), meta=np.ndarray> efth (time, station, frequency, direction) float32 dask.array<chunksize=(249, 14048, 25, 24), meta=np.ndarray>

In [9]: %time lats = dset.latitude.isel(time=0).values CPU times: user 9.06 s, sys: 1.13 s, total: 10.2 s Wall time: 7.7 s ```

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3486/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 2 rows from issue in issue_comments
Powered by Datasette · Queries took 0.964ms · About: xarray-datasette