issues: 1322491028
This data as json
id | node_id | number | title | user | state | locked | assignee | milestone | comments | created_at | updated_at | closed_at | author_association | active_lock_reason | draft | pull_request | body | reactions | performed_via_github_app | state_reason | repo | type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1322491028 | I_kwDOAMm_X85O05yU | 6850 | Slow lazy performance on cloud data | 31974425 | closed | 0 | 3 | 2022-07-29T17:05:31Z | 2022-09-12T18:39:05Z | 2022-09-12T18:39:04Z | NONE | Hi, I am not sure if this is the place to raise my issue but I'd appreciate any help! I am trying to do a more complicated calculation with CESM cloud data (on pangeo cloud deployment) and am running into an issue on a simpler calculation as part of the workflow. In the process of taking the derivative the cell takes a very long time to run when differencing - even though this step is not even computing anything. It should run quickly but as you can see from the screen shot, the cell takes a long time to run. It shows runtime is ~20s but wall time is much longer (~2min). This becomes a serious issue when trying to take the derivative of multiple variables part of a larger workflow. @jbusecke and I replicated the differencing problem on a randomized dask dataset and, as you can see, the cell takes a much quicker time to run. Below I have pasted reproducible code that isolates the problem. I am not sure how to proceed on fixing this slow performance and would appreciate your help, thanks! ``` import xarray as xr import numpy as np import dask.array as dsa import pop_tools from xgcm import Grid import xgcm from intake import open_catalog Dask sample datasettest_values = dsa.random.random((14695, 2400, 3600), chunks=(1, 2400, 3600)) da_sample = xr.DataArray(test_values, dims=['time', 'x', 'y']) da_sample_u = xr.DataArray(test_values, dims=['time', 'x_u', 'y_u']) ds_sample = xr.Dataset(data_vars=dict(test_values=da_sample, u=da_sample_u)) %timeit ds_sample.pad({'nlon':(2,2)}).diff('nlon') Original dataseturl = "https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean/CESM_POP.yaml" cat = open_catalog(url) ds = cat["CESM_POP_hires_control"].to_dask() ds = ds.drop([d for d in ds.dims if d in ds.coords]) %timeit ds.pad({'nlon':(2,2)}).diff('nlon')
```
|
{ "url": "https://api.github.com/repos/pydata/xarray/issues/6850/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
completed | 13221727 | issue |