home / github / issues

Menu
  • GraphQL API
  • Search all tables

issues: 560860376

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
560860376 MDU6SXNzdWU1NjA4NjAzNzY= 3755 Performance problem when doing computation between two arrays with discontinuous indexes 33070178 open 0     0 2020-02-06T08:44:40Z 2020-12-03T18:16:02Z   NONE      

MCVE Code Sample

```python import xarray as xr import numpy as np

Creating array

ds = xr.Dataset() ds["longitude"] = np.arange(0.1,2000.1,1) ds["latitude"] = range(3000) ds["step"] = range(50) ds["field"] = (("step","latitude","longitude"),np.random.randn(50,3000,2000)) ds.to_netcdf("big_array.nc")

Create another array

ds = xr.Dataset() ds["longitude"] = np.arange(500.1,600.1,1) # Coordinate are a continuous subset of the first array ds["latitude"] = np.arange(510,660) ds["id"] = range(50) ds["field"] = (("longitude","latitude","id"),np.random.randn(100,150,50)) ds.to_netcdf("slicing.nc")

Create another array with "discontinuity" in longitude dimension

ds = xr.Dataset() ds["longitude"] = list(np.arange(500.1,598.1,1)) +[622.1,640.1] ds["latitude"] = range(510,660) ds["id"] = range(10) ds["mask"] = (("longitude","latitude","id"),np.random.randn(100,150,10)) ds.to_netcdf("no_slicing.nc")

Load the Three arrays

da = xr.open_dataset("big_array.nc") db = xr.open_dataset("slicing.nc").isel(id=0) dc = xr.open_dataarray("no_slicing.nc").isel(id=0)

%timeit da*db

32.3 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit da*dc

2.13 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def slicing_operation(dc,da): """ Slicing when knowing that dc is a subpart of da """ import operator min_lat = np.max([dc.latitude.values.min(),da.latitude.values.min()]) max_lat = np.min([dc.latitude.values.max(),da.latitude.values.max()]) index_lat_field = operator.and_(da.latitude >= min_lat,da.latitude <= max_lat) min_lon = np.max([dc.longitude.values.min(),da.longitude.values.min()]) max_lon = np.min([dc.longitude.values.max(),da.longitude.values.max()]) index_lon_field = operator.and_(da.longitude >= min_lon,da.longitude <= max_lon)

# Extending dc such that it covers all longitude of da 
# Creating an empty array. 
dempty = xr.Dataset()
dempty["longitude"] = da.longitude.isel(longitude=index_lon_field).values
dempty["latitude"] = da.latitude.isel(latitude=index_lat_field).values
dc_ext = dc.broadcast_like(dempty)
# Performing multiplication and align 
res,_ = xr.align((dc_ext *da),dc)
return res

%timeit slicing_operation(dc,da)

43.9 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Checking if we get the same results

res_0 = dc*da res = slicing_operation(dc,da) print(abs(res_0 - res).max()) ```

Problem Description

A performance problem occured when performing operation between da and dc. Computing da *db takes ~32 ms While db and dc have the same shape, opeartion dc * da takes ~2.13s (so x 60 in cpu times). This seems linked to the fact tha dc includes discontinuous indexes of da.

Indeed, when extending dc (such that its longitude are continuous when compared at da) the computation time take ~44 ms. For me it seems that when doing da * dc da dataset is fully loaded in memory first, while only the part involved in computation is loaded when doing da *db. Indeed when doing : python %time da.load() we get the following result: ``` CPU times: user 0 ns, sys: 1.88 s, total: 1.88 s

Wall time: 1.89 s On top of that,da dbscales with the number ofid selected while it is not the case forda dc ```.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-957.27.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.7.3 xarray: 0.14.1 pandas: 0.25.3 numpy: 1.17.4 scipy: 1.4.1 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: None cftime: 1.0.4.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: 0.9.7.6 iris: None bottleneck: 1.3.1 dask: 2.9.1 distributed: 2.9.1 matplotlib: 3.1.2 cartopy: None seaborn: 0.9.0 numbagg: None setuptools: 44.0.0.post20200106 pip: 19.3.1 conda: None pytest: 5.3.2 IPython: 7.11.1 sphinx: 2.3.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/3755/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
    13221727 issue

Links from other tables

  • 1 row from issues_id in issues_labels
  • 0 rows from issue in issue_comments
Powered by Datasette · Queries took 0.891ms · About: xarray-datasette