home / github / issues

Menu
  • Search all tables
  • GraphQL API

issues: 304624171

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
304624171 MDU6SXNzdWUzMDQ2MjQxNzE= 1985 Load a small subset of data from a big dataset takes forever 22245117 closed 0     8 2018-03-13T04:27:58Z 2019-01-13T01:46:08Z 2019-01-13T01:46:08Z CONTRIBUTOR      

Code Sample

```python def cut_dataset(ds2cut, varList = ['Temp' 'S' 'Eta' 'U' 'V' 'W'], lonRange = [-180, 180], latRange = [-90, 90], depthRange = [0, float("inf")], timeRange = ['2007-09-01T00', '2008-08-31T18'], timeFreq = '1D', sampMethod = 'mean', interpC = True, saveNetCDF = False): """ Cut the dataset """

# Copy dataset
ds = ds2cut.copy(deep=True)

# Choose variables
varList_tmp = list(varList)
for varName in ds.variables:
    if all(x != 'time' for x in ds[varName].dims) or (varName=='time'):
            varList_tmp.append(varName)
toDrop = list(set(ds.variables)-set(varList_tmp))
ds = ds.drop(toDrop)

# Cut dataset
ds = ds.sel(time = slice(min(timeRange),  max(timeRange)),
            Xp1  = slice(min(lonRange),   max(lonRange)),
            Yp1  = slice(min(latRange),   max(latRange)),
            Zp1  = slice(min(depthRange), max(depthRange)))
ds = ds.sel(X    = slice(min(ds['Xp1'].values), max(ds['Xp1'].values)),
            Y    = slice(min(ds['Yp1'].values), max(ds['Yp1'].values)),
            Z    = slice(min(ds['Zp1'].values), max(ds['Zp1'].values)),
            Zu   = slice(min(ds['Zp1'].values), max(ds['Zp1'].values)),
            Zl   = slice(min(ds['Zp1'].values), max(ds['Zp1'].values)))

# Resample
if sampMethod=='snapshot':
    ds = ds.resample(time=timeFreq).first(skipna=False)
elif sampMethod=='mean':
    ds = ds.resample(time=timeFreq).mean()

# Create grid
grid = xgcm.Grid(ds,periodic=False)

# Interpolate
if interpC:
    for varName in varList:
        for dim in ds[varName].dims:
            if len(dim)>1 and dim!='time':
                ds[varName] = grid.interp(ds[varName], axis=dim[0])

# Remove useless variables
allDims = []
for varName in varList:
    for dim in ds[varName].dims:
        allDims.append(dim)
toDrop = []
for varName in ds.variables:
    if len(list(set(ds[varName].dims)-set(allDims)))>0:
        toDrop.append(varName)
ds = ds.drop(toDrop)

# Save to NetCDF
if saveNetCDF: ds.to_netcdf(saveNetCDF)

return ds, grid

3D test

ds_cut, grid_cut = cut_dataset(ds, varList = ['Eta'], latRange = [65, 65.5], depthRange = [0, 2], timeRange = ['2007-11-15T00', '2007-11-16T00'], timeFreq = '1D', sampMethod = 'mean', interpC = False, saveNetCDF = '3Dvariable.nc')

4D test

ds_cut, grid_cut = cut_dataset(ds, varList = ['Temp'], lonRange = [-30, -29.5], latRange = [65, 65.5], depthRange = [0, 2], timeRange = ['2007-11-15T00', '2007-11-16T00'], timeFreq = '1D', sampMethod = 'mean', interpC = False, saveNetCDF = '4Dvariable.nc') ```

Problem description

I'm working with a big dataset. However, most of the time I only need a small subset of data. My idea was to open and concatenate everything with open_mfdataset, and then extract subsets of data using the indexing routines. This approach works very good when I extract 3D variables (just lon, lat, and time), but it fails when I try to extract 4D variables (lon, lat, time, and depth). It doesn't actually fail, but to_netcdf takes forever. When I open a smaller dataset since the very beginning (let's say just November), then I'm also able to extract 4D variables. When I load the sub-dataset after using the indexing routines, does xarray need to read the whole original 4D variable? If yes, then I should probably change my approach and I should open subset of data since the very beginning. If no, am I doing something wrong?

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-327.18.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: None.None xarray: 0.10.1 pandas: 0.20.1 numpy: 1.11.0 scipy: 0.17.1 netCDF4: 1.3.1 h5netcdf: 0.5.0 h5py: 2.7.1 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.15.2 distributed: 1.18.1 matplotlib: 1.5.1 cartopy: 0.16.0 seaborn: None setuptools: 27.2.0 pip: 8.1.2 conda: 4.4.11 pytest: 2.9.1 IPython: 4.2.0 sphinx: 1.4.1
{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1985/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed 13221727 issue

Links from other tables

  • 0 rows from issues_id in issues_labels
  • 8 rows from issue in issue_comments
Powered by Datasette · Queries took 0.778ms · About: xarray-datasette