home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

1 row where user = 7504461 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 1

  • issue 1

state 1

  • closed 1

repo 1

  • xarray 1
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
166593563 MDU6SXNzdWUxNjY1OTM1NjM= 912 Speed up operations with xarray dataset saulomeirelles 7504461 closed 0     12 2016-07-20T14:21:40Z 2016-12-29T01:07:52Z 2016-12-29T01:07:52Z NONE      

Hi all,

I've been recently having hard times to manipulate a xarray dataset. Not sure if I am making some awkward mistake, but it is taking an unacceptable amount of time to perform simple operations.

Here is a piece of my code:

ncfile = glob('*conc_size_12m.nc') ds = xray.open_dataset(ncfile[0]) ds

<xarray.Dataset> Dimensions: (burst: 2485, duration: 2400, z: 160) Coordinates: zdist (z) float64 0.01014 0.02027 0.03041 0.04054 0.05068 ... burst_nr (burst) float64 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 ... time (duration, burst) datetime64[ns] 2014-09-16T07:00:00 ... * burst (burst) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... * duration (duration) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... * z (z) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ... Data variables: conc_profs (duration, z, burst) float32 3.99138e-05 4.23636e-05 ... burst_duration (duration) float64 0.0 0.1246 0.2493 0.3739 0.4985 ... grainSize_profs (duration, z, burst) float32 200.0 200.0 200.0 200.0 ...

ds.nbytes * (2 ** -30)

7.15415246784687

%time conc_avg = ds.conc_profs.chunk(2400).mean(('z','duration'))

CPU times: user 12 ms, sys: 0 ns, total: 12 ms Wall time: 9.84 ms

%time conc_avg.load()

%time conc_avg = ds.conc_profs.isel(burst=0).mean(('z','duration'))

CPU times: user 708 ms, sys: 2.87 s, total: 3.58 s Wall time: 1min 56s

If I work with chunks, it is impossible to load back the array in a reasonable amount of time (I waited for more than 30 min).

Looping over the dimension burst, it takes about 2 minutes per loop which is also quite unreasonable.

I was wondering if the problem could stem from the creation of my dataset which I saved into this 7+GB netCDF file. Could that be the case?

I am working in a Linux Inter Core i5 which is supposed to handle these manipulations with no hicups. I use the IOOS environment to run xarray (vr '0.7.1').

Can someone provide me some advice on how to optimize my script?

I am happy to supply with more details if needed.

Cheers,

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/912/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 24.144ms · About: xarray-datasette