home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where issue = 166593563 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • shoyer 6
  • saulomeirelles 5
  • jhamman 1

author_association 2

  • MEMBER 7
  • NONE 5

issue 1

  • Speed up operations with xarray dataset · 12 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
269566487 https://github.com/pydata/xarray/issues/912#issuecomment-269566487 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDI2OTU2NjQ4Nw== jhamman 2443309 2016-12-29T01:07:52Z 2016-12-29T01:07:52Z MEMBER

@saulomeirelles - Hopefully, you were able to work through this issue. If not, feel free to reopen.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
234056046 https://github.com/pydata/xarray/issues/912#issuecomment-234056046 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzNDA1NjA0Ng== shoyer 1217238 2016-07-20T19:29:55Z 2016-07-20T19:29:55Z MEMBER

Just looking at a task manager while a task executes can give you a sense of what's going on. Dask also has some diagnostics that may be helpful: http://dask.pydata.org/en/latest/diagnostics.html On Wed, Jul 20, 2016 at 11:44 AM Saulo Meirelles notifications@github.com wrote:

No, not really. I got no error message whatsoever. Is there any test I can do to tackle this?

Sent from Smartphone. Please forgive typos.

On Jul 20, 2016 8:41 PM, "Stephan Hoyer" notifications@github.com wrote:

I decided to wait for .load() to do the job but the kernel dies after a while.

Are you running out of memory? Can you tell what's going on? This is a little surprising to me.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/912#issuecomment-234042142, or mute the thread < https://github.com/notifications/unsubscribe-auth/AHKCTXaBpbA0ieSdI2I_hIUjVBxuKaNeks5qXmvPgaJpZM4JQ0_D

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/912#issuecomment-234043292, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKS1ujXItyYDLgA4ZtBkHEbTBTiTnrvks5qXmylgaJpZM4JQ0_D .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
234043292 https://github.com/pydata/xarray/issues/912#issuecomment-234043292 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzNDA0MzI5Mg== saulomeirelles 7504461 2016-07-20T18:44:53Z 2016-07-20T18:44:53Z NONE

No, not really. I got no error message whatsoever. Is there any test I can do to tackle this?

Sent from Smartphone. Please forgive typos.

On Jul 20, 2016 8:41 PM, "Stephan Hoyer" notifications@github.com wrote:

I decided to wait for .load() to do the job but the kernel dies after a while.

Are you running out of memory? Can you tell what's going on? This is a little surprising to me.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/912#issuecomment-234042142, or mute the thread https://github.com/notifications/unsubscribe-auth/AHKCTXaBpbA0ieSdI2I_hIUjVBxuKaNeks5qXmvPgaJpZM4JQ0_D .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
234042142 https://github.com/pydata/xarray/issues/912#issuecomment-234042142 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzNDA0MjE0Mg== shoyer 1217238 2016-07-20T18:41:17Z 2016-07-20T18:41:17Z MEMBER

I decided to wait for .load() to do the job but the kernel dies after a while.

Are you running out of memory? Can you tell what's going on? This is a little surprising to me.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
234035910 https://github.com/pydata/xarray/issues/912#issuecomment-234035910 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzNDAzNTkxMA== saulomeirelles 7504461 2016-07-20T18:20:24Z 2016-07-20T18:20:24Z NONE

True.

I decided to wait for .load() to do the job but the kernel dies after a while.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
234026185 https://github.com/pydata/xarray/issues/912#issuecomment-234026185 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzNDAyNjE4NQ== shoyer 1217238 2016-07-20T17:47:45Z 2016-07-20T17:47:45Z MEMBER

It's worth noting that conc_avg = ds.conc_profs.chunk({'burst': 10}).mean(('z','duration')) doesn't actually do any computation -- that's why it's so fast. It just sets up the computation graph. No computation happens until you write .load().

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
234022793 https://github.com/pydata/xarray/issues/912#issuecomment-234022793 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzNDAyMjc5Mw== saulomeirelles 7504461 2016-07-20T17:36:02Z 2016-07-20T17:36:17Z NONE

Thanks, @shoyer !

Setting smaller chunks helps, however my issue is the way back.

This is fine:

%time conc_avg = ds.conc_profs.chunk({'burst': 10}).mean(('z','duration'))

CPU times: user 24 ms, sys: 0 ns, total: 24 ms Wall time: 23.8 ms

But this:

%time result = conc_avg.load()

takes an insane amount of time which intrigues me because is just a vector with 2845 points.

Is there another way to tackle this without dask like using a for-loop?

If dask is the way to go, what would be the quickest way to convert to numpy array?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
233998757 https://github.com/pydata/xarray/issues/912#issuecomment-233998757 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzMzk5ODc1Nw== shoyer 1217238 2016-07-20T16:11:27Z 2016-07-20T16:11:27Z MEMBER

When you write ds.conc_profs.chunk(2400), it sets up the data to be loaded in a giant chunk, almost the entire file at once. Even if you use .isel() afterwards, dask does not always manage to subset the data from the initial chunk. (Sometimes it does succeed, which makes this a little confusing.)

You will probably be more successful if you try something like ds.conc_profs.chunk({'burst': 10}) instead, which keeps the intermediate chunks to a reasonable size.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
233998071 https://github.com/pydata/xarray/issues/912#issuecomment-233998071 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzMzk5ODA3MQ== saulomeirelles 7504461 2016-07-20T16:08:57Z 2016-07-20T16:08:57Z NONE

I've tried to create individual nc-files and then read them all using open_mfdataset but I got an error for opening too many files which was reported here before.

The glob is just a (bad) habit because I normally read multiple files. O_0

Cheers,

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
233996527 https://github.com/pydata/xarray/issues/912#issuecomment-233996527 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzMzk5NjUyNw== shoyer 1217238 2016-07-20T16:03:30Z 2016-07-20T16:03:30Z MEMBER

Thanks for describing that -- I misread your initial description and thought you were using open_mfdataset rather than open_dataset (the glob threw me off!). The source of these files shouldn't matter once you have it in a netCDF file.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
233995495 https://github.com/pydata/xarray/issues/912#issuecomment-233995495 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzMzk5NTQ5NQ== saulomeirelles 7504461 2016-07-20T16:00:02Z 2016-07-20T16:00:02Z NONE

The input files are 2485 nested mat-files that come out from a measurement device. I read them in Python ( loadmat(matfile) ) and turn them into numpy arrays like this:

``` matfiles = glob('*sed.mat')

matfiles = sorted(matfiles ,key=lambda x: extract_number(x) )


if matfiles:

    ts = 2400
    zs = 160

    Burst        = np.empty(len(matfiles))
    Time         = np.empty((ts,len(matfiles)), dtype='datetime64[s]')
    ConcProf     = np.empty((ts,zs,len(matfiles)), dtype='float64')
    GsizeProf    = np.empty((ts,zs,len(matfiles)), dtype='float64')

```

Afterwards, I populate the matrices in a loop:

``` def f(i):
Dist, Burst[i], Time[:,i], ConcProf[:,:,i], GsizeProf[:,:,i] = getABSpars(matfiles[i])

```

where

``` def getABSpars(matfile):

ndata = loadmat(matfile)

Dist  = ndata['r']

t_dic = ndata['BurstInfo']['StartTime']

try:
    t_dt  = dt.datetime.strptime(t_dic, '%d-%b-%Y %H:%M:%S')
except:
    t_dic = t_dic + ' 00:00:00'
    t_dt  = dt.datetime.strptime(t_dic, '%d-%b-%Y %H:%M:%S')

t_range   = date_range( t_dt,
            periods = ndata['MassProfiles'].shape[1],
            freq    = ndata['BurstInfo']['MassProfileInterval']+'L')

Burst         = int(ndata['BurstInfo']['BurstNumber'])
Time          = t_range
ConcProf      = np.asarray(ndata['MassProfiles'] ).T
GsizeProf     = np.asarray(ndata['SizeProfiles']*1e6).T

return Dist, Burst, Time, ConcProf, GsizeProf

```

Using the multiprocessing package:

pool = ThreadPool(4) pool.map(f, range(len(matfiles))) pool.close()

Finally I create the xarray dataset and then save into a nc-file:

``` ds = xray.Dataset( { 'conc_profs' : ( ['duration', 'z', 'burst'], ConcProf ), 'grainSize_profs' : ( ['duration', 'z', 'burst'], GsizeProf ), 'burst_duration' : ( ['duration'], np.linspace(0,299, Time.shape[0]) ), }, coords = {'time' : (['duration', 'burst'], Time) , 'zdist' : (['z'], Dist), 'burst_nr' : (['burst'], Burst) } )

ds.to_netcdf('ABS_conc_size_12m.nc' , mode='w')

```

It costs me around 1 h to generate the nc-file.

Could this be the reason of my headaches?

Thanks!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563
233991357 https://github.com/pydata/xarray/issues/912#issuecomment-233991357 https://api.github.com/repos/pydata/xarray/issues/912 MDEyOklzc3VlQ29tbWVudDIzMzk5MTM1Nw== shoyer 1217238 2016-07-20T15:46:50Z 2016-07-20T15:46:50Z MEMBER

What do the original input files look like, before you join them together? This may be a case where the dask.array task scheduler does very poorly.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Speed up operations with xarray dataset 166593563

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 10.513ms · About: xarray-datasette