home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

7 rows where issue = 372848074 and user = 1872600 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • rsignell-usgs · 7 ✖

issue 1

  • open_mfdataset usage and limitations. · 7 ✖

author_association 1

  • NONE 7
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
510144707 https://github.com/pydata/xarray/issues/2501#issuecomment-510144707 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUxMDE0NDcwNw== rsignell-usgs 1872600 2019-07-10T16:59:12Z 2019-07-11T11:47:02Z NONE

@TomAugspurger , I sat down here at Scipy with @rabernat and he instantly realized that we needed to drop the feature_id coordinate to prevent open_mfdataset from trying to harmonize that coordinate from all the chunks.

So if I use this code, the open_mfdataset command finishes: python def drop_coords(ds): ds = ds.drop(['reference_time','feature_id']) return ds.reset_coords(drop=True) and I can then add back in the dropped coordinate values at the end: python dsets = [xr.open_dataset(f) for f in files[:3]] ds.coords['feature_id'] = dsets[0].coords['feature_id']

I'm now running into memory issues when I write the zarr data -- but I should raise that as a new issue, right?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509379294 https://github.com/pydata/xarray/issues/2501#issuecomment-509379294 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTM3OTI5NA== rsignell-usgs 1872600 2019-07-08T20:28:48Z 2019-07-08T20:29:20Z NONE

@TomAugspurger , I thought @rabernat's suggestion of implementing python def drop_coords(ds): return ds.reset_coords(drop=True) would avoid this checking. Did I understand or implement this incorrectly?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509341467 https://github.com/pydata/xarray/issues/2501#issuecomment-509341467 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTM0MTQ2Nw== rsignell-usgs 1872600 2019-07-08T18:34:02Z 2019-07-08T18:34:02Z NONE

@rabernat , to answer your question, if I open just two files: ds = xr.open_mfdataset(files[:2], preprocess=drop_coords, autoclose=True, parallel=True) the resulting dataset is: <xarray.Dataset> Dimensions: (feature_id: 2729077, reference_time: 1, time: 2) Coordinates: * reference_time (reference_time) datetime64[ns] 2009-01-01 * feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804 * time (time) datetime64[ns] 2009-01-01 2009-01-01T01:00:00 Data variables: streamflow (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> q_lateral (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> velocity (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> qSfcLatRunoff (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> qBucket (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> qBtmVertRunoff (time, feature_id) float64 dask.array<shape=(2, 2729077), chunksize=(1, 2729077)> Attributes: featureType: timeSeries proj4: +proj=longlat +datum=NAD83 +no_defs model_initialization_time: 2009-01-01_00:00:00 station_dimension: feature_id model_output_valid_time: 2009-01-01_00:00:00 stream_order_output: 1 cdm_datatype: Station esri_pe_string: GEOGCS[GCS_North_American_1983,DATUM[D_North_... Conventions: CF-1.6 model_version: NWM 1.2 dev_OVRTSWCRT: 1 dev_NOAH_TIMESTEP: 3600 dev_channel_only: 0 dev_channelBucket_only: 0 dev: dev_ prefix indicates development/internal me...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509340139 https://github.com/pydata/xarray/issues/2501#issuecomment-509340139 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTM0MDEzOQ== rsignell-usgs 1872600 2019-07-08T18:30:18Z 2019-07-08T18:30:18Z NONE

@TomAugspurger, okay, I just ran the above code again and here's what happens:

The open_mfdataset proceeds nicely on my 8 workers with 40 cores, eventually completing the 8760 open_dataset tasks in about 10 minutes. One interesting thing is that the number of tasks keep dropping as time goes on. Not sure why that would be: The memory usage on the workers seems okay during this process:

Then, despite the tasks showing on the dashboard being completed, the open_mfdataset command does not complete, but nothing has died, and I'm not sure what's happening. I check top and get this:

then after about 10 more minutes, I get these warnings:

and then the errors: python-traceback distributed.client - WARNING - Couldn't gather 17520 keys, rescheduling {'getattr-fd038834-befa-4a9b-b78f-51f9aa2b28e5': ('tcp://127.0.0.1:45640',), 'drop_coords-39be9e52-59de-4e1f-b6d8-27e7d931b5af': ('tcp://127.0.0.1:55881',), 'drop_coords-8bd07037-9ca4-4f97-83fb-8b02d7ad0333': ('tcp://127.0.0.1:56164',), 'drop_coords-ca3dd72b-e5af-4099-b593-89dc97717718': ('tcp://127.0.0.1:59961',), 'getattr-c0af8992-e928-4d42-9e64-340303143454': ('tcp://127.0.0.1:42989',), 'drop_coords-8cdfe5fb-7a29-4606-8692-efa747be5bc1': ('tcp://127.0.0.1:35445',), 'getattr-03669206-0d26-46a1-988d-690fe830e52f': ... Full error listing here: https://gist.github.com/rsignell-usgs/3b7101966b8c6d05f48a0e01695f35d6

Does this help? I'd be happy to screenshare if that would be useful.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
509282831 https://github.com/pydata/xarray/issues/2501#issuecomment-509282831 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwOTI4MjgzMQ== rsignell-usgs 1872600 2019-07-08T15:51:23Z 2019-07-08T15:51:23Z NONE

@TomAugspurger, I'm back from vacation now and ready to attack this again. Any updates on your end?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
506475819 https://github.com/pydata/xarray/issues/2501#issuecomment-506475819 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDUwNjQ3NTgxOQ== rsignell-usgs 1872600 2019-06-27T19:16:28Z 2019-06-27T19:24:31Z NONE

I tried this, and either I didn't apply it right, or it didn't work. The memory use kept growing until the process died. My code to process the 8760 netcdf files with open_mfdataset looks like this:

```python import xarray as xr from dask.distributed import Client, progress, LocalCluster

cluster = LocalCluster() client = Client(cluster)

import pandas as pd

dates = pd.date_range(start='2009-01-01 00:00',end='2009-12-31 23:00', freq='1h') files = ['./nc/{}/{}.CHRTOUT_DOMAIN1.comp'.format(date.strftime('%Y'),date.strftime('%Y%m%d%H%M')) for date in dates]

def drop_coords(ds): return ds.reset_coords(drop=True)

ds = xr.open_mfdataset(files, preprocess=drop_coords, autoclose=True, parallel=True) ds1 = ds.chunk(chunks={'time':168, 'feature_id':209929})

import numcodecs numcodecs.blosc.use_threads = False ds1.to_zarr('zarr/2009', mode='w', consolidated=True) ```

I transfered the netcdf files from AWS S3 to my local disk to run this, using this command:

rclone sync --include '*.CHRTOUT_DOMAIN1.comp' aws-east:nwm-archive/2009 . --checksum --fast-list --transfers 16 @TomAugspurger, if you could take a look, that would be great, and if you have any ideas of how to make this example simpler/more easily reproducible, please let me know.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074
497381301 https://github.com/pydata/xarray/issues/2501#issuecomment-497381301 https://api.github.com/repos/pydata/xarray/issues/2501 MDEyOklzc3VlQ29tbWVudDQ5NzM4MTMwMQ== rsignell-usgs 1872600 2019-05-30T15:55:56Z 2019-05-30T15:58:48Z NONE

I'm hitting some memory issues with using open_mfdataset with a cluster also.

Specifically, I'm trying to open 8760 NetCDF files with an 8 node, 40 cpu LocalCluster.

When I issue: ds = xr.open_mfdataset(files, parallel=True) all looks good on the Dask dashboard: and the tasks complete with no errors in about 4 minutes.

Then 4 more minutes go by before I get a bunch of errors like: distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting distributed.nanny - WARNING - Worker process 26054 was killed by unknown signal distributed.nanny - WARNING - Restarting worker and my cell doesn't complete.

Any suggestions?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  open_mfdataset usage and limitations. 372848074

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 14.941ms · About: xarray-datasette