home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

6 rows where issue = 504497403 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 4

  • crusaderky 2
  • sipposip 2
  • shoyer 1
  • dcherian 1

author_association 2

  • MEMBER 4
  • NONE 2

issue 1

  • add option to open_mfdataset for not using dask · 6 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
540477057 https://github.com/pydata/xarray/issues/3386#issuecomment-540477057 https://api.github.com/repos/pydata/xarray/issues/3386 MDEyOklzc3VlQ29tbWVudDU0MDQ3NzA1Nw== sipposip 42270910 2019-10-10T09:11:31Z 2019-10-10T09:11:31Z NONE

@dcherian a dump of a single file: ``` ncdump -hs era5_mean_sea_level_pressure_2002.nc netcdf era5_mean_sea_level_pressure_2002 { dimensions: longitude = 1440 ; latitude = 721 ; time = 8760 ; variables: float longitude(longitude) ; longitude:units = "degrees_east" ; longitude:long_name = "longitude" ; float latitude(latitude) ; latitude:units = "degrees_north" ; latitude:long_name = "latitude" ; int time(time) ; time:units = "hours since 1900-01-01 00:00:00.0" ; time:long_name = "time" ; time:calendar = "gregorian" ; short msl(time, latitude, longitude) ; msl:scale_factor = 0.23025422306319 ; msl:add_offset = 99003.8223728885 ; msl:_FillValue = -32767s ; msl:missing_value = -32767s ; msl:units = "Pa" ; msl:long_name = "Mean sea level pressure" ; msl:standard_name = "air_pressure_at_mean_sea_level" ;

// global attributes: :Conventions = "CF-1.6" ; :history = "2019-10-03 16:05:54 GMT by grib_to_netcdf-2.10.0: /opt/ecmwf/eccodes/bin/grib_to_netcdf -o /cache/data5/adaptor.mars.internal-1570117777.9045198-23871-11-c8564b6f-4db5-48d8-beab-ba9fef91d4e8.nc /cache/tmp/c8564b6f-4db5-48d8-beab-ba9fef91d4e8-adaptor.mars.internal-1570117777.905033-23871-3-tmp.grib" ; :_Format = "64-bit offset" ; } ```

@shoyer : thanks for the tip, I think that it indeed simply adding more data-loading threads is the best solution.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  add option to open_mfdataset for not using dask 504497403
540474492 https://github.com/pydata/xarray/issues/3386#issuecomment-540474492 https://api.github.com/repos/pydata/xarray/issues/3386 MDEyOklzc3VlQ29tbWVudDU0MDQ3NDQ5Mg== crusaderky 6213168 2019-10-10T09:05:21Z 2019-10-10T09:05:21Z MEMBER

@sipposip if your dask graph is resolved straight after the load from disk, you can try disabling the dask optimizer to see if you can squeeze some milliseconds out of load(). You can look up the setting syntax on the dask documentation.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  add option to open_mfdataset for not using dask 504497403
540208420 https://github.com/pydata/xarray/issues/3386#issuecomment-540208420 https://api.github.com/repos/pydata/xarray/issues/3386 MDEyOklzc3VlQ29tbWVudDU0MDIwODQyMA== shoyer 1217238 2019-10-09T21:28:48Z 2019-10-09T21:28:48Z MEMBER

netCDF4.MFDataset works on a much more restricted set of netCDF files than xarray.open_mfdataset. I'm not surprised it's a little bit faster, but I'm not sure it's worth the maintenance burden of supporting this separate code path. Making a fully featured version of open_mfdataset with dask would be challenging.

Can you simply add more threads in TensorFlow/Keras for loading the data? My other suggestion is to pre-shuffle the data on disk, so you don't need random access inside your training loop.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  add option to open_mfdataset for not using dask 504497403
540033550 https://github.com/pydata/xarray/issues/3386#issuecomment-540033550 https://api.github.com/repos/pydata/xarray/issues/3386 MDEyOklzc3VlQ29tbWVudDU0MDAzMzU1MA== dcherian 2448579 2019-10-09T14:43:29Z 2019-10-09T14:43:29Z MEMBER

It would be useful to see what a single file looks like and what the combined dataset looks like. open_mfdataset can sometimes require some tuning to get good performance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  add option to open_mfdataset for not using dask 504497403
539916279 https://github.com/pydata/xarray/issues/3386#issuecomment-539916279 https://api.github.com/repos/pydata/xarray/issues/3386 MDEyOklzc3VlQ29tbWVudDUzOTkxNjI3OQ== sipposip 42270910 2019-10-09T09:20:06Z 2019-10-09T09:20:06Z NONE

setting dask.config.set(scheduler="synchronous") globally indeed resolved the threading issues, thanks. However, loading and preprocessing a single timeslice of data is ~40 % slower with dask and open_mfdataset (with chunks={'time':1}) compared to netCDF4.MFDataset . Is this is expected/a known issue? If not, I can try to create a minimal reproducible example.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  add option to open_mfdataset for not using dask 504497403
539907822 https://github.com/pydata/xarray/issues/3386#issuecomment-539907822 https://api.github.com/repos/pydata/xarray/issues/3386 MDEyOklzc3VlQ29tbWVudDUzOTkwNzgyMg== crusaderky 6213168 2019-10-09T08:58:21Z 2019-10-09T08:58:21Z MEMBER

@sipposip xarray doesn't use netCDF4.MFDataset, but netCDF4.Dataset which is then wrapped by dask arrays which are then concatenated.

Opening each file separately with open_dataset, and then concatenating them with xr.concat does not work, as this loads the data into memory.

This is by design, because of the reason above. The NetCDF/HDF5 lazy loading means that data is loaded up into a numpy.ndarray on the first operation performed upon it. This includes concatenation.

I'm aware that threads within threads, threads within processes, and processes within threads cause a world of pain in the form of random deadlocks - I've been there myself. You can completely disable dask threads process-wide with python dask.config.set(scheduler="synchronous") ... ds.load() or as a context manager python with dask.config.set(scheduler="synchronous"): ds.load() or for the single operation: python ds.load(scheduler="synchronous") Does this address your issue?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  add option to open_mfdataset for not using dask 504497403

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 639.862ms · About: xarray-datasette
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows