home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

17 rows where author_association = "MEMBER" and issue = 304589831 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • jhamman 13
  • rabernat 2
  • shoyer 2

issue 1

  • Parallel open_mfdataset · 17 ✖

author_association 1

  • MEMBER · 17 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
382487555 https://github.com/pydata/xarray/pull/1983#issuecomment-382487555 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM4MjQ4NzU1NQ== jhamman 2443309 2018-04-18T18:38:47Z 2018-04-18T18:38:47Z MEMBER

With my last commits here, this feature is completely optional and defaults to the current behavior. I cleaned up the tests a bit further and am now ready to merge this. Baring any objections, I'll merge this on Friday.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
382157273 https://github.com/pydata/xarray/pull/1983#issuecomment-382157273 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM4MjE1NzI3Mw== jhamman 2443309 2018-04-17T21:41:03Z 2018-04-17T21:41:03Z MEMBER

I think that makes sense for now. We need to experiment with this a bit more but I don't see a problem merging the basic workflow we have now (with a minor change to the default behavior).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
382154051 https://github.com/pydata/xarray/pull/1983#issuecomment-382154051 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM4MjE1NDA1MQ== shoyer 1217238 2018-04-17T21:30:53Z 2018-04-17T21:30:53Z MEMBER

It sounds like the right resolution for now would be to leave the default as parallel=False and leave this as an optional feature.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
382146851 https://github.com/pydata/xarray/pull/1983#issuecomment-382146851 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM4MjE0Njg1MQ== jhamman 2443309 2018-04-17T21:08:29Z 2018-04-17T21:08:29Z MEMBER

@NicWayand - Thanks for giving this a go. Some thoughts on your problem...

I'm have been using this feature for the past few days and have been seeing a speedup on datasets with many files along the lines of what I showed above. I am applying my tests on perhaps the perfect test architecture (parallel shared fs, fast interconnect, etc.). I think there are many reasons/cases where this won't work as well.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
381277673 https://github.com/pydata/xarray/pull/1983#issuecomment-381277673 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM4MTI3NzY3Mw== jhamman 2443309 2018-04-13T22:42:59Z 2018-04-13T22:42:59Z MEMBER

@rabernat - I got the tests passing here again. If you can make the time to try your example/test again, it would be great to figure out what wasn't working before.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
380257320 https://github.com/pydata/xarray/pull/1983#issuecomment-380257320 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM4MDI1NzMyMA== jhamman 2443309 2018-04-10T21:44:28Z 2018-04-10T21:45:02Z MEMBER

@rabernat - I just pushed a few more commits here. Can I ask two questions:

When using the distributed scheduler, what configuration are you using? Can you try: - autoclose=True (in open_mfdataset) - processes=True (in client)

If this turns out to be a corner case with the distributed scheduler, I can add a integration test for that specific use case.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
380150362 https://github.com/pydata/xarray/pull/1983#issuecomment-380150362 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM4MDE1MDM2Mg== jhamman 2443309 2018-04-10T15:49:06Z 2018-04-10T15:49:06Z MEMBER

@rabernat - my last commit(s) seem to have broken the CI so I'll need to revisit this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
380121937 https://github.com/pydata/xarray/pull/1983#issuecomment-380121937 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM4MDEyMTkzNw== rabernat 1197350 2018-04-10T14:32:25Z 2018-04-10T14:32:25Z MEMBER

I recently tried this branch with my data server and got an error.

I opened a dataset this way ```python

works fine with parallel=False

ds = xr.open_mfdataset(os.path.join(ddir, 'V1_1.204.nc'), decode_cf=False, parallel=True) ```

and got the following error.

distributed.utils - ERROR - NetCDF: HDF error Traceback (most recent call last): File "/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/utils.py", line 237, in f result[0] = yield make_coro() File "/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/distributed/client.py", line 1356, in _gather traceback) File "/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/six.py", line 692, in reraise raise value.with_traceback(tb) File "/home/rpa/.conda/envs/dask_distributed/lib/python3.5/site-packages/dask/compatibility.py", line 48, in apply return func(*args, **kwargs) File "/home/rpa/xarray/xarray/backends/api.py", line 318, in open_dataset return maybe_decode_store(store, lock) File "/home/rpa/xarray/xarray/backends/api.py", line 238, in maybe_decode_store drop_variables=drop_variables) File "/home/rpa/xarray/xarray/conventions.py", line 594, in decode_cf vars, attrs = obj.load() File "/home/rpa/xarray/xarray/backends/common.py", line 217, in load for k, v in self.get_variables().items()) File "/home/rpa/xarray/xarray/backends/netCDF4_.py", line 319, in get_variables iteritems(self.ds.variables)) File "/home/rpa/xarray/xarray/core/utils.py", line 308, in FrozenOrderedDict return Frozen(OrderedDict(*args, **kwargs)) File "/home/rpa/xarray/xarray/backends/netCDF4_.py", line 318, in <genexpr> for k, v in File "/home/rpa/xarray/xarray/backends/netCDF4_.py", line 311, in open_store_variable encoding['original_shape'] = var.shape File "netCDF4/_netCDF4.pyx", line 3381, in netCDF4._netCDF4.Variable.shape.__get__ (netCDF4/_netCDF4.c:34388) File "netCDF4/_netCDF4.pyx", line 2759, in netCDF4._netCDF4.Dimension.__len__ (netCDF4/_netCDF4.c:27006) RuntimeError: NetCDF: HDF error

Without the distributed scheduler (but with parallel=True), I get no error, but the command never returns, and eventually I have to restart the kernel.

Any idea what could be going on? (Sorry for the non-reproducible bug report...I figured some trials "in the field" might be useful.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
379323343 https://github.com/pydata/xarray/pull/1983#issuecomment-379323343 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM3OTMyMzM0Mw== jhamman 2443309 2018-04-06T17:33:45Z 2018-04-06T17:33:45Z MEMBER

All the tests are passing here? Any final objectors?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
379306351 https://github.com/pydata/xarray/pull/1983#issuecomment-379306351 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM3OTMwNjM1MQ== jhamman 2443309 2018-04-06T16:29:15Z 2018-04-06T16:29:15Z MEMBER

I image there will be a small performance cost when the number of files is small. That cost is probably lost in the noise in most i/o operations.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
379305062 https://github.com/pydata/xarray/pull/1983#issuecomment-379305062 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM3OTMwNTA2Mg== rabernat 1197350 2018-04-06T16:24:22Z 2018-04-06T16:24:22Z MEMBER

Can we imagine cases where it might actually degrade performance?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
379304351 https://github.com/pydata/xarray/pull/1983#issuecomment-379304351 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM3OTMwNDM1MQ== shoyer 1217238 2018-04-06T16:21:51Z 2018-04-06T16:21:51Z MEMBER

My reason for suggesting default parallel=True when using distributed is default to turning this feature on when we can expect it will probably improve performance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
379303753 https://github.com/pydata/xarray/pull/1983#issuecomment-379303753 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM3OTMwMzc1Mw== jhamman 2443309 2018-04-06T16:19:35Z 2018-04-06T16:19:35Z MEMBER

I'm curious about the logic of defaulting to parallel when using distributed.

I'm not tied to the behavior. It was suggested by @shoyer a while back. Perhaps we try this and evaluate how it works in the wild?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
376689828 https://github.com/pydata/xarray/pull/1983#issuecomment-376689828 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM3NjY4OTgyOA== jhamman 2443309 2018-03-27T21:59:35Z 2018-03-27T21:59:35Z MEMBER

Have you tested this with both a local system and an HPC cluster?

I have. See below for a simple example using this feature on Cheyenne.

```python In [1]: import xarray as xr ...: ...: import glob ...:

In [2]: pattern = '/glade/u/home/jhamman/workdir/LOCA_daily/met_data/CESM1-BGC/16th/rcp45/r1i1p1//nc'

In [3]: len(glob.glob(pattern)) Out[3]: 285

In [4]: %time ds = xr.open_mfdataset(pattern) CPU times: user 15.5 s, sys: 2.62 s, total: 18.1 s Wall time: 42.4 s

In [5]: ds.close()

In [6]: %time ds = xr.open_mfdataset(pattern, parallel=True) CPU times: user 18.4 s, sys: 5.28 s, total: 23.6 s Wall time: 30.7 s

In [7]: ds.close()

In [8]: from dask.distributed import Client

In [9]: client = Client() clien In [10]: client Out[10]: <Client: scheduler='tcp://127.0.0.1:39853' processes=72 cores=72>

In [11]: %time ds = xr.open_mfdataset(pattern, parallel=True, autoclose=True) CPU times: user 10.8 s, sys: 808 ms, total: 11.6 s Wall time: 12.4 s ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
375799794 https://github.com/pydata/xarray/pull/1983#issuecomment-375799794 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM3NTc5OTc5NA== jhamman 2443309 2018-03-23T21:12:33Z 2018-03-23T21:12:33Z MEMBER

I'm tempted to just skip this test there but thought I should ask for help first...

I've skipped the offending test on appveyor for now. Objectors speak up please. I don't have a windows machine to test on and iterating via appveyor is not something a sane person does 😉.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
373245814 https://github.com/pydata/xarray/pull/1983#issuecomment-373245814 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM3MzI0NTgxNA== jhamman 2443309 2018-03-15T03:05:08Z 2018-03-15T03:05:08Z MEMBER

If anyone understands Windows file handling with Python, I'm all ears as to why this is failing on AppVeyor. I'm tempted to just skip this test there but thought I should ask for help first...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831
372807932 https://github.com/pydata/xarray/pull/1983#issuecomment-372807932 https://api.github.com/repos/pydata/xarray/issues/1983 MDEyOklzc3VlQ29tbWVudDM3MjgwNzkzMg== jhamman 2443309 2018-03-13T20:30:49Z 2018-03-13T20:30:49Z MEMBER

@shoyer - I updated this to use dask.delayed. I actually like it more because I only have to call compute once. Thanks for the suggestion.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Parallel open_mfdataset 304589831

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 15.477ms · About: xarray-datasette