home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where author_association = "MEMBER" and issue = 1333650265 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 2

  • shoyer 3
  • max-sixty 2

issue 1

  • `sel` behaving randomly when applying to a dataset with multiprocessing · 5 ✖

author_association 1

  • MEMBER · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1210976795 https://github.com/pydata/xarray/issues/6904#issuecomment-1210976795 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85ILgob shoyer 1217238 2022-08-10T16:43:36Z 2022-08-10T16:43:36Z MEMBER

You might look into different multiprocessing modes: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

It may also be that the NetCDF or HDF5 libraries were simply not written in a way that can support multi-processing. This would not surprise me.

BTW is there any advantage or difference in terms of cpu and memory consumption in opening the file only one or let it open by every process? I'm asking because I thought opening in every process was just plain stupid but it seems to perform exactly the same, so maybe I'm just creating a problem where there is none

I agree, maybe this isn't worth the trouble. I have not seen it done successfully before.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210255676 https://github.com/pydata/xarray/issues/6904#issuecomment-1210255676 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIwk8 shoyer 1217238 2022-08-10T07:10:41Z 2022-08-10T07:10:41Z MEMBER

Will that work in the same way if I still use process_map, which uses concurrent.futures under the hood?

Yes it should, as long as you're using multi-processing under the covers.

If you do multi-threading, then you would want to use threading.Lock(). But I believe we already apply a thread lock by default.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210233503 https://github.com/pydata/xarray/issues/6904#issuecomment-1210233503 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIrKf shoyer 1217238 2022-08-10T06:45:06Z 2022-08-10T06:45:06Z MEMBER

Can you try explicitly passing in a multiprocessing lock into the open_dataset() constructor? Something like: python from multiprocessing import Lock ds = xarray.open_dataset(file, lock=Lock())

(We automatically select appropriate locks if using Dask, but I'm not sure how we would do that more generally...)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210216148 https://github.com/pydata/xarray/issues/6904#issuecomment-1210216148 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIm7U max-sixty 5635139 2022-08-10T06:24:54Z 2022-08-10T06:24:54Z MEMBER

Re nearest, does it replicate with exact lookups?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1209921400 https://github.com/pydata/xarray/issues/6904#issuecomment-1209921400 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IHe94 max-sixty 5635139 2022-08-09T21:39:21Z 2022-08-09T21:39:21Z MEMBER

That sounds quite unfriendly!

A couple of questions to reduce the size of the example, without providing any answers yet unfortunately:

  • Is process_map from tqdm? Do you get the same behavior from the standard multiprocessing?
  • What if we remove method=nearest?
  • Is the file a single netCDF file?
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 242.681ms · About: xarray-datasette