home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where issue = 1333650265 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • guidocioni 7
  • shoyer 3
  • max-sixty 2

author_association 2

  • NONE 7
  • MEMBER 5

issue 1

  • `sel` behaving randomly when applying to a dataset with multiprocessing · 12 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1210976795 https://github.com/pydata/xarray/issues/6904#issuecomment-1210976795 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85ILgob shoyer 1217238 2022-08-10T16:43:36Z 2022-08-10T16:43:36Z MEMBER

You might look into different multiprocessing modes: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

It may also be that the NetCDF or HDF5 libraries were simply not written in a way that can support multi-processing. This would not surprise me.

BTW is there any advantage or difference in terms of cpu and memory consumption in opening the file only one or let it open by every process? I'm asking because I thought opening in every process was just plain stupid but it seems to perform exactly the same, so maybe I'm just creating a problem where there is none

I agree, maybe this isn't worth the trouble. I have not seen it done successfully before.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210383450 https://github.com/pydata/xarray/issues/6904#issuecomment-1210383450 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IJPxa guidocioni 12760310 2022-08-10T09:07:00Z 2022-08-10T09:07:00Z NONE

This is a minimal working example that I could come up with. You can try to open any netcdf that you have. I tested on a small one and it didn't reproduce the error, so it is definitely only happening with large datasets when the arrays are not loaded into memory. Unfortunately, as you need a large file, I cannot really attach it here.

```python import xarray as xr from tqdm.contrib.concurrent import process_map import pprint

def main(): global ds ds = xr.open_dataset('input.nc') it = range(0, 5) results = [] for i in it: results.append(compute(i)) print("------------Serial results-----------------") pprint.pprint(results) results = process_map(compute, it, max_workers=6, chunksize=1, disable=True) print("------------Parallel results-----------------") pprint.pprint(results)

def compute(station): ds_point = ds.isel(lat=0, lon=0) return station, ds_point.t_2m_max.mean().item(), ds_point.t_2m_min.mean().item(), ds_point.lon.min().item(), ds_point.lat.min().item()

if name == "main": main() ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210349031 https://github.com/pydata/xarray/issues/6904#issuecomment-1210349031 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IJHXn guidocioni 12760310 2022-08-10T08:38:31Z 2022-08-10T08:38:31Z NONE

Re nearest, does it replicate with exact lookups?

Ok, it seems to fail also with exact lookups o.O This is extremely weird

I'm using python def compute(): ds_point = ds.isel(lat=0, lon=0) return ds_point.t_2m_med.mean().item(), ds_point.t_2m_min.mean().item(), ds_point.lon.min().item(), ds_point.lat.min().item()

Result for the serial version

python [( 10.469047546386719, 6.5044121742248535, 6.0, 48.0), ( 10.469047546386719, 6.5044121742248535, 6.0, 48.0), ( 10.469047546386719, 6.5044121742248535, 6.0, 48.0), ( 10.469047546386719, 6.5044121742248535, 6.0, 48.0), ( 10.469047546386719, 6.5044121742248535, 6.0, 48.0)] As you would expect all values are the same.

And for the parallel version with EXACTLY the same code

python [( 7.968084812164307, 6.948009967803955, 6.0, 48.0), ( 7.825599193572998, 6.995675563812256, 6.0, 48.0), ( 8.894186019897461, 6.849221706390381, 6.0, 48.0), ( 8.901763916015625, 6.69615364074707, 6.0, 48.0), ( 9.164983749389648, 6.484694480895996, 6.0, 48.0)]

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210341456 https://github.com/pydata/xarray/issues/6904#issuecomment-1210341456 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IJFhQ guidocioni 12760310 2022-08-10T08:32:13Z 2022-08-10T08:32:13Z NONE

python , lock=Lock()

That causes an error

python Error 11: Resource temporarily unavailable

Here is the complete tracebabk python concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/var/models/miniconda3/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/var/models/miniconda3/lib/python3.8/concurrent/futures/process.py", line 198, in _process_chunk return [fn(*args) for args in chunk] File "/var/models/miniconda3/lib/python3.8/concurrent/futures/process.py", line 198, in <listcomp> return [fn(*args) for args in chunk] File "test_sel_bug.py", line 58, in compute_clima return station, ds_point.t_2m_med.mean().item(), ds_point.t_2m_min.mean().item(), ds_point.lon.min().item(), ds_point.lat.min().item() File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/common.py", line 58, in wrapped_func return self.reduce(func, dim, axis, skipna=skipna, **kwargs) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/dataarray.py", line 2696, in reduce var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/variable.py", line 1806, in reduce data = func(self.data, **kwargs) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/variable.py", line 339, in data return self.values File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/variable.py", line 512, in values return _as_array_or_item(self._data) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/variable.py", line 252, in _as_array_or_item data = np.asarray(data) File "/var/models/miniconda3/lib/python3.8/site-packages/numpy/core/_asarray.py", line 102, in asarray return array(a, dtype, copy=False, order=order) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/indexing.py", line 552, in __array__ self._ensure_cached() File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/indexing.py", line 549, in _ensure_cached self.array = NumpyIndexingAdapter(np.asarray(self.array)) File "/var/models/miniconda3/lib/python3.8/site-packages/numpy/core/_asarray.py", line 102, in asarray return array(a, dtype, copy=False, order=order) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/indexing.py", line 522, in __array__ return np.asarray(self.array, dtype=dtype) File "/var/models/miniconda3/lib/python3.8/site-packages/numpy/core/_asarray.py", line 102, in asarray return array(a, dtype, copy=False, order=order) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/indexing.py", line 423, in __array__ return np.asarray(array[self.key], dtype=None) File "/var/models/miniconda3/lib/python3.8/site-packages/numpy/core/_asarray.py", line 102, in asarray return array(a, dtype, copy=False, order=order) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/coding/variables.py", line 70, in __array__ return self.func(self.array) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/coding/variables.py", line 137, in _apply_mask data = np.asarray(data, dtype=dtype) File "/var/models/miniconda3/lib/python3.8/site-packages/numpy/core/_asarray.py", line 102, in asarray return array(a, dtype, copy=False, order=order) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/indexing.py", line 423, in __array__ return np.asarray(array[self.key], dtype=None) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/backends/netCDF4_.py", line 93, in __getitem__ return indexing.explicit_indexing_adapter( File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/core/indexing.py", line 712, in explicit_indexing_adapter result = raw_indexing_method(raw_key.tuple) File "/var/models/miniconda3/lib/python3.8/site-packages/xarray/backends/netCDF4_.py", line 106, in _getitem array = getitem(original_array, key) File "src/netCDF4/_netCDF4.pyx", line 4420, in netCDF4._netCDF4.Variable.__getitem__ File "src/netCDF4/_netCDF4.pyx", line 5363, in netCDF4._netCDF4.Variable._get File "src/netCDF4/_netCDF4.pyx", line 1950, in netCDF4._netCDF4._ensure_nc_success RuntimeError: Resource temporarily unavailable """

I think we may be heading the right direction

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210285626 https://github.com/pydata/xarray/issues/6904#issuecomment-1210285626 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85II346 guidocioni 12760310 2022-08-10T07:41:20Z 2022-08-10T07:41:20Z NONE

Will that work in the same way if I still use process_map, which uses concurrent.futures under the hood?

Yes it should, as long as you're using multi-processing under the covers.

If you do multi-threading, then you would want to use threading.Lock(). But I believe we already apply a thread lock by default.

mmm ok I'll try and let you know.

BTW is there any advantage or difference in terms of cpu and memory consumption in opening the file only one or let it open by every process? I'm asking because I thought opening in every process was just plain stupid but it seems to perform exactly the same, so maybe I'm just creating a problem where there is none

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210255676 https://github.com/pydata/xarray/issues/6904#issuecomment-1210255676 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIwk8 shoyer 1217238 2022-08-10T07:10:41Z 2022-08-10T07:10:41Z MEMBER

Will that work in the same way if I still use process_map, which uses concurrent.futures under the hood?

Yes it should, as long as you're using multi-processing under the covers.

If you do multi-threading, then you would want to use threading.Lock(). But I believe we already apply a thread lock by default.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210238864 https://github.com/pydata/xarray/issues/6904#issuecomment-1210238864 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIseQ guidocioni 12760310 2022-08-10T06:51:18Z 2022-08-10T06:51:18Z NONE

Can you try explicitly passing in a multiprocessing lock into the open_dataset() constructor? Something like:

python from multiprocessing import Lock ds = xarray.open_dataset(file, lock=Lock())

(We automatically select appropriate locks if using Dask, but I'm not sure how we would do that more generally...)

ok that's a good shot. Will that work in the same way if I still use process_map, which uses concurrent.futures under the hood?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210233503 https://github.com/pydata/xarray/issues/6904#issuecomment-1210233503 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIrKf shoyer 1217238 2022-08-10T06:45:06Z 2022-08-10T06:45:06Z MEMBER

Can you try explicitly passing in a multiprocessing lock into the open_dataset() constructor? Something like: python from multiprocessing import Lock ds = xarray.open_dataset(file, lock=Lock())

(We automatically select appropriate locks if using Dask, but I'm not sure how we would do that more generally...)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210220238 https://github.com/pydata/xarray/issues/6904#issuecomment-1210220238 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIn7O guidocioni 12760310 2022-08-10T06:30:06Z 2022-08-10T06:30:06Z NONE

Re nearest, does it replicate with exact lookups?

I haven't tried yet because it doesn't really match my use case. One idea that I had was to provide the list of points before starting the loop, creating an iterator with the slices from the xarray and then pass this to the loop. But I would end up using more data than necessary because I don't process all cases.

another thing that I've noticed is that if the list of iterators is smaller than the chunksize everything's good, probably because it reverts to the serial case as only 1 worker is processing

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210216148 https://github.com/pydata/xarray/issues/6904#issuecomment-1210216148 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIm7U max-sixty 5635139 2022-08-10T06:24:54Z 2022-08-10T06:24:54Z MEMBER

Re nearest, does it replicate with exact lookups?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1210174583 https://github.com/pydata/xarray/issues/6904#issuecomment-1210174583 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IIcx3 guidocioni 12760310 2022-08-10T05:23:13Z 2022-08-10T05:24:24Z NONE

That sounds quite unfriendly!

A couple of questions to reduce the size of the example, without providing any answers yet unfortunately:

  • Is process_map from tqdm? Do you get the same behavior from the standard multiprocessing?

Yep, and yep (believe me, I've tried anything in desperation 😄)

  • What if we remove method=nearest?

Which method should I use then? I need the closest point

  • Is the file a single netCDF file?

Yep

I can try to make a minimal example, however, in order to reproduce the issue, I think it's necessary to open a large dataset.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265
1209921400 https://github.com/pydata/xarray/issues/6904#issuecomment-1209921400 https://api.github.com/repos/pydata/xarray/issues/6904 IC_kwDOAMm_X85IHe94 max-sixty 5635139 2022-08-09T21:39:21Z 2022-08-09T21:39:21Z MEMBER

That sounds quite unfriendly!

A couple of questions to reduce the size of the example, without providing any answers yet unfortunately:

  • Is process_map from tqdm? Do you get the same behavior from the standard multiprocessing?
  • What if we remove method=nearest?
  • Is the file a single netCDF file?
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  `sel` behaving randomly when applying to a dataset with multiprocessing 1333650265

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 10.821ms · About: xarray-datasette