github: issues: 3 rows where repo = 13221727 and user = 10595679 sorted by updated

3 rows where repo = 13221727 and user = 10595679 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	comments	created_at	updated_at ▲	closed_at	author_association	draft	pull_request	body	reactions	state_reason	repo	type
425320466	MDU6SXNzdWU0MjUzMjA0NjY=	2852	Allow grouping by dask variables	jmichel-otb 10595679	open	10	2019-03-26T09:55:19Z	2022-04-18T15:45:41Z		CONTRIBUTOR			Code Sample, a copy-pastable example if possible I am using `xarray` in combination to `dask distributed` on a cluster, so a mimimal code sample demonstrating my problem is not easy to come up with. Problem description Here is what I observe: 1. In my environment, `dask distributed` is correctly set-up with auto-scaling. I can verify this by loading data into `xarray` and using aggregation functions like `mean()`. This triggers auto-scaling and the dask dashboard shows that the processing is spread accross slave nodes. I have the following `xarray` dataset called `geoms_ds`: ``` <xarray.Dataset> Dimensions: (x: 10980, y: 10980) Coordinates: y (y) float64 4.9e+06 4.9e+06 4.9e+06 ... 4.79e+06 4.79e+06 4.79e+06 x (x) float64 3e+05 3e+05 3e+05 ... 4.098e+05 4.098e+05 4.098e+05 Data variables: label (y, x) uint16 dask.array<shape=(10980, 10980), chunksize=(200, 10980)> ``` Which I load with the following code sample: `python import xarray as xr geoms = xr.open_rasterio('test_rasterization_T31TCJ_uint16.tif',chunks={'band': 1, 'x': 10980, 'y': 200}) geoms_squeez = geoms.isel(band=0).squeeze().drop(labels='band') geoms_ds = geoms_squeez.to_dataset(name='label')` This `array` holds a finite number of integer values denoting groups (or classes if you like). I would like to perform statistics on groups (with additional variables) such as the mean value of a given variable for each group for instance. I can do this perfectly for a single group using `.where(label=xxx).mean('variable')`, this behaves as expected, triggering auto-scaling and dask graph of task. The problem is that I have a lot of groups (or classes) and looping through all of them and apply `where()` is not very efficient. From my reading of `xarray` documentation, `groupby` is what I need, to perform stats on all groups at once. When I try to use `geoms_ds.groupby('label').size()` for instance, here is what I observe: Grouping is not lazy, it is evaluated immediately, Grouping is not performed through dask distributed, only the master node is working, on a single thread, The grouping operation takes a large amount of time and eats a large amount of memory (nearly 30 Gb, which is a lot more than what is required to store the full dataset in memory) Most of the time, the grouping fail with the following errors and warnings: distributed.utils_perf - WARNING - full garbage collections took 52% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 47% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 48% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 50% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 53% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 57% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 58% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 58% CPU time recently (threshold: 10%) distributed.utils_perf - WARNING - full garbage collections took 59% CPU time recently (threshold: 10%) WARNING:dask_jobqueue.core:Worker tcp://10.135.39.92:51747 restart in Job 2758934. This can be due to memory issue. distributed.utils - ERROR - 'tcp://10.135.39.92:51747' Traceback (most recent call last): File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/distributed/utils.py", line 648, in log_errors yield File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/distributed/scheduler.py", line 1360, in add_worker yield self.handle_worker(comm=comm, worker=address) File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run value = future.result() File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper yielded = next(result) File "/work/logiciels/projets/eolab/conda/eolab/lib/python3.6/site-packages/distributed/scheduler.py", line 2220, in handle_worker worker_comm = self.stream_comms[worker] KeyError: ... Which I assume comes from the fact that the process is killed by pbs for excessive memory usage. Expected Output I would except the following: * Single call to `groupby`lazily evaluated, * Evaluation of aggregation function performed through `dask distributed` * The dataset is not so large, even on a single master thread the computation should end well in reasonable time. Output of `xr.show_versions()` NSTALLED VERSIONS ------------------ commit: None python: 3.6.7 \| packaged by conda-forge \| (default, Nov 21 2018, 03:09:43) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-327.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.11.3 pandas: 0.24.1 numpy: 1.16.1 scipy: 1.2.0 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 PseudonetCDF: None rasterio: 1.0.15 cfgrib: None iris: None bottleneck: None cyordereddict: None dask: 1.1.1 distributed: 1.25.3 matplotlib: 3.0.2 cartopy: 0.17.0 seaborn: 0.9.0 setuptools: 40.7.1 pip: 19.0.1 conda: None pytest: None IPython: 7.1.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2852/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	issue
428374352	MDExOlB1bGxSZXF1ZXN0MjY2NzU2OTEw	2865	BUG: Fix #2864 by adding the missing vrt parameters	jmichel-otb 10595679	closed	5	2019-04-02T18:22:07Z	2019-04-11T16:24:17Z	2019-04-11T16:24:13Z	CONTRIBUTOR	0	pydata/xarray/pulls/2865	[x] Closes #2864 [x] Tests added [x] Fully documented, including `whats-new.rst` for all changes and `api.rst` for new API	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2865/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		xarray 13221727	pull
428300345	MDU6SXNzdWU0MjgzMDAzNDU=	2864	Bug in WarpedVRT support of open_rasterio()	jmichel-otb 10595679	closed	3	2019-04-02T15:37:08Z	2019-04-11T16:24:13Z	2019-04-11T16:24:13Z	CONTRIBUTOR			Code Sample, a copy-pastable example if possible Using the following data: https://gitlab.orfeo-toolbox.org/orfeotoolbox/otb/blob/develop/Data/Input/QB_Toulouse_Ortho_XS.tif ```python import rasterio as rio from rasterio.crs import CRS from rasterio.warp import calculate_default_transform,aligned_target from rasterio.enums import Resampling from rasterio.vrt import WarpedVRT import xarray as xr path = 'QB_Toulouse_Ortho_XS.tif' Read input metadata with rasterio with rio.open(path) as src: print('Input file CRS is {}'.format(src.crs)) print('Input file shape is {}'.format(src.shape)) print('Input file transform is {}'.format(src.transform)) # Create a different CRS dst_crs = CRS.from_epsg(2154) # Compute a transform that will resample to dst_crs and change resolution to dst_crs transform, width, height = calculate_default_transform( src.crs, dst_crs, src.width, src.height,resolution=20, src.bounds) # Fill vrt options as shown in https://rasterio.readthedocs.io/en/stable/topics/virtual-warping.html vrt_options = { 'resampling': Resampling.cubic, 'crs': dst_crs, 'transform': transform, 'height': height, 'width': width } # Create a WarpedVRT using the vrt_options with WarpedVRT(src,vrt_options) as vrt: print('VRT shape is {}'.format(vrt.shape)) # Open VRT with xarray ds = xr.open_rasterio(vrt) # Shape does not match vrt shape: print(ds) `Output:` $ python test_rio_vrt.py Input file CRS is EPSG:32631 Input file shape is (500, 500) Input file transform is \| 0.60, 0.00, 374149.98\| \| 0.00,-0.60, 4829183.99\| \| 0.00, 0.00, 1.00\| VRT shape is (16, 16) <xarray.DataArray (band: 4, y: 500, x: 500)> [1000000 values with dtype=int16] Coordinates: band (band) int64 1 2 3 4 * y (y) float64 6.28e+06 6.28e+06 6.28e+06 ... 6.279e+06 6.279e+06 * x (x) float64 5.741e+05 5.741e+05 5.741e+05 ... 5.744e+05 5.744e+05 Attributes: transform: (0.6003151072155879, 0.0, 574068.2261249251, 0.0, -0.6003151... crs: EPSG:2154 res: (0.6003151072155879, 0.6003151072155879) is_tiled: 0 nodatavals: (nan, nan, nan, nan) ``` Problem description In the above example, `xarray.open_rasterio()` is asked to read a `WarpedVRT` created with `rasterio`. This `WarpedVRT` has custom `transform`, `width` and `height`, which is a very common use case of `WarpedVRT` where you upsample or downsample your data on the fly during read. Las, `open_rasterio()` ignores those custom attributes, which result in a wrong `DataArray` shape: `VRT shape is (16, 16) <xarray.DataArray (band: 4, y: 500, x: 500)>` Expected Output Correct output should be: `VRT shape is (16, 16) <xarray.DataArray (band: 4, y: 16, x: 16)>` Which is fairly easy to obtain by modifying the following lines in `open_rasterio()`: https://github.com/pydata/xarray/blob/0c73a380745c4792ab440eb020f78f203897abe5/xarray/backends/rasterio_.py#L222 With the following: `python vrt_params = dict(crs=vrt.crs.to_string(), resampling=vrt.resampling, src_nodata=vrt.src_nodata, dst_nodata=vrt.dst_nodata, tolerance=vrt.tolerance, # Edit transform=vrt.transform, width=vrt.width, height=vrt.height, # End edit warp_extras=vrt.warp_extras)` I can provide this patch in a pull request if needed. Output of `xr.show_versions()` >>> xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.7 \| packaged by conda-forge \| (default, Nov 21 2018, 03:09:43) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-862.2.3.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.utf8 LOCALE: fr_FR.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.11.3 pandas: 0.24.1 numpy: 1.16.1 scipy: 1.2.0 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 PseudonetCDF: None rasterio: 1.0.17 cfgrib: None iris: None bottleneck: None cyordereddict: None dask: 1.1.1 distributed: 1.25.3 matplotlib: 3.0.2 cartopy: 0.17.0 seaborn: 0.9.0 setuptools: 40.7.3 pip: 19.0.1 conda: None pytest: 4.2.0 IPython: 7.1.1 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/2864/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	completed	xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

3 rows where repo = 13221727 and user = 10595679 sorted by updated_at descending

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of `xr.show_versions()`

Code Sample, a copy-pastable example if possible

Read input metadata with rasterio

Problem description

Expected Output

Output of `xr.show_versions()`

Advanced export

issues

3 rows where repo = 13221727 and user = 10595679 sorted by updated_at descending

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of xr.show_versions()

Code Sample, a copy-pastable example if possible

Read input metadata with rasterio

Problem description

Expected Output

Output of xr.show_versions()

Advanced export

Output of `xr.show_versions()`

Output of `xr.show_versions()`