github: issues: 2 rows where repo = 13221727 and user = 10678620 sorted by updated

2 rows where repo = 13221727 and user = 10678620 sorted by updated_at descending

Search:

descending

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at ▲	closed_at	author_association	active_lock_reason	draft	pull_request	body	reactions	performed_via_github_app	state_reason	repo	type
1646350377	PR_kwDOAMm_X85NMgVL	7698	Use read1 instead of read to get magic number	groutr 10678620	open	0			5	2023-03-29T18:57:23Z	2023-04-18T22:36:30Z		FIRST_TIME_CONTRIBUTOR		0	pydata/xarray/pulls/7698	Addresses #7697. I changed the isinstance check because neither `read` nor `read1` are provided by IOBase. Only RawIOBase and BufferedIOBase provide `read` and `read1` respectively. I think that there is little benefit to using `.tell()`. I suggest the following: `python filename_or_obj.seek(0) magic_number = filename_or_obj.read1(count) filename_or_obj.seek(0)`	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7698/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			xarray 13221727	pull
1646267547	I_kwDOAMm_X85iIAyb	7697	open_mfdataset very slow	groutr 10678620	open	0			6	2023-03-29T17:55:45Z	2023-03-29T21:21:24Z		NONE				What happened? I am trying to open an mfdataset consisting of over 4400 files. The call completes in 342.735s on my machine. After running a profiler, I discovered that most of that time is spent reading the first 8 bytes of the file. However, on my filesystem, looking at my system resource monitor, it looks like the entire file is being read (with a sustained 40-50MB of read IO most of that time). I traced the bottleneck down to https://github.com/pydata/xarray/blob/96030d4825159c6259736155084f38ee8db5c9bb/xarray/core/utils.py#L662 According to my profile, 264.381s (77%) of the execution time is spent on this line. I isolated the essence of this code, by reading the first 8 bytes of each file. `python for f in files: with open(f, 'rb') as fh: if fh.tell() != 0: fh.seek(0) magic = fh.read(8) fh.seek(0)` Profiling this on my directory of netcdf files took 137.587s (not sure why this was faster than 264s, caching maybe?). Changing the `fh.read(8)` to `fh.read1(8)`, the execution time dropped to 1.52s. What did you expect to happen? I expected open_mfdataset to be quicker. Minimal Complete Verifiable Example ```Python import xarray as xr import pathlib files = [... <list of 4400 filenames> ...] This takes almost 6 minutes to finish. D = xr.open_mfdataset(files, compat='override', coords='minimal', data_vars='minimal') ``` MVCE confirmation [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray. [ ] Complete example — the example is self-contained, including all data and the text of any traceback. [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result. [X] New issue — a search of GitHub Issues suggests this is not a duplicate. Relevant log output No response Anything else we need to know? I cannot share the netcdf files. I believe this issue to isolated, and possibly triggered by the shared filesystems found on supercomputers. Environment INSTALLED VERSIONS ------------------ commit: None python: 3.11.0 \| packaged by conda-forge \| (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.80.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: (None, None) libhdf5: 1.12.2 libnetcdf: 4.9.1 xarray: 2023.2.0 pandas: 1.5.3 numpy: 1.24.2 scipy: 1.10.1 netCDF4: 1.6.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.6 cfgrib: None iris: None bottleneck: None dask: 2023.3.1 distributed: None matplotlib: 3.7.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.3.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 67.5.1 pip: 23.0.1 conda: None pytest: None mypy: None IPython: 8.11.0 sphinx: None	{ "url": "https://api.github.com/repos/pydata/xarray/issues/7697/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }			xarray 13221727	issue

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);

issues

2 rows where repo = 13221727 and user = 10678620 sorted by updated_at descending

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

This takes almost 6 minutes to finish.

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

Advanced export