github: issue_comments: 10 rows where issue = 224553135 and user = 1197350 sorted by updated

10 rows where issue = 224553135 and user = 1197350 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
1043038150	https://github.com/pydata/xarray/issues/1385#issuecomment-1043038150	https://api.github.com/repos/pydata/xarray/issues/1385	IC_kwDOAMm_X84-K3_G	rabernat 1197350	2022-02-17T14:57:03Z	2022-02-17T14:57:03Z	MEMBER	See deeper dive in https://github.com/pydata/xarray/discussions/6284	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
1043016100	https://github.com/pydata/xarray/issues/1385#issuecomment-1043016100	https://api.github.com/repos/pydata/xarray/issues/1385	IC_kwDOAMm_X84-Kymk	rabernat 1197350	2022-02-17T14:36:23Z	2022-02-17T14:36:23Z	MEMBER	Ah ok so if that is your goal, `decode_times=False` should be enough to solve it. There is a problem with the time encoding in this file. The units (`days since 1950-01-01T00:00:00Z`) are not compatible with the values (738457.04166667, etc.). That would place your measurements sometime in the year 3971. This is part of the problem, but not the whole story.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
1043001146	https://github.com/pydata/xarray/issues/1385#issuecomment-1043001146	https://api.github.com/repos/pydata/xarray/issues/1385	IC_kwDOAMm_X84-Ku86	rabernat 1197350	2022-02-17T14:21:45Z	2022-02-17T14:22:23Z	MEMBER	(I could post to a web server if there's any reason to prefer that.) In general that would be a little more convenient than google drive, because then we could download the file from python (rather than having a manual step). This would allow us to share a fully copy-pasteable code snippet to reproduce the issue. But don't worry about that for now. First, I'd note that your issue is not really related to `open_mfdataset` at all, since it is reproduced just using `open_dataset`. The core problem is that you have ~15M timesteps, and it is taking forever to decode the times out of them. It's fast when you do `decode_times=False` because the data aren't actually being read. I'm going to make a post over in discussions to dig a bit deeper into this. StackOverflow isn't monitored too regularly by this community.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
1042937825	https://github.com/pydata/xarray/issues/1385#issuecomment-1042937825	https://api.github.com/repos/pydata/xarray/issues/1385	IC_kwDOAMm_X84-Kffh	rabernat 1197350	2022-02-17T13:14:50Z	2022-02-17T13:14:50Z	MEMBER	Hi Tom! 👋 So much has evolved about xarray since this original issue was posted. However, we continue to use it as a catchall for people looking to speed up open_mfdataset. I saw your stackoverflow post. Any chance you could post a link to the actual file in question?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
561920115	https://github.com/pydata/xarray/issues/1385#issuecomment-561920115	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDU2MTkyMDExNQ==	rabernat 1197350	2019-12-05T01:09:25Z	2019-12-05T01:09:25Z	MEMBER	In your twitter thread you said Do any of my xarray/dask folks know why open_mfdataset takes such a significant amount of time compared to looping over a list of files? Each file corresponds to a new time, just wanting to open multiple times at once... The general reason for this is usually that `open_mfdataset` performs coordinate compatibility checks when it concatenates the files. It's useful to actually read the code of open_mfdataset to see how it works. First, all the files are opened individually https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L900-L903 You can recreate this step outside of xarray yourself by doing something like `python from glob import glob datasets = [xr.open_dataset(fname, chunks={}) for fname in glob('.nc')]` Once each dataset is open, xarray calls out to one of its combine functions. This logic has gotten more complex over the years as different options have been introduced, but the gist is this: https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L947-L952 You can reproduce this step outside of xarray, e.g. `ds = xr.concat(datasets, dim='time')` At that point, various checks will kick in to be sure that the coordinates in the different datasets are compatible. Performing these checks requires the data to be read eagerly, which can be a source of slow performance. Without seeing more details about your files, it's hard to know exactly where the issue lies. A good place to start is to simply drop all coordinates from your data as a preprocessing step. ``` def drop_all_coords(ds): return ds.reset_coords(drop=True) xr.open_mfdataset('.nc', combine='by_coords', preprocess=drop_all_coords) ``` If you observe a big speedup, this points at coordinate compatibility checks as the culprit. From there you can experiment with the various options for `open_mfdataset`, such as `coords='minimal', compat='override'`, etc. Once you post your file details, we can provide more concrete suggestions.	{ "total_count": 6, "+1": 6, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
561915767	https://github.com/pydata/xarray/issues/1385#issuecomment-561915767	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDU2MTkxNTc2Nw==	rabernat 1197350	2019-12-05T00:52:06Z	2019-12-05T00:52:06Z	MEMBER	@keltonhalbert - I'm sorry you're frustrated by this issue. It's hard to provide a general answer to "why is open_mfdataset slow?" without seeing the data in question. I'll try to provide some best practices and recommendations here. In the meantime, could you please post the xarray repr of two of your files? To be explicit. `python ds1 = xr.open_dataset('file1.nc') print(ds1) ds2 = xr.open_dataset('file2.nc') print(ds2)` This will help us debug.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
463369751	https://github.com/pydata/xarray/issues/1385#issuecomment-463369751	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDQ2MzM2OTc1MQ==	rabernat 1197350	2019-02-13T21:04:03Z	2019-02-13T21:04:03Z	MEMBER	What if you do `xr.open_mfdataset(fname, decode_times=False)`?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
371891466	https://github.com/pydata/xarray/issues/1385#issuecomment-371891466	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDM3MTg5MTQ2Ng==	rabernat 1197350	2018-03-09T17:53:15Z	2018-03-09T17:53:15Z	MEMBER	Calling `ds = xr.decode_cf(ds, decode_times=False)` on the dataset returns instantly. However, the variable data is wrapped in the adaptors, effectively destroying the chunks ```python ds.SST.variable._data LazilyIndexedArray(array=DaskIndexingAdapter(array=dask.array<_apply_mask, shape=(16401, 2400, 3600), dtype=float32, chunksize=(1, 2400, 3600)>), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))) ``` Calling getitem on this array triggers the whole dask array to be computed, which would takes forever and would completely blow out the notebook memory. This is because of #1372, which would be fixed by #1725. This has actually become a major showstopper for me. I need to work with this dataset in decoded form. Versions INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 3.12.62-60.64.8-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 xarray: 0.10.1 pandas: 0.22.0 numpy: 1.13.3 scipy: 1.0.0 netCDF4: 1.3.1 h5netcdf: 0.5.0 h5py: 2.7.1 Nio: None zarr: 2.2.0a2.dev176 bottleneck: 1.2.1 cyordereddict: None dask: 0.17.1 distributed: 1.21.3 matplotlib: 2.1.2 cartopy: 0.15.1 seaborn: 0.8.1 setuptools: 38.4.0 pip: 9.0.1 conda: None pytest: 3.3.2 IPython: 6.2.1	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
370064483	https://github.com/pydata/xarray/issues/1385#issuecomment-370064483	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDM3MDA2NDQ4Mw==	rabernat 1197350	2018-03-02T21:57:26Z	2018-03-02T21:57:26Z	MEMBER	An update on this long-standing issue. I have learned that `open_mfdataset` can be blazingly fast if `decode_cf=False` but extremely slow with `decode_cf=True`. As an example, I am loading a POP datataset on cheyenne. Anyone with access can try this example. ```python base_dir = '/glade/scratch/rpa/' prefix = 'BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001' code = 'pop.h.nday1.SST' glob_pattern = os.path.join(base_dir, prefix, '%s.%s..nc' % (prefix, code)) def non_time_coords(ds): return [v for v in ds.data_vars if 'time' not in ds[v].dims] def drop_non_essential_vars_pop(ds): return ds.drop(non_time_coords(ds)) this runs almost instantly ds = xr.open_mfdataset(glob_pattern, decode_times=False, chunks={'time': 1}, preprocess=drop_non_essential_vars_pop, decode_cf=False) `And returns this` <xarray.Dataset> Dimensions: (d2: 2, nlat: 2400, nlon: 3600, time: 16401, z_t: 62, z_t_150m: 15, z_w: 62, z_w_bot: 62, z_w_top: 62) Coordinates: z_w_top (z_w_top) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 ... * z_t (z_t) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w (z_w) float32 0.0 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * z_t_150m (z_t_150m) float32 500.0 1500.0 2500.0 3500.0 4500.0 5500.0 ... * z_w_bot (z_w_bot) float32 1000.0 2000.0 3000.0 4000.0 5000.0 6000.0 ... * time (time) float64 7.322e+05 7.322e+05 7.322e+05 7.322e+05 ... Dimensions without coordinates: d2, nlat, nlon Data variables: time_bound (time, d2) float64 dask.array<shape=(16401, 2), chunksize=(1, 2)> SST (time, nlat, nlon) float32 dask.array<shape=(16401, 2400, 3600), chunksize=(1, 2400, 3600)> Attributes: nsteps_total: 480 tavg_sum: 64800.0 title: BRCP85C5CN_ne120_t12_pop62.c13b17.asdphys.001 start_time: This dataset was created on 2016-03-14 at 05:32:30.3 Conventions: CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-curren... source: CCSM POP2, the CCSM Ocean Component cell_methods: cell_methods = time: mean ==> the variable values are aver... calendar: All years have exactly 365 days. history: none contents: Diagnostic and Prognostic Variables revision: $Id: tavg.F90 56176 2013-12-20 18:35:46Z mlevy@ucar.edu $ ``` This is roughly 45 years of daily data, one file per year. Instead, if I just change `decode_cf=True` (the default), it takes forever. I can monitor what is happening via the distributed dashboard. It looks like this: There are more of these `open_dataset` tasks then there are number of files (45), so I can only presume there are 16401 individual tasks (one for each timestep), which each takes about 1 s in serial. This is a real failure of lazy decoding. Maybe it can be fixed by #1725, possibly related to #1372. cc Pangeo folks: @jhamman, @mrocklin	{ "total_count": 2, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 2, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135
297494539	https://github.com/pydata/xarray/issues/1385#issuecomment-297494539	https://api.github.com/repos/pydata/xarray/issues/1385	MDEyOklzc3VlQ29tbWVudDI5NzQ5NDUzOQ==	rabernat 1197350	2017-04-26T18:07:03Z	2017-04-26T18:07:03Z	MEMBER	cc: @geosciz, who is helping with this project.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	slow performance with open_mfdataset 224553135

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

10 rows where issue = 224553135 and user = 1197350 sorted by updated_at descending

this runs almost instantly

Advanced export