github: issue_comments: 19 rows where issue = 288184220 sorted by updated

19 rows where issue = 288184220 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
768627652	https://github.com/pydata/xarray/issues/1823#issuecomment-768627652	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDc2ODYyNzY1Mg==	dcherian 2448579	2021-01-27T22:43:59Z	2021-01-27T22:43:59Z	MEMBER	That's 34k 3MB files! I suggest combining to 1k 100MB files, that would work a lot better.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
768600657	https://github.com/pydata/xarray/issues/1823#issuecomment-768600657	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDc2ODYwMDY1Nw==	Hossein-Madadi 9200184	2021-01-27T21:51:24Z	2021-01-27T21:52:11Z	CONTRIBUTOR	PS @rabernat `%%time ds = xr.open_mfdataset("/glade/p/cesm/community/ASD-HIGH-RES-CESM1/hybrid_v5_rel04_BC5_ne120_t12_pop62/ocn/proc/tseries/monthly/*.nc", parallel=True, coords="minimal", data_vars="minimal", compat='override')` This completes in 40 seconds with 10 workers on cheyenne. @dcherian, thanks for your solution. In my experience with 34013 NetCDF files, I could open 117 Gib in 13min 14s. Can I decrease this time?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
768460310	https://github.com/pydata/xarray/issues/1823#issuecomment-768460310	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDc2ODQ2MDMxMA==	dcherian 2448579	2021-01-27T17:50:09Z	2021-01-27T17:50:09Z	MEMBER	Let's close this since there is an opt-in mostly-fast path. I've added an item to #4648 to cover adding an asv benchmark for mfdataset.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
531945252	https://github.com/pydata/xarray/issues/1823#issuecomment-531945252	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDUzMTk0NTI1Mg==	jbusecke 14314623	2019-09-16T20:29:35Z	2019-09-16T20:29:35Z	CONTRIBUTOR	Wooooow. Thanks. Ill have to give this a whirl soon.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
531913598	https://github.com/pydata/xarray/issues/1823#issuecomment-531913598	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDUzMTkxMzU5OA==	dcherian 2448579	2019-09-16T19:03:47Z	2019-09-16T19:03:47Z	MEMBER	PS @rabernat `%%time ds = xr.open_mfdataset("/glade/p/cesm/community/ASD-HIGH-RES-CESM1/hybrid_v5_rel04_BC5_ne120_t12_pop62/ocn/proc/tseries/monthly/*.nc", parallel=True, coords="minimal", data_vars="minimal", compat='override')` This completes in 40 seconds with 10 workers on cheyenne.	{ "total_count": 3, "+1": 0, "-1": 0, "laugh": 0, "hooray": 1, "confused": 0, "heart": 0, "rocket": 2, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
531912893	https://github.com/pydata/xarray/issues/1823#issuecomment-531912893	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDUzMTkxMjg5Mw==	dcherian 2448579	2019-09-16T19:01:57Z	2019-09-16T19:01:57Z	MEMBER	=) @TomNicholas PRs welcome!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
531905844	https://github.com/pydata/xarray/issues/1823#issuecomment-531905844	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDUzMTkwNTg0NA==	TomNicholas 35968931	2019-09-16T18:43:52Z	2019-09-16T18:43:52Z	MEMBER	This is big if true! But surely to close an issue raised by complaints about speed, we should really have some new asv speed tests?	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
531816800	https://github.com/pydata/xarray/issues/1823#issuecomment-531816800	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDUzMTgxNjgwMA==	dcherian 2448579	2019-09-16T15:00:16Z	2019-09-16T15:00:16Z	MEMBER	YES! (well almost) The PR lets you skip compatibility checks. The magic spell is `xr.open_mfdataset(..., data_vars="minimal", coords="minimal", compat="override")` You can skip index comparison by adding `join="override"`. Whats left is extremely large indexes and lazy index / coordinate loading but we have #2039 open for that. I will rename that issue. If you have time, can you test it out?	{ "total_count": 2, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
531813935	https://github.com/pydata/xarray/issues/1823#issuecomment-531813935	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDUzMTgxMzkzNQ==	rabernat 1197350	2019-09-16T14:53:57Z	2019-09-16T14:53:57Z	MEMBER	Is this issue really closed?!? 🎉🎂🏆🥇	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 1, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
489135792	https://github.com/pydata/xarray/issues/1823#issuecomment-489135792	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDQ4OTEzNTc5Mg==	dcherian 2448579	2019-05-03T15:29:14Z	2019-05-03T15:40:27Z	MEMBER	One common use-case is files with large numbers of `concat_dim`-invariant non-dimensional co-ordinates. This is easy to speed up by dropping those variables from all but the first file. e.g. https://github.com/pangeo-data/esgf2xarray/blob/6a5e4df0d329c2f23b403cbfbb65f0f1dfa98d52/esgf2zarr/aggregate.py#L107-L110 `python # keep only coordinates from first ensemble member to simplify merge first = member_dsets_aligned[0] rest = [mds.reset_coords(drop=True) for mds in member_dsets_aligned[1:]] objs_to_concat = [first] + rest` Similarly https://github.com/NCAR/intake-esm/blob/e86a8e8a80ce0fd4198665dbef3ba46af264b5ea/intake_esm/aggregate.py#L53-L57 `python def merge_vars_two_datasets(ds1, ds2): """ Merge two datasets, dropping all variables from second dataset that already exist in the first dataset's coordinates. """` See also #2039 (second code block) One way to do this might be to add a `master_file` kwarg to `open_mfdataset`. This would imply `coords='minimal', join='exact'` (I think; `prealigned=True` in some other proposals) and would drop non-dimensional coordinates from all but the first file and then call concat. As bonus it would assign attributes from the `master_file` to the merged dataset (for which I think there are open issues) : this functionality exists in `netCDF4.MFDataset` so that's a plus. EDIT: #2039 (third code block) is also a possibility. This might look like `python xr.open_mfdataset('files*.nc', master_file='first', concat_dim='time')` in which case the first file is read; all coords that are not `concat_dim` become `drop_variables` for an `open_dataset` call that reads the remaining files. We then merge with the first dataset and assign attrs. EDIT2: `master_file` combines two different functionalities here: specifying a "template file" and a file to choose attributes from. So maybe we need two kwargs: `template_file` and `attrs_from`?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
489101053	https://github.com/pydata/xarray/issues/1823#issuecomment-489101053	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDQ4OTEwMTA1Mw==	rabernat 1197350	2019-05-03T13:47:12Z	2019-05-03T13:47:12Z	MEMBER	So I think it is quite important to consider this issue together with #2697. An xml specification called NCML already exists which tells software how to put together multiple netCDF files into a single virtual netcdf. We should leverage this existing spec as much as possible. A realistic use case for me is that I have, say 1000 files of high-res model output, each with large coordinate variables, all generated from the same model run. If we want to for for which we know a priori that certain coordinates (dimension coordinates or otherwise) are identical, we could save a lot of disk reads (the slow part of `open_mfdataset`) by never reading those coordinates at all. Enabling this would require a pretty low-level change in xarray. For example, we couldn't even rely on `open_dataset` in its current form to open files, because `open_dataset` eagerly loads all dimension coordinates into indexes. One way forward might be to create a new Store class. For a catalog of tricks I use to optimize opening these sorts of big, complex, multi-file datasets (e.g. CMIP), check out https://github.com/pangeo-data/esgf2xarray/blob/master/esgf2zarr/aggregate.py	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
489064553	https://github.com/pydata/xarray/issues/1823#issuecomment-489064553	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDQ4OTA2NDU1Mw==	j08lue 3404817	2019-05-03T11:26:06Z	2019-05-03T11:36:44Z	CONTRIBUTOR	The original issue of this thread is that you sometimes might want to disable alignment checks for coordinates other than the `concat_dim` and only check for same dimensions and dimension shapes. When you `xr.merge` with `join='exact'`, it still checks for alignment (see https://github.com/pydata/xarray/pull/1330#issuecomment-302711852), but does not join the coordinates if they are not aligned. This behavior (not joining) is also included in what @rabernat envisioned here, but his suggestion goes beyond that: you don't even load coordinate values from all but the first dataset and just blindly trust that they are aligned. So `xr.open_mfdataset(join='exact', coords='minimal')` does not fix this issue here, I think.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
489027263	https://github.com/pydata/xarray/issues/1823#issuecomment-489027263	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDQ4OTAyNzI2Mw==	TomNicholas 35968931	2019-05-03T09:25:00Z	2019-05-03T09:25:00Z	MEMBER	@dcherian I'm sorry, I'm very interested in this but after reading the issues I'm still not clear on what's being proposed: What exactly is the bottleneck? Is it reading the coords from all the files? Is it loading the coord values into memory? Is it performing the alignment checks on those coords once they're in memory? Is it performing alignment checks on the dimensions? Is this suggestion relevant to datasets that don't have any coords? Which of these steps would a `join='exact'` option omit? A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset. But this is already an option to `open_mfdataset`?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
488440840	https://github.com/pydata/xarray/issues/1823#issuecomment-488440840	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDQ4ODQ0MDg0MA==	dcherian 2448579	2019-05-01T21:42:01Z	2019-05-01T21:45:38Z	MEMBER	I am currently motivated to fix this. Over in https://github.com/pydata/xarray/pull/1413#issuecomment-302843502 @rabernat mentioned allowing the user to pass join='exact' via open_mfdataset. A related optimization would be to allow the user to pass coords='minimal' (or other concat coords options) via open_mfdataset. @shoyer suggested calling decode_cf later here though perhaps this wont help too much: https://github.com/pydata/xarray/issues/1385#issuecomment-439263419 Is this all that we can do on the xarray side?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
373123959	https://github.com/pydata/xarray/issues/1823#issuecomment-373123959	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDM3MzEyMzk1OQ==	jbusecke 14314623	2018-03-14T18:16:38Z	2018-03-14T18:16:38Z	CONTRIBUTOR	Awesome, thanks for the clarification. I just looked at #1981 and it seems indeed very elegant (in fact I just now used this approach to parallelize printing of movie frames!) Thanks for that!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
372862174	https://github.com/pydata/xarray/issues/1823#issuecomment-372862174	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDM3Mjg2MjE3NA==	jhamman 2443309	2018-03-14T00:13:34Z	2018-03-14T00:13:34Z	MEMBER	@jbusecke - No. These options are not mutually exclusive. The parallel open is, in my opinion, the lowest hanging fruit so that's why I started there. There are other improvements that we can tackle incrementally.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
372856076	https://github.com/pydata/xarray/issues/1823#issuecomment-372856076	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDM3Mjg1NjA3Ng==	jbusecke 14314623	2018-03-13T23:40:54Z	2018-03-13T23:40:54Z	CONTRIBUTOR	Would these two options be necessarily mutually exclusive? I think parallelizing the read in sounds amazing. But isnt there some merit in skipping some of the checks all together, if the user is sure about the structure of the data contained in the many files? I am often working with the aforementioned type of data (many files either contain a new timestep or a different variable, but most of the dimensions/coordinates are the same). In some cases I am finding that reading the data "lazily" consumes a significant amount of the time in my workflow. I am unsure how hard this would be to achieve, and perhaps it is not worth it after all. Just putting out a few ideas, while I wait for my `xr.open_mfdataset` to finish :-)	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 1, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
359069753	https://github.com/pydata/xarray/issues/1823#issuecomment-359069753	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDM1OTA2OTc1Mw==	jbusecke 14314623	2018-01-19T19:45:00Z	2018-01-19T19:45:00Z	CONTRIBUTOR	I did not really find an elegant solution. What I did was just specify all dims and coords as `drop_variables` and then update those from a master file with `ds.update(ds_master)` Perhaps this could be generalized in a sense, by reading all coords and dims just from the first file.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220
357336022	https://github.com/pydata/xarray/issues/1823#issuecomment-357336022	https://api.github.com/repos/pydata/xarray/issues/1823	MDEyOklzc3VlQ29tbWVudDM1NzMzNjAyMg==	jhamman 2443309	2018-01-12T19:46:12Z	2018-01-12T19:46:12Z	MEMBER	@rabernat - Depending on the structure of the dataset, another possibility that would speed up some `open_mfdataset` tasks substantially is to implement the step of opening each file and getting its metadata in in some parallel way (dask/joblib/etc.) and either returning the just dataset schema or a picklable version of the dataset itself. I think this will only be able to work with `autoclose=True` but it could be quite useful when working with many files.	{ "total_count": 3, "+1": 3, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	We need a fast path for open_mfdataset 288184220

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);