issue_comments: 520741706

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/3213#issuecomment-520741706	https://api.github.com/repos/pydata/xarray/issues/3213	520741706	MDEyOklzc3VlQ29tbWVudDUyMDc0MTcwNg==	1634164	2019-08-13T08:31:30Z	2019-08-13T08:31:30Z	NONE	This is very exciting! In energy-economic research (unlike, e.g., earth systems research), data are almost always sparse, so first-class sparse support will be broadly useful. I'm leaving a comment here (since this seems to be a meta-issue; please link from wherever else, if needed) with two example use-cases. For the moment, #3206 seems to cover them, so I can't name any specific additional features. MESSAGEix is an energy systems optimization model framework, formulated as a linear program. Some variables have many dimensions, for instance, the input coefficient for a technology has the dimensions `(node_loc, technology, year_vintage, year_active, mode, node_origin, commodity, level, time, time_origin)`. In the global version of our model, the `technology` dimension has over 400 labels. Often two or more dimensions are tied, eg `technology='coal power plant'` will only take input from `(commodity='coal', level='primary energy')`; all other combinations of `(commodity, level)` are empty for this `technology`. So, this data is inherently sparse. For modeling research, specifying quantities in this way is a good design because (a) it is intuitive to researchers in this domain, and (b) the optimization model is solved using various LP solvers via GAMS, which automatically prune zero rows in the resulting matrices. When we were developing a dask/DAG-based system for model results post-processing, we wanted to use xarray, but had some quantities with tens of millions of elements that were less than 1% full. Here is some test code that triggered MemoryErrors using xarray. We chose to fall back on using a pd.Series subclass that mocks xarray methods. In transportation research, stock models of vehicle fleets are often used. These models always have at least two time dimensions: `cohort` (the time period in which a vehicle was sold) and `period`(s) in which it is used (and thus consumes fuel, etc.). Since a vehicle sold in 2020 can't be used in 2015, these data are always triangular w.r.t. these two dimensions. (The dimensions `year_vintage` and `year_active` in example #1 above have the same relationship.) Once multiplied by other dimensions (technology; fuel; size or shape or market segment; embodied materials; different variables; model runs across various scenarios or input assumptions) the overhead of dense arrays can become problematic.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		479942077