github: issue_comments: 54 rows where issue = 479942077 sorted by updated

54 rows where issue = 479942077 sorted by updated_at descending

Search:

descending

id	html_url	issue_url	node_id	user	created_at	updated_at ▲	author_association	body	reactions	issue
1544952425	https://github.com/pydata/xarray/issues/3213#issuecomment-1544952425	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X85cFhpp	jbbutler 41593244	2023-05-12T01:01:21Z	2023-05-12T01:01:21Z	NONE	Thank you all so much for the feedback and resources! I agree (1) testing the limits of xArray's API compatibility with sparse and (2) developing some documentation for what is/isn't supported are great places to start, so I'll get on that while I think about the other I/O issues (serialization, etc.)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
1534724554	https://github.com/pydata/xarray/issues/3213#issuecomment-1534724554	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X85begnK	rabernat 1197350	2023-05-04T12:51:59Z	2023-05-04T12:51:59Z	MEMBER	I suspect (but don't know, as I'm just a user of xarray, not a developer) that it's also not thoroughly tested. Existing sparse testing is here: https://github.com/pydata/xarray/blob/main/xarray/tests/test_sparse.py We would welcome enhancements to this!	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
1534695467	https://github.com/pydata/xarray/issues/3213#issuecomment-1534695467	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X85beZgr	khaeru 1634164	2023-05-04T12:31:22Z	2023-05-04T12:31:22Z	NONE	That's a totally valid scope limitation for the sparse package, and I understand the motivation. I'm just saying that the principle of least astonishment is not being followed: the user cannot at the moment read either the xarray or sparse docs and know which portions of the xarray API will work when giving `…, sparse=True`, and which instead require a deliberate choice to densify, or see examples of how best to mix the two. It would be helpful to clarify—that's all.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
1534238962	https://github.com/pydata/xarray/issues/3213#issuecomment-1534238962	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X85bcqDy	hameerabbasi 2190658	2023-05-04T07:47:04Z	2023-05-04T07:47:04Z	MEMBER	Speaking a bit to things like `cumprod`, it's hard to support those natively with sparse data structures in many cases (at least as things stand in the current Numba framework). While that doesn't apply in the case of `cumprod`, PyData/Sparse also has a policy that if the best algorithm available is a dense one, we simply raise an error, and the user should densify explicitly to avoid filling all available RAM or getting obscure `MemoryError`s.	{ "total_count": 3, "+1": 3, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
1534231523	https://github.com/pydata/xarray/issues/3213#issuecomment-1534231523	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X85bcoPj	khaeru 1634164	2023-05-04T07:40:26Z	2023-05-04T07:40:26Z	NONE	@jbbutler please also see this comment et seq. https://github.com/pydata/sparse/issues/1#issuecomment-792342987 and related pydata/sparse#438. To add to @rabernat's point about sparse support being "not well documented", I suspect (but don't know, as I'm just a user of xarray, not a developer) that it's also not thoroughly tested. I expected to be able to use e.g. `DataArray.cumprod` when the underlying data was sparse, but could not. IMHO, I/O to/from sparse-backed objects is less valuable if only a small subset of xarray functionality is available on those objects. Perhaps explicitly testing/confirming which parts of the API do/do not currently work with sparse would support the improvements to the docs that Ryan mentioned, and reveal the work remaining to provide full(er) support.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
1534001190	https://github.com/pydata/xarray/issues/3213#issuecomment-1534001190	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X85bbwAm	rabernat 1197350	2023-05-04T02:36:57Z	2023-05-04T02:36:57Z	MEMBER	Hi @jdbutler and welcome! We would welcome this sort of contribution eagerly. I would characterize our current support of sparse arrays as really just a proof of concept. When to use sparse and how to do it effectively is not well documented. Simply adding more documentation around the already-supported use cases would be a great place to start IMO. My own exploration of this are described in this Pangeo post. The use case is regridding. It touches on quite a few of the points you're interested in, in particular the integration with geodataframe. Along similar lines, @dcherian has been working on using opt_einsum together with sparse in https://github.com/pangeo-data/xESMF/issues/222#issuecomment-1524041837 and https://github.com/pydata/xarray/issues/7764. I'd also suggest catching up on what @martinfleis is doing with vector data cubes in xvec. (See also Pangeo post on this topic.) Of the three topics you enumerated, I'm most interested in the serialization one. However, I'd rather see serialization of sparse arrays prototyped in Zarr, as its much more conducive to experimentation than NetCDF (which requires writing C to do anything custom). I would recommend exploring serialization from a sparse array in memory to a sparse format on disk via a custom codec. Zarr recently added support for a `meta_array` parameter that determines what array type is materialized by the codec pipeline (see https://github.com/zarr-developers/zarr-python/pull/1131). The use case there was loading data direct to GPU. In a way sparse is similar--it's an array container that is not numpy or dask.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
1533842816	https://github.com/pydata/xarray/issues/3213#issuecomment-1533842816	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X85bbJWA	jbbutler 41593244	2023-05-03T22:40:32Z	2023-05-03T22:40:32Z	NONE	Hi all! As part of a research project, I'm looking to contribute to xArray's sparse capabilities, with an emphasis on sparse support for use-cases in the geosciences. I'm wondering if anyone in the geosciences (or adjacent disciplines!) has encountered problems with xArray's current level of sparse support, and what kinds of improvements they'd like to see to address those issues. From playing around, it seems the current strategy of backing DataArrays with COO sparse arrays takes care of a lot of use cases, but I have the following ideas that may (or may not) be useful to implement further: Functions/methods to convert from geopandas GeoDataframes of vector data to rasterized, potentially sparse ndarrays in an xArray Dataset/DataArray (reverse direction too); this is related to the issue of converting from sparse arrays back to multi-indexed pandas objects at the top of this issue (which I believe has yet to be solved) Loading sparse data from a netcdf() file directly into a Dataset/Array backed by sparse ndarray(s) (seems like the only way to get sparse backings is to either unstack or call '.from_dataframe()/series()' with the sparse flag set to True?) Support for other sparse array conventions (for ex, GCXS in the sparse package for better memory efficiency; I can't find any improvements to make on the current COO backing in terms of supported arithmetic operations, merges/joins, etc.) I'd appreciate any feedback on these ideas, as well as any other things that would be nice to have implemented!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
1014462537	https://github.com/pydata/xarray/issues/3213#issuecomment-1014462537	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X848d3hJ	Material-Scientist 40465719	2022-01-17T12:20:18Z	2022-01-17T12:20:18Z	NONE	I know. But having sparse data I can treat as if it were dense allows me to unstack without running out of memory, and then ffill & downsample the data in chunks: It would be nice if xarray automatically converted the data from sparse back to dense for doing operations on the chunks just like pandas does. The picture shows that I'm already using nbytes to determine the size.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
1014383681	https://github.com/pydata/xarray/issues/3213#issuecomment-1014383681	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X848dkRB	hameerabbasi 2190658	2022-01-17T10:48:48Z	2022-01-17T10:48:48Z	MEMBER	For `ffill` specifically, you would get a dense array out anyway, so there's no point to keeping it sparse, unless one did something like run-length-encoding or similar. As for the `size` issue, PyData/Sparse provides the `nbytes` attribute which could be helpful in determining size.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
1013887301	https://github.com/pydata/xarray/issues/3213#issuecomment-1013887301	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X848brFF	Material-Scientist 40465719	2022-01-16T14:35:29Z	2022-01-16T14:40:13Z	NONE	I would prefer to retain the dense representation, but with tricks to keep the data of sparse type in memory. Look at the following example with pandas multiindex & sparse dtype: The dense data uses ~40 MB of memory, while the dense representation with sparse dtypes uses only ~0.5 kB of memory! And while you can import dataframes with the sparse=True keyword, the size seems to be displayed inaccurately (both are the same size?), and we cannot examine the data like we can with pandas multiindex + sparse dtype: Besides, a lot of operations are not available on sparse xarray data variables (i.e. if I wanted to group by price level for ffill & downsampling): So, it would be nice if xarray adopted pandas’ approach of unstacking sparse data. In the end, you could extract all the non-NaN values and write them to a sparse storage format, such as TileDB sparse arrays. cc: @stavrospapadopoulos	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
634558423	https://github.com/pydata/xarray/issues/3213#issuecomment-634558423	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDYzNDU1ODQyMw==	SimonHeybrock 12912489	2020-05-27T10:00:25Z	2021-10-15T04:38:25Z	NONE	@pnsaevik If the approach we adopt in scipp could be ported to xarray you would be able to to something like (assuming that the ragged array representation you have in mind is "list of lists"): ```python data = my_load_netcdf(...) # list of lists assume 'x' is the dimension of the nested lists bin_edges = sc.Variable(dims=['x'], values=[0.1,0.3,0.5,0.7,0.9]) realigned = sc.realign(data, {'x':bin_edges}) filtered = realigned['x', 1:3].copy() my_store_netcdf(filtered.unaligned, ...) ``` Basically, we have slicing for the "realigned" wrapper. It performs a filter operation when copied. Edit 2021: Above example is very outdated, we have cleaned up the mechanism, see https://scipp.github.io/user-guide/binned-data/binned-data.html.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
632536798	https://github.com/pydata/xarray/issues/3213#issuecomment-632536798	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDYzMjUzNjc5OA==	SimonHeybrock 12912489	2020-05-22T07:20:35Z	2021-10-15T04:36:17Z	NONE	I am not familiar with the details of the various applications people in this discussion have, but here is an approach we are taking, trying to solve variations of the problem "data scattered in multi-dimensional space" or irregular time-series data. See https://scipp.github.io/user-guide/binned-data/binned-data.html for an illustrated description. The basic idea is to keep data in a linear representation and wrap it in a "realigned" wrapper. One reason for this development was to provide a pathway to use dask with our type of data (independent time series at a large number of points in space, with chunking along the "time-series", which is not a dimension since every time series has a different length). With the linked approach we could use dask to distribute the linear underlying representation, keeping the lightweight realigned wrapper on all workers. We are still in early experimentation with this (the dask part is not actually in development yet). It probably has performance issues if more than "millions" of points are realigned --- our case is millions of time series with thousands/millions of time points in each, but the two do not mix (not both are realigned, and if they are it is independently), so we do not run into the performance issue in most cases. In principle I could imagine this non-destructive realignment approach could be mapped to xarray, so it may be of interest to people here.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
943518935	https://github.com/pydata/xarray/issues/3213#issuecomment-943518935	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X844PPTX	scottgigante-immunai 84813314	2021-10-14T16:26:21Z	2021-10-14T16:26:21Z	NONE	Thanks so much! Appreciate it.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
943517731	https://github.com/pydata/xarray/issues/3213#issuecomment-943517731	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X844PPAj	keewis 14808389	2021-10-14T16:25:04Z	2021-10-14T16:25:04Z	MEMBER	that's mostly an oversight, I think. However, to be really useful we'd need to get a `sparse-xarray` library which makes working with `sparse` and `xarray` easier (going from dense to sparse or the reverse still requires something like `da.copy(data=sparse.COO.from_numpy(da.data))`, which is not user-friendly). Anyways, the docs you're looking for is working with numpy-like arrays, even though there's no explicit mention of `sparse` there, either.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
943504365	https://github.com/pydata/xarray/issues/3213#issuecomment-943504365	https://api.github.com/repos/pydata/xarray/issues/3213	IC_kwDOAMm_X844PLvt	scottgigante-immunai 84813314	2021-10-14T16:10:10Z	2021-10-14T16:10:10Z	NONE	According to test_sparse.py it looks like XArray already supports sparse, even though the XArray docs doesn't mention this support. Can we expect `sparse` to gain first-class support in future? Or is the absence of mention in the docs a hint we shouldn't rely on this functionality?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
634551055	https://github.com/pydata/xarray/issues/3213#issuecomment-634551055	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDYzNDU1MTA1NQ==	pnsaevik 12728107	2020-05-27T09:44:55Z	2020-05-27T09:44:55Z	NONE	Thanks for looking into sparse arrays for xarray. I have a use case I believe would be common: Load a netCDF file written using ragged array representation Extract a slice in either coordinate direction Store back into netCDF At least I would love such a functionality...	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
615772303	https://github.com/pydata/xarray/issues/3213#issuecomment-615772303	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDYxNTc3MjMwMw==	hameerabbasi 2190658	2020-04-18T08:41:39Z	2020-04-18T08:41:39Z	MEMBER	Hi. Yes, it’d be nice if we had a meta issue I could then open separate issues for for sllearn implementations. Performance is not ideal, and I realise that. However I’m working on a more generic solution to performance as I type.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
615501070	https://github.com/pydata/xarray/issues/3213#issuecomment-615501070	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDYxNTUwMTA3MA==	mrocklin 306380	2020-04-17T23:08:18Z	2020-04-17T23:08:18Z	MEMBER	@amueller have you all connected with @hameerabbasi ? I'm not surprised to hear that there are performance issues with pydata/sparse relative to scipy.sparse, but Hameer has historically been pretty open to working to resolve issues quickly. I'm not sure if there is already an ongoing conversation between the two groups, but I'd recommend replacing "we've chosen not to use pydata/sparse because it isn't feature complete enough for us" with "in order for us to use pydata/sparse we would need the following features".	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
615500990	https://github.com/pydata/xarray/issues/3213#issuecomment-615500990	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDYxNTUwMDk5MA==	amueller 449558	2020-04-17T23:07:57Z	2020-04-17T23:07:57Z	NONE	@shoyer thanks! Mostly spitballing here, but it's interesting to know that 2) would be the bigger problem in your opinion, I had assumed 1) would be the main issue. That raises the question whether it's easier to wrap `scipy.sparse` in a duck array, or to make `pydata/sparse` a viable solution for sklearn.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
615499609	https://github.com/pydata/xarray/issues/3213#issuecomment-615499609	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDYxNTQ5OTYwOQ==	shoyer 1217238	2020-04-17T23:01:15Z	2020-04-17T23:01:15Z	MEMBER	Wrapping `scipy.sparse` in xarray would present two challenges: It only supports 2D arrays, which feels awkward for a library focused on N-dimensional data. There is no existing "duck array" compatibility layer (i.e., `__array_function__`) that makes scipy.sparse matrices work like NumPy arrays (in fact, they actually are designed to mimic the deprecated np.matrix). (2) is the biggest challenge. I don't want to maintain that compatibility layer inside xarray, but if it existed we would be happy to try using it. pydata/sparse solves both these problems, though again indeed it only has quite limited data structures.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
615497160	https://github.com/pydata/xarray/issues/3213#issuecomment-615497160	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDYxNTQ5NzE2MA==	amueller 449558	2020-04-17T22:51:09Z	2020-04-17T22:51:09Z	NONE	Small comment from #3981: sklearn has just started running benchmarks, but it looks like pydata/sparse is not feature complete enough for us to use. We might be interested in having scipy.sparse support in xarray. There are two problems with scipy.sparse for us as far as I can see (this is very preliminary): it only has COO, which is not good for us, and ideally we'd want to avoid memory copies whenever we want to use xarray, and I think going from scipy.sparse to pydata/sparse will involve memory copies, even if pydata/sparse adds other formats.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
597825416	https://github.com/pydata/xarray/issues/3213#issuecomment-597825416	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU5NzgyNTQxNg==	fmfreeze 18172466	2020-03-11T19:29:31Z	2020-03-11T19:29:31Z	NONE	Concatenating multiple lazy, differently sized xr.DataArrays - each wrapping a sparse.COO by xr.apply_ufunc(sparse.COO, ds, dask='parallelized') as @crusaderky suggested - results again in an xr.DataArray, whose wrapped dask array chunks are mapped to numpy arrays: `<xarray.DataArray 'myDataset' (cycle: 10, time: 8000000)> dask.array<concatenate, shape=(10, 8000000), dtype=float64, chunksize=(1, 5273216), chunktype=numpy.ndarray> Coordinates: * time (time) float64 0.0 5e-07 1e-06 1.5e-06 2e-06 ... 4.0 4.0 4.0 4.0 * cycle (cycle) int64 1 2 3 4 5 6 7 8 9 10` But also when mapping the resulting, concatenated DataArray to sparse.COO afterwards, my main goal - scalable serialization of a lazy xarray - cannot be achieved. So one suggestion to @shoyer original question: It would be great, if sparse, but still lazy DataArrays/Datasets could be serialized without the data-overhead itself. Currently, that seems to work only for DataArrays which are merged/aligned by DataArrays of the same shape.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
592476821	https://github.com/pydata/xarray/issues/3213#issuecomment-592476821	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU5MjQ3NjgyMQ==	crusaderky 6213168	2020-02-28T11:39:50Z	2020-02-28T11:39:50Z	MEMBER	xr.apply_ufunc(sparse.COO, ds, dask='parallelized')	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
591388766	https://github.com/pydata/xarray/issues/3213#issuecomment-591388766	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU5MTM4ODc2Ng==	fmfreeze 18172466	2020-02-26T11:54:40Z	2020-02-26T11:54:40Z	NONE	Thank you @crusaderky, unfortunately some obstacles appeared using your loading technique. As thousands of .h5 files are the datasource for my use case and they have various - and sometimes different paths to - datasets, using the xarray.open_mfdatasets(...) function seems not to be possible straight forward. But: 1) I have a routine merging all .h5 datasets into corresponding dask arrays, wrapping dense numpy arrays implicitly 2) I "manually" slice out a part of the the huge lazy dask array and wrap that into an xarray.DataArray/Dataset 3) But applying xr.apply_ufunc(sparse.COO, ds, dask='allowed') on that slice then results in an NotImplementedError: Format not supported for conversion. Supplied type is <class 'dask.array.core.Array'>, see help(sparse.as_coo) for supported formats. (I am not sure, if this is the right place to discuss, so I would be thankful for a response on SO in that case: https://stackoverflow.com/questions/60117268/how-to-make-use-of-xarrays-sparse-functionality-when-combining-differently-size)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
587564478	https://github.com/pydata/xarray/issues/3213#issuecomment-587564478	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU4NzU2NDQ3OA==	crusaderky 6213168	2020-02-18T16:58:25Z	2020-02-18T16:58:25Z	MEMBER	you just need to load up your NetCDF files with xarray.open_mfdataset. This will give you an xarray.Dataset, that wraps around one dask.array.Array per variable, that wrap around one numpy.ndarray (DENSE array) per dask chunk. convert to sparse with xarray.apply_ufunc(sparse.COO, ds). This will give you an xarray.Dataset, that wraps around one dask.array.Array per variable, that wrap around one sparse.COO (SPARSE array) per dask chunk. use xarray.merge or whatever to align and merge you may want to rechunk at this point to obtain less, larger chunks. You can estimate your chunk size in bytes if you know your data density (read my previous email). Do whatever other calculations you want. All operations will produce in output the same data type as point 2. To go back to dense, invoke xarray.apply_ufunc(lambda x: x.todense(), ds) to go back to the format as in (1). This step is only necessary if you have something that won't accept/recognize sparse arrays directly in input; namely, writing to a NetCDF dataset. If your data has not been reduced enough, you may need to rechunk into smaller chunks first in order to fit into your RAM constraints. Regards On Tue, 18 Feb 2020 at 13:56, fmfreeze notifications@github.com wrote: Thank you @crusaderky https://github.com/crusaderky for your input. I understand and agree with your statements for sparse data files. My approach is different, because within my (hdf5) data files on disc, I have no sparse datasets at all. But as I combine two differently sampled xarray dataset (initialized by h5py > dask > xarray) with xarrays built-in top-level function "xarray.merge()" (resp. xarray.combine_by_coords()), the resulting dataset is sparse. Generally that is nice behaviour, because two differently sampled datasets get aligned along a coordinate/dimension, and the gaps are filled by NaNs. Nevertheless, thos NaN "gaps" seem to need memory for every single NaN. That is what should be avoided. Maybe by implementing a redundant pointer to the same memory adress for each NaN? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3213?email_source=notifications&email_token=ABPM4MFWF22BFFYDHV6BS2DRDPSHXA5CNFSM4ILGYGP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMCBWHQ#issuecomment-587471646, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPM4MHIUWDYX6ZFKRRBIJLRDPSHXANCNFSM4ILGYGPQ .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
587471646	https://github.com/pydata/xarray/issues/3213#issuecomment-587471646	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU4NzQ3MTY0Ng==	fmfreeze 18172466	2020-02-18T13:56:09Z	2020-02-18T13:56:51Z	NONE	Thank you @crusaderky for your input. I understand and agree with your statements for sparse data files. My approach is different, because within my (hdf5) data files on disc, I have no sparse datasets at all. But as I combine two differently sampled xarray dataset (initialized by h5py > dask > xarray) with xarrays built-in top-level function "xarray.merge()" (resp. xarray.combine_by_coords()), the resulting dataset is sparse. Generally that is nice behaviour, because two differently sampled datasets get aligned along a coordinate/dimension, and the gaps are filled by NaNs. Nevertheless, those NaN "gaps" seem to need memory for every single NaN. That is what should be avoided. Maybe by implementing a redundant pointer to the same memory adress for each NaN?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
585997533	https://github.com/pydata/xarray/issues/3213#issuecomment-585997533	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU4NTk5NzUzMw==	crusaderky 6213168	2020-02-13T22:12:37Z	2020-02-13T22:12:37Z	MEMBER	Hi fmfreeze, > Dask integration enables xarray to scale to big data, only as long as the data has no sparse character. Do you agree on that formulation or am I missing something fundamental? I don't agree. To my understanding xarray->dask->sparse works very well (save bugs), as long as your data density (the percentage of non-default points) is roughly constant across dask chunks. If it isn't, then you'll have some chunks that consume substantially more RAM and CPU to compute than others. This can be mitigated, if you know in advance where you are going to have more samples, by setting uneven dask chunk sizes. For example, if you have a one-dimensional array of 100k points and you know in advance that the density of non-default samples follows a gaussian or triangular distribution, then it may be wise to have very large chunks at the tails and then get them progressively smaller towards the center, e.g. (30k, 12k, 5k, 2k, 1k, 1k, 2k, 5k, 10k, 30k). Of course, there are use cases where you're going to have unpredictable hotspots; I'm afraid that in those the only thing you can do is size your chunks for the worst case and end up oversplitting everywhere else. Regards Guido On Thu, 13 Feb 2020 at 10:55, fmfreeze notifications@github.com wrote: Thank you all for making xarray and its tight development with dask so great! As @shoyer https://github.com/shoyer mentioned Yes, it would be useful (eventually) to have lazy loading of sparse arrays from disk, like we want we currently do for dense arrays. This would indeed require knowing that the indices are sorted. I am wondering, if creating a lazy & sparse xarray Dataset/DataArray is already possible? Especially when creating the sparse part at runtime, and loading only the data part: Assume two differently sampled - and lazy dask - DataArrays are merged/combined along a coordinate axis into a Dataset. Then the smaller (= less dense) DataVariable is filled with NaNs. As far as I experienced the current behaviour is, that each NaN value requires memory. That issue might be formulated this way: Dask integration enables xarray to scale to big data, only as long as the data has no sparse character. Do you agree on that formulation or am I missing something fundamental? A code example reproducing that issue is described here: https://stackoverflow.com/q/60117268/9657367 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3213?email_source=notifications&email_token=ABPM4MBFBIH7EK4PPAWHRH3RCURJLA5CNFSM4ILGYGP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELUJNRQ#issuecomment-585668294, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPM4MCMRVUZXQSDYCAIP3LRCURJLANCNFSM4ILGYGPQ .	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
585668294	https://github.com/pydata/xarray/issues/3213#issuecomment-585668294	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU4NTY2ODI5NA==	fmfreeze 18172466	2020-02-13T10:55:15Z	2020-02-13T10:55:15Z	NONE	Thank you all for making xarray and its tight development with dask so great! As @shoyer mentioned Yes, it would be useful (eventually) to have lazy loading of sparse arrays from disk, like we want we currently do for dense arrays. This would indeed require knowing that the indices are sorted. I am wondering, if creating a lazy & sparse xarray Dataset/DataArray is already possible? Especially when creating the sparse part at runtime, and loading only the data part: Assume two differently sampled - and lazy dask - DataArrays are merged/combined along a coordinate axis into a Dataset. Then the smaller (= less dense) DataVariable is filled with NaNs. As far as I experienced the current behaviour is, that each NaN value requires memory. That issue might be formulated this way: *Dask integration enables xarray to scale to big data, only* as long as the data has no sparse character**. Do you agree on that formulation or am I missing something fundamental? A code example reproducing that issue is described here: https://stackoverflow.com/q/60117268/9657367	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
551134122	https://github.com/pydata/xarray/issues/3213#issuecomment-551134122	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU1MTEzNDEyMg==	dcherian 2448579	2019-11-07T15:40:12Z	2019-11-07T15:40:12Z	MEMBER	the `coords, data_vars, join ,compat` kwargs in that example are passed down to concat and merge, as appropriate. We do need more documentation on that ....	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
551132924	https://github.com/pydata/xarray/issues/3213#issuecomment-551132924	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU1MTEzMjkyNA==	k-a-mendoza 4605410	2019-11-07T15:37:21Z	2019-11-07T15:37:21Z	NONE	@dcherian These examples seem focused on merging from disk, whereas the use-case I'm running into is joining data produced by computation in ram. I'll try updating my xarray installation and see where that gets me.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
551125982	https://github.com/pydata/xarray/issues/3213#issuecomment-551125982	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU1MTEyNTk4Mg==	dcherian 2448579	2019-11-07T15:23:56Z	2019-11-07T15:23:56Z	MEMBER	@El-minadero a lot of that overhead may be fixed on master and more recent xarray versions. https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets has some tips on quickly concatenating / merging datasets. It depends on the datasets you are joining...	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
551090042	https://github.com/pydata/xarray/issues/3213#issuecomment-551090042	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU1MTA5MDA0Mg==	k-a-mendoza 4605410	2019-11-07T13:57:46Z	2019-11-07T13:57:46Z	NONE	@oliverhiggs Ive also noticed a huge computational overhead when joining xarray datasets where the result would be sparse. Something like a minute of computation time to join two 10GB datasets, even when there are no overlapping indices. I'm not sure if a sparse representation would help but its possible we'd get a reduced memory footprint and a faster merge/concat time with this kind of support.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
550966385	https://github.com/pydata/xarray/issues/3213#issuecomment-550966385	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU1MDk2NjM4NQ==	oliverhiggs 5311739	2019-11-07T07:57:17Z	2019-11-07T07:57:17Z	NONE	Thanks for rolling out support for sparse arrays! I think it would be great to have a `sparse` argument in the `concat` function. This will be useful with `join='outer'`, as the resultant DataArray/Dataset can become quite sparse. As an example, I have a use case where I want to concatenate (across a new dimension) a number of DataArrays with date indexes covering different date ranges. When I use `concat` the memory used by the resultant array is much greater than that used by the original list of DataArrays due to the large number of NA values that appear in it.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
546058673	https://github.com/pydata/xarray/issues/3213#issuecomment-546058673	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDU0NjA1ODY3Mw==	k-a-mendoza 4605410	2019-10-24T19:05:23Z	2019-10-24T19:05:23Z	NONE	So how would one change an existing dataset or dataarray to using a sparse representation? something like `xr.Dataset.from_dataframe(data_set.to_dataframe(),sparse=True)`? seems a little silly to have to convert it to a pandas dataframe and back	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
527771975	https://github.com/pydata/xarray/issues/3213#issuecomment-527771975	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNzc3MTk3NQ==	p-d-moore 47371188	2019-09-04T07:05:37Z	2019-09-04T07:05:37Z	NONE	Thanks @crusaderky, appreciated. Might as as well suggest it there.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
527766483	https://github.com/pydata/xarray/issues/3213#issuecomment-527766483	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNzc2NjQ4Mw==	crusaderky 6213168	2019-09-04T06:46:08Z	2019-09-04T06:46:08Z	MEMBER	@p-d-moore what you say makes sense but it is well outside of the domain of xarray. What you're describing is basically a new sparse class, substantially more sophisticated than COO, and should be proposed in the sparse board, not here. After it's implemented in sparse, xarray will be able to wrap around it.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
527762609	https://github.com/pydata/xarray/issues/3213#issuecomment-527762609	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNzc2MjYwOQ==	p-d-moore 47371188	2019-09-04T06:32:21Z	2019-09-04T06:32:21Z	NONE	I would like to add a request for sparse xarrays: Support ffill and bfill operations along ordered dimensions (such as datetime coordinates) while maintaining the sparse level of data density. The challenge to overcome is that performing ffill operations on sparse data quickly creates data that is no longer "sparse" in practice and makes dealing with the data challenging. My suggested implementation (and the way I have previously done this in another programming environment) is to represent the data as rows of contiguous regions with a single (non-sparse) value rather than rows of single points. The contiguous dimensions could be defined as any dimensions that are "ordered" such as datetime coordinates. That is, the data then is represented as a list of values + coordinate ranges rather than a list of values + coordinates. The idea is that you can easily compute operations like ffill without changing the sparsity of the matrix, and thus support typical aggregating functions you might like to apply to the data before you collapse the data and convert to a non-sparse form (e.g. perform a lag difference of the most recent value with the most recent value 20 days ago, or do a cross-sectional mean on the data along a certain dimension, using the most recent data at each given point in time). These types of operations can be more useful when the data is "fuller" such as after a forward fill, but often not useful when the data is very sparsely populated (as the cross-sectional operations are unlikely to hit the sparse data among the different dimensions). Care must be taken to avoid "collisions" between sparse blocks of data, that is, avoiding that the list of sparse blocks accidentally overlap. The implementation can get tricky but I believe the goal to be worthwhile. I am happy to expand on the request if the idea is not well expressed.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
526748987	https://github.com/pydata/xarray/issues/3213#issuecomment-526748987	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNjc0ODk4Nw==	shoyer 1217238	2019-08-30T21:01:55Z	2019-08-30T21:01:55Z	MEMBER	You will need to install NumPy 1.17 or set the env variable before importing NumPy. On Fri, Aug 30, 2019 at 1:57 PM firdaus janoos notifications@github.com wrote: Thanks. That solved that error but introduced another one. Specifically - this is my dataframe [image: image] https://user-images.githubusercontent.com/923438/64050831-2d061280-cb47-11e9-915b-01fe42eadefe.png and this is the error that I get with sparse=True [image: image] https://user-images.githubusercontent.com/923438/64049668-91bf6e00-cb43-11e9-921f-1a044f3446a9.png [image: image] https://user-images.githubusercontent.com/923438/64050631-a94c2600-cb46-11e9-8653-9820b445bc86.png My numpy version is definitely about 1.16 [image: image] https://user-images.githubusercontent.com/923438/64050648-b701ab80-cb46-11e9-8dac-aaf2bf9e260d.png I also set this os.environ["NUMPY_EXPERIMENTAL_ARRAY_FUNCTION"]='1' just in case Furthermore, I don't get this error when I don't set sparse=True ( I just get OOM errors but that's another matter) ... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3213?email_source=notifications&email_token=AAJJFVTN37AMEA6ROS7YT2LQHGCVHA5CNFSM4ILGYGP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5SYQ6Q#issuecomment-526747770, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJJFVWM2V4H3V6BJMJ7IQDQHGCVHANCNFSM4ILGYGPQ .	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
526747770	https://github.com/pydata/xarray/issues/3213#issuecomment-526747770	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNjc0Nzc3MA==	fjanoos 923438	2019-08-30T20:57:54Z	2019-08-30T20:57:54Z	NONE	Thanks. That solved that error but introduced another one. Specifically - this is my dataframe and this is the error that I get with `sparse=True` My numpy version is definitely about 1.16 I also set this `os.environ["NUMPY_EXPERIMENTAL_ARRAY_FUNCTION"]='1'` just in case Furthermore, I don't get this error when I don't set `sparse=True` ( I just get OOM errors but that's another matter) ...	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
526736529	https://github.com/pydata/xarray/issues/3213#issuecomment-526736529	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNjczNjUyOQ==	dcherian 2448579	2019-08-30T20:21:28Z	2019-08-30T20:21:28Z	MEMBER	`conda install -c conda-forge sparse` Basically you need to install https://sparse.pydata.org/en/latest/ using either pip or conda.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
526733257	https://github.com/pydata/xarray/issues/3213#issuecomment-526733257	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNjczMzI1Nw==	fjanoos 923438	2019-08-30T20:10:43Z	2019-08-30T20:10:43Z	NONE	I cloned the master branch and installed it using 'python setup.py develop'. When I try to use the sparse data loading functionality as per `python oo = xa.Dataset.from_dataframe( my_df, sparse=True )` I get the following error: ``` ModuleNotFoundError Traceback (most recent call last) <ipython-input-9-fce0ca6bc4c2> in <module> ----> 1 oo = xa.Dataset.from_dataframe( poly_df.iloc[:10000], sparse=True ) /mnt/local/xarray/xarray/core/dataset.py in from_dataframe(cls, dataframe, sparse) 4040 4041 if sparse: -> 4042 obj._set_sparse_data_from_dataframe(dataframe, dims, shape) 4043 else: 4044 obj._set_numpy_data_from_dataframe(dataframe, dims, shape) /mnt/local/xarray/xarray/core/dataset.py in _set_sparse_data_from_dataframe(self, dataframe, dims, shape) 3936 self, dataframe: pd.DataFrame, dims: tuple, shape: Tuple[int, ...] 3937 ) -> None: -> 3938 from sparse import COO 3939 3940 idx = dataframe.index ModuleNotFoundError: No module named 'sparse' ``` Any suggestions on what I need to do ?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
526718101	https://github.com/pydata/xarray/issues/3213#issuecomment-526718101	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNjcxODEwMQ==	shoyer 1217238	2019-08-30T19:19:13Z	2019-08-30T19:19:13Z	MEMBER	We have a new "sparse=True" option in xarray.Dataset.from_dataframe for exactly this use case. Pandas's to_xarray() method just calls this method, so it would make sense to forward keyword arguments, too. On Fri, Aug 30, 2019 at 11:53 AM firdaus janoos notifications@github.com wrote: Would it be possible that pd.{Series, DataFrame}.to_xarray() automatically creates a sparse dataarray - or we have a flag in to_xarray which allows controlling for this. I have a very sparse dataframe and everytime I try to convert it to xarray I blow out my memory. Keeping it sparse but logically as a DataArray would be fantastic — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3213?email_source=notifications&email_token=AAJJFVTD2IWWPE6RTSWPVLDQHFUDTA5CNFSM4ILGYGP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5SPPNI#issuecomment-526710709, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJJFVWEOGEBGXV62QFYU6DQHFUDTANCNFSM4ILGYGPQ .	{ "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 1, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
526710709	https://github.com/pydata/xarray/issues/3213#issuecomment-526710709	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNjcxMDcwOQ==	fjanoos 923438	2019-08-30T18:53:44Z	2019-08-30T18:53:44Z	NONE	Would it be possible that pd.{Series, DataFrame}.to_xarray() automatically creates a sparse dataarray - or we have a flag in to_xarray which allows controlling for this. I have a very sparse dataframe and everytime I try to convert it to xarray I blow out my memory. Keeping it sparse but logically as a DataArray would be fantastic.	{ "total_count": 2, "+1": 2, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
524104485	https://github.com/pydata/xarray/issues/3213#issuecomment-524104485	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNDEwNDQ4NQ==	darothen 4992424	2019-08-22T22:39:21Z	2019-08-22T22:39:21Z	NONE	Tagging @jeliashi for visibility/collaboration	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
524088419	https://github.com/pydata/xarray/issues/3213#issuecomment-524088419	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyNDA4ODQxOQ==	normanb 277229	2019-08-22T21:40:13Z	2019-08-22T21:40:13Z	CONTRIBUTOR	I would be interested in trying to add PDAL (https://pdal.io/) support, is this something of interest in a similar way to the support for rasterio for dense arrays?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
521691465	https://github.com/pydata/xarray/issues/3213#issuecomment-521691465	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyMTY5MTQ2NQ==	shoyer 1217238	2019-08-15T15:50:42Z	2019-08-15T15:50:42Z	MEMBER	Yes, it would be useful (eventually) to have lazy loading of sparse arrays from disk, like we want we currently do for dense arrays. This would indeed require knowing that the indices are sorted.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
521596825	https://github.com/pydata/xarray/issues/3213#issuecomment-521596825	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyMTU5NjgyNQ==	ivirshup 8238804	2019-08-15T10:34:30Z	2019-08-15T10:34:30Z	NONE	That's fair. I just think it would be useful to have an assurance that indices are sorted you read them. I don't see how to express this within the CF specs while still looking like a COO array though.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
521530770	https://github.com/pydata/xarray/issues/3213#issuecomment-521530770	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyMTUzMDc3MA==	ivirshup 8238804	2019-08-15T06:28:24Z	2019-08-15T07:33:04Z	NONE	Would it be feasible to use the contiguous ragged array spec or the gathering based compression when the COO coordinates are sorted? I think this could be very helpful for read efficiency, though I'm not sure if random writes were a requirement here.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
521533999	https://github.com/pydata/xarray/issues/3213#issuecomment-521533999	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyMTUzMzk5OQ==	shoyer 1217238	2019-08-15T06:42:44Z	2019-08-15T06:42:44Z	MEMBER	I like the indexed ragged array representation because it maps directly into sparse’s COO format. I’m sure other formats would be possible, but they would also likely be harder to implement.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
521301555	https://github.com/pydata/xarray/issues/3213#issuecomment-521301555	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyMTMwMTU1NQ==	shoyer 1217238	2019-08-14T15:42:58Z	2019-08-14T15:42:58Z	MEMBER	netCDF has a pretty low-level base spec, with conventions left to higher level docs like CF conventions. Fortunately, there does seems to be a CF convention that would be a good fit for for sparse data in COO format, namely the indexed ragged array representation (example, note the `instance_dimension` attribute). That's probably the right thing to use for sparse arrays in xarray.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
521224538	https://github.com/pydata/xarray/issues/3213#issuecomment-521224538	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyMTIyNDUzOA==	crusaderky 6213168	2019-08-14T12:25:39Z	2019-08-14T12:25:39Z	MEMBER	As for NetCDF, instead of a bespoke xarray-only convention, wouldn't it be much better to push a spec extension upstream?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
521223609	https://github.com/pydata/xarray/issues/3213#issuecomment-521223609	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyMTIyMzYwOQ==	crusaderky 6213168	2019-08-14T12:22:37Z	2019-08-14T12:22:37Z	MEMBER	As already mentioned in #3206, `unstack(sparse=True)` would be extremely useful.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
521221473	https://github.com/pydata/xarray/issues/3213#issuecomment-521221473	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyMTIyMTQ3Mw==	crusaderky 6213168	2019-08-14T12:15:39Z	2019-08-14T12:20:59Z	MEMBER	+1 for the introduction of to_sparse() / to_dense(), but let's please avoid the mistakes that were done with chunk(). DataArray.chunk() is extremely frustrating when you have non-index coords and, 9 times out of 10, you only want to chunk the data and you have to go through the horrid `python a = DataArray(a.data.chunk(), dims=a.dims, coords=a.coords, attrs=a.attrs, name=a.name)` Exactly the same issue would apply to to_sparse(). Possibly we could define them as ```python class DataArray: def to_sparse( self, data: bool = True, coords: Union[Iterable[Hashable], bool] = False ) class Dataset: def to_sparse( self, data_vars: Union[Iterable[Hashable], bool] = True, coords: Union[Iterable[Hashable], bool] = False ) ``` same for to_dense() and chunk() (the latter would require a DeprecationWarning for a few release before switching the default for coords from True to False - only to be triggered in presence of dask-backed coords).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077
520741706	https://github.com/pydata/xarray/issues/3213#issuecomment-520741706	https://api.github.com/repos/pydata/xarray/issues/3213	MDEyOklzc3VlQ29tbWVudDUyMDc0MTcwNg==	khaeru 1634164	2019-08-13T08:31:30Z	2019-08-13T08:31:30Z	NONE	This is very exciting! In energy-economic research (unlike, e.g., earth systems research), data are almost always sparse, so first-class sparse support will be broadly useful. I'm leaving a comment here (since this seems to be a meta-issue; please link from wherever else, if needed) with two example use-cases. For the moment, #3206 seems to cover them, so I can't name any specific additional features. MESSAGEix is an energy systems optimization model framework, formulated as a linear program. Some variables have many dimensions, for instance, the input coefficient for a technology has the dimensions `(node_loc, technology, year_vintage, year_active, mode, node_origin, commodity, level, time, time_origin)`. In the global version of our model, the `technology` dimension has over 400 labels. Often two or more dimensions are tied, eg `technology='coal power plant'` will only take input from `(commodity='coal', level='primary energy')`; all other combinations of `(commodity, level)` are empty for this `technology`. So, this data is inherently sparse. For modeling research, specifying quantities in this way is a good design because (a) it is intuitive to researchers in this domain, and (b) the optimization model is solved using various LP solvers via GAMS, which automatically prune zero rows in the resulting matrices. When we were developing a dask/DAG-based system for model results post-processing, we wanted to use xarray, but had some quantities with tens of millions of elements that were less than 1% full. Here is some test code that triggered MemoryErrors using xarray. We chose to fall back on using a pd.Series subclass that mocks xarray methods. In transportation research, stock models of vehicle fleets are often used. These models always have at least two time dimensions: `cohort` (the time period in which a vehicle was sold) and `period`(s) in which it is used (and thus consumes fuel, etc.). Since a vehicle sold in 2020 can't be used in 2015, these data are always triangular w.r.t. these two dimensions. (The dimensions `year_vintage` and `year_active` in example #1 above have the same relationship.) Once multiplied by other dimensions (technology; fuel; size or shape or market segment; embodied materials; different variables; model runs across various scenarios or input assumptions) the overhead of dense arrays can become problematic.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	How should xarray use/support sparse arrays? 479942077

Advanced export

JSON shape: default, array, newline-delimited, object

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);

issue_comments

54 rows where issue = 479942077 sorted by updated_at descending

assume 'x' is the dimension of the nested lists

```

Advanced export