issue_comments: 1216123818

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/4285#issuecomment-1216123818	https://api.github.com/repos/pydata/xarray/issues/4285	1216123818	IC_kwDOAMm_X85IfJOq	12912489	2022-08-16T04:15:24Z	2022-08-16T04:15:24Z	NONE	5. Neutron scattering data Scipp is an xarray-like labelled data structure for neutron scattering experiment data. On their FAQ Q titled "Why is xarray not enough", one of the things they quote is Support for event data, a particular form of sparse data. More concretely, this is essentially a 1-D (or N-D) array of random-length lists, with very small list entries. This type of data arises in time-resolved detection of neutrons in pixelated detectors. Would a `RaggedArray` class that's wrappable in xarray help with this? (cc @SimonHeybrock) Partially, but the bigger challenge may be the related algorithms, e.g., for getting data into this layout, and for switching to other ragged layouts. For context, one of the main reasons for our data layout is the ability to make cuts/slices quickly. We frequently deal with 2-D, 3-D, and 4-D data. For example, a 3-D case may be be the momentum transfer $\vec Q$ in a scattering process, with a "record" for every detected neutron. Desired final resolution may exceed 1000 per dimension (of the 3 components of $\vec Q$). On top of this there may be additional dimensions relating to environment parameters of the sample under study, such as temperature, pressure, or strain. This would lead to bin-counts that cannot be handled easily (in single-node memory). A naive solution could be to simply work with something like `pandas.DataFrame`, with columns for the components of $\vec Q$ as well as the sample environment parameters. Those could then be used for grouping/histogramming to the desired 2-D cuts or slices. However, as frequently many such slices or required this can quickly become inefficient (though there is certainly cases where it would work well, providing a simpler solution that scipp). Scipp's ragged data can be considered a "partial sorting", to build a sort of "index". Based on all this we can then, e.g., quickly compute high-resolution cuts. Say we are in 3-D (Qx, Qy, Qz). We would not have bin sizes that match the final resolution required by the science. Instead we could use 50x50x50 bins. Then we can very quickly produce a high-res 2-D plot (say (1000x1000), Qx, Qz or whatever), since our binned data format reduces the data/memory you have to load and consider by a factor of up to 50 (in this example).	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		667864088