issue_comments: 633296515

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	performed_via_github_app	issue
https://github.com/pydata/xarray/issues/2780#issuecomment-633296515	https://api.github.com/repos/pydata/xarray/issues/2780	633296515	MDEyOklzc3VlQ29tbWVudDYzMzI5NjUxNQ==	22566757	2020-05-24T20:45:43Z	2020-05-24T20:45:43Z	CONTRIBUTOR	For the example given, this would mean finding `largest = max(abs(ds.min()), abs(ds.max()))` and finding the first integer dtype wide enough to write that: `[np.iinfo("i{bytes:d}".format(bytes=2 i)).max >= largest for i in range(4)]` would help there. The function below should help with this; I would tend to use this at array creation time rather than at save time so you get these benefits in memory as well as on disk. For the character/string variables, the smallest representation varies a bit more: a fixed-width encoding (`dtype=S6`) will probably be smaller if all the strings are about the same size, while variable-width strings are probably smaller if there are many short strings and only a few long strings. If you happen to know that a given field is a five-character identifier or a one-character status code, you can again set these types to be used in memory (which I think makes dask happier when it comes time to save), while free-form survey responses will likely be better as a variable-length string. It may be possible use the distribution of string lengths (perhaps using numpy.char.str_len) to see whether most of the strings are at least 90% as long as the longest, but it's probably simpler to test. Doing this correctly for floating-point types would be difficult, but I think that's outside the scope of this issue. Hopefully this gives you something to work with. ```python import numpy as np def dtype_for_int_array(arry: "array of integers") -> np.dtype: """Find the smallest integer dtype that will encode arry. `Parameters ---------- arry : array of integers The array to compress Returns ------- smallest: dtype The smallest dtype that will represent arry """ largest = max(abs(arry.min()), abs(arry.max())) typecode = "i{bytes:d}".format( bytes=2 np.nonzero([ np.iinfo("i{bytes:d}".format(bytes=2 ** i)).max >= largest for i in range(4) ])[0][0] ) return np.dtype(typecode)` ``` Looking at `df.memory_usage()` will explain why I do this early. If I extend your example with this new function, I see the following: ```python df_small = df.copy() for col in df_small: ... df_small[col] = df_small[col].astype( ... dtype_for_int_array(df_small[col]) if df_small[col].dtype.kind == "i" else "S1" ... ) ... df_small.memory_usage() Index 80 a 100000 b 100000 c 100000 d 100000 e 800000 dtype: int64 df.memory_usage() Index 80 a 800000 b 800000 c 800000 d 800000 e 800000 dtype: int64 ``` It looks like pandas always uses object dtype for string arrays, so the numbers in that column likely reflect the size of an array of pointers. XArray lets you use a dtype of "S1" or "U1", but I haven't found the equivalent of the `memory_usage` method.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		412180435