home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 633296515

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/2780#issuecomment-633296515 https://api.github.com/repos/pydata/xarray/issues/2780 633296515 MDEyOklzc3VlQ29tbWVudDYzMzI5NjUxNQ== 22566757 2020-05-24T20:45:43Z 2020-05-24T20:45:43Z CONTRIBUTOR

For the example given, this would mean finding largest = max(abs(ds.min()), abs(ds.max())) and finding the first integer dtype wide enough to write that: [np.iinfo("i{bytes:d}".format(bytes=2 ** i)).max >= largest for i in range(4)] would help there. The function below should help with this; I would tend to use this at array creation time rather than at save time so you get these benefits in memory as well as on disk.

For the character/string variables, the smallest representation varies a bit more: a fixed-width encoding (dtype=S6) will probably be smaller if all the strings are about the same size, while variable-width strings are probably smaller if there are many short strings and only a few long strings. If you happen to know that a given field is a five-character identifier or a one-character status code, you can again set these types to be used in memory (which I think makes dask happier when it comes time to save), while free-form survey responses will likely be better as a variable-length string. It may be possible use the distribution of string lengths (perhaps using numpy.char.str_len) to see whether most of the strings are at least 90% as long as the longest, but it's probably simpler to test.

Doing this correctly for floating-point types would be difficult, but I think that's outside the scope of this issue.

Hopefully this gives you something to work with.

```python import numpy as np

def dtype_for_int_array(arry: "array of integers") -> np.dtype: """Find the smallest integer dtype that will encode arry.

Parameters
----------
arry : array of integers
    The array to compress

Returns
-------
smallest: dtype
    The smallest dtype that will represent arry
"""
largest = max(abs(arry.min()), abs(arry.max()))
typecode = "i{bytes:d}".format(
    bytes=2 ** np.nonzero([
        np.iinfo("i{bytes:d}".format(bytes=2 ** i)).max >= largest
        for i in range(4)
    ])[0][0]
)
return np.dtype(typecode)

```

Looking at df.memory_usage() will explain why I do this early. If I extend your example with this new function, I see the following: ```python

df_small = df.copy() for col in df_small: ... df_small[col] = df_small[col].astype( ... dtype_for_int_array(df_small[col]) if df_small[col].dtype.kind == "i" else "S1" ... ) ... df_small.memory_usage() Index 80 a 100000 b 100000 c 100000 d 100000 e 800000 dtype: int64 df.memory_usage() Index 80 a 800000 b 800000 c 800000 d 800000 e 800000 dtype: int64 ```

It looks like pandas always uses object dtype for string arrays, so the numbers in that column likely reflect the size of an array of pointers. XArray lets you use a dtype of "S1" or "U1", but I haven't found the equivalent of the memory_usage method.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  412180435
Powered by Datasette · Queries took 0.583ms · About: xarray-datasette