home / github / issue_comments

Menu
  • GraphQL API
  • Search all tables

issue_comments: 602508864

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/3007#issuecomment-602508864 https://api.github.com/repos/pydata/xarray/issues/3007 602508864 MDEyOklzc3VlQ29tbWVudDYwMjUwODg2NA== 8833517 2020-03-23T10:27:27Z 2020-03-23T10:27:27Z CONTRIBUTOR

I recently had a similar issue and found out the cause: When transforming from a dataframe to an xarray, the xarray allocates memory for all possible combinations of the coordinates. In this particular case, you have 5 unique values for latitude and longitude in your five rows, which means there are 5*5=25 possible combinations of lat/long values. All missing values are then filled in as NaN.

Let me illustrate by recreating just your data on latitude, longitude, wind_surface and hurs:

python In [3]: data = [ ...: [34.511383, 16.467664, 29.658546, 70.481293], ...: [34.515558, 16.723973, 30.896049, 71.356644], ...: [34.517359, 16.852138, 31.514799, 71.708603], ...: [34.518970, 16.980310, 32.105423, 72.023773], ...: [34.520391, 17.108487, 32.724174, 72.106110], ...: ] In [4]: df = pd.DataFrame(data=data, columns=['lat', 'long', 'wind_surface', 'hurs']).set_index(['lat', 'long']) In [5]: df Out[5]: wind_surface hurs lat long 34.511383 16.467664 29.658546 70.481293 34.515558 16.723973 30.896049 71.356644 34.517359 16.852138 31.514799 71.708603 34.518970 16.980310 32.105423 72.023773 34.520391 17.108487 32.724174 72.106110

But for the xarray, this means it will end up creating a 5x5 array, of which only 5 values are given along the diagonal. This is very clearly visible when showing just the DataArray for a single column: python In [6]: df.to_xarray()['wind_surface'] Out[6]: <xarray.DataArray 'wind_surface' (lat: 5, long: 5)> array([[29.658546, nan, nan, nan, nan], [ nan, 30.896049, nan, nan, nan], [ nan, nan, 31.514799, nan, nan], [ nan, nan, nan, 32.105423, nan], [ nan, nan, nan, nan, 32.724174]]) Coordinates: * lat (lat) float64 34.51 34.52 34.52 34.52 34.52 * long (long) float64 16.47 16.72 16.85 16.98 17.11

However, as to_xarray() outputs a DataSet, each DataArray, i.e. column from the dataframe, is summarized as a 1D array, which makes it seem like a lot of data is just 'missing': python In [7]: df.to_xarray() Out[7]: <xarray.Dataset> Dimensions: (lat: 5, long: 5) Coordinates: * lat (lat) float64 34.51 34.52 34.52 34.52 34.52 * long (long) float64 16.47 16.72 16.85 16.98 17.11 Data variables: wind_surface (lat, long) float64 29.66 nan nan nan ... nan nan nan 32.72 hurs (lat, long) float64 70.48 nan nan nan ... nan nan nan 72.11

So it works as intended, but can throw you for a loop if you don't realize it's creating an array the size of all possible index combinations.

@shoyer can you close this issue?

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  454073421
Powered by Datasette · Queries took 0.676ms · About: xarray-datasette