home / github

Menu
  • GraphQL API
  • Search all tables

issues

Table actions
  • GraphQL API for issues

3 rows where state = "closed", type = "issue" and user = 1797906 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date), closed_at (date)

type 1

  • issue · 3 ✖

state 1

  • closed · 3 ✖

repo 1

  • xarray 3
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association active_lock_reason draft pull_request body reactions performed_via_github_app state_reason repo type
291332965 MDU6SXNzdWUyOTEzMzI5NjU= 1854 Drop coordinates on loading large dataset. jamesstidard 1797906 closed 0     22 2018-01-24T19:35:46Z 2020-02-15T14:49:53Z 2020-02-15T14:49:53Z NONE      

I've been struggling for quite a while to load a large dataset so I thought it best ask as I think I'm missing a trick. I've also looked through the issues but, even though there are a fair few questions that seemed promising.

I have a number of *.nc files with variables across the coordinates latitude, longitude and time. Each file has the data for all the latitude and longitudes of the world and then some period of time - about two months.

The goal is to go through that data and get all the history of a single latitude/longitude coordinate - instead of the data for all latitude and longitude for small periods.

This is my current few lines of script:

python ds = xr.open_mfdataset('path/to/ncs/*.nc', chunks={'time': 127}) # 127 is normally the size of the time dimension in each file recs = ds.sel(latitude=10, longitude=10).to_dataframe().to_records() np.savez('location.npz', recs)

However, this blows out the memory on my machine on the open_mfdataset call when I use the full dataset. I've tried a bunch of different ways of chunking the data (like: 'latitude': 1, 'longitude': 1) but not been able to get past this stage.

I was wondering if there's a way to either determine a good chunk size or maybe tell the open_mfdataset to only keep values from the lat/lng coordinates I care about (coords kwarg looked like it could've been it) .

I'm using version 0.10.0 of xarray

Would very much appreciate any help.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1854/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
257400162 MDU6SXNzdWUyNTc0MDAxNjI= 1572 Modifying data set resulting in much larger file size jamesstidard 1797906 closed 0     7 2017-09-13T14:24:06Z 2017-09-18T08:59:24Z 2017-09-13T17:12:28Z NONE      

I'm loading a 130MB nc file and applying a where mask to it to remove a significant amount of the floating points - replacing them with nan. However, when I go to save this file it has increased to over 500MB. If I load the original data set and instantly save it the file stays roughly the same size.

Here's how I'm applying the mask:

```python import os import xarray as xr

fp = 'ERA20c/swh_2010_01_05_05.nc' ds = xr.open_dataset(fp)

ds = ds.where(ds.latitude > 50)

head, ext = os.path.splitext(fp) xr.open_dataset(fp).to_netcdf('{}-duplicate{}'.format(head, ext)) ds.to_netcdf('{}-masked{}'.format(head, ext)) ```

Is there a way to reduce this file size of the masked dataset? I'd expect it to be roughly the same size or smaller.

Thanks.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1572/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue
255997962 MDU6SXNzdWUyNTU5OTc5NjI= 1561 exit code 137 when using xarray.open_mfdataset jamesstidard 1797906 closed 0     3 2017-09-07T16:31:50Z 2017-09-13T14:16:07Z 2017-09-13T14:16:06Z NONE      

While using the xarray.open_mfdataset I get a exit code 137 SIGKILL 9 killing my process. I do not get this while using a subset of the data though. I'm also providing a chunks argument.

Does anyone know what might be causing this? Could it be the computer is completely running out of memory (RAM + SWAP + HDD)? Unsure what's causing this as I get no stack trace just the SIGKILL.

Thanks.

{
    "url": "https://api.github.com/repos/pydata/xarray/issues/1561/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed xarray 13221727 issue

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [active_lock_reason] TEXT,
   [draft] INTEGER,
   [pull_request] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [state_reason] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
);
CREATE INDEX [idx_issues_repo]
    ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
    ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
    ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
    ON [issues] ([user]);
Powered by Datasette · Queries took 22.727ms · About: xarray-datasette