home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

16 rows where issue = 336458472 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 3

  • NickMortimer 9
  • rabernat 6
  • jhamman 1

author_association 2

  • NONE 9
  • MEMBER 7

issue 1

  • xarray to zarr · 16 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
449081085 https://github.com/pydata/xarray/issues/2256#issuecomment-449081085 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQ0OTA4MTA4NQ== rabernat 1197350 2018-12-20T17:49:13Z 2018-12-20T17:49:13Z MEMBER

I'm going to close this. Please feel free to reopen if more discussion is needed.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
401745899 https://github.com/pydata/xarray/issues/2256#issuecomment-401745899 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMTc0NTg5OQ== NickMortimer 4338975 2018-07-02T10:03:36Z 2018-07-02T10:03:36Z NONE

As an update chunking could be improved, I've crunched over 800 floats into the structure with 140k profiles and even though the levels are expanded to 3000 (way over kill) the space on disk is 1/3 the original size and could be less than 1/4 if chunking was set nicely to prevent super small file sizes. I can now just access any profile by an index I might be happy!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
401728326 https://github.com/pydata/xarray/issues/2256#issuecomment-401728326 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMTcyODMyNg== NickMortimer 4338975 2018-07-02T09:18:08Z 2018-07-02T09:19:18Z NONE

@rabernat thanks so far for all the help. So if pickle is not the way forward then I need to resize casts so they all have same dimensions. So I came up with the following code:

``` def expand_levels(dataset,maxlevel=1500): newds = xr.Dataset() blankstack = np.empty((dataset.N_PROF.size,maxlevel-dataset.N_LEVELS.size)) blankstack[:] = np.nan newds['N_PROF'] = dataset.N_PROF.values; newds['N_LEVELS'] = np.arange(maxlevel).astype('int64') newds['N_PARAM'] = dataset.N_PARAM newds['N_CALIB'] = dataset.N_CALIB for varname, da in dataset.data_vars.items(): if 'N_PROF' in da.dims: if 'N_LEVELS' in da.dims: newds[varname] = xr.DataArray(np.hstack((da.values, blankstack)), dims=da.dims, name=da.name, attrs=da.attrs) elif 'N_HISTORY' not in da.dims: newds[varname] = da newds.attrs = dataset.attrs return newds

def append_to_zarr(dataset,zarrfile): for varname, da in dataset.data_vars.items(): zarrfile[varname].append(da.values)

files =list(glob.iglob(r'D:\argo\csiro**_prof.nc', recursive=True)) expand_levels(xr.open_dataset(files[0]),3000).to_zarr(r'D:\argo\argo.zarr',mode='w') za =zarr.open(r'D:\argo\argo.zarr',mode='w+') for f in files[1:]: print(f) append_to_zarr(expand_levels(xr.open_dataset(f), 3000),za)

``` This basically appends nan on the end of the profiles to get them all the same length. Then I append them into the zarr structure. This is very experimental I just wanted to see how appending them all to big arrays would work. It might be better to save a resized netcdf and then open them all at once and do a to_zarr?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
401195638 https://github.com/pydata/xarray/issues/2256#issuecomment-401195638 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMTE5NTYzOA== NickMortimer 4338975 2018-06-28T22:46:32Z 2018-06-28T22:47:09Z NONE

Yes I agree Zarr is best for large arrays etc. that's kid of why I ended up on the array of xray objects idea. I guess that was sort of creating an object store in zarr. What I'd like to offer is a simple set of analytical tools based on jupyter allowing for easy processing of float data, getting away from the download and process pattern. I'm still trying to find the best way to do this as Argo data does not neatly fall into any one system because of it's lack of homogeneity

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
401087741 https://github.com/pydata/xarray/issues/2256#issuecomment-401087741 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMTA4Nzc0MQ== rabernat 1197350 2018-06-28T16:07:02Z 2018-06-28T16:07:02Z MEMBER

Zarr is most useful for very large, homogeneous arrays. The argo data are not that large, and are inhomogeneous. So I'm not sure zarr will really help you out that much here.

In your original post, you said you were doing "cloud processing", but later you referred to a cluster filesystem. Do you plan to put this data in object storage?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400910725 https://github.com/pydata/xarray/issues/2256#issuecomment-400910725 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkxMDcyNQ== NickMortimer 4338975 2018-06-28T04:56:48Z 2018-06-28T04:57:33Z NONE

@jhamman Ah thanks for that it looks interesting. Is there a way a specifying in the .to_zarr()?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400909861 https://github.com/pydata/xarray/issues/2256#issuecomment-400909861 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkwOTg2MQ== jhamman 2443309 2018-06-28T04:49:57Z 2018-06-28T04:49:57Z MEMBER

If the proliferation of small files is a concern, you may find a different zarr store appealing. The default in Xarray is a DirectoryStore but you don't have to use that: http://zarr.readthedocs.io/en/latest/api/storage.html

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400909462 https://github.com/pydata/xarray/issues/2256#issuecomment-400909462 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkwOTQ2Mg== NickMortimer 4338975 2018-06-28T04:46:26Z 2018-06-28T04:46:26Z NONE

I am still confused about what you are trying to achieve. What do you mean by "cache"? Is your goal to compress the data so that it uses less space on disk? Or is it to provide a more "analysis ready" format?

I'd like to have both ;)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400908763 https://github.com/pydata/xarray/issues/2256#issuecomment-400908763 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkwODc2Mw== NickMortimer 4338975 2018-06-28T04:40:29Z 2018-06-28T04:40:29Z NONE

Now worries, at the moment I'm in play mode, everything is new to me pretty much!

Ok the aim of this little set up is to be able to do things like compare floats with those nearby or create a climatology for a local area from Argo profiles. for example produce a report for every operational Argo float each cycle and feed that to some kind of AI/ML system to detect bad data in near real time.

So initially I need a platform that I can easily data mine historical floats. Now with the pickle solution the entire data set can be accessed with a very small foot print.

Why zarr?

I seem to remember reading that reading/writing to from HFD5 was limited when compression was turned on. Plus I like the way zarr does things it looks a lot more fault tolerant

Keep asking the questions they are very valuable

Are you going to the pangeo meeting?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400906996 https://github.com/pydata/xarray/issues/2256#issuecomment-400906996 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkwNjk5Ng== rabernat 1197350 2018-06-28T04:27:38Z 2018-06-28T04:27:38Z MEMBER

Thanks for the extra info!

I am still confused about what you are trying to achieve. What do you mean by "cache"? Is your goal to compress the data so that it uses less space on disk? Or is it to provide a more "analysis ready" format?

In other words, why do you feel you need to transform this data to zarr? Why not just work directly with the netcdf files?

Sorry to keep asking questions rather than providing any answers! Just trying to understand your goals...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400906158 https://github.com/pydata/xarray/issues/2256#issuecomment-400906158 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkwNjE1OA== NickMortimer 4338975 2018-06-28T04:20:28Z 2018-06-28T04:20:28Z NONE

With the Pickle solution I end up with 31 files in 3 folders with a size on disk of 1.2 MB storing 250 profiles of a single float

I'm new to github and opensource! Thanks for the time and edit!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400905950 https://github.com/pydata/xarray/issues/2256#issuecomment-400905950 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkwNTk1MA== rabernat 1197350 2018-06-28T04:18:56Z 2018-06-28T04:18:56Z MEMBER

FYI, I edited your comment to place the output in block quotes (triple ``` before and after) so it is more readable.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400905262 https://github.com/pydata/xarray/issues/2256#issuecomment-400905262 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkwNTI2Mg== NickMortimer 4338975 2018-06-28T04:12:47Z 2018-06-28T04:18:07Z NONE

Yes I agree with you I started out with the ds.to_zarr for each file, the problem was that each property of the cycle e.g. lat and long ended up in it's own file. one float with 250 cycles ended up with over 70,000 small files one my file system, because of cluster size they occupied over 100meg of hard disk. as there are over 4000 floats lots of small files are not going to be viable.

cycles[int(ds.CYCLE_NUMBER.values[0])-1]=ds

Yep this line is funny. CYCLE_NUMBER increments up with each cycle and starts at 1. Sometimes a cycle might be delayed and added at a later date, so did not want to make the assumption that the list of files had been sorted into the order of the float cycles, so instead I want to build an array of cycles in order. Also if a file is replaced by a newer version then I want it to overwrite the profile in the array

<xarray.Dataset> Dimensions: (N_CALIB: 1, N_HISTORY: 9, N_LEVELS: 69, N_PARAM: 3, N_PROF: 1) Dimensions without coordinates: N_CALIB, N_HISTORY, N_LEVELS, N_PARAM, N_PROF Data variables: DATA_TYPE object ... FORMAT_VERSION object ... HANDBOOK_VERSION object ... REFERENCE_DATE_TIME object ... DATE_CREATION object ... DATE_UPDATE object ... PLATFORM_NUMBER (N_PROF) object ... PROJECT_NAME (N_PROF) object ... PI_NAME (N_PROF) object ... STATION_PARAMETERS (N_PROF, N_PARAM) object ... CYCLE_NUMBER (N_PROF) float64 ... DIRECTION (N_PROF) object ... DATA_CENTRE (N_PROF) object ... DC_REFERENCE (N_PROF) object ... DATA_STATE_INDICATOR (N_PROF) object ... DATA_MODE (N_PROF) object ... PLATFORM_TYPE (N_PROF) object ... FLOAT_SERIAL_NO (N_PROF) object ... FIRMWARE_VERSION (N_PROF) object ... WMO_INST_TYPE (N_PROF) object ... JULD (N_PROF) datetime64[ns] ... JULD_QC (N_PROF) object ... JULD_LOCATION (N_PROF) datetime64[ns] ... LATITUDE (N_PROF) float64 ... LONGITUDE (N_PROF) float64 ... POSITION_QC (N_PROF) object ... POSITIONING_SYSTEM (N_PROF) object ... PROFILE_PRES_QC (N_PROF) object ... PROFILE_TEMP_QC (N_PROF) object ... PROFILE_PSAL_QC (N_PROF) object ... VERTICAL_SAMPLING_SCHEME (N_PROF) object ... CONFIG_MISSION_NUMBER (N_PROF) float64 ... PRES (N_PROF, N_LEVELS) float32 ... PRES_QC (N_PROF, N_LEVELS) object ... PRES_ADJUSTED (N_PROF, N_LEVELS) float32 ... PRES_ADJUSTED_QC (N_PROF, N_LEVELS) object ... TEMP (N_PROF, N_LEVELS) float32 ... TEMP_QC (N_PROF, N_LEVELS) object ... TEMP_ADJUSTED (N_PROF, N_LEVELS) float32 ... TEMP_ADJUSTED_QC (N_PROF, N_LEVELS) object ... PSAL (N_PROF, N_LEVELS) float32 ... PSAL_QC (N_PROF, N_LEVELS) object ... PSAL_ADJUSTED (N_PROF, N_LEVELS) float32 ... PSAL_ADJUSTED_QC (N_PROF, N_LEVELS) object ... PRES_ADJUSTED_ERROR (N_PROF, N_LEVELS) float32 ... TEMP_ADJUSTED_ERROR (N_PROF, N_LEVELS) float32 ... PSAL_ADJUSTED_ERROR (N_PROF, N_LEVELS) float32 ... PARAMETER (N_PROF, N_CALIB, N_PARAM) object ... SCIENTIFIC_CALIB_EQUATION (N_PROF, N_CALIB, N_PARAM) object ... SCIENTIFIC_CALIB_COEFFICIENT (N_PROF, N_CALIB, N_PARAM) object ... SCIENTIFIC_CALIB_COMMENT (N_PROF, N_CALIB, N_PARAM) object ... SCIENTIFIC_CALIB_DATE (N_PROF, N_CALIB, N_PARAM) object ... HISTORY_INSTITUTION (N_HISTORY, N_PROF) object ... HISTORY_STEP (N_HISTORY, N_PROF) object ... HISTORY_SOFTWARE (N_HISTORY, N_PROF) object ... HISTORY_SOFTWARE_RELEASE (N_HISTORY, N_PROF) object ... HISTORY_REFERENCE (N_HISTORY, N_PROF) object ... HISTORY_DATE (N_HISTORY, N_PROF) object ... HISTORY_ACTION (N_HISTORY, N_PROF) object ... HISTORY_PARAMETER (N_HISTORY, N_PROF) object ... HISTORY_START_PRES (N_HISTORY, N_PROF) float32 ... HISTORY_STOP_PRES (N_HISTORY, N_PROF) float32 ... HISTORY_PREVIOUS_VALUE (N_HISTORY, N_PROF) float32 ... HISTORY_QCTEST (N_HISTORY, N_PROF) object ... Attributes: title: Argo float vertical profile institution: CSIRO source: Argo float history: 2013-07-30T09:13:35Z creation;2014-08-18T19:33:14Z ... references: http://www.argodatamgt.org/Documentation user_manual_version: 3.1 Conventions: Argo-3.1 CF-1.6 featureType: trajectoryProfile

A single float file end up with 194 small files in 68 directories total size 30.4 KB (31,223 bytes) but size on disk 776 KB (794,624 bytes)

I have tried

ds = xr.open_mfdataset(r"C:\Users\mor582\Documents\projects\argo\D1901324\*_*.nc")

but fails with: ValueError: arguments without labels along dimension 'N_HISTORY' cannot be aligned because they have different dimension sizes: {9, 11, 6}

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400902555 https://github.com/pydata/xarray/issues/2256#issuecomment-400902555 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkwMjU1NQ== rabernat 1197350 2018-06-28T03:51:34Z 2018-06-28T03:51:34Z MEMBER

Can you clarify what you are trying to achieve with the transformations?

Why not do something like this? python for file in filenames: ds = xr.open_dataset(file) ds.to_zarr(file + '.zarr')

I'm particularly confused by this line: cycles[int(ds.CYCLE_NUMBER.values[0])-1]=ds Could it be that you are describing the "straight pickle to zarr array" workflow you referred to in your earlier post? This is definitely an unconventional and not recommended way to interface xarray with zarr. It would be better to use the built-in .to_zarr() function. We can help you debug why that isn't working well, but we need more information.

Specifically:

Could you please post the repr of a single netcdf dataset from this collection, i.e. python ds = xr.open_dataset('file.nc') print(ds)

Then could you call ds.to_zarr() and describe the contents of the resulting zarr store in more detail? (For example, could you list the directories within the store?)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400901163 https://github.com/pydata/xarray/issues/2256#issuecomment-400901163 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDkwMTE2Mw== NickMortimer 4338975 2018-06-28T03:41:10Z 2018-06-28T03:41:10Z NONE

Thanks yep my goal is to provide a simple online notebook that can be used to process/qa/qc Argo float data. I'd like to create system that works intuitively with the the current file structure and not build a database of values on the top of them.

here's a first go with some code

``` def processfloat(floatpath,zarrpath): root = zarr.open(zarrpath, mode='a') filenames = glob.glob(floatpath)

for file in filenames:
    ds = xr.open_dataset(file)
    platform = ds.PLATFORM_NUMBER.values[0].strip()
    float =root.get(platform)
    if float==None:
        float = root.create_group(platform)
    cycles = float.get('cycles')
    if cycles == None:
        cycles = float.zeros('cycles', shape=1, chunks=10, dtype=object, object_codec=numcodecs.Pickle())
    while len(cycles)<ds.CYCLE_NUMBER.values[0]:
        cycles.append([0])
    cycles[int(ds.CYCLE_NUMBER.values[0])-1]=ds
summary =float.zeros('summary', shape=1, chunks=10, dtype=object, object_codec=numcodecs.Pickle())
summary[0] = pd.DataFrame(list(map(lambda x: {'latitude':x.LATITUDE.values[0],
                                        'longitude':x.LONGITUDE.values[0],
                                        'time':x.JULD.values[0],'platform':platform,'cycle':x.CYCLE_NUMBER.values[0]}, cycles)))

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472
400899756 https://github.com/pydata/xarray/issues/2256#issuecomment-400899756 https://api.github.com/repos/pydata/xarray/issues/2256 MDEyOklzc3VlQ29tbWVudDQwMDg5OTc1Ng== rabernat 1197350 2018-06-28T03:31:34Z 2018-06-28T03:31:34Z MEMBER

I think this effort should be of great interest to a lot of computational oceanographers. I have worked a lot with both Argo data and zarr, but never yet tried to combine them.

I would recommend reading this guide if you have not done so already: http://pangeo-data.org/data.html#guide-to-preparing-cloud-optimized-data

Then could you post the xarray repr of one of the netcdf files you are working with here? i.e. ds = xr.open_dataset('file.nc') print(ds)

And then finally post the full code you are using to read, transform, and output the zarr data.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray to zarr 336458472

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 13.221ms · About: xarray-datasette