home / github

Menu
  • GraphQL API
  • Search all tables

issue_comments

Table actions
  • GraphQL API for issue_comments

66 rows where author_association = "CONTRIBUTOR" and user = 703554 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, reactions, created_at (date), updated_at (date)

issue 21

  • WIP: Zarr backend 16
  • xarray/zarr cloud demo 7
  • Adds Dataset.query() method, analogous to pandas DataFrame.query() 7
  • HDF5 backend for xray 5
  • zarr as persistent store for xarray 4
  • Explicit indexes in xarray's data-model (Future of MultiIndex) 4
  • initial implementation of support for NetCDF groups 3
  • Low memory/out-of-core index? 3
  • Unnamed dimensions 3
  • Fancy indexing a Dataset with dask DataArray causes excessive memory usage 2
  • Explaining xarray in a single picture 2
  • Support multi-dimensional grouped operations and group_over 1
  • WIP: New DataStore / Encoder / Decoder API for review 1
  • Zarr consolidated 1
  • Zarr loading from ZipStore gives error on default arguments 1
  • Allow nested dictionaries in the Zarr backend (#3517) 1
  • DOC: from examples to tutorials 1
  • Errors using to_zarr for an s3 store 1
  • Wrap "Dimensions" onto multiple lines in xarray.Dataset repr? 1
  • Fancy indexing a Dataset with dask DataArray triggers multiple computes 1
  • Slow performance of concat() 1

user 1

  • alimanfoo · 66 ✖

author_association 1

  • CONTRIBUTOR · 66 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
1544199022 https://github.com/pydata/xarray/issues/7833#issuecomment-1544199022 https://api.github.com/repos/pydata/xarray/issues/7833 IC_kwDOAMm_X85cCptu alimanfoo 703554 2023-05-11T15:26:52Z 2023-05-11T15:26:52Z CONTRIBUTOR

Awesome, thanks @kmuehlbauer and @Illviljan 🙏🏻

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Slow performance of concat() 1704950804
1190061811 https://github.com/pydata/xarray/issues/3564#issuecomment-1190061811 https://api.github.com/repos/pydata/xarray/issues/3564 IC_kwDOAMm_X85G7ubz alimanfoo 703554 2022-07-20T09:44:40Z 2022-07-20T09:44:40Z CONTRIBUTOR

Hi folks,

Just to mention that we've created a short tutorial on xarray which is meant as a gentle intro to folks coming from the malaria genetics field, who mostly have never heard of xarray before. We illustrate xarray first using outputs from a geostatistical model of how insecticide-treated bednets are used in Africa. We then give a couple of brief examples of how we use xarray for genomic data. There's video walkthroughs in French and English:

https://anopheles-genomic-surveillance.github.io/workshop-5/module-1-xarray.html

Please feel free to link to this in the xarray tutorial site if you'd like to :)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  DOC: from examples to tutorials 527323165
1190057727 https://github.com/pydata/xarray/issues/6771#issuecomment-1190057727 https://api.github.com/repos/pydata/xarray/issues/6771 IC_kwDOAMm_X85G7tb_ alimanfoo 703554 2022-07-20T09:40:41Z 2022-07-20T09:41:07Z CONTRIBUTOR

Hi @dcherian,

We are currently reworking https://tutorial.xarray.dev/intro.html and would love to either add your material or link to it if you're creating a consolidated collection of genetics-related material. xref (#3564). We don't have a "domain-specific" section yet but are planning to create one after SciPy.

FWIW we've created a short tutorial on xarray which is meant as a gentle intro to folks coming from the malaria genetics field. We illustrate xarray first using outputs from a geostatistical model of how insecticide-treated bednets are used in Africa. We then give a couple of brief examples of how we use xarray for genomic data. There's video walkthroughs in French and English:

https://anopheles-genomic-surveillance.github.io/workshop-5/module-1-xarray.html

Please feel free to link to this in the xarray tutorial site if you'd like to :)

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
  Explaining xarray in a single picture 1300534066
1190052947 https://github.com/pydata/xarray/issues/6771#issuecomment-1190052947 https://api.github.com/repos/pydata/xarray/issues/6771 IC_kwDOAMm_X85G7sRT alimanfoo 703554 2022-07-20T09:36:10Z 2022-07-20T09:36:10Z CONTRIBUTOR

Hi @TomNicholas,

I would've thought that latitude and longitude would be 1-dimensional coordinate variables, yet they are drawn as 2-D arrays?

I think that if you assume that the axes of your grid data align with the cardinal directions (East-West / North-South) then you would expect latitude and longitude to be 1D, but if they don't align then the coordinates would need be 2D (i.e. if x and y are merely arbitrary lines along the Earth's surface).

I agree with you though that 2D lat/lon grids are unnecessarily confusing, especially for non-geoscience users.

Interesting, I hadn't considered that. Definitely a bit mind-bending though for us non-geoscientists :)

I like the second diagram you showed more (it's also a neater version of the labelled one I made here). I think it's debatable whether elevation and land_cover constitute coordinates or data variables, but I have no strong opinion on that.

As for improvements, I think it would be clearer to at least use the second image over the first, and perhaps we could improve it further.

SGTM. FWIW on the second diagram I would use "dimensions" instead of "indexes". Getting dimensions first then helps to explain how you can use a coordinate variable to index a dimension.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explaining xarray in a single picture 1300534066
1054526670 https://github.com/pydata/xarray/issues/324#issuecomment-1054526670 https://api.github.com/repos/pydata/xarray/issues/324 IC_kwDOAMm_X84-2szO alimanfoo 703554 2022-02-28T18:10:02Z 2022-02-28T18:10:02Z CONTRIBUTOR

Still relevant, would like to be able to group by multiple variables along a single dimension.

{
    "total_count": 6,
    "+1": 6,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Support multi-dimensional grouped operations and group_over 58117200
802732278 https://github.com/pydata/xarray/issues/4663#issuecomment-802732278 https://api.github.com/repos/pydata/xarray/issues/4663 MDEyOklzc3VlQ29tbWVudDgwMjczMjI3OA== alimanfoo 703554 2021-03-19T10:44:31Z 2021-03-19T10:44:31Z CONTRIBUTOR

Thanks @dcherian.

Just to add that if we make progress with supporting indexing with dask arrays then at some point I think we'll hit a separate issue, which is that xarray will require that the chunk sizes of the indexed arrays are computed, but currently calling the dask array method compute_chunk_sizes() is inefficient for n-d arrays. Raised here: https://github.com/dask/dask/issues/7416

In case anyone needs a workaround for indexing a dataset with a 1d boolean dask array, I'm currently using this hacked implementation of a compress() style function that operates on an xarray dataset, which includes more efficient computation of chunk sizes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fancy indexing a Dataset with dask DataArray triggers multiple computes 759709924
802101178 https://github.com/pydata/xarray/issues/5054#issuecomment-802101178 https://api.github.com/repos/pydata/xarray/issues/5054 MDEyOklzc3VlQ29tbWVudDgwMjEwMTE3OA== alimanfoo 703554 2021-03-18T16:45:51Z 2021-03-18T16:58:44Z CONTRIBUTOR

FWIW my use case actually only needs indexing a single dimension, i.e., something equivalent to the numpy (or dask.array) compress function. This can be hacked for xarray datasets in a fairly straightforward way:

```python def _compress_dataarray(a, indexer, dim): data = a.data try: axis = a.dims.index(dim) except ValueError: v = data else: # rely on array_function to handle dispatching to dask if # data is a dask array v = np.compress(indexer, a.data, axis=axis) if hasattr(v, 'compute_chunk_sizes'): # needed to know dim lengths v.compute_chunk_sizes() return v

def compress_dataset(ds, indexer, dim): if isinstance(indexer, str): indexer = ds[indexer].data

coords = dict()
for k in ds.coords:
    a = ds[k]
    v = _compress_dataarray(a, indexer, dim)
    coords[k] = (a.dims, v)

data_vars = dict()
for k in ds.data_vars:
    a = ds[k]
    v = _compress_dataarray(a, indexer, dim)
    data_vars[k] = (a.dims, v)

attrs = ds.attrs.copy()

return xr.Dataset(data_vars=data_vars, coords=coords, attrs=attrs)

```

Given the complexity of fancy indexing in general, I wonder if it's worth contemplating implementing a Dataset.compress() method as a first step.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fancy indexing a Dataset with dask DataArray causes excessive memory usage 834972299
802096873 https://github.com/pydata/xarray/issues/5054#issuecomment-802096873 https://api.github.com/repos/pydata/xarray/issues/5054 MDEyOklzc3VlQ29tbWVudDgwMjA5Njg3Mw== alimanfoo 703554 2021-03-18T16:39:59Z 2021-03-18T16:39:59Z CONTRIBUTOR

Thanks @dcherian.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Fancy indexing a Dataset with dask DataArray causes excessive memory usage 834972299
800504527 https://github.com/pydata/xarray/pull/4984#issuecomment-800504527 https://api.github.com/repos/pydata/xarray/issues/4984 MDEyOklzc3VlQ29tbWVudDgwMDUwNDUyNw== alimanfoo 703554 2021-03-16T18:28:09Z 2021-03-16T18:28:09Z CONTRIBUTOR

Yay, first xarray PR :partying_face:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adds Dataset.query() method, analogous to pandas DataFrame.query() 819911891
800317378 https://github.com/pydata/xarray/pull/4984#issuecomment-800317378 https://api.github.com/repos/pydata/xarray/issues/4984 MDEyOklzc3VlQ29tbWVudDgwMDMxNzM3OA== alimanfoo 703554 2021-03-16T14:40:45Z 2021-03-16T14:40:45Z CONTRIBUTOR

Could we add a very small test for the DataArray? Given the coverage on Dataset, it should mostly just test that the method works.

No problem, some DataArray tests are there.

Any thoughts from others before we merge?

Good to go from my side.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adds Dataset.query() method, analogous to pandas DataFrame.query() 819911891
800176868 https://github.com/pydata/xarray/pull/4984#issuecomment-800176868 https://api.github.com/repos/pydata/xarray/issues/4984 MDEyOklzc3VlQ29tbWVudDgwMDE3Njg2OA== alimanfoo 703554 2021-03-16T11:24:42Z 2021-03-16T11:24:42Z CONTRIBUTOR

Hi @max-sixty,

It looks like we need a requires_numexpr decorator on the tests — would you be OK to add that?

Sure, done.

Could we add a simple method to DataArray which converts to a Dataset, calls the functions, and converts back too? (there are lots of examples already of this, let me know any issues)

Done.

And we should add the methods to api.rst, and a whatsnew entry if possible.

Done.

Let me know if there's anything else. Looking forward to using this :smile:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adds Dataset.query() method, analogous to pandas DataFrame.query() 819911891
798993998 https://github.com/pydata/xarray/pull/4984#issuecomment-798993998 https://api.github.com/repos/pydata/xarray/issues/4984 MDEyOklzc3VlQ29tbWVudDc5ODk5Mzk5OA== alimanfoo 703554 2021-03-14T22:44:49Z 2021-03-14T22:44:49Z CONTRIBUTOR

Currently the test runs over an array of two dimensions — x & y. Would pd.query work if there were also a z dimension?

No worries, yes any number of dimensions can be queried. I've added tests showing three dimensions can be queried.

As an aside, in writing these tests I came upon a probable upstream bug in pandas, reported as https://github.com/pandas-dev/pandas/issues/40436. I don't think this affects this PR though, and has low impact as only the "python" query parser is affected, and most people will use the default "pandas" query parser.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adds Dataset.query() method, analogous to pandas DataFrame.query() 819911891
797668635 https://github.com/pydata/xarray/pull/4984#issuecomment-797668635 https://api.github.com/repos/pydata/xarray/issues/4984 MDEyOklzc3VlQ29tbWVudDc5NzY2ODYzNQ== alimanfoo 703554 2021-03-12T18:16:15Z 2021-03-12T18:16:15Z CONTRIBUTOR

Just to mention I've added tests to verify this works with variables backed by dask arrays. Also added explicit tests of different eval engine and query parser options. And added a docstring.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adds Dataset.query() method, analogous to pandas DataFrame.query() 819911891
797636489 https://github.com/pydata/xarray/pull/4984#issuecomment-797636489 https://api.github.com/repos/pydata/xarray/issues/4984 MDEyOklzc3VlQ29tbWVudDc5NzYzNjQ4OQ== alimanfoo 703554 2021-03-12T17:21:29Z 2021-03-12T17:21:29Z CONTRIBUTOR

Hi @max-sixty, no problem. Re this...

Does the pd.eval work with more than two dimensions?

...not quite sure what you mean, could you elaborate?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adds Dataset.query() method, analogous to pandas DataFrame.query() 819911891
788828644 https://github.com/pydata/xarray/pull/4984#issuecomment-788828644 https://api.github.com/repos/pydata/xarray/issues/4984 MDEyOklzc3VlQ29tbWVudDc4ODgyODY0NA== alimanfoo 703554 2021-03-02T11:10:20Z 2021-03-02T11:10:20Z CONTRIBUTOR

Hi folks, thought I'd put up a proof of concept PR here for further discussion. Any advice/suggestions about if/how to take this forward would be very welcome.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Adds Dataset.query() method, analogous to pandas DataFrame.query() 819911891
631075010 https://github.com/pydata/xarray/issues/4079#issuecomment-631075010 https://api.github.com/repos/pydata/xarray/issues/4079 MDEyOklzc3VlQ29tbWVudDYzMTA3NTAxMA== alimanfoo 703554 2020-05-19T20:50:26Z 2020-05-19T20:50:51Z CONTRIBUTOR

In the specific example from your notebook, where do the dimensions lengths __variants/BaseCounts_dim1, __variants/MLEAC_dim1 and __variants/MLEAF_dim1 come from?

BaseCounts_dim1 is length 4, so maybe that corresponds to DNA bases ATGC?

In this specific example, I do actually know where these dimension lengths come from. In fact I should've used the shared dimension alt_alleles instead of __variants/MLEAC_dim1 and __variants/MLEAF_dim1. And yes BaseCounts_dim1 does correspond to DNA bases.

But two points.

First, I don't care about these dimensions. The only dimensions I care about and will use are variants, samples and ploidy.

Second, more important, this kind of data can come from a number of different sources, each of which includes a different set of arrays with different names and semantics. While there are some common arrays and naming conventions where I can guess what the dimensions mean, in general I can't know all of those up front and bake them in as special cases.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unnamed dimensions 621078539
631071623 https://github.com/pydata/xarray/issues/4081#issuecomment-631071623 https://api.github.com/repos/pydata/xarray/issues/4081 MDEyOklzc3VlQ29tbWVudDYzMTA3MTYyMw== alimanfoo 703554 2020-05-19T20:43:07Z 2020-05-19T20:43:07Z CONTRIBUTOR

Thanks @shoyer for raising this, would be nice to wrap the dimensions, I'd vote for one per line.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Wrap "Dimensions" onto multiple lines in xarray.Dataset repr? 621123222
630924754 https://github.com/pydata/xarray/issues/4079#issuecomment-630924754 https://api.github.com/repos/pydata/xarray/issues/4079 MDEyOklzc3VlQ29tbWVudDYzMDkyNDc1NA== alimanfoo 703554 2020-05-19T16:14:27Z 2020-05-19T16:14:27Z CONTRIBUTOR

Thanks @shoyer.

For reference, I'm exploring putting some genome variation data into xarray, here's an initial experiment and discussion here.

In general I will have some arrays where I won't know what some of the dimensions mean, and so cannot give them a meaningful name.

No worries if this is hard, was just wondering if it was supported already.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unnamed dimensions 621078539
630913851 https://github.com/pydata/xarray/issues/4079#issuecomment-630913851 https://api.github.com/repos/pydata/xarray/issues/4079 MDEyOklzc3VlQ29tbWVudDYzMDkxMzg1MQ== alimanfoo 703554 2020-05-19T15:55:54Z 2020-05-19T15:55:54Z CONTRIBUTOR

Thanks so much @rabernat for quick response.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Unnamed dimensions 621078539
605179227 https://github.com/pydata/xarray/issues/3831#issuecomment-605179227 https://api.github.com/repos/pydata/xarray/issues/3831 MDEyOklzc3VlQ29tbWVudDYwNTE3OTIyNw== alimanfoo 703554 2020-03-27T18:10:05Z 2020-03-27T18:10:05Z CONTRIBUTOR

Just to say having some kind of stack integration tests is a marvellous idea. Another example of an issue that's very hard to pin down is https://github.com/zarr-developers/zarr-python/issues/528.

Btw we have also run into issues with fsspec caching directory listings and not invalidating the cache when store changes are made, although I haven't checked with latest master. We have a lot of workarounds in our code where we reopen everything after we've made changes to a store. Probably an area where some more digging and careful testing may be needed.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Errors using to_zarr for an s3 store 576337745
554463832 https://github.com/pydata/xarray/pull/3526#issuecomment-554463832 https://api.github.com/repos/pydata/xarray/issues/3526 MDEyOklzc3VlQ29tbWVudDU1NDQ2MzgzMg== alimanfoo 703554 2019-11-15T17:57:42Z 2019-11-15T17:57:42Z CONTRIBUTOR

FWIW in the Zarr Python implementation I don't think we do any special encoding or decoding of attribute values. Whatever value is given then gets serialised using the built-in json.dumps. This means I believe that if someone provides a dict as an attribute value then that will get serialised as a JSON object, and get deserialised back to a dict, although this is not something we test for currently.

From the zarr v2 spec point of view I think anything goes in the .zattrs file, as long as .zattrs is a JSON object at the root.

Hth.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Allow nested dictionaries in the Zarr backend (#3517) 522519084
455374760 https://github.com/pydata/xarray/issues/2586#issuecomment-455374760 https://api.github.com/repos/pydata/xarray/issues/2586 MDEyOklzc3VlQ29tbWVudDQ1NTM3NDc2MA== alimanfoo 703554 2019-01-17T23:49:07Z 2019-01-17T23:49:07Z CONTRIBUTOR

IMO, zarr needs some kind of "resolver" mechanism that takes a string and decides what kind of store it represents. For example, if the path ends with .zip, then it should know it's zip store, if it starts with gs://, it should know it's a google cloud store, etc.

Some very limited support for this is there already, e.g., if string ends with '.zip' then a zip store will be used, but there's no support for dispatching to cloud stores via a URL-like protocol. There's an open issue for that: https://github.com/zarr-developers/zarr/issues/214

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr loading from ZipStore gives error on default arguments 386515973
444187219 https://github.com/pydata/xarray/issues/1603#issuecomment-444187219 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDQ0NDE4NzIxOQ== alimanfoo 703554 2018-12-04T17:33:34Z 2018-12-04T17:33:34Z CONTRIBUTOR

I think that one big source of confusion has been so far mixing coordinates/variables and indexes. These are really two separate concepts, and the indexes refactoring should address that IMHO.

For example, I think that da[some_name] should never return indexes but only coordinates (and/or data variables for Dataset). That would be much simpler.

Can't claim to be following every detail here, but this sounds very sensible to me FWIW.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
442801741 https://github.com/pydata/xarray/pull/2559#issuecomment-442801741 https://api.github.com/repos/pydata/xarray/issues/2559 MDEyOklzc3VlQ29tbWVudDQ0MjgwMTc0MQ== alimanfoo 703554 2018-11-29T11:33:33Z 2018-11-29T11:33:33Z CONTRIBUTOR

Great to see this. On the API, FWIW I'd vote for using the same keyword (consolidated) in both, less burden on the user to remember what to use.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Zarr consolidated 382497709
392831984 https://github.com/pydata/xarray/issues/1603#issuecomment-392831984 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDM5MjgzMTk4NA== alimanfoo 703554 2018-05-29T15:59:46Z 2018-05-29T15:59:46Z CONTRIBUTOR

Ok, cool. Was wondering if now was right time to revisit that, alongside the work proposed in this PR. Happy to participate in that discussion, still interested in implementing some alternative index classes.

On Tue, 29 May 2018, 15:45 Stephan Hoyer, notifications@github.com wrote:

Yes, the index API still needs to be determined. But I think we want to support something like that. On Tue, May 29, 2018 at 1:20 AM Alistair Miles notifications@github.com wrote:

I see this mentions an Index API, is that still to be decided?

On Tue, 29 May 2018, 05:28 Stephan Hoyer, notifications@github.com wrote:

I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this:

  1. Normalizing and creating default indexes in the Dataset/DataArray constructor.
  2. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs.
  3. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables.

I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in #2195 https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392649605, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AAq8QvMauEPa6hfgorDoShZ2PwyYWk6Tks5t3M6AgaJpZM4PtACU

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392692996, or mute the thread < https://github.com/notifications/unsubscribe-auth/ABKS1p8RjrupPM2z2d4_ylWX7826RQ0Rks5t3QTHgaJpZM4PtACU

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392803210, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QgygnzTX053NlGZ5A5j_tRkRxMj7ks5t3V79gaJpZM4PtACU .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
392692996 https://github.com/pydata/xarray/issues/1603#issuecomment-392692996 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDM5MjY5Mjk5Ng== alimanfoo 703554 2018-05-29T08:20:22Z 2018-05-29T08:20:22Z CONTRIBUTOR

I see this mentions an Index API, is that still to be decided?

On Tue, 29 May 2018, 05:28 Stephan Hoyer, notifications@github.com wrote:

I started thinking about how to do this incrementally, and it occurs to me that a good place to start would be to write some of the utility functions we'll need for this:

  1. Normalizing and creating default indexes in the Dataset/DataArray constructor.
  2. Combining indexes from all xarray objects that are inputs for an operations into indexes for the outputs.
  3. Extracting MultiIndex objects from arguments into Dataset/DataArray and expanding them into multiple variables.

I drafted up docstrings for each of these functions and did a little bit of working starting to think through implementations in #2195 https://github.com/pydata/xarray/pull/2195. So this would be a great place for others to help out. Each of these could be separate PRs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1603#issuecomment-392649605, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QvMauEPa6hfgorDoShZ2PwyYWk6Tks5t3M6AgaJpZM4PtACU .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
371626776 https://github.com/pydata/xarray/issues/1974#issuecomment-371626776 https://api.github.com/repos/pydata/xarray/issues/1974 MDEyOklzc3VlQ29tbWVudDM3MTYyNjc3Ng== alimanfoo 703554 2018-03-08T21:15:04Z 2018-03-08T21:15:04Z CONTRIBUTOR

It worked! Thanks again, pangeo.pydata.org is super cool.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray/zarr cloud demo 303270676
371603679 https://github.com/pydata/xarray/issues/1974#issuecomment-371603679 https://api.github.com/repos/pydata/xarray/issues/1974 MDEyOklzc3VlQ29tbWVudDM3MTYwMzY3OQ== alimanfoo 703554 2018-03-08T19:52:01Z 2018-03-08T19:52:01Z CONTRIBUTOR

I have it running! Will try to start the talk with it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray/zarr cloud demo 303270676
371561259 https://github.com/pydata/xarray/issues/1974#issuecomment-371561259 https://api.github.com/repos/pydata/xarray/issues/1974 MDEyOklzc3VlQ29tbWVudDM3MTU2MTI1OQ== alimanfoo 703554 2018-03-08T17:30:21Z 2018-03-08T17:30:21Z CONTRIBUTOR

Actually just realising @rabernat and @mrocklin you guys already demoed all of this to ESIP back in January (really nice talk btw). So maybe I don't need to repeat.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray/zarr cloud demo 303270676
371558334 https://github.com/pydata/xarray/issues/1974#issuecomment-371558334 https://api.github.com/repos/pydata/xarray/issues/1974 MDEyOklzc3VlQ29tbWVudDM3MTU1ODMzNA== alimanfoo 703554 2018-03-08T17:21:08Z 2018-03-08T17:21:08Z CONTRIBUTOR

Thanks @mrocklin.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray/zarr cloud demo 303270676
371544386 https://github.com/pydata/xarray/issues/1974#issuecomment-371544386 https://api.github.com/repos/pydata/xarray/issues/1974 MDEyOklzc3VlQ29tbWVudDM3MTU0NDM4Ng== alimanfoo 703554 2018-03-08T16:38:48Z 2018-03-08T16:38:48Z CONTRIBUTOR

Ha, Murphy's law. Shame because the combination of jupyterlab interface, launching a kubernetes cluster, and being able to click through to the Dask dashboard looks futuristic cool :-) I was really looking forward to seeing all my jobs spinning through the Dask dashboard as they work. I actually have a pretty packed talk already so don't absolutely need to include this, but if it does come back in time I'll slot it in. Talk starts 8pm GMT so still a few hours yet...

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray/zarr cloud demo 303270676
371538819 https://github.com/pydata/xarray/issues/1974#issuecomment-371538819 https://api.github.com/repos/pydata/xarray/issues/1974 MDEyOklzc3VlQ29tbWVudDM3MTUzODgxOQ== alimanfoo 703554 2018-03-08T16:22:16Z 2018-03-08T16:22:16Z CONTRIBUTOR

Just tried to run the xarray-data notebook from within pangeo.pydata.org jupyterlab, when I run this command:

gcsmap = gcsfs.mapping.GCSMap('pangeo-data/newman-met-ensemble')

...it hangs there indefinitely. If I keyboard interrupt it bottoms out here:

/opt/conda/lib/python3.6/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options) 71 if source_address: 72 sock.bind(source_address) ---> 73 sock.connect(sa) 74 return sock 75

...suggesting it is not able to make a connection. Am I doing something wrong?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray/zarr cloud demo 303270676
371299755 https://github.com/pydata/xarray/issues/1974#issuecomment-371299755 https://api.github.com/repos/pydata/xarray/issues/1974 MDEyOklzc3VlQ29tbWVudDM3MTI5OTc1NQ== alimanfoo 703554 2018-03-07T21:58:49Z 2018-03-07T21:58:49Z CONTRIBUTOR

Wonderful, thanks both!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  xarray/zarr cloud demo 303270676
350375750 https://github.com/pydata/xarray/pull/1528#issuecomment-350375750 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MDM3NTc1MA== alimanfoo 703554 2017-12-08T21:24:45Z 2017-12-08T22:27:47Z CONTRIBUTOR

Just to confirm, if writes are aligned with chunk boundaries in the destination array then no locking is required.

Also if you're going to be moving large datasets into cloud storage and doing distributed computing then it may be worth investigating compressors and compressor options as good compression ratio may make a big difference where network bandwidth may be the limiting factor. I would suggest using the Blosc compressor with cname='zstd'. I would also suggest using shuffle, the Blosc codec in latest numcodecs has an AUTOSHUFFLE option so byte shuffle is used for arrays with >1 byte item size and bit shuffle is used for arrays with 1 byte item size . I would also experiment with compression level (clevel) to see how speed balances against compression ratio. E.g., Blosc(cname='zstd', clevel=5, shuffle=Blosc.AUTOSHUFFLE) may be a good starting point. The default compressor is Blosc(cname='lz4', ...) is more optimised for fast local storage, so speed is very good but compression ratio is moderate, this may not be best for distributed computing.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
350379064 https://github.com/pydata/xarray/pull/1528#issuecomment-350379064 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MDM3OTA2NA== alimanfoo 703554 2017-12-08T21:40:40Z 2017-12-08T22:27:35Z CONTRIBUTOR

Some examples of compressor benchmarking here may be useful http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html

The specific conclusions probably won't apply to your data but some of the code and ideas may be useful. Since writing that article I added Zstd and LZ4 compressors in numcodecs so those may also be worth trying in addition to Blosc with various configurations. (Blosc breaks up each chunk into blocks which enables multithreaded compression/decompression but can also reduce compression ratio over the same compressor library used without Blosc. I.e., Blosc(cname='zstd', clevel=1) will behave differently from Zstd(level=1) even though the same underlying compression library (Zstandard) is being used.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
348839453 https://github.com/pydata/xarray/pull/1528#issuecomment-348839453 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0ODgzOTQ1Mw== alimanfoo 703554 2017-12-04T01:40:57Z 2017-12-04T01:40:57Z CONTRIBUTOR

I know you're not including string support in this PR, but for interest, there are a couple of changes coming into zarr via https://github.com/alimanfoo/zarr/pull/212 that may be relevant in future.

It should now be impossible to generate a segfault via a badly configured object array. It is also now much harder to badly configure an object array. When creating an object array, an object codec should be provided via the object_codec parameter. There are now three codecs in numcodecs that can be used for variable length text strings: MsgPack, Pickle and JSON (new). Examples notebook here. In that notebook I also ran some simple benchmarks and MsgPack comes out well, but JSON isn't too shabby either.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
348183062 https://github.com/pydata/xarray/pull/1087#issuecomment-348183062 https://api.github.com/repos/pydata/xarray/issues/1087 MDEyOklzc3VlQ29tbWVudDM0ODE4MzA2Mg== alimanfoo 703554 2017-11-30T13:07:53Z 2017-11-30T13:07:53Z CONTRIBUTOR

FWIW for the filters, if it would be possible to use the numcodecs Codec API http://numcodecs.readthedocs.io/en/latest/abc.html then that could be beneficial beyond xarray, as any work you put into developing filters could then be used elsewhere (e.g., in zarr).

On Thu, Nov 30, 2017 at 12:05 PM, Stephan Hoyer notifications@github.com wrote:

OK, I'm going to try to reboot this and finish it up in the form of an API that we'll be happy with going forward. I just discovered two more xarray backends over the past two days (in Unidata's Siphon and something @alexamici https://github.com/alexamici and colleagues are writing to reading GRIB files), so clearly the demand is here.

One additional change I'd like to make is try to rewrite the encoding/decoding functions for variables into a series of invertible coding filters that can potentially be chained together in a flexible way (this is somewhat inspired by zarr). This will allow different backends to mix/match filters as necessary, depending on their particular needs. I'll start on that in another PR.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1087#issuecomment-348169779, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QmzjKBnyjuGDFN6btGfhr2eFrhoiks5s7poXgaJpZM4Kq10M .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: New DataStore / Encoder / Decoder API for review 187625917
347385269 https://github.com/pydata/xarray/pull/1528#issuecomment-347385269 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM4NTI2OQ== alimanfoo 703554 2017-11-28T01:36:29Z 2017-11-28T01:49:24Z CONTRIBUTOR

FWIW I think the best option at the moment is to make sure you add either Pickle or MsgPack filter for any zarr array with an object dtype.

BTW I was thinking that zarr should automatically add one of these filters any time someone creates an array with an object dtype, to avoid them hitting the pointer issue. If you have any thoughts on best solution drop them here: https://github.com/alimanfoo/zarr/issues/208

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347381734 https://github.com/pydata/xarray/pull/1528#issuecomment-347381734 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM4MTczNA== alimanfoo 703554 2017-11-28T01:16:07Z 2017-11-28T01:16:07Z CONTRIBUTOR

When still in the original interpreter session, all the objects still exist in memory, so all the pointers stored in the array are still valid. Restart the session and the objects are gone and the pointers are invalid.

On Tue, Nov 28, 2017 at 1:14 AM, Alistair Miles alimanfoo@googlemail.com wrote:

Try exiting and restarting the interpreter, then running:

zgs = zarr.open_group(store='zarr_directory') zgs.x[:]

On Tue, Nov 28, 2017 at 1:10 AM, Ryan Abernathey <notifications@github.com

wrote:

zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory

@alimanfoo https://github.com/alimanfoo: the following also seems to works with directory store

values = np.array([b'ab', b'cdef', np.nan], dtype=object) zgs = zarr.open_group(store='zarr_directory') zgs.create('x', shape=values.shape, dtype=values.dtype) zgs.x[:] = values

This seems to contradict your statement above. What am I missing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-347380750, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QnNQ7bI5GRyHsUUSQAgusymx8eJnks5s611rgaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 <+44%201865%20743596> Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347381500 https://github.com/pydata/xarray/pull/1528#issuecomment-347381500 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM4MTUwMA== alimanfoo 703554 2017-11-28T01:14:42Z 2017-11-28T01:14:42Z CONTRIBUTOR

Try exiting and restarting the interpreter, then running:

zgs = zarr.open_group(store='zarr_directory') zgs.x[:]

On Tue, Nov 28, 2017 at 1:10 AM, Ryan Abernathey notifications@github.com wrote:

zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory

@alimanfoo https://github.com/alimanfoo: the following also seems to works with directory store

values = np.array([b'ab', b'cdef', np.nan], dtype=object) zgs = zarr.open_group(store='zarr_directory') zgs.create('x', shape=values.shape, dtype=values.dtype) zgs.x[:] = values

This seems to contradict your statement above. What am I missing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-347380750, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QnNQ7bI5GRyHsUUSQAgusymx8eJnks5s611rgaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347363503 https://github.com/pydata/xarray/pull/1528#issuecomment-347363503 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM2MzUwMw== alimanfoo 703554 2017-11-27T23:27:41Z 2017-11-27T23:27:41Z CONTRIBUTOR

For variable length strings (or any array with an object dtype) zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory (as in your first example). The filter has to be specified manually, some examples here: http://zarr.readthedocs.io/en/master/tutorial.html#string-arrays. There are two codecs currently in numcodecs that can do this, one is Pickle, the other is MsgPack. I haven't done any benchmarking of data size or encoding speed, but MsgPack may be preferable because it's more portable.

There was some discussion a while back about creating a codec that handles variable-length strings by encoding via UTF8 then concatenating encoded bytes and lengths or offsets, IIRC similar to Arrow, and maybe even creating a special "text" dtype that inserts this filter automatically so you don't have to add it manually. But there hasn't been a strong motivation so far.

On Mon, Nov 27, 2017 at 10:32 PM, Stephan Hoyer notifications@github.com wrote:

Overall, I find the conventions module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends.

Agreed!

I wonder why zarr doesn't have a UTF-8 variable length string type ( alimanfoo/zarr#206 https://github.com/alimanfoo/zarr/issues/206) -- that would feel like the obvious first choice for encoding this data.

That said, xarary should be able to use first-length bytes just fine, doing UTF-8 encoding/decoding on the fly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-347351224, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QkLTQUuspLhiXYR2_WMW8Hg9LFziks5s6ziTgaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345619509 https://github.com/pydata/xarray/pull/1528#issuecomment-345619509 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTYxOTUwOQ== alimanfoo 703554 2017-11-20T08:07:44Z 2017-11-20T08:07:44Z CONTRIBUTOR

Fantastic!

On Monday, November 20, 2017, Matthew Rocklin notifications@github.com wrote:

That is, indeed, quite exciting. Also exciting is that I was able to look at and compute on your data easily.

In [1]: import zarr

In [2]: import gcsfs

In [3]: fs = gcsfs.GCSFileSystem(project='pangeo-181919')

In [4]: gcsmap = gcsfs.mapping.GCSMap('zarr_store_test', gcs=fs, check=True, create=False)

In [5]: import xarray as xr

In [6]: ds_gcs = xr.open_zarr(gcsmap, mode='r')

In [7]: ds_gcs Out[7]: <xarray.Dataset> Dimensions: (x: 200, y: 100) Coordinates: * x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... * y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... Data variables: bar (x) float64 dask.array<shape=(200,), chunksize=(40,)> foo (y, x) float32 dask.array<shape=(100, 200), chunksize=(50, 40)> Attributes: array_atr: [1, 2] some_attr: copana

In [8]: ds_gcs.sum() Out[8]: <xarray.Dataset> Dimensions: () Data variables: bar float64 dask.array<shape=(), chunksize=()> foo float32 dask.array<shape=(), chunksize=()>

In [9]: ds_gcs.sum().compute() Out[9]: <xarray.Dataset> Dimensions: () Data variables: bar float64 0.0 foo float32 20000.0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-345575240, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Quu1UYM4BO3i_KzMkXGnN-g-TFczks5s4OO5gaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345080945 https://github.com/pydata/xarray/pull/1528#issuecomment-345080945 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTA4MDk0NQ== alimanfoo 703554 2017-11-16T22:18:04Z 2017-11-16T22:18:04Z CONTRIBUTOR

Re different zarr storage backends, main options are plain dict, DirectoryStore, ZipStore, and there's a new DBMStore class just merged which enables storage in any DBM-style database (e.g., Berkeley DB). ZipStore has some constraints because of how zip files work, you can't really replace an entry in a zip file which means anything that writes the same array chunk more than once will generate warnings. Dask's S3Map should also work, I haven't tried it and obviously not ideal for unit tests but I'd be interested if you get any experience with it.

Re different combinations of zarr and dask chunks, it can be thread safe even if chunks are not aligned, just need to pass a synchronizer when instantiating the array or group. Zarr has a ThreadSynchronizer class which can be used for thread-based parallelism. If a synchronizer is provided, it is used to lock each chunk individually during write operations. More info here.

Re fill values, zarr has a native concept of fill value for each array, with the fill value stored as part of the array metadata. Array metadata are stored as JSON and I recently merged a fix so that a bytes fill values could be used (via base64 encoding). I believe the netcdf way is to store fill value separately as value of "_FillValue" attribute? You could do this with zarr but user attributes are also JSON and so you would need to do your own encoding/decoding. But if possible I'd suggest using the native zarr fill_value support as it handles bytes fill value encoding and also checks to ensure fill values are valid wrt the array dtype.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
339897936 https://github.com/pydata/xarray/pull/1528#issuecomment-339897936 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzOTg5NzkzNg== alimanfoo 703554 2017-10-27T07:42:34Z 2017-10-27T07:42:34Z CONTRIBUTOR

Suggest testing against GitHub master, there are a few other issues I'd like to work through before next release.

On Thu, 26 Oct 2017 at 23:07, Ryan Abernathey notifications@github.com wrote:

Fantastic! Are you planning a release any time soon? If not we can set up to test against the github master.

Sent from my iPhone

On Oct 26, 2017, at 5:04 PM, Alistair Miles notifications@github.com wrote:

Just to say, support for 0d arrays, and for arrays with one or more zero-length dimensions, is in zarr master.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-339815147, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QtP5kta-H9Y90Puv9BHig7krEI0Wks5swQKQgaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
339800443 https://github.com/pydata/xarray/pull/1528#issuecomment-339800443 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzOTgwMDQ0Mw== alimanfoo 703554 2017-10-26T21:04:17Z 2017-10-26T21:04:17Z CONTRIBUTOR

Just to say, support for 0d arrays, and for arrays with one or more zero-length dimensions, is in zarr master.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
338786761 https://github.com/pydata/xarray/issues/1650#issuecomment-338786761 https://api.github.com/repos/pydata/xarray/issues/1650 MDEyOklzc3VlQ29tbWVudDMzODc4Njc2MQ== alimanfoo 703554 2017-10-23T20:29:41Z 2017-10-23T20:29:41Z CONTRIBUTOR

Index API sounds good.

Also I was just looking at dask.dataframe indexing, there .loc is implemented using information about index values at the boundaries of each partition (chunk). Not sure xarray should use same strategy for chunked datasets, but is another approach to avoid loading indexes into memory.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Low memory/out-of-core index? 267628781
338687376 https://github.com/pydata/xarray/issues/1650#issuecomment-338687376 https://api.github.com/repos/pydata/xarray/issues/1650 MDEyOklzc3VlQ29tbWVudDMzODY4NzM3Ng== alimanfoo 703554 2017-10-23T14:58:59Z 2017-10-23T14:58:59Z CONTRIBUTOR

It looks like #1017 is about having no index at all. I want indexes, but I want to avoid loading all coordinate values into memory.

On Mon, Oct 23, 2017 at 1:47 PM, Fabien Maussion notifications@github.com wrote:

Has anyone considered implementing an index for monotonic data that does not require loading all values into main memory?

But this is already the case? #1017 https://github.com/pydata/xarray/pull/1017

With on file datasets I think it is sufficient to drop_variables when opening the dataset in order not to parse the coordinates:

ds = xr.open_dataset(f, drop_variables=['lon', 'lat'])

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1650#issuecomment-338647540, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QsbZ81N2pKybO1sFHVHK0KTk1aELks5svIrJgaJpZM4QCq62 .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Low memory/out-of-core index? 267628781
338627454 https://github.com/pydata/xarray/issues/1650#issuecomment-338627454 https://api.github.com/repos/pydata/xarray/issues/1650 MDEyOklzc3VlQ29tbWVudDMzODYyNzQ1NA== alimanfoo 703554 2017-10-23T11:19:30Z 2017-10-23T11:19:30Z CONTRIBUTOR

Just to add a further thought, which is that the upper levels of the binary search tree could be be cached to get faster performance for repeated searches.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Low memory/out-of-core index? 267628781
338622746 https://github.com/pydata/xarray/issues/1603#issuecomment-338622746 https://api.github.com/repos/pydata/xarray/issues/1603 MDEyOklzc3VlQ29tbWVudDMzODYyMjc0Ng== alimanfoo 703554 2017-10-23T10:56:40Z 2017-10-23T10:56:40Z CONTRIBUTOR

Just to say I'm interested in how MultiIndexes are handled also. In our use case, we have two variables conventionally named CHROM (chromosome) and POS (position) which together describe a location in a genome. I want to combine both variables into a multi-index so I can, e.g., select all data from some data variable for chromosome X between positions 100,000-200,000. For all our data variables, this genome location multi-index would be used to index the first dimension.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  Explicit indexes in xarray's data-model (Future of MultiIndex) 262642978
338459385 https://github.com/pydata/xarray/issues/66#issuecomment-338459385 https://api.github.com/repos/pydata/xarray/issues/66 MDEyOklzc3VlQ29tbWVudDMzODQ1OTM4NQ== alimanfoo 703554 2017-10-22T08:02:29Z 2017-10-22T08:02:29Z CONTRIBUTOR

Just to say thanks for the work on this, I've been looking at the h5netcdf code recently to understand better how dimensions are plumbed in netcdf4. I'm exploring refactoring all my data model classes in scikit-allel to build on xarray, I think the time is right, especially if xarray gets a Zarr backend too.

On Sun, 22 Oct 2017 at 02:01, Stephan Hoyer notifications@github.com wrote:

Closed #66 https://github.com/pydata/xarray/issues/66.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/66#event-1304360167, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QqPs_6iyjBqHhFoB2CV7blLX8TUYks5supQEgaJpZM4BpxKD .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  HDF5 backend for xray 29453809
335186616 https://github.com/pydata/xarray/pull/1528#issuecomment-335186616 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNTE4NjYxNg== alimanfoo 703554 2017-10-09T15:07:29Z 2017-10-09T17:23:21Z CONTRIBUTOR

I'm on paternity leave for the next 2 weeks, then will be catching up for a couple of weeks I expect. May be able to merge straightforward PRs but will have limited bandwidth.

{
    "total_count": 3,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 3,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
335030993 https://github.com/pydata/xarray/pull/1528#issuecomment-335030993 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNTAzMDk5Mw== alimanfoo 703554 2017-10-08T19:17:27Z 2017-10-08T23:37:47Z CONTRIBUTOR

FWIW I think some JSON encoders for attributes would ultimately be a useful addition to zarr, but I won't be able to put any effort into zarr in the next month, so workarounds in xarray sounds like a good idea for now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325813339 https://github.com/pydata/xarray/pull/1528#issuecomment-325813339 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTgxMzMzOQ== alimanfoo 703554 2017-08-29T21:43:48Z 2017-08-29T21:43:48Z CONTRIBUTOR

On Tuesday, August 29, 2017, Ryan Abernathey notifications@github.com wrote:

@alimanfoo https://github.com/alimanfoo: when do you anticipate the 2.2 zarr release to happen? Will the API change significantly? If so, I will wait for that to move forward here.

Zarr 2.2 will hopefully happen some time in the next 2 months, but it will be fully backwards-compatible, no breaking API changes.

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325729013 https://github.com/pydata/xarray/pull/1528#issuecomment-325729013 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcyOTAxMw== alimanfoo 703554 2017-08-29T17:02:41Z 2017-08-29T17:02:41Z CONTRIBUTOR

FWIW all filter (codec) classes have been migrated from zarr to a separate packaged called numcodecs and will be imported from there in the next (2.2) zarr release. Here is FixedScaleOffset. Implementation is basic numpy, probably some room for optimization.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325727280 https://github.com/pydata/xarray/pull/1528#issuecomment-325727280 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcyNzI4MA== alimanfoo 703554 2017-08-29T16:56:55Z 2017-08-29T16:56:55Z CONTRIBUTOR

Following this with interest.

Regarding autoclose, just to confirm that zarr doesn't really have any notion of whether something is open or closed. When using the DirectoryStore storage class (most common use case I imagine), all files are automatically closed, nothing is kept open. There are some storage classes (e.g., ZipStore) that do require an explicit close call to finalise the file on disk if you have been writing data, but I think you can ignore this in xarray and leave it up to the user to manage this themselves.

Out of interest, @shoyer do you still think there would be value in writing a wrapper for zarr analogous to h5netcdf? Or does this PR provide all the necessary functionality?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
282031922 https://github.com/pydata/xarray/issues/1223#issuecomment-282031922 https://api.github.com/repos/pydata/xarray/issues/1223 MDEyOklzc3VlQ29tbWVudDI4MjAzMTkyMg== alimanfoo 703554 2017-02-23T15:55:38Z 2017-02-23T15:55:38Z CONTRIBUTOR

FWIW I think it would be better in xarray or a separate package, at least at the moment, just because I don't have a lot of time right now for OSS and need to keep Zarr as lean as possible.

On Thursday, February 23, 2017, Martin Durant notifications@github.com wrote:

@alimanfoo https://github.com/alimanfoo , do you think this work would make more sense as part of zarr rather than as part of xarray?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1223#issuecomment-281990573, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QoeCQOn7WvB8gtLP5Bs6cifIKRQiks5rfYjSgaJpZM4Lp0yH .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr as persistent store for xarray 202260275
281829618 https://github.com/pydata/xarray/issues/1223#issuecomment-281829618 https://api.github.com/repos/pydata/xarray/issues/1223 MDEyOklzc3VlQ29tbWVudDI4MTgyOTYxOA== alimanfoo 703554 2017-02-22T22:43:52Z 2017-02-22T22:43:52Z CONTRIBUTOR

Yep, that looks good. I was wondering about the xarray_to_zarr() function?

On Wednesday, February 22, 2017, Martin Durant notifications@github.com wrote:

@alimanfoo https://github.com/alimanfoo , in the new dataset save function, I do exactly [as you suggest] (https://gist.github.com/ martindurant/06a1e98c91f0033c4649a48a2f943390#file-zarr_xarr-py-L168), with everything getting put as a dict into the main zarr group attributes, with special attribute names "attrs" for the data-set root, "coords" for the set of coordinate objects and "variables" for the set of variables objects (all of these have their own attributes in xarray).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1223#issuecomment-281813651, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QqSXNzQkrR0xOhhcp9QxWUIkz8Teks5rfKvggaJpZM4Lp0yH .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr as persistent store for xarray 202260275
281496902 https://github.com/pydata/xarray/issues/1223#issuecomment-281496902 https://api.github.com/repos/pydata/xarray/issues/1223 MDEyOklzc3VlQ29tbWVudDI4MTQ5NjkwMg== alimanfoo 703554 2017-02-21T22:05:39Z 2017-02-21T22:05:39Z CONTRIBUTOR

Just to say this is looking neat.

For storing an xarray.DataArray, do you think it would be possible to do away with pickling up all metadata and storing in the .xarray resource? Specifically I'm wondering if this could all be stored as attributes on the Zarr array, with some conventions for special xarray attribute names? I'm guessing there must be some conventions for storing all this metadata as attributes in an HDF5 (netCDF) file, it would potentially be nice to mirror that as much as possible?

On Sat, Feb 11, 2017 at 10:56 PM, Martin Durant notifications@github.com wrote:

I have developed my example a little to sidestep subclassing you suggest, which seemed tricky to implement.

Please see https://gist.github.com/martindurant/ 06a1e98c91f0033c4649a48a2f943390 (dataset_to/from_zarr functions)

I can use the zarr groups structure to mirror at least typical use of xarrays: variables, coordinates and sets of attributes on each. I have tested this with s3 too, stealing a little code from dask to show the idea.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1223#issuecomment-279181938, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QtydMLiMvgETYyaVF5D1CLb-4ot4ks5rbjy5gaJpZM4Lp0yH .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Email: alimanfoo@googlemail.com Web: http://purl.org/net/aliman Twitter: https://twitter.com/alimanfoo Tel: +44 (0)1865 287721

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr as persistent store for xarray 202260275
274214755 https://github.com/pydata/xarray/issues/1223#issuecomment-274214755 https://api.github.com/repos/pydata/xarray/issues/1223 MDEyOklzc3VlQ29tbWVudDI3NDIxNDc1NQ== alimanfoo 703554 2017-01-21T00:24:27Z 2017-01-21T00:24:27Z CONTRIBUTOR

Happy to help if there's anything to do on the zarr side.

On Fri, 20 Jan 2017 at 23:47, Matthew Rocklin notifications@github.com wrote:

Also cc @alimanfoo https://github.com/alimanfoo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1223#issuecomment-274209930, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QlwtJQ_OKOekveWuYtLmpR-caHvgks5rUUeTgaJpZM4Lp0yH .

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  zarr as persistent store for xarray 202260275
90813596 https://github.com/pydata/xarray/issues/66#issuecomment-90813596 https://api.github.com/repos/pydata/xarray/issues/66 MDEyOklzc3VlQ29tbWVudDkwODEzNTk2 alimanfoo 703554 2015-04-08T06:04:53Z 2015-04-08T06:04:53Z CONTRIBUTOR

Thanks Stephan, I'll take a look.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  HDF5 backend for xray 29453809
43385302 https://github.com/pydata/xarray/pull/127#issuecomment-43385302 https://api.github.com/repos/pydata/xarray/issues/127 MDEyOklzc3VlQ29tbWVudDQzMzg1MzAy alimanfoo 703554 2014-05-16T22:16:01Z 2014-05-16T22:16:01Z CONTRIBUTOR

No worries, glad to contribute.

On Friday, 16 May 2014, Stephan Hoyer notifications@github.com wrote:

Thanks @alimanfoo https://github.com/alimanfoo!

Reply to this email directly or view it on GitHubhttps://github.com/xray-pydata/xray/pull/127#issuecomment-43287303 .

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@gmail.com Tel: +44 (0)1865 287721 _new number_

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  initial implementation of support for NetCDF groups 33396232
43059199 https://github.com/pydata/xarray/pull/127#issuecomment-43059199 https://api.github.com/repos/pydata/xarray/issues/127 MDEyOklzc3VlQ29tbWVudDQzMDU5MTk5 alimanfoo 703554 2014-05-14T09:20:01Z 2014-05-14T09:20:01Z CONTRIBUTOR

I've added a test to check for an error when a group is not found. I also changed the implementation of the group access function to avoid recursion, it seemed simpler.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  initial implementation of support for NetCDF groups 33396232
43024743 https://github.com/pydata/xarray/pull/127#issuecomment-43024743 https://api.github.com/repos/pydata/xarray/issues/127 MDEyOklzc3VlQ29tbWVudDQzMDI0NzQz alimanfoo 703554 2014-05-13T23:11:07Z 2014-05-13T23:11:07Z CONTRIBUTOR

Thanks for the comments, all makes good sense.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  initial implementation of support for NetCDF groups 33396232
42869488 https://github.com/pydata/xarray/issues/66#issuecomment-42869488 https://api.github.com/repos/pydata/xarray/issues/66 MDEyOklzc3VlQ29tbWVudDQyODY5NDg4 alimanfoo 703554 2014-05-12T18:29:57Z 2014-05-12T18:29:57Z CONTRIBUTOR

One other detail, I have an HDF5 group for each conceptual dataset, but then variables may be organised into subgroups. It would be nice if this could be accommodated, e.g., when opening an HDF5 group as an xray dataset, assume the dataset contains all variables in the group and any subgroups searched recursively. Again apologies I don't know if this is allowed in NetCDF4, will do the research.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  HDF5 backend for xray 29453809
42840763 https://github.com/pydata/xarray/issues/66#issuecomment-42840763 https://api.github.com/repos/pydata/xarray/issues/66 MDEyOklzc3VlQ29tbWVudDQyODQwNzYz alimanfoo 703554 2014-05-12T14:45:57Z 2014-05-12T14:45:57Z CONTRIBUTOR

Thanks @akleeman for the info, much appreciated.

A couple of other points I thought maybe worth mentioning if you're considering wrapping h5py.

First I've been using lzf as the compression filter in my HDF5 files. I believe h5py bundles the source for lzf. I don't know if lzf would be supported if accessing through the python netcdf API.

Second, I have a situation where I have multiple datasets, each of which is stored in a separate groups, each of which has two dimensions (genome position and biological sample). The genome position scale is different for each dataset (there's one dataset per chromosome), however, the biological sample scale is actually common to all of the datasets. So at the moment I have a variable in the root group with the "samples" dimension scale, then each dataset group has it's own "position" dimension scale. You can represent all this with HDF5 dimension scales, but I've no idea if this is accommodated by NetCDF4 or could fit into the xray model. I could work around this by copying the samples variable into each dataset, but just thought I mention this pattern as something to be aware of.

On Mon, May 12, 2014 at 3:04 PM, akleeman notifications@github.com wrote:

@alimanfoo https://github.com/alimanfoo

Glad you're enjoying xray!

From your description it sounds like it should be relatively simple for you to get xray working with your dataset. NetCDF4 is a subset of h5py and simply adding dimension scales should get you most of the way there.

Re: groups, each xray.Dataset corresponds to one HDF5 group. So while xray doesn't currently support groups, you could split your HDF5 dataset into separate files for each group and load those files using xray. Alternatively (if you feel ambitious) it shouldn't be too hard to get xray's NetCDF4DataStore (backends.netCDF4_.py) to work with groups, allowing you to do something like:

dataset = xray.open_dataset('multiple_groups.h5', group='/one_group')

Thishttp://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.htmlgives some good examples of how groups work within the netCDF4.

Also, as @shoyer https://github.com/shoyer mentioned, it might make sense to modify xray so that NetCDF4 support is obtained by wrapping h5py instead of netCDF4 which might make your life even easier.

Reply to this email directly or view it on GitHubhttps://github.com/xray-pydata/xray/issues/66#issuecomment-42835510 .

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org The Wellcome Trust Centre for Human Genetics Roosevelt Drive Oxford OX3 7BN United Kingdom Web: http://purl.org/net/aliman Email: alimanfoo@gmail.com Tel: +44 (0)1865 287721 _new number_

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  HDF5 backend for xray 29453809
42805550 https://github.com/pydata/xarray/issues/66#issuecomment-42805550 https://api.github.com/repos/pydata/xarray/issues/66 MDEyOklzc3VlQ29tbWVudDQyODA1NTUw alimanfoo 703554 2014-05-12T08:08:37Z 2014-05-12T08:08:37Z CONTRIBUTOR

I'm really enjoying working with xray, it's so nice to be able to think of my dimensions as named and labeled dimensions, no more remembering which axis is which!

I'm not sure if this is relevant to this specific issue, but I am working for the most part with HDF5 files created using h5py. I'm only just learning about NetCDF-4, but I have datasets that comprise a number of 1D and 2D variables with shared dimensions, so I think my data is already very close to the right model. I have a couple of questions:

(1) If I have multiple datasets within an HDF5 file, each within a separate group, can I access those through xray?

(2) What would I need to add to my HDF5 to make it fully compliant with the xray/NetCDF4 model? Is it just a question of creating and attaching dimension scales or would I need to do something else as well?

Thanks in advance.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  HDF5 backend for xray 29453809

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 24.911ms · About: xarray-datasette