home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

30 rows where author_association = "CONTRIBUTOR" and issue = 253136694 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 2

  • alimanfoo 16
  • martindurant 14

issue 1

  • WIP: Zarr backend · 30 ✖

author_association 1

  • CONTRIBUTOR · 30 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions performed_via_github_app issue
365412033 https://github.com/pydata/xarray/pull/1528#issuecomment-365412033 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NTQxMjAzMw== martindurant 6042212 2018-02-13T21:35:03Z 2018-02-13T21:35:03Z CONTRIBUTOR

Yeah, ideally when adding a variable like ds['myvar'] = xr.DataArray(data=da.zeros(..., chunks=(..)), dims=['l', 'b', 'v']) ds.to_zarr(mapping) we should be able to apply an optimization strategy in which the zarr array is created without filling in all those unnecessary zeros. This seems doable.

On the other hand, implementing ds.myvar[slice, slice, slice] = some data ds.to_zarr(mapping) (which cannot be done currently with dask-arrays at all), in such a way that only partitions with data get updated - this seems really hard.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364817111 https://github.com/pydata/xarray/pull/1528#issuecomment-364817111 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgxNzExMQ== martindurant 6042212 2018-02-12T02:43:43Z 2018-02-12T03:47:48Z CONTRIBUTOR

OK, so the way to do this in pure-zarr appears to be to simply create the appropriate zarr array and set it's dimensions attribute:

ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005), 'l': np.arange(150, 72, -0.005), 'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)}, ds.to_zarr(mapping) g = zarr.open_group(mapping) arr = g.zeros(..., shape like l, b, v) arr.attrs['_ARRAY_DIMENSIONS'] = ['l', 'b', 'v']

xr..open_zarr(mapping) now shows the new array, without having to materialize any data into it, and arr can be written to piecemeal - without the convenience of the coordinate mapping, of course.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364804697 https://github.com/pydata/xarray/pull/1528#issuecomment-364804697 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwNDY5Nw== martindurant 6042212 2018-02-12T00:19:55Z 2018-02-12T00:19:55Z CONTRIBUTOR

It might be enough, in this case, to provide some helper function in zarr to create and fetch arrays that will show up as variables in xarray - this need not be specific to being used via dask. I am assuming with the work done in this PR, that there is an unambiguous way to determine if a zarr group can be interpreted as an xarray dataset, and that zarr then knows how to add things that look like variables (which generally in the zarr case don't involve writing any actual data until the parts of the array are filled in).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364803984 https://github.com/pydata/xarray/pull/1528#issuecomment-364803984 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwMzk4NA== martindurant 6042212 2018-02-12T00:12:36Z 2018-02-12T00:12:36Z CONTRIBUTOR

@jhamman , that partially solves what I mean, I can probably turn my data into dask arrays with some difficulty; but really I was hoping for something like the following: ds = xr.Dataset(coords={'b': np.arange(-4, 6, 0.005), 'l': np.arange(150, 72, -0.005), 'v': np.arange(58722.24288, -164706.4225401, -8.2446e2)}, arr = ds.create_new_zero_array(dims=['l', 'b', 'v']) arr[0:10, :, :] = 1 and expect to be able to set the values of the new variable in the same way that you can with the equivalent zarr array. I can probably get around this by setting the values with da.zeros, finding the zarr array in the dataset, and then setting its values.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
364801073 https://github.com/pydata/xarray/pull/1528#issuecomment-364801073 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM2NDgwMTA3Mw== martindurant 6042212 2018-02-11T23:35:34Z 2018-02-11T23:35:34Z CONTRIBUTOR

Question: how would one build a zarr-xarray dataset?

With zarr you can open an array that contains no data, and use set-slice notation to fill in the values (which is what dask's store essentially does).

If I have some pre-known coordinates and bigger-than-memory data arrays, how would I go about getting the values into the zarr structure? If this can't be done directly with the xarray interface, is there a way to call zarr's open/create/zeros such that the corresponding array will appear as a variable when the same dataset is opened with xarray?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
350375750 https://github.com/pydata/xarray/pull/1528#issuecomment-350375750 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MDM3NTc1MA== alimanfoo 703554 2017-12-08T21:24:45Z 2017-12-08T22:27:47Z CONTRIBUTOR

Just to confirm, if writes are aligned with chunk boundaries in the destination array then no locking is required.

Also if you're going to be moving large datasets into cloud storage and doing distributed computing then it may be worth investigating compressors and compressor options as good compression ratio may make a big difference where network bandwidth may be the limiting factor. I would suggest using the Blosc compressor with cname='zstd'. I would also suggest using shuffle, the Blosc codec in latest numcodecs has an AUTOSHUFFLE option so byte shuffle is used for arrays with >1 byte item size and bit shuffle is used for arrays with 1 byte item size . I would also experiment with compression level (clevel) to see how speed balances against compression ratio. E.g., Blosc(cname='zstd', clevel=5, shuffle=Blosc.AUTOSHUFFLE) may be a good starting point. The default compressor is Blosc(cname='lz4', ...) is more optimised for fast local storage, so speed is very good but compression ratio is moderate, this may not be best for distributed computing.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
350379064 https://github.com/pydata/xarray/pull/1528#issuecomment-350379064 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM1MDM3OTA2NA== alimanfoo 703554 2017-12-08T21:40:40Z 2017-12-08T22:27:35Z CONTRIBUTOR

Some examples of compressor benchmarking here may be useful http://alimanfoo.github.io/2016/09/21/genotype-compression-benchmark.html

The specific conclusions probably won't apply to your data but some of the code and ideas may be useful. Since writing that article I added Zstd and LZ4 compressors in numcodecs so those may also be worth trying in addition to Blosc with various configurations. (Blosc breaks up each chunk into blocks which enables multithreaded compression/decompression but can also reduce compression ratio over the same compressor library used without Blosc. I.e., Blosc(cname='zstd', clevel=1) will behave differently from Zstd(level=1) even though the same underlying compression library (Zstandard) is being used.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
348839453 https://github.com/pydata/xarray/pull/1528#issuecomment-348839453 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0ODgzOTQ1Mw== alimanfoo 703554 2017-12-04T01:40:57Z 2017-12-04T01:40:57Z CONTRIBUTOR

I know you're not including string support in this PR, but for interest, there are a couple of changes coming into zarr via https://github.com/alimanfoo/zarr/pull/212 that may be relevant in future.

It should now be impossible to generate a segfault via a badly configured object array. It is also now much harder to badly configure an object array. When creating an object array, an object codec should be provided via the object_codec parameter. There are now three codecs in numcodecs that can be used for variable length text strings: MsgPack, Pickle and JSON (new). Examples notebook here. In that notebook I also ran some simple benchmarks and MsgPack comes out well, but JSON isn't too shabby either.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347385269 https://github.com/pydata/xarray/pull/1528#issuecomment-347385269 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM4NTI2OQ== alimanfoo 703554 2017-11-28T01:36:29Z 2017-11-28T01:49:24Z CONTRIBUTOR

FWIW I think the best option at the moment is to make sure you add either Pickle or MsgPack filter for any zarr array with an object dtype.

BTW I was thinking that zarr should automatically add one of these filters any time someone creates an array with an object dtype, to avoid them hitting the pointer issue. If you have any thoughts on best solution drop them here: https://github.com/alimanfoo/zarr/issues/208

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347381734 https://github.com/pydata/xarray/pull/1528#issuecomment-347381734 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM4MTczNA== alimanfoo 703554 2017-11-28T01:16:07Z 2017-11-28T01:16:07Z CONTRIBUTOR

When still in the original interpreter session, all the objects still exist in memory, so all the pointers stored in the array are still valid. Restart the session and the objects are gone and the pointers are invalid.

On Tue, Nov 28, 2017 at 1:14 AM, Alistair Miles alimanfoo@googlemail.com wrote:

Try exiting and restarting the interpreter, then running:

zgs = zarr.open_group(store='zarr_directory') zgs.x[:]

On Tue, Nov 28, 2017 at 1:10 AM, Ryan Abernathey <notifications@github.com

wrote:

zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory

@alimanfoo https://github.com/alimanfoo: the following also seems to works with directory store

values = np.array([b'ab', b'cdef', np.nan], dtype=object) zgs = zarr.open_group(store='zarr_directory') zgs.create('x', shape=values.shape, dtype=values.dtype) zgs.x[:] = values

This seems to contradict your statement above. What am I missing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-347380750, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QnNQ7bI5GRyHsUUSQAgusymx8eJnks5s611rgaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 <+44%201865%20743596> Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347381500 https://github.com/pydata/xarray/pull/1528#issuecomment-347381500 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM4MTUwMA== alimanfoo 703554 2017-11-28T01:14:42Z 2017-11-28T01:14:42Z CONTRIBUTOR

Try exiting and restarting the interpreter, then running:

zgs = zarr.open_group(store='zarr_directory') zgs.x[:]

On Tue, Nov 28, 2017 at 1:10 AM, Ryan Abernathey notifications@github.com wrote:

zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory

@alimanfoo https://github.com/alimanfoo: the following also seems to works with directory store

values = np.array([b'ab', b'cdef', np.nan], dtype=object) zgs = zarr.open_group(store='zarr_directory') zgs.create('x', shape=values.shape, dtype=values.dtype) zgs.x[:] = values

This seems to contradict your statement above. What am I missing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-347380750, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QnNQ7bI5GRyHsUUSQAgusymx8eJnks5s611rgaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
347363503 https://github.com/pydata/xarray/pull/1528#issuecomment-347363503 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NzM2MzUwMw== alimanfoo 703554 2017-11-27T23:27:41Z 2017-11-27T23:27:41Z CONTRIBUTOR

For variable length strings (or any array with an object dtype) zarr needs a filter that can encode and pack the strings into a single buffer, except in the special case where the data are being stored in-memory (as in your first example). The filter has to be specified manually, some examples here: http://zarr.readthedocs.io/en/master/tutorial.html#string-arrays. There are two codecs currently in numcodecs that can do this, one is Pickle, the other is MsgPack. I haven't done any benchmarking of data size or encoding speed, but MsgPack may be preferable because it's more portable.

There was some discussion a while back about creating a codec that handles variable-length strings by encoding via UTF8 then concatenating encoded bytes and lengths or offsets, IIRC similar to Arrow, and maybe even creating a special "text" dtype that inserts this filter automatically so you don't have to add it manually. But there hasn't been a strong motivation so far.

On Mon, Nov 27, 2017 at 10:32 PM, Stephan Hoyer notifications@github.com wrote:

Overall, I find the conventions module to be a bit unwieldy. There is a lot of stuff in there, not all of which is related to CF conventions. It would be useful to separate the actual conventions from the encoding / decoding needed for different backends.

Agreed!

I wonder why zarr doesn't have a UTF-8 variable length string type ( alimanfoo/zarr#206 https://github.com/alimanfoo/zarr/issues/206) -- that would feel like the obvious first choice for encoding this data.

That said, xarary should be able to use first-length bytes just fine, doing UTF-8 encoding/decoding on the fly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-347351224, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QkLTQUuspLhiXYR2_WMW8Hg9LFziks5s6ziTgaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345770374 https://github.com/pydata/xarray/pull/1528#issuecomment-345770374 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTc3MDM3NA== martindurant 6042212 2017-11-20T17:37:01Z 2017-11-20T17:37:01Z CONTRIBUTOR

This is, of course, by design :) I imagine there is much that could be done to optimise performance, but for fewer, larger chunks, it should be pretty good.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345619509 https://github.com/pydata/xarray/pull/1528#issuecomment-345619509 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTYxOTUwOQ== alimanfoo 703554 2017-11-20T08:07:44Z 2017-11-20T08:07:44Z CONTRIBUTOR

Fantastic!

On Monday, November 20, 2017, Matthew Rocklin notifications@github.com wrote:

That is, indeed, quite exciting. Also exciting is that I was able to look at and compute on your data easily.

In [1]: import zarr

In [2]: import gcsfs

In [3]: fs = gcsfs.GCSFileSystem(project='pangeo-181919')

In [4]: gcsmap = gcsfs.mapping.GCSMap('zarr_store_test', gcs=fs, check=True, create=False)

In [5]: import xarray as xr

In [6]: ds_gcs = xr.open_zarr(gcsmap, mode='r')

In [7]: ds_gcs Out[7]: <xarray.Dataset> Dimensions: (x: 200, y: 100) Coordinates: * x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... * y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... Data variables: bar (x) float64 dask.array<shape=(200,), chunksize=(40,)> foo (y, x) float32 dask.array<shape=(100, 200), chunksize=(50, 40)> Attributes: array_atr: [1, 2] some_attr: copana

In [8]: ds_gcs.sum() Out[8]: <xarray.Dataset> Dimensions: () Data variables: bar float64 dask.array<shape=(), chunksize=()> foo float32 dask.array<shape=(), chunksize=()>

In [9]: ds_gcs.sum().compute() Out[9]: <xarray.Dataset> Dimensions: () Data variables: bar float64 0.0 foo float32 20000.0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-345575240, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8Quu1UYM4BO3i_KzMkXGnN-g-TFczks5s4OO5gaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345104440 https://github.com/pydata/xarray/pull/1528#issuecomment-345104440 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTEwNDQ0MA== martindurant 6042212 2017-11-17T00:10:19Z 2017-11-17T00:10:19Z CONTRIBUTOR

hdfs3 also has a MutableMapping for HDFS. I did not succeed in getting one into azure-datalake-store, but it would not be hard to make. In this way, zarr can become a pretty general array cloud storage mechanism.

{
    "total_count": 2,
    "+1": 2,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
345080945 https://github.com/pydata/xarray/pull/1528#issuecomment-345080945 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDM0NTA4MDk0NQ== alimanfoo 703554 2017-11-16T22:18:04Z 2017-11-16T22:18:04Z CONTRIBUTOR

Re different zarr storage backends, main options are plain dict, DirectoryStore, ZipStore, and there's a new DBMStore class just merged which enables storage in any DBM-style database (e.g., Berkeley DB). ZipStore has some constraints because of how zip files work, you can't really replace an entry in a zip file which means anything that writes the same array chunk more than once will generate warnings. Dask's S3Map should also work, I haven't tried it and obviously not ideal for unit tests but I'd be interested if you get any experience with it.

Re different combinations of zarr and dask chunks, it can be thread safe even if chunks are not aligned, just need to pass a synchronizer when instantiating the array or group. Zarr has a ThreadSynchronizer class which can be used for thread-based parallelism. If a synchronizer is provided, it is used to lock each chunk individually during write operations. More info here.

Re fill values, zarr has a native concept of fill value for each array, with the fill value stored as part of the array metadata. Array metadata are stored as JSON and I recently merged a fix so that a bytes fill values could be used (via base64 encoding). I believe the netcdf way is to store fill value separately as value of "_FillValue" attribute? You could do this with zarr but user attributes are also JSON and so you would need to do your own encoding/decoding. But if possible I'd suggest using the native zarr fill_value support as it handles bytes fill value encoding and also checks to ensure fill values are valid wrt the array dtype.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
339897936 https://github.com/pydata/xarray/pull/1528#issuecomment-339897936 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzOTg5NzkzNg== alimanfoo 703554 2017-10-27T07:42:34Z 2017-10-27T07:42:34Z CONTRIBUTOR

Suggest testing against GitHub master, there are a few other issues I'd like to work through before next release.

On Thu, 26 Oct 2017 at 23:07, Ryan Abernathey notifications@github.com wrote:

Fantastic! Are you planning a release any time soon? If not we can set up to test against the github master.

Sent from my iPhone

On Oct 26, 2017, at 5:04 PM, Alistair Miles notifications@github.com wrote:

Just to say, support for 0d arrays, and for arrays with one or more zero-length dimensions, is in zarr master.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/pull/1528#issuecomment-339815147, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QtP5kta-H9Y90Puv9BHig7krEI0Wks5swQKQgaJpZM4PDrlp .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
339800443 https://github.com/pydata/xarray/pull/1528#issuecomment-339800443 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzOTgwMDQ0Mw== alimanfoo 703554 2017-10-26T21:04:17Z 2017-10-26T21:04:17Z CONTRIBUTOR

Just to say, support for 0d arrays, and for arrays with one or more zero-length dimensions, is in zarr master.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
335186616 https://github.com/pydata/xarray/pull/1528#issuecomment-335186616 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNTE4NjYxNg== alimanfoo 703554 2017-10-09T15:07:29Z 2017-10-09T17:23:21Z CONTRIBUTOR

I'm on paternity leave for the next 2 weeks, then will be catching up for a couple of weeks I expect. May be able to merge straightforward PRs but will have limited bandwidth.

{
    "total_count": 3,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 3,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
335030993 https://github.com/pydata/xarray/pull/1528#issuecomment-335030993 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzNTAzMDk5Mw== alimanfoo 703554 2017-10-08T19:17:27Z 2017-10-08T23:37:47Z CONTRIBUTOR

FWIW I think some JSON encoders for attributes would ultimately be a useful addition to zarr, but I won't be able to put any effort into zarr in the next month, so workarounds in xarray sounds like a good idea for now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
333400272 https://github.com/pydata/xarray/pull/1528#issuecomment-333400272 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMzMzQwMDI3Mg== martindurant 6042212 2017-10-01T19:26:22Z 2017-10-01T19:26:22Z CONTRIBUTOR

I have not done anything, I'm afraid, since posting my commit, the content of which is just an example of how you might pass parameters down to zarr, and a test-case which shows that the basic data is round-tripping properly, but actually the dataset does not come back with the same structure as it started off. We can loop back and decide where to go from here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
327901739 https://github.com/pydata/xarray/pull/1528#issuecomment-327901739 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNzkwMTczOQ== martindurant 6042212 2017-09-07T19:36:15Z 2017-09-07T19:36:15Z CONTRIBUTOR

@shoyer , is https://github.com/martindurant/xarray/commit/6c1fb6b76ebba862a1c5831210ce026160da0065 a reasonable start ?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
327833777 https://github.com/pydata/xarray/pull/1528#issuecomment-327833777 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNzgzMzc3Nw== martindurant 6042212 2017-09-07T15:23:31Z 2017-09-07T15:23:31Z CONTRIBUTOR

@rabernat , is there anything I can do to help push this along?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325813339 https://github.com/pydata/xarray/pull/1528#issuecomment-325813339 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTgxMzMzOQ== alimanfoo 703554 2017-08-29T21:43:48Z 2017-08-29T21:43:48Z CONTRIBUTOR

On Tuesday, August 29, 2017, Ryan Abernathey notifications@github.com wrote:

@alimanfoo https://github.com/alimanfoo: when do you anticipate the 2.2 zarr release to happen? Will the API change significantly? If so, I will wait for that to move forward here.

Zarr 2.2 will hopefully happen some time in the next 2 months, but it will be fully backwards-compatible, no breaking API changes.

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325729013 https://github.com/pydata/xarray/pull/1528#issuecomment-325729013 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcyOTAxMw== alimanfoo 703554 2017-08-29T17:02:41Z 2017-08-29T17:02:41Z CONTRIBUTOR

FWIW all filter (codec) classes have been migrated from zarr to a separate packaged called numcodecs and will be imported from there in the next (2.2) zarr release. Here is FixedScaleOffset. Implementation is basic numpy, probably some room for optimization.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325728378 https://github.com/pydata/xarray/pull/1528#issuecomment-325728378 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcyODM3OA== martindurant 6042212 2017-08-29T17:00:29Z 2017-08-29T17:00:29Z CONTRIBUTOR

A further rather big advantage in zarr that I'm not aware of in cdf/hdf (I may be wrong) is not just null values, but not having a given block be written to disc at all if it only contains null data. This probably meshes perfectly well with most user's understanding of missing data/fill value.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325727354 https://github.com/pydata/xarray/pull/1528#issuecomment-325727354 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcyNzM1NA== martindurant 6042212 2017-08-29T16:57:10Z 2017-08-29T16:57:10Z CONTRIBUTOR

Worth pointing out here, that the zarr filter-set is extensible (I suppose hdf5 is too, but I don't think this is ever done in practice), but I don't think it makes any particular claims to performance.

I think both of the options above are reasonable, and there is no particular reason to exclude either: a zarr variable could look to xarray like floats but actually be stored as ints (i.e., arguments are passed to zarr), or it could look like ints which xarray expects to inflate to floats (i.e., stored as an attribute). I mean, if a user stores a float variable, but includes kwargs to zarr for scale/filter (or any other filter arguments), we should make no attempt to interrupt that.

The only question is, if the user wishes to apply scale/offset in xarray, which is their most likely intention? I would guess the latter, compute in xarray and use attributes, since xarray users probably don't know about zarr and its filters.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325727280 https://github.com/pydata/xarray/pull/1528#issuecomment-325727280 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTcyNzI4MA== alimanfoo 703554 2017-08-29T16:56:55Z 2017-08-29T16:56:55Z CONTRIBUTOR

Following this with interest.

Regarding autoclose, just to confirm that zarr doesn't really have any notion of whether something is open or closed. When using the DirectoryStore storage class (most common use case I imagine), all files are automatically closed, nothing is kept open. There are some storage classes (e.g., ZipStore) that do require an explicit close call to finalise the file on disk if you have been writing data, but I think you can ignore this in xarray and leave it up to the user to manage this themselves.

Out of interest, @shoyer do you still think there would be value in writing a wrapper for zarr analogous to h5netcdf? Or does this PR provide all the necessary functionality?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325390391 https://github.com/pydata/xarray/pull/1528#issuecomment-325390391 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTM5MDM5MQ== martindurant 6042212 2017-08-28T15:41:08Z 2017-08-28T15:41:08Z CONTRIBUTOR

@rabernat : on actually looking through your code :) Happy to see you doing exactly as I felt I was not knowledgeable to do and poke xarray's guts. If I can help in any way, please let me know, although I don't have a lot of spare hours right now.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694
325220001 https://github.com/pydata/xarray/pull/1528#issuecomment-325220001 https://api.github.com/repos/pydata/xarray/issues/1528 MDEyOklzc3VlQ29tbWVudDMyNTIyMDAwMQ== martindurant 6042212 2017-08-27T19:46:31Z 2017-08-27T19:46:31Z CONTRIBUTOR

Sorry that I let this slide - there was not a huge upswell of interest around what I had done, and I was not ready to dive into xarray internals. Could you comment more on the difference between your approach and mine? Is the aim to reduce the number of metadata files hanging around? zarr has made an effort with the groups interface to parallel netCDF, which is, after all, what xarray essentially expects of all its data sources.

As in this comment I have come to the realisation that although nice to/from zarr methods can be made relatively easily, they will not get traction unless they can be put within a class that mimics the existing xarray infrastructure, i.e., the user would never know, except that magically they have extra encoding/compression options, the file-path can be an S3 URL (say), and dask parallel computation suddenly works on a cluster and/or with out-of-core processing. That would raise some eyebrows!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  WIP: Zarr backend 253136694

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [performed_via_github_app] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
);
CREATE INDEX [idx_issue_comments_issue]
    ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
    ON [issue_comments] ([user]);
Powered by Datasette · Queries took 572.36ms · About: xarray-datasette