html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue https://github.com/pydata/xarray/issues/7446#issuecomment-1396560033,https://api.github.com/repos/pydata/xarray/issues/7446,1396560033,IC_kwDOAMm_X85TPdCh,56827,2023-01-19T07:44:30Z,2023-01-19T07:44:30Z,NONE,"On Tue, Jan 17, 2023 at 5:23 PM Ryan Abernathey ***@***.***> wrote: > Hi @gauteh ! This is very cool! Thanks for > sharing. I'm really excited about way that Rust can be used to optimized > different parts of our stack. > > A couple of questions: > > - > > Can your reader read over HTTP / S3 protocol? Or is it just local > files? > > It is built to do this, but I haven't implemented it. I initially wrote it for an OpenDAP server (dars: https://github.com/gauteh/dars), where the plan is to also support files stored in the cloud. So the hidefix-reader can read from any interface that supports ReadAt or Read + Seek. It would probably be beneficial to index the files beforehand. I submitted a patch to HDF5 that allows it to iterate over the chunks quickly, so indexing a 5-6 GB file takes only a couple of hundred ms - so I no longer store the index for local files. It is still faster than native HDF5 including the indexing. > > - > - > > Do you know about kerchunk ? The > approach you described: > > The reader works by indexing the chunks of a dataset so that chunks > can be accessed independently. > > ...is identical to the approach taken by Kerchunk (although the > implementation is different). I'm curious what specification you use to > store your indexes. Could we make your implementation interoperable with > kerchunk, such that a kerchunk reference specification could be read by > your reader? It would be great to reach for some degree of alignment here. > > The index is serializable using the rust serde system, so it can be stored in any format supported by that. A fair amount of effort went into making the deserialization _zero-copy_: that means that I can read the e.g. 10mb index for a 5-6gb file very quickly, but it requires very little deserialization since the read buffers are already memory-mapped to the structures making it very fast. I don't have a specific format at the moment, but I have used bincode a lot in e.g. dars. > > - > - > > Do you know about hdf5-coro - http://icesat2sliderule.org/h5coro/ - > they have similar goals, but focused on cloud-based access > > I hope this can be of general interest, and if it would be of interest to > move the hidefix xarray backend into xarray that would be very cool. > > This is definitely of general interest! However, it is not necessary to > add a new backend directly into xarray. We support entry points which allow > packages to implement their own readers, as you have apparently already > discovered: > https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html > > Installing your package should be enough to enable the new engine. > > We would, however, welcome a documentation PR that described how to use > this package on the I/O page. > Great, the package should already register itself with xarray. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1536004355