home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 510816294

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions performed_via_github_app issue
https://github.com/pydata/xarray/issues/3096#issuecomment-510816294 https://api.github.com/repos/pydata/xarray/issues/3096 510816294 MDEyOklzc3VlQ29tbWVudDUxMDgxNjI5NA== 18643609 2019-07-12T09:19:41Z 2019-07-12T09:19:41Z NONE

Hi @VincentDehaye. Thanks for being an early adopter! We really appreciate your feedback. I'm sorry it didn't work as expected. We are in really new territory with this feature.

I'm a bit confused about why you are using the multiprocessing module here. The recommended way of parallelizing xarray operations is via the built-in dask support. There are no guarantees that multiprocessing like you're doing will work right. When we talk about parallel append, we are always talking about dask.

Your MCVE is not especially helpful for debugging because the two key functions (make_xarray_dataset and upload_to_s3) are not shown. Could you try simplifying your example a bit? I know it is hard when cloud is involved. But try to let us see more of what is happening under the hood.

If you are creating a dataset for the first time, you probably don't want append. You want to do

python ds = xr.open_mfdataset(all_the_source_files) ds.to_zarr(s3fs_target)

If you are using a dask cluster, this will automatically parallelize everything.

Hi @rabernat, thank you for your quick answer. I edited my MCVE so that you can reproduce the error(as long as you have access to a S3 bucket). I actually forgot about open_mfdataset, that's why I was doing it this way. However in the future I would still like to be able to have standalone workers, because the bandwidth quickly becomes a bottleneck for me (both on downloading the files and uploading to the cloud) so I would like to split the tasks on different machines.

With regards to open_mfdataset(), I checked the code and realized under the hood it's only calling multiple open_dataset(). I was worried it would load the values (and not only metadata) in memory, but I checked it on one file and it apparently does not. Can you confirm this ? In this case I could probably open my whole dataset at once, which would be very convenient. After reading your issue #1385, I also need to check that my case works fine with decode_cf=False. I experienced some troubles with the append on a time dimension but found a workaround, I will probably open another issue for documenting this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  466994138
Powered by Datasette · Queries took 0.65ms · About: xarray-datasette