issue_comments
10 rows where issue = 1197117301 sorted by updated_at descending
This data as json, CSV (advanced)
Suggested facets: reactions, created_at (date), updated_at (date)
issue 1
- Writing a a dataset to .zarr in a loop makes all the data NaNs · 10 ✖
id | html_url | issue_url | node_id | user | created_at | updated_at ▲ | author_association | body | reactions | performed_via_github_app | issue |
---|---|---|---|---|---|---|---|---|---|---|---|
1099643203 | https://github.com/pydata/xarray/issues/6456#issuecomment-1099643203 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85BizlD | max-sixty 5635139 | 2022-04-14T21:31:37Z | 2022-04-14T21:31:37Z | MEMBER |
Right, you changed the example after I responded
Something surprising is indeed going on here. To focus on the surprising part; ```python print(ds3.low_dim.values) ds3.to_zarr('zarr_bug.zarr', mode='w') print(ds3.low_dim.values) ``` returns:
Similarly: ```python In [50]: ds3.low_dim.count().compute() Out[50]: <xarray.DataArray 'low_dim' ()> array(1000000) In [51]: ds3.to_zarr('zarr_bug.zarr', mode='w') Out[51]: <xarray.backends.zarr.ZarrStore at 0x16a27c6d0> In [55]: ds3.low_dim.count().compute() Out[55]: <xarray.DataArray 'low_dim' ()> array(500000) ``` So it's changing the result in memory just from writing to the Zarr store. I'm not sure what the cause is. We can still massively reduce the size of this example — it's currently doing pickling, got a bunch of repeated code, etc. Does it work without the pickling? What if |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 | |
1098856530 | https://github.com/pydata/xarray/issues/6456#issuecomment-1098856530 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85BfzhS | tbloch1 34276374 | 2022-04-14T08:37:11Z | 2022-04-14T08:37:11Z | NONE | @delgadom thanks! This did help with my actual code, and I've now done my processing. But this bug report was more about the fact that overwriting was converting data to NaNs (in two different ways depending on the code apparently). In my case there is no longer any need to do the overwriting, but this doesn't seem like the expected behaviour of overwriting, and I'm sure there are some valid reasons to overwrite data - hence me opening the bug report. If overwriting is supposed to convert data to NaNs then I guess we could close this issue, but I'm not sure that's intended? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 | |
1098574761 | https://github.com/pydata/xarray/issues/6456#issuecomment-1098574761 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85Beuup | delgadom 3698640 | 2022-04-13T23:34:16Z | 2022-04-13T23:34:48Z | CONTRIBUTOR |
when I said "you're overwriting the file every iteration" I meant to put the emphasis on overwiting. by using See the docs on
This interpretation of mode is consistent across all of python - see the docs for python builtins: open So I think changing your writes to |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 | |
1096382964 | https://github.com/pydata/xarray/issues/6456#issuecomment-1096382964 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85BWXn0 | tbloch1 34276374 | 2022-04-12T08:47:55Z | 2022-04-12T08:48:48Z | NONE | @max-sixty could you explain which bit isn't working for you? The initial example I shared works fine in colab for me, so that might be a you problem. The second one required specifying the chunks when making the datasets (I've editted above). Here's a link to the colab (which has both examples). It's worth noting that the way in which the dataset is broken does seem to be slightly different in each of these examples - in the former example all data becomes NaN, in the latter example only the initially saved data becomes NaN. |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 | |
1094583214 | https://github.com/pydata/xarray/issues/6456#issuecomment-1094583214 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85BPgOu | tbloch1 34276374 | 2022-04-11T06:01:44Z | 2022-04-12T08:48:13Z | NONE | @max-sixty - I've tried to slim it down below (no loop, and only one save). From the print statements, it's clear that before overwriting the .zarr ``` import pandas as pd import numpy as np import glob import xarray as xr from tqdm import tqdm Creating pkl files[pd.DataFrame(np.random.randint(0,10, (1000,500))).astype(object).to_pickle('df{}.pkl'.format(i)) for i in range(4)] fnames = glob.glob('*.pkl') df1 = pd.read_pickle(fnames[0]) df1.columns = np.arange(0,500).astype(object) # the real pkl files contain all objects df1.index = np.arange(0,1000).astype(object) df1 = df1.astype(np.float32) ds = xr.DataArray(df1.values, dims=['fname', 'res_dim'], coords={'fname': df1.index.values, 'res_dim': df1.columns.values}) ds = ds.to_dataset(name='low_dim').chunk({'fname': 500, 'res_dim': 1}) ds.to_zarr('zarr_bug.zarr', mode='w') ds1 = xr.open_zarr('zarr_bug.zarr', decode_coords="all") df2 = pd.read_pickle(fnames[1]) df2.columns = np.arange(0,500).astype(object) df2.index = np.arange(0,1000).astype(object) df2 = df2.astype(np.float32) ds2 = xr.DataArray(df2.values, dims=['fname', 'res_dim'], coords={'fname': df2.index.values, 'res_dim': df2.columns.values}) ds2 = ds2.to_dataset(name='low_dim').chunk({'fname': 500, 'res_dim': 1}) ds3 = xr.concat([ds1, ds2], dim='fname') ds3['fname'] = ds3.fname.astype(str) print(ds3.low_dim.values) ds3.to_zarr('zarr_bug.zarr', mode='w') print(ds3.low_dim.values) ``` The output:
|
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 | |
1095585081 | https://github.com/pydata/xarray/issues/6456#issuecomment-1095585081 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85BTU05 | max-sixty 5635139 | 2022-04-11T21:29:27Z | 2022-04-11T21:29:27Z | MEMBER | @tbloch1 it doesn't copy in to someone else's python atm — that's the "C" part of MCVE... |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 | |
1094587632 | https://github.com/pydata/xarray/issues/6456#issuecomment-1094587632 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85BPhTw | tbloch1 34276374 | 2022-04-11T06:07:06Z | 2022-04-11T10:42:51Z | NONE | @delgadom - In the example it's saving every iteration, but in my actual code it's much less frequent. I figured there was probably a better way to achieve the same thing, but it still doesn't seem like the expected behaviour, which is why I thought I should raise the issue here. The files are just sequentially names (as in my example), but the indices of the resulting dataframes are a bunch of unique strings (file-paths, not dates). |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 | |
1094412198 | https://github.com/pydata/xarray/issues/6456#issuecomment-1094412198 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85BO2em | max-sixty 5635139 | 2022-04-10T23:46:53Z | 2022-04-10T23:46:53Z | MEMBER |
Or GH Discussions! But it would need a smaller MCVE |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 | |
1094411214 | https://github.com/pydata/xarray/issues/6456#issuecomment-1094411214 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85BO2PO | delgadom 3698640 | 2022-04-10T23:40:49Z | 2022-04-10T23:40:49Z | CONTRIBUTOR | @tbloch1 following up on Max's suggestion - it looks like you might be overwriting the file with every iteration. See the docs on ds.to_zarr - To me, this doesn't seem likely to be a bug, but is more of a usage question. Have you tried asking on stackoverflow with the xarray tag? |
{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 | |
1093253883 | https://github.com/pydata/xarray/issues/6456#issuecomment-1093253883 | https://api.github.com/repos/pydata/xarray/issues/6456 | IC_kwDOAMm_X85BKbr7 | max-sixty 5635139 | 2022-04-08T19:05:12Z | 2022-04-08T19:05:12Z | MEMBER | Hi @tbloch1 — thanks for the issue So I understand — is this loading the existing dataset, adding one a slice, and then writing the whole result? Have you considered using For the example — would it be possible to slim that down a bit further? Does it happen with with one read & write after the initial one? |
{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
Writing a a dataset to .zarr in a loop makes all the data NaNs 1197117301 |
Advanced export
JSON shape: default, array, newline-delimited, object
CREATE TABLE [issue_comments] ( [html_url] TEXT, [issue_url] TEXT, [id] INTEGER PRIMARY KEY, [node_id] TEXT, [user] INTEGER REFERENCES [users]([id]), [created_at] TEXT, [updated_at] TEXT, [author_association] TEXT, [body] TEXT, [reactions] TEXT, [performed_via_github_app] TEXT, [issue] INTEGER REFERENCES [issues]([id]) ); CREATE INDEX [idx_issue_comments_issue] ON [issue_comments] ([issue]); CREATE INDEX [idx_issue_comments_user] ON [issue_comments] ([user]);
user 3