html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,performed_via_github_app,issue
https://github.com/pydata/xarray/pull/7080#issuecomment-1259913775,https://api.github.com/repos/pydata/xarray/issues/7080,1259913775,IC_kwDOAMm_X85LGMIv,22566757,2022-09-27T18:47:00Z,2022-09-27T18:47:00Z,CONTRIBUTOR,"I think the current default for two-dimensional plots is to try to re-use an existing axis if neither `row` nor `col` are set (implied by documentation for `ax` argument in [the description of the `DataArray.plot` descriptor](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.plot.html#xarray.DataArray.plot)). Would this PR change that behavior, and should that be documented?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1385143758
https://github.com/pydata/xarray/issues/7076#issuecomment-1259894192,https://api.github.com/repos/pydata/xarray/issues/7076,1259894192,IC_kwDOAMm_X85LGHWw,22566757,2022-09-27T18:28:26Z,2022-09-27T18:28:26Z,CONTRIBUTOR,"Fix confirmed, thank you.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1384465119
https://github.com/pydata/xarray/issues/6439#issuecomment-1115251379,https://api.github.com/repos/pydata/xarray/issues/6439,1115251379,IC_kwDOAMm_X85CeWKz,22566757,2022-05-02T19:00:47Z,2022-05-02T19:05:19Z,CONTRIBUTOR,"Oh, right, you suggested that [a bit ago](https://github.com/pydata/xarray/issues/6439#issuecomment-1089182651).
When I checkout `upstream/main` in my local XArray repository root and run the example, it completes without error. When I fix the example to use the correct dimension, the implicit print on the last line shows the nearly the same as `unstacked_diag` from a few lines earlier.
Still not sure what fixed this, but since it's working, I don't care so much. I will wait for this to show up in a release. Thank you!","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1192449540
https://github.com/pydata/xarray/issues/6439#issuecomment-1115127711,https://api.github.com/repos/pydata/xarray/issues/6439,1115127711,IC_kwDOAMm_X85Cd3-f,22566757,2022-05-02T17:04:13Z,2022-05-02T17:45:56Z,CONTRIBUTOR,"> Just a tip: You don't need any stacking for that. Just use an indexer with a new dim:
>
I am aware that I can extract the diagonal of the arrays by using the same index for each argument of `isel`. That is, in fact, how I extracted the diagonals in each case above (look for `diag_index` to find the examples).
The bit that interests me is unstacking the relevant dimension, because the data in the original case comes to me with, effectively, a stacked dimension, and I would like to turn it back into an unstacked dimension because that is what I am used to using `pcolormesh` to plot.
That is to say, skipping the unstacking rather defeats the purpose of what I am trying to do, unless you have suggestions for how to create a two-dimensional plot (one using something like `contourf` or `pcolormesh`) of a one-dimensional `Dataset`, or a series of two-dimensional plots from a two-dimensional `Dataset`.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1192449540
https://github.com/pydata/xarray/issues/2780#issuecomment-1112552981,https://api.github.com/repos/pydata/xarray/issues/2780,1112552981,IC_kwDOAMm_X85CUDYV,22566757,2022-04-28T18:57:26Z,2022-04-28T19:01:34Z,CONTRIBUTOR,"I found a way to get the sample dataset to save to a smaller netCDF:
```python
import os
import numpy as np
import numpy.testing as np_tst
import pandas as pd
import xarray as xr
##################################################
# Original example
# Create pandas DataFrame
df = pd.DataFrame(
np.random.randint(low=0, high=10, size=(100000, 5)),
columns=[""a"", ""b"", ""c"", ""d"", ""e""],
)
# Make 'e' a column of strings
df[""e""] = df[""e""].astype(str)
# Make 'f' a column of floats
DIGITS = 1
df[""f""] = np.around(10 ** DIGITS * np.random.random(size=df.shape[0]), DIGITS)
# Save to csv
df.to_csv(""df.csv"")
# Convert to an xarray's Dataset
ds = xr.Dataset.from_dataframe(df)
# Save NetCDF file
ds.to_netcdf(""ds.nc"")
##################################################
# Additions
def dtype_for_int_array(arry: ""array of integers"") -> np.dtype:
""""""Find the smallest integer dtype that will encode arry.
Parameters
----------
arry : array of integers
The array to compress
Returns
-------
smallest: dtype
The smallest dtype that will represent arry
""""""
largest = max(abs(arry.min()), abs(arry.max()))
typecode = ""i{bytes:d}"".format(
bytes=2
** np.nonzero(
[
np.iinfo(""i{bytes:d}"".format(bytes=2**i)).max >= largest
for i in range(4)
]
)[0][0]
)
return np.dtype(typecode)
def dtype_for_str_array(
arry: ""xr.DataArray of strings"", for_disk: bool = True
) -> np.dtype:
""""""Find a good string dtype for encoding arry.
Parameters
----------
arry : xr.DataArray of strings
The array to compress
for_disk : bool
True if meant for encoding argument of to_netcdf()
False if meant for in-memory datasets
Returns
-------
smallest: dtype
The smallest dtype that will represent arry
""""""
lengths = arry.str.len()
largest = lengths.max()
if not for_disk:
# Variant for in-memory datasets
# Makes dask happier about strings
typecode = ""S{bytes:d}"".format(
largest
)
else:
# Variant for on-disk datasets
# 0.2 and 0.6 are both guesses
# If there's ""a lot"" of strings ""much shorter than"" the longest
# use vlen str where available
# otherwise use a string concatenation dimension
if lengths.quantile(0.2) < 0.6 * largest:
typecode = ""O""
else:
typecode = ""S1""
return np.dtype(typecode)
# Set up encoding for saving to netCDF
encoding = {}
for name, var in ds.items():
encoding[name] = {}
var_kind = var.dtype.kind
# Perhaps we should assume ""u"" means people know what they're
# doing
if var_kind in (""u"", ""i""):
dtype = dtype_for_int_array(var)
if var_kind == ""u"":
dtype = dtype.replace(""i"", ""u"")
elif var_kind == ""f"":
finfo = np.finfo(var.dtype)
abs_var = np.abs(var)
dynamic_range = abs_var.max() / abs_var[abs_var > 0].min()
if dynamic_range > 10**finfo.precision:
# Dynamic range too high for quantization
dtype = var.dtype
else:
# set scale_factor and add_offset for quantization
# Also figure out what dtype compresses best
var_min = var.min()
var_range = var.max() - var_min
mid_range = var_min + var_range / 2
# Rescale to -1 to 1
values_to_compress = (var - mid_range) / (0.5 * var_range)
# for digits in range(finfo.precision):
for digits in (2, 4, 9, 18):
if np.allclose(
values_to_compress,
np.around(values_to_compress, digits),
rtol=finfo.precision,
):
dtype = var.dtype
# Convert digits to integer dtype
# digits <= 2 to i1
# digits <= 4 to i2
# digits <= 9 to i4
# digits <= 18 to i8
if digits <= 2:
dtype = np.dtype(""i1"")
elif digits <= 4:
dtype = np.dtype(""i2"")
elif digits <= 9:
dtype = np.dtype(""i4"")
else:
dtype = np.dtype(""i8"")
if dtype.itemsize >= var.dtype.itemsize:
# Quantization saves space
dtype = var.dtype
else:
# Quantization does not save space
storage_iinfo = np.iinfo(dtype)
encoding[name][""add_offset""] = mid_range.values
encoding[name][""scale_factor""] = (
2 * var_range / storage_iinfo.max
).values
encoding[name][""_FillValue""] = storage_iinfo.min
break
else:
# Quantization would lose information
dtype = var.dtype
elif var_kind == ""O"":
dtype = dtype_for_str_array(var)
else:
dtype = var.dtype
encoding[name][""dtype""] = dtype
ds.to_netcdf(""ds_encoded.nc"", encoding=encoding)
# Display results
stat_csv = os.stat(""df.csv"")
stat_nc = os.stat(""ds.nc"")
stat_enc = os.stat(""ds_encoded.nc"")
sizes = pd.Series(
index=[""CSV"", ""default netCDF"", ""encoded netCDF""],
data=[stats.st_size for stats in [stat_csv, stat_nc, stat_enc]],
name=""File sizes"",
)
print(""File sizes (kB):"", np.right_shift(sizes, 10), sep=""\n"", end=""\n\n"")
print(""Sizes relative to CSV:"", sizes / sizes.iloc[0], sep=""\n"", end=""\n\n"")
# Check that I didn't break the floats
from_disk = xr.open_dataset(""ds_encoded.nc"")
np_tst.assert_allclose(ds[""f""], from_disk[""f""], rtol=10**-DIGITS, atol=10**-DIGITS)
```
```bash
$ python xarray_auto_small_output.py && ls -sSh *.csv *.nc
File sizes (kB):
CSV 1942
default netCDF 10161
encoded netCDF 1375
Name: File sizes, dtype: int64
Sizes relative to CSV:
CSV 1.000000
default netCDF 5.230366
encoded netCDF 0.708063
Name: File sizes, dtype: float64
10M ds.nc 1.9M df.csv 1.4M ds_encoded.nc
```
I added a column of floats with one digit before and after the decimal point to the example dataset, because why not.
Does this satisfy your use-case?
Should I turn the giant loop into a function to go into xarray somewhere? If so, I should probably tie the float handling in with [the new `least_significant_digit` feature in netCDF4-python](https://unidata.github.io/netcdf4-python/#efficient-compression-of-netcdf-variables) so the data gets read in the same way it was before getting written out.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,412180435
https://github.com/pydata/xarray/issues/6310#issuecomment-1069092987,https://api.github.com/repos/pydata/xarray/issues/6310,1069092987,IC_kwDOAMm_X84_uRB7,22566757,2022-03-16T12:50:50Z,2022-03-16T12:50:50Z,CONTRIBUTOR,That could work. Are you set up to check that? That can be either a full repository checkout or an XArray installation you can edit.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1154014066
https://github.com/pydata/xarray/issues/6310#issuecomment-1069084130,https://api.github.com/repos/pydata/xarray/issues/6310,1069084130,IC_kwDOAMm_X84_uO3i,22566757,2022-03-16T12:40:20Z,2022-03-16T12:40:20Z,CONTRIBUTOR,"Given this:
https://github.com/pydata/xarray/blob/613a8fda4f07181fbc41d6ff2296fec3726fd351/xarray/conventions.py#L782-L783
I think that should be working. This:
https://github.com/pydata/xarray/blob/613a8fda4f07181fbc41d6ff2296fec3726fd351/xarray/conventions.py#L770-L779
explicitly says it should, and is probably the part where things go wrong, but it should be going wrong the same way for `encoding` and `attrs`.
I think
https://github.com/pydata/xarray/blob/613a8fda4f07181fbc41d6ff2296fec3726fd351/xarray/conventions.py#L758-L768
may need to be split into two conditionals, one for `attrs` and one for `encoding`. I'm not sure how to get the `continue` behavior while allowing the code to work for both `attrs` and `encoding` without code duplication.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1154014066
https://github.com/pydata/xarray/issues/6310#issuecomment-1069064616,https://api.github.com/repos/pydata/xarray/issues/6310,1069064616,IC_kwDOAMm_X84_uKGo,22566757,2022-03-16T12:17:37Z,2022-03-16T12:17:37Z,CONTRIBUTOR,"I tried to find what the CF conventions say about including dimension coordinates (I'm using [the name from scitools-iris](https://scitools-iris.readthedocs.io/en/stable/userguide/iris_cubes.html#coordinates) rather than ""coordinate variable"" as used in the CF conventions to keep myself from getting confused) in the `coordinates` attribute. From what I can tell, the whole document is consistent with usually excluding dimension coordinates from the `coordinates` attribute. Most of the [Discrete Sampling Geometry examples in appendix H](https://cfconventions.org/cf-conventions/cf-conventions.html#appendix-examples-discrete-geometries) seem to include the dimension coordinates in the `coordinates` attributes, [though at least one example](https://cfconventions.org/cf-conventions/cf-conventions.html#_orthogonal_multidimensional_array_representation_of_time_series) leaves the dimension coordinates implied rather than explicit.
From what I remember, XArray is based on the netCDF data model, rather than the CF data model, so initializing `variable_coordinates[var_name] = set(variable.dims)` will do the wrong thing if the dataset doesn't set one or more of its dimension coordinates ([example H.2](https://cfconventions.org/cf-conventions/cf-conventions.html#_orthogonal_multidimensional_array_representation_of_time_series) has variables with dimensions `(""station"", ""time"")`, but no variable named `station`. [Section 4.5](https://cfconventions.org/cf-conventions/cf-conventions.html#discrete-axis) makes this practice explicit). You could work around this by leaving the initialization as it stands but dropping the `if coordinate_name not in variable.dims` condition on including `coordinate_name` as part of the `coordinates` attribute.
> 1. Stick to the current logic which might be non-conformal with the CF conventions in case of ""Discrete Sampling Geometries"". However, users can manually fix this by setting the coordinates in encoding.
Based on this, I think doing solution one from the previous post on writing a dataset will always be consistent with CF, but assuming that netCDF files XArray reads into datasets will always follow this pattern would be a problem. I suspect there are tests for reading netCDF files with dimension coordinates included in `coordinates` attributes already, but haven't checked.
> 3. Implement a logic to recognize cases where a dataset is a ""Discrete Sampling Geometry"" and only then list the non-auxiliary coordinates in the variable attribute. This is a bit tricky, and I don't have the time to implement this, I'm afraid.
If you want to try solution three, almost all Discrete Sampling Geometry files [must have a global attribute called `featureType`](https://cfconventions.org/cf-conventions/cf-conventions.html#featureType). Since that attribute is recommended for all Discrete Sampling Geometry files, you could declare that the presence of that attribute defines a Discrete Sampling Geometry file for XArray. However, I don't see any place that says including dimension coordinates in the `coordinates` attribute is required, even for Discrete Sampling Geometry files, and a few places that explicitly say dimension coordinates can be omitted from the `coordinates` attribute, even for Discrete Sampling Geometry files.
The references from CF on whether dimension coordinates can be included in the `coordinates` attribute:
The fifth paragraph of [CF section five](https://cfconventions.org/cf-conventions/cf-conventions.html#coordinate-system) says:
> If the longitude, latitude, vertical or time coordinate is multi-valued, varies in only one dimension, and varies independently of other spatiotemporal coordinates, it is not permitted to store it as an auxiliary coordinate variable.
I *think* this is saying that if you can represent a coordinate using just one dimension, you shouldn't use two (that is, avoid using `np.tile(np.arange(10), (3, 1))` as a longitude coordinate). The other interpretation is that dimension coordinates must not be included in the `coordinates` attribute, which seems unlikely given that three lines later it says:
> Note that it is permissible, but optional, to list coordinate variables as well as auxiliary coordinate variables in the coordinates attribute.
The first paragraph of the [section on Discrete sampling geometries](https://cfconventions.org/cf-conventions/cf-conventions.html#discrete-sampling-geometries):
> Every element of every feature must be unambiguously associated with its space and time coordinates and with the feature that contains it. The coordinates attribute must be attached to every data variable to indicate the spatiotemporal coordinate variables that are needed to geo-locate the data.
I think dimension coordinates are explicit enough to count as ""unambiguously associated"", even without inclusion in the `coordinates` attribute, since they share a name with one of the dimensions of the Discrete Sampling Geometry data variables. This seems to be made explicit in the fourth paragraph:
> Auxiliary coordinate variables containing the nominal and the precise positions should be listed in the relevant coordinates attributes of data variables. In orthogonal representations the nominal positions could be coordinate variables, which do not need to be listed in the coordinates attribute, rather than auxiliary coordinate variables.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,1154014066
https://github.com/pydata/xarray/issues/5510#issuecomment-866314326,https://api.github.com/repos/pydata/xarray/issues/5510,866314326,MDEyOklzc3VlQ29tbWVudDg2NjMxNDMyNg==,22566757,2021-06-22T20:33:44Z,2021-06-22T20:37:53Z,CONTRIBUTOR,"~`encoding` is where that information is stored between reading a dataset in from disk and saving it back out again.~ `_encode_coordinates` can take a default value from either of `encoding` or `attrs`, but a falsy value will be overwritten. Setting `.attrs[""coordinates""] = "" ""` should work.
```python
>>> import numpy as np, xarray as xr
>>> data = xr.DataArray(np.random.randn(2, 3), dims=(""x"", ""y""), coords={""x"": [10, 20]})
>>> ds = xr.Dataset({""foo"": data, ""bar"": (""x"", [1, 2]), ""fake"": 10})
>>> ds = ds.assign_coords({""reftime"":np.array(""2004-11-01T00:00:00"", dtype=np.datetime64)})
>>> ds = ds.assign({""test"": 1})
>>> ds.test.encoding[""coordinates""] = "" ""
>>> ds.to_netcdf(""file.nc"")
```
```bash
$ ncdump -h file.nc
netcdf file {
dimensions:
x = 2 ;
y = 3 ;
variables:
int64 x(x) ;
double foo(x, y) ;
foo:_FillValue = NaN ;
foo:coordinates = ""reftime"" ;
int64 bar(x) ;
bar:coordinates = ""reftime"" ;
int64 fake ;
fake:coordinates = ""reftime"" ;
int64 reftime ;
reftime:units = ""days since 2004-11-01 00:00:00"" ;
reftime:calendar = ""proleptic_gregorian"" ;
int64 test ;
test:coordinates = "" "" ;
}
```
As mentioned above, the XArray data model associates coordinates with dimensions rather than with variables, so any time you read the dataset back in again, the `test` variable will gain `reftime` as a coordinate, because the dimensions of `reftime` (`()`), are a subset of the dimensions of `test` (also `()`).
Not producing a `coordinates` attribute for variables mentioned in another variable's `bounds` attribute (or a few other attributes, for that matter) would be entirely doable within the function linked above, and should be straightforward if you want to make a PR for that.
Making `realization` and the bounds show up in `ds.coords` rather than `ds.data_vars` may also skip setting the `coordinates` attribute, though I'm less sure of that. It would, however, add `realization` to the `coordinates` attributes of every other `data_var` unless you overrode that, which may not be what you want.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,927336712
https://github.com/pydata/xarray/pull/2844#issuecomment-778717611,https://api.github.com/repos/pydata/xarray/issues/2844,778717611,MDEyOklzc3VlQ29tbWVudDc3ODcxNzYxMQ==,22566757,2021-02-14T03:35:55Z,2021-02-14T03:35:55Z,CONTRIBUTOR,"> ~Does anyone know why the `xr.open_dataset(....)` call is echoed in the warning message. Is this intentional? Cc @dcherian @DWesl~
>
It seems you've already figured this out, but for anyone else with this question, the repeat of the call on that file is part of the warning that the file does not have all the variables the attributes refer to. You can fix this by recreating the file with the listed variables added (`areacella`, or by deleting the attribute from the variables (`cell_measures`). You can also ignore the warning using the machinery in the `warnings` module.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-778629061,https://api.github.com/repos/pydata/xarray/issues/2844,778629061,MDEyOklzc3VlQ29tbWVudDc3ODYyOTA2MQ==,22566757,2021-02-13T14:46:25Z,2021-02-13T14:46:25Z,CONTRIBUTOR,I think this looks good.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-761842344,https://api.github.com/repos/pydata/xarray/issues/2844,761842344,MDEyOklzc3VlQ29tbWVudDc2MTg0MjM0NA==,22566757,2021-01-17T16:48:39Z,2021-01-17T16:48:39Z,CONTRIBUTOR,Looks good to me. I was wondering where those docstrings were.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/issues/4121#issuecomment-670778071,https://api.github.com/repos/pydata/xarray/issues/4121,670778071,MDEyOklzc3VlQ29tbWVudDY3MDc3ODA3MQ==,22566757,2020-08-07T23:07:14Z,2020-08-17T13:09:27Z,CONTRIBUTOR,"#2844 used to move these variables to `ds.coords` rather than `ds.data_vars`, and allowed saving of `ancillary_variables` via the `encoding` attribute. It was decided to drop that since `ancillary_variables` are linked to variables rather than dimensions like most of the other CF attributes.
The specific behavior mentioned in the original post (describing `ancillary_variables` in the output) might work better in [`cf-xarray`'s `ds.cf.describe` method](https://cf-xarray.readthedocs.io/en/latest/examples/introduction.html#What-attributes-have-been-discovered?).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,630573329
https://github.com/pydata/xarray/pull/2844#issuecomment-670996109,https://api.github.com/repos/pydata/xarray/issues/2844,670996109,MDEyOklzc3VlQ29tbWVudDY3MDk5NjEwOQ==,22566757,2020-08-09T02:17:07Z,2020-08-09T16:36:12Z,CONTRIBUTOR,"That's two people with that view so I made the change.
Again, I feel that the quality flags are essentially meaningless on their own, useful primarily in the context of their associated variables, like the items currently put in the XArray `coords` attribute, which, admittedly, is only those variables identified by CF as dimension or auxiliary coordinates at the moment, and should remain associated with the relevant variable even if it is extracted into a `DataArray`. Since all of the other people who have opinions on the matter seem to disagree with me, I changed the code to preserve the present behavior with regards to `ancillary_variables`. I can always monkey-patch it back in if it really bothers me, or add a `Dataset.__getitem__` wrapper to `xarray-contrib/cf-xarray` so that the `ancillary_variables` stay associated when I pull variables out, or move back to `SciTools/iris`.
On a related note, I should probably check whether this breaks conversion to an `iris.Cube`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-670816691,https://api.github.com/repos/pydata/xarray/issues/2844,670816691,MDEyOklzc3VlQ29tbWVudDY3MDgxNjY5MQ==,22566757,2020-08-08T03:25:17Z,2020-08-08T03:53:39Z,CONTRIBUTOR,"You are correct; `ancillary_variables` is neither `grid_mapping` or `bounds`.
My personal view is that the quality information should stay with the variable it describes unless explicitly dropped; I think your view is that quality information can always be extracted from the original dataset, and that no variable should carry quality information for a different variable. At this point it would be simple to remove `ancillary_variables` from the attributes processed by this PR. There was a suggestion earlier of adding a `decode_aux_vars` argument to control the new behavior as a means of avoiding back-compatibility breaks like this one. I will leave that as a question for the maintainers; there is also some related discussion at #4215.
I should point out that a similar situation arises for `grid_mapping`; `ds.coords[""time""]` will include the `grid_mapping` variable in its coordinates.
In contrast, `ds.coords[""x""]` will not include the bounds for the `x` variable, since it has more dimensions than `ds.coords[""x""]`","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-670765806,https://api.github.com/repos/pydata/xarray/issues/2844,670765806,MDEyOklzc3VlQ29tbWVudDY3MDc2NTgwNg==,22566757,2020-08-07T22:29:20Z,2020-08-07T22:29:20Z,CONTRIBUTOR,"The MinimumVersionsPolicy error appears to be a series of internal `conda` errors, and is probably unrelated.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-670730008,https://api.github.com/repos/pydata/xarray/issues/2844,670730008,MDEyOklzc3VlQ29tbWVudDY3MDczMDAwOA==,22566757,2020-08-07T22:02:47Z,2020-08-07T22:02:47Z,CONTRIBUTOR,pydata/xarray-data#19,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-667744209,https://api.github.com/repos/pydata/xarray/issues/2844,667744209,MDEyOklzc3VlQ29tbWVudDY2Nzc0NDIwOQ==,22566757,2020-08-03T00:14:24Z,2020-08-03T19:42:02Z,CONTRIBUTOR,"The `rasm` dataset has coordinates `xc` and `yc`, which reference bounds `xv` and `yv` respectively, which I do not see in the variable list with `decode_coords=False`. It would appear that pydata/xarray-data#4 did not include the bounds in the updated dataset when adding coordinates to `rasm.nc`, so this warning is correct. I do not know that file, so I'm probably not the best person to add bounds. Should I wait for an update to `pydata/xarray-data`, or should I ask sphinx to ignore the warning?
Another option is to just delete the `bounds` attributes of `xc` and `yc` in `rasm.nc`","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-667737160,https://api.github.com/repos/pydata/xarray/issues/2844,667737160,MDEyOklzc3VlQ29tbWVudDY2NzczNzE2MA==,22566757,2020-08-02T23:13:26Z,2020-08-02T23:13:26Z,CONTRIBUTOR,"The example the doc build doesn't like:
```python
ds = xr.tutorial.load_dataset(""rasm"")
ds.to_zarr(""rasm.zarr"", mode=""w"")
import zarr
zgroup = zarr.open(""rasm.zarr"")
print(zgroup.tree())
dict(zgroup[""Tair""].attrs)
```
I'll need to look into the `rasm` dataset to figure out why there is a warning now.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/issues/4215#issuecomment-658329779,https://api.github.com/repos/pydata/xarray/issues/4215,658329779,MDEyOklzc3VlQ29tbWVudDY1ODMyOTc3OQ==,22566757,2020-07-14T18:07:05Z,2020-07-14T18:07:05Z,CONTRIBUTOR,"`formula_terms` is another attribute with variable names, although it requires a bit more parsing.
> > Question: Should we allow `decode_coords` to control whether variables mentioned in these attributes are set as coordinate variables?
>
> I don't think this is necessary. It's easy to explicitly set or reset coordinates afterwards if desired.
Is that ""putting the variables in these attributes in `coords` is out of scope for XArray"" or ""putting the variables in these attributes in `coords` is out of scope for `decode_coords`"" or something else?
> I would say no however to ancillary_variables, since those are not really about coordinates and instead about linked data variables (like uncertainties).
I tend to think of uncertainties and status flags as important for the interpretation of the associated variables that should stay with the data variables unless a decision is explicitly made to drop them. On the other hand, since XArray seems to associate coordinates with dimensions rather than with variables, I can see why this might be less than desirable. This argument would also apply to `grid_mapping`.
> > My one concern with #2844 is clarifying the role of `encoding` vs. `attrs`.
>
> I think we should probably ensure that xarray always propagates `encoding` exactly like how it propagates `attrs`.
Should this be part of #2844 or should preserving `encoding` be a separate PR?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,654889988
https://github.com/pydata/xarray/pull/2844#issuecomment-497948836,https://api.github.com/repos/pydata/xarray/issues/2844,497948836,MDEyOklzc3VlQ29tbWVudDQ5Nzk0ODgzNg==,22566757,2019-06-01T14:20:08Z,2020-07-14T15:55:17Z,CONTRIBUTOR,"On 5/31/2019 11:50 AM, dcherian wrote:
> It isn't just MetPy though. I'm sure there's existing code relying on
> adding |grid_mapping| and |bounds| to |attrs| in order to write
> CF-compliant files. So there's a (potentially big) backward
> compatibility issue. This becomes worse if in the future we keep
> interpreting more CF attributes and moving them to |encoding| :/.
At present, the proper, CF-compliant way to do this is to have both
|grid_mapping| and |bounds| variables in |data_vars|, and maintain the
attributes yourself, including making sure the variables get copied into
the result after relevant |ds[var_name]| and |ds.sel(axis=bounds)|
operations.
If you decide to move these variables to |coords|, the |bounds| variables will still get dropped on any subsetting operation, including those where the relevant axis was retained, the |grid_mapping| variables will be included in the result of all subsetting operations (including pulling out, for example, a time coordinate), and both will be included in some |coordinates| attribute when written to disk, breaking CF compliance.
This PR only really addresses getting these variables in |coords| initially and
keeping them out of the global |coordinates| attribute when writing to disk.
>
> Since I'm doing this primarily to get grid_mapping and bounds
> variables out of ds.data_vars.
>
> I'm +1 on this but I wonder whether saving them in |attrs| and using
> that information when encoding coordinates would be the more pragmatic
> choice.
You have a point about |grid_mapping|, but applying the MetPy approach
of saving the information in another, more directly useful format
(|cartopy.Projection| instances) immediately after loading the file
would be a way around that.
For |bounds|, I think |pd.PeriodIndex| would be the most natural
representation for time, and |pd.IntervalIndex| for most other 1-D cases,
but that still leaves |bounds| for two-or-more-dimensional coordinates.
That's a design choice I'll leave to the maintainers.
>
> We could define |encoding| as containing a specified set of CF
> attributes that control on-disk representation such as |units|,
> |scale_factor|, |contiguous| etc. and leaving everything else in
> |attrs|. A full list of attributes that belong in |encoding| could be in
> the docs so that downstream packages can fully depend on this behaviour.
>
> Currently I see |coordinates| is interpreted and moved to |encoding|. In
> the above proposal, this would be left in |attrs| but its value would
> still be interpreted if |decode_coords=True|.
>
> What do you think?
At present, |set(ds[var_name].attrs[""coordinates""].split())| and
|set(ds[var_name].coords) - set(ds[var_name].indexes[dim_name])|
would be identical, since the |coordinates| attribute is essentially
computed from the second expression on write.
Do you have a use case in mind where you need specifically the list of
CF auxiliary coordinates, or is that just an example of something that
would change under the new proposal? I assume |units| would be moved to
|encoding| only for |datetime64[ns]| and |timedelta64[ns]| variables.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-644405067,https://api.github.com/repos/pydata/xarray/issues/2844,644405067,MDEyOklzc3VlQ29tbWVudDY0NDQwNTA2Nw==,22566757,2020-06-15T21:40:49Z,2020-06-15T21:40:49Z,CONTRIBUTOR,"This PR currently puts `grid_mapping` and `bounds` in `encoding` once it is done with them. Is that where XArray wants to put them, or should they be somewhere else?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/issues/2780#issuecomment-633296515,https://api.github.com/repos/pydata/xarray/issues/2780,633296515,MDEyOklzc3VlQ29tbWVudDYzMzI5NjUxNQ==,22566757,2020-05-24T20:45:43Z,2020-05-24T20:45:43Z,CONTRIBUTOR,"For the example given, this would mean finding `largest = max(abs(ds.min()), abs(ds.max()))` and finding the first integer dtype wide enough to write that: `[np.iinfo(""i{bytes:d}"".format(bytes=2 ** i)).max >= largest for i in range(4)]` would help there. The function below should help with this; I would tend to use this at array creation time rather than at save time so you get these benefits in memory as well as on disk.
For the character/string variables, the smallest representation varies a bit more: a fixed-width encoding (`dtype=S6`) will probably be smaller if all the strings are about the same size, while variable-width strings are probably smaller if there are many short strings and only a few long strings. If you happen to know that a given field is a five-character identifier or a one-character status code, you can again set these types to be used in memory (which I think makes dask happier when it comes time to save), while free-form survey responses will likely be better as a variable-length string. It may be possible use the distribution of string lengths (perhaps using [numpy.char.str_len](https://numpy.org/doc/stable/reference/generated/numpy.char.str_len.html)) to see whether most of the strings are at least 90% as long as the longest, but it's probably simpler to test.
Doing this correctly for floating-point types would be difficult, but I think that's outside the scope of this issue.
Hopefully this gives you something to work with.
```python
import numpy as np
def dtype_for_int_array(arry: ""array of integers"") -> np.dtype:
""""""Find the smallest integer dtype that will encode arry.
Parameters
----------
arry : array of integers
The array to compress
Returns
-------
smallest: dtype
The smallest dtype that will represent arry
""""""
largest = max(abs(arry.min()), abs(arry.max()))
typecode = ""i{bytes:d}"".format(
bytes=2 ** np.nonzero([
np.iinfo(""i{bytes:d}"".format(bytes=2 ** i)).max >= largest
for i in range(4)
])[0][0]
)
return np.dtype(typecode)
```
Looking at [`df.memory_usage()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html) will explain why I do this early.
If I extend your example with this new function, I see the following:
```python
>>> df_small = df.copy()
>>> for col in df_small:
... df_small[col] = df_small[col].astype(
... dtype_for_int_array(df_small[col]) if df_small[col].dtype.kind == ""i"" else ""S1""
... )
...
>>> df_small.memory_usage()
Index 80
a 100000
b 100000
c 100000
d 100000
e 800000
dtype: int64
>>> df.memory_usage()
Index 80
a 800000
b 800000
c 800000
d 800000
e 800000
dtype: int64
```
It looks like pandas always uses object dtype for string arrays, so the numbers in that column likely reflect the size of an array of pointers. XArray lets you use a dtype of ""S1"" or ""U1"", but I haven't found the equivalent of the `memory_usage` method.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,412180435
https://github.com/pydata/xarray/pull/2844#issuecomment-633253434,https://api.github.com/repos/pydata/xarray/issues/2844,633253434,MDEyOklzc3VlQ29tbWVudDYzMzI1MzQzNA==,22566757,2020-05-24T16:09:04Z,2020-05-24T16:09:04Z,CONTRIBUTOR,"Should I change this to put `grid_mapping` and `bounds` back in `attrs`, or should I leave them in `encoding`?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/issues/4068#issuecomment-633251217,https://api.github.com/repos/pydata/xarray/issues/4068,633251217,MDEyOklzc3VlQ29tbWVudDYzMzI1MTIxNw==,22566757,2020-05-24T15:53:10Z,2020-05-24T15:53:10Z,CONTRIBUTOR,"For others reading this issue, the h5netcdf workaround was discussed in #3297, with further discussion on supporting complex numbers in netCDF in cf-convention/cf-conventions#204.
The short version: `engine=""h5netcdf"", invalid_netcdf=True` will save these files, but the netCDF-C library doesn't understand the result. Reading with `engine=""h5netcdf""` may be able to round-trip these files, but I haven't checked that.
There is a longer discussion of why netCDF-C doesn't understand these files at Unidata/netcdf-c#267. That specific issue is for booleans, but complex numbers are likely the same.","{""total_count"": 1, ""+1"": 1, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,619347681
https://github.com/pydata/xarray/pull/2844#issuecomment-597375929,https://api.github.com/repos/pydata/xarray/issues/2844,597375929,MDEyOklzc3VlQ29tbWVudDU5NzM3NTkyOQ==,22566757,2020-03-10T23:54:41Z,2020-03-10T23:54:41Z,CONTRIBUTOR,"I think the choice is between `attrs` and `encoding`, not both.
If it helps lean your decision one way or the other, `attrs` tends to stay associated with `Dataset`s through more operations than `encoding`, so `parse_cf()` would have to be called fairly soon after opening if the information ends up in `encoding`, while putting it in `attrs` gives users a bit more time for that.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/issues/3689#issuecomment-587466776,https://api.github.com/repos/pydata/xarray/issues/3689,587466776,MDEyOklzc3VlQ29tbWVudDU4NzQ2Njc3Ng==,22566757,2020-02-18T13:44:27Z,2020-02-18T13:44:27Z,CONTRIBUTOR,`bounds` and `grid_mapping`?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,548607657
https://github.com/pydata/xarray/pull/2844#issuecomment-587466093,https://api.github.com/repos/pydata/xarray/issues/2844,587466093,MDEyOklzc3VlQ29tbWVudDU4NzQ2NjA5Mw==,22566757,2020-02-18T13:43:12Z,2020-02-18T13:43:12Z,CONTRIBUTOR,"The test failures seem to all be due to recent changes in `cftime`/`CFTimeIndex`, which I haven't touched.
Is sticking the `grid_mapping` and `bounds` attributes in `encoding` good, or should I put them back in `attrs`?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-586273656,https://api.github.com/repos/pydata/xarray/issues/2844,586273656,MDEyOklzc3VlQ29tbWVudDU4NjI3MzY1Ng==,22566757,2020-02-14T12:47:06Z,2020-02-14T12:47:06Z,CONTRIBUTOR,"I just noticed [pandas.PeriodIndex](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-span-representation) would be an alternative to [pandas.IntervalIndex](https://pandas.pydata.org/docs/reference/api/pandas.IntervalIndex.html#pandas.IntervalIndex) for time data if which side the interval is closed on is largely irrelevant for such data.
Is there an interest in using these for 1D coordinates with bounds? I think `ds.groupby_bins()` already returns an `IntervalIndex`.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/3724#issuecomment-586261327,https://api.github.com/repos/pydata/xarray/issues/3724,586261327,MDEyOklzc3VlQ29tbWVudDU4NjI2MTMyNw==,22566757,2020-02-14T12:07:21Z,2020-02-14T12:07:21Z,CONTRIBUTOR,"Not yet, at least:
https://github.com/pydata/xarray/network/dependents
GitHub points my projects using XArray at
https://github.com/thadncs/https-github.com-pydata-xarray
rather than this repository, There seem to be a decent number of repositories there:
https://github.com/thadncs/https-github.com-pydata-xarray/network/dependents
I have no idea why GitHub shifted them, nor what to do about it.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,555752381
https://github.com/pydata/xarray/pull/2844#issuecomment-497566742,https://api.github.com/repos/pydata/xarray/issues/2844,497566742,MDEyOklzc3VlQ29tbWVudDQ5NzU2Njc0Mg==,22566757,2019-05-31T04:00:17Z,2019-05-31T04:00:17Z,CONTRIBUTOR,"Switched to use `in` rather than `is not None`.
Re: `grid_mapping` in `.encoding` not `.attrs`
[MetPy assumes `grid_mapping` will be in `.attrs`](https://github.com/Unidata/MetPy/blob/master/metpy/xarray.py#L200). Since the [xarray documentation mentions this capability](http://xarray.pydata.org/en/latest/weather-climate.html#cf-compliant-coordinate-variables), should I be making concurrent changes to MetPy to allow this to continue?
If so, would it be sufficient to change their `.attrs` references to `.encoding` and mentioning in both sets of documentation that the user should call `ds.metpy.parse_cf()` immediately after loading to ensure the information is available for MetPy to use? I don't entirely understand the accessor API.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-497558317,https://api.github.com/repos/pydata/xarray/issues/2844,497558317,MDEyOklzc3VlQ29tbWVudDQ5NzU1ODMxNw==,22566757,2019-05-31T03:04:06Z,2019-05-31T03:13:53Z,CONTRIBUTOR,"This is briefly mentioned above, in
https://github.com/pydata/xarray/pull/2844#discussion_r270595609
The rationale was that everywhere else xarray uses CF attributes for something, the original values of those attributes are recorded in `var.encoding`, not `var.attrs`, and consistency across a code base is a good thing. Since I'm doing this primarily to get `grid_mapping` and `bounds` variables out of `ds.data_vars`, I don't have strong opinions on the subject.
If you feel strongly to the contrary, there's an idea at the top of this thread for getting `bounds` information encoded in terms xarray already uses in some cases (`Dataset.groupby_bins()`), and the diffs for this PR should help you figure out what needs changing to support this.
For `grid_mapping` there's
http://xarray.pydata.org/en/latest/weather-climate.html#cf-compliant-coordinate-variables
which is enough for my uses.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2844#issuecomment-478290053,https://api.github.com/repos/pydata/xarray/issues/2844,478290053,MDEyOklzc3VlQ29tbWVudDQ3ODI5MDA1Mw==,22566757,2019-03-30T21:17:17Z,2019-03-30T21:17:17Z,CONTRIBUTOR,"I can shift this to use encoding only, but I'm having trouble figuring out where that code would go.
Would the preferred path be to create VariableCoder classes for each and add them to encode_cf_variable, then add tests to xarray.tests.test_coding?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/2843#issuecomment-478248763,https://api.github.com/repos/pydata/xarray/issues/2843,478248763,MDEyOklzc3VlQ29tbWVudDQ3ODI0ODc2Mw==,22566757,2019-03-30T14:04:12Z,2019-03-30T14:04:12Z,CONTRIBUTOR,"I just checked and can't find that section of the documentation now, so that seems to be consistent.
I suppose that's a vote for ""be sure to check current behavior before submitting old packages"".
I'll change my code to this new method then.
Thanks","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424262546
https://github.com/pydata/xarray/pull/2844#issuecomment-476586154,https://api.github.com/repos/pydata/xarray/issues/2844,476586154,MDEyOklzc3VlQ29tbWVudDQ3NjU4NjE1NA==,22566757,2019-03-26T11:31:05Z,2019-03-26T11:31:05Z,CONTRIBUTOR,"Related to #1475 and #2288 , but this is just keeping the metadata consistent where already present, not extending the data model to include bounds, cells, or projections. I should add a test to ensure saving still works if the bounds are lost when pulling out variables.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,424265093
https://github.com/pydata/xarray/pull/814#issuecomment-309484883,https://api.github.com/repos/pydata/xarray/issues/814,309484883,MDEyOklzc3VlQ29tbWVudDMwOTQ4NDg4Mw==,22566757,2017-06-19T15:58:27Z,2017-06-19T15:58:27Z,CONTRIBUTOR,"If you're still looking for the old tests, it looks like they disappeared in the last merge commit, [f48de5](https://github.com/pydata/xarray/pull/814/commits/f48de5a3ea916b80a7bfc070c3c1c3549e931189#diff-48ab4ba033ad06981f566a2e2f561f5aL1819).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,145140657