id,node_id,number,title,user,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,performed_via_github_app,state_reason,repo,type 2038622503,I_kwDOAMm_X855gukn,8548,Shaping the future of Backends,43316012,open,0,,,3,2023-12-12T22:08:50Z,2023-12-15T17:14:59Z,,COLLABORATOR,,,,"### What is your issue? Backends in xarray are used to read and write files (or in general objects) and transform them into useful xarray Datasets. This issue will collect ideas on how to continuously improve them. # Current state Along the reading and writing process there are many implicit and explicit configuration possibilities. There are many backend specific options and many en-,decoder specific options. Most of them are currently difficult or even impossible to discover. There is the infamous `open_dataset` method which can do everything, but there are also some specialized methods like `open_zarr` or `to_netcdf`. The only really formalized way to extend xarray capabilities is via the `BackendEntrypoint`. Currently only for reading files. This has proven to work and things are going so well that people are discussing getting rid of the special reading methods (#7495). A major critique in this thread is again the discoverability of configuration options. ## Problems To name a few: - Discoverability of configuration options is poor - No distinction between backend and encoding options - New options are simply added as another keyword argument to `open_dataset` - No writing support for backends ## What already improved - Adding URL and description attributes to the backends (#7000, #7200) - Add static typing - Allow creating instances of backends with their respective options (#8520) # The future After listing all the problems, lets see how we can improve the situation and make backends an allrounder solution to reading and writing all kinds of files. ## What happens behind the scenes In general the reading and writing of Datasets in xarray is a three-step process. ``` [ done by backend.open_dataset] Dataset < chunking < decoding < opening_in_store < file Dataset > validating > encoding > storing_in_store > file ``` Probably you could consider combining the chunking and decoding as well as validation and encoding into a single logical step in the pipeline. This view should help decide how to set up a future architecture of backends. You can see that there is a common middle object in this process, a in-memory representation of the file on disc between en-, decoding and the abstract store. This is actually a `xarray.Dataset` and is internally called a ""backend dataset"". ## `write_dataset` method A quite natural extension of backends would be to implement a `write_dataset` method (name pending). This would allow backends to fulfill the complete right side of the pipeline. ## Transformer class Due to a lack of a common word for a class that handles ""encoding"" and ""decoding"" I will call them transformer here. The process of en- and decoding is currently done ""hardcoded"" by the respective `open_dataset` and `to_netcdf` methods. One could imagine to introduce the concept of a common class that handles both. This class could handle the implemented CF or netcdf encoding conventions. But it would also allow users to define their own storing conventions (Why not create a custom transformer that adds indexes based on variable attributes?) The possibilities are endless, and an interface that fulfills all the requirements still has to be found. This would homogenize the reading and writing process to ``` Dataset <> Transformer <> Backend <> file ``` As a bonus this would increase discoverability of the configuration options of the decoding options (then transformer arguments). The new interface then could be ```python backend = Netcdf4BackendEntrypoint(group=""data"") decoder = CFTransformer(cftime=True) ds = xr.open_dataset(""file.nc"", engine=backend, decoder=decoder) ``` while of course still allowing to pass all options simply as kwarg (since this is still the easiest way of telling beginners how to open files) The final improvement here would be to add additional entrypoints for these transformers ;) # Disclaimer Now this issue is just a bunch of random ideas that require quite some refinement or they might even turn out to be nonsense. So lets have a exciting discussion about these things :) If you have something to add to the above points I will include your ideas as well. This is meant as a collection of ideas on how to improve our backends :)","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8548/reactions"", ""total_count"": 5, ""+1"": 5, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue 1928972239,PR_kwDOAMm_X85cC_Wb,8276,Give NamedArray Generic dimension type,43316012,open,0,,,3,2023-10-05T20:02:56Z,2023-10-16T13:41:45Z,,COLLABORATOR,,1,pydata/xarray/pulls/8276," - [x] Towards #8199 - [ ] Tests added - [ ] User visible changes (including notable bug fixes) are documented in `whats-new.rst` - [ ] New functions/methods are listed in `api.rst` This aims at making the dimenion type a generic parameter. I thought I will start with NamedArray when testing this out because it is much less interconnected.","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/8276/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,pull 1221885425,I_kwDOAMm_X85I1H3x,6549,Improved Dataset broadcasting,43316012,open,0,,,3,2022-04-30T17:51:37Z,2022-05-01T14:37:43Z,,COLLABORATOR,,,,"### Is your feature request related to a problem? I am a bit puzzled about how xarrays is broadcasting Datasets. It seems to always add all dimensions to all variables. Is this what you want in general? See this example: ```python import xarray as xr da = xr.DataArray([[1, 2, 3]], dims=(""x"", ""y"")) # # array([[1, 2, 3]]) ds = xr.Dataset({""a"": (""x"", [1]), ""b"": (""z"", [2, 3])}) # # Dimensions: (x: 1, z: 2) # Dimensions without coordinates: x, z # Data variables: # a (x) int32 1 # b (z) int32 2 3 ds.broadcast_like(da) # returns: # # Dimensions: (x: 1, y: 3, z: 2) # Dimensions without coordinates: x, y, z # Data variables: # a (x, y, z) int32 1 1 1 1 1 1 # b (x, y, z) int32 2 3 2 3 2 3 # I think it should return: # # Dimensions: (x: 1, y: 3, z: 2) # Dimensions without coordinates: x, y, z # Data variables: # a (x, y) int32 1 1 1 # notice here without ""z"" dim # b (x, y, z) int32 2 3 2 3 2 3 ``` ### Describe the solution you'd like I would like broadcasting to behave the same way as e.g. a simple addition. In the upper example `da + ds` produces the dimensions that I want. ### Describe alternatives you've considered `ds + xr.zeros_like(da)` this works, but seems more like a ""dirty hack"". ### Additional context Maybe one can add an option to broadcasting that controls this behavior?","{""url"": ""https://api.github.com/repos/pydata/xarray/issues/6549/reactions"", ""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",,,13221727,issue