-
Notifications
You must be signed in to change notification settings - Fork 139
Add support for opening structured dtypes as void for zarr driver
#272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Ok, I imported this locally and am looking at a few things.
|
…ussion_r2663351949` and extend test coverage
…ussion_r2663353574` and extend test coverage
My bad, I don't know how I missed the compilation issue for the spec test. I've fixed the issue and resolved your comments. Thanks for taking a look and getting back to this so fast! |
…ges/BASE..a42b6f511375dc1bd402ad525519f57210546735#r2666111665` Enforce schema validation for one-of `field` and `open_as_void`
…ussion_r2666195572` Use synthesized open_as_void field
|
Sorry for the delay responding to the latest round of feedback. I believe everything raised so far has been addressed. |
| /// \param partial_metadata Constraints in the form of partial zarr metadata. | ||
| /// \param selected_field The field to which `schema` applies. | ||
| /// \param schema Schema constraints for the `selected_field`. | ||
| /// \param open_as_void If true, opens the array as raw bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
selected_field and open_as_void are mutually exclusive in GetNewMetadata.
At least document that & add an error return/test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for GetSpecRankAndFieldInfo
| fill_value = (*metadata.fill_value)[field_index]; | ||
| // For void access, synthesize a byte-level fill value | ||
| if (open_as_void) { | ||
| const Index nbytes = metadata.dtype->bytes_per_outer_element; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to get the fill_value from the metadata. More like:
size_t field_index = 0; // open_as_void has a single field.
if (!open_as_void) {
TENSORSTORE_ASSIGN_OR_RETURN(
field_index,
GetFieldIndex(*metadata.dtype, selected_field));
}
fill_value = (*metadata.fill_value)[field_index];
That might require CreateVoidMetadata to set a proper fill value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes CreateVoidMetadata should set a proper fill value.
| #include "absl/status/status.h" | ||
| #include "absl/strings/cord.h" | ||
| #include <nlohmann/json_fwd.hpp> | ||
| #include "riegeli/bytes/cord_reader.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing BUILD deps for these.
| /// \file | ||
| /// Support for encoding/decoding zarr "dtype" specifications. | ||
| /// See: https://zarr.readthedocs.io/en/stable/spec/v2.html | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is missing some includes (*it already was, but now it's missing more).
#include <optional>
#include <string>
#include <string_view>
#include <vector>
#include "absl/status/status.h"
#include "tensorstore/index.h"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the BUILD deps for the new non-std library includes as well.
| friend void to_json(::nlohmann::json& out, // NOLINT | ||
| const ZarrDType& dtype); | ||
|
|
||
| /// Lazily-computed cache for GetVoidField(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need this cached? Seems like just returning a Field is reasonable, as the CreateVoidMetadata is already going to copy it into vector, and it's not expensive to recreate when opening a new zarr volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See jbms@ below; he's in favor of keeping this cached but guarding it.
| // since we're treating the data as raw bytes regardless of the actual dtype. | ||
| // Shape is allowed to differ (handled by base class for resizing). | ||
| // Other fields like compressor, order, chunks must still match. | ||
| if (existing_metadata.dtype.bytes_per_outer_element != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could just rely on the normal validate but applied to the void metadata.
| // contain only the void field, allowing standard encode/decode to work. | ||
| // CreateVoidMetadata uses the same chunks and bytes_per_outer_element as | ||
| // the original validated metadata, so it should never fail. | ||
| initializer.metadata = CreateVoidMetadata(original_metadata).value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably would be better to have the original metadata cache the pointer to the void metadata and initialize it on first access.
| field_index, | ||
| GetFieldIndex(metadata.dtype, spec().selected_field)); | ||
| } | ||
| TENSORSTORE_RETURN_IF_ERROR( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to make sure to validate the schema against the void metadata, not the regular metadata. Note: The partial_metadata validation is still against the regular metadata, though.
|
|
||
| // Set fill_value for the single void field. | ||
| // Empty/null fill value is handled by GetChunkGridSpecification. | ||
| metadata->fill_value.resize(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to have a proper fill value here.
| } | ||
|
|
||
| const ZarrDType::Field* ZarrDType::GetVoidField() const { | ||
| if (!void_field_cache_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cache isn't thread safe which is a problem. However, instead of caching void_field instead could just cache the entire void metadata (and do it thread safely).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looking more closely I see that it will be inconvenient to not have this cache, since we need a stable pointer to the void field. So you should keep the cache but just make it thread safe, e.g. using absl::call_once.
Supersedes #264
Removes
bool open_as_void = falsedefault in C++ function declarations and definitions.Uses a derived DataCache implementation similar to strategy used in
neuroglancer_precomputeddriver forUnshardedDataCacheandShardedDataCache