-
Notifications
You must be signed in to change notification settings - Fork 462
GH-534: Clarify versioning and V2 #535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/main/thrift/parquet.thrift
Outdated
| * | ||
| * N.B. this page header is not necessarily strictly better then DataPageHeader. | ||
| * Page indexes already require that rows are aligned on page boundaries, and compressing | ||
| * repetition and definition levels can still be effective in some cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying this is deprecated? Why do we need this comment? It's not clear what you're trying to achieve here. (Nit: prefer not to use abbreviations like N.B.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased, no I'm wasn't just saying it is not necessarily a strict improvement.
| /** Version of this file | ||
| * | ||
| * Deprecated. Readers should determine if they support reading based on | ||
| * specific metadata (e.g. encoding enum) rather then relying on this field | ||
| * to make this determination. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this. I don't think we should abandon versioning, but rather be more explicit about breaking changes and what is included with version update. Regardless, this needs more discussion with the community and a clear path forward for how we support breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased this to explain rationale. I agree it needs discussion in the community but I think based on previous conversations this is probably the consensus path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than potentially (implicitly) changing the spec, I suggest we add some langague to help people understand how to interpret this field and the tradeoffs between using different versions.
For example, maybe we could say something like
As of December 2025, there is no agreed upon set of features that constitute version 2, so for maximum compatibility, writers should populate "1" for version and accept "1" and "2" interchangeably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rephrased along these lines.
Encodings.md
Outdated
| Some Parquet implementations distinguish encodings as "v1" and "v2". From | ||
| a specification perspective this distinction is considered meaningless. Writers may use any | ||
| encoding with both data page v1 and data page v2. Readers should lazily evaluate if they can | ||
| read a file (e.g. only error when required to a read a page with an unknown encoding). | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we're redefining what version means to be scoped only to encodings and then saying that it's not necessary. It seems like we want to either separate encodings from versioning (e.g. any encoding that is understood by a client should be considered supported regardless of when it was introduced) or be more explicit about associating new encodings with a version (along with other possible breaking structural/representational changes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased to remove version entirely as it might cause confusion.
|
After thinking this through a little more, I think we should more clearly define what each "versioned identifier" means and clearly articulate under what conditions it would change. For example: Magic Number
|
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change wording for version.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. THanks @emkornfield
I do think we should make sure that @danielcweeks 's concerns about version number are addressed before merging this
src/main/thrift/parquet.thrift
Outdated
| * in some scenarios). Page indexes require pages start and end at row boundaries regardless of which | ||
| * page header is used. | ||
| * | ||
| * As of December 2025, most known Parquet readers can read pages using this header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could refer readers to https://parquet.apache.org/docs/file-format/implementationstatus/ for the most up to date information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can also change the content of
- New page format allowing reading levels without decompressing the data
To another word to make it clearer that there is no agreed upon expectation that all writers will eventually use this new header. Perhaps:
- Alternate page format allowing reading levels without decompressing the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rehrased along these lines.
| /** Version of this file | ||
| * | ||
| * Deprecated. Readers should determine if they support reading based on | ||
| * specific metadata (e.g. encoding enum) rather then relying on this field | ||
| * to make this determination. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than potentially (implicitly) changing the spec, I suggest we add some langague to help people understand how to interpret this field and the tradeoffs between using different versions.
For example, maybe we could say something like
As of December 2025, there is no agreed upon set of features that constitute version 2, so for maximum compatibility, writers should populate "1" for version and accept "1" and "2" interchangeably.
I discussed the changes with @danielcweeks offline and I think he is OK with the current state. @danielcweeks please let me know if I misunderstood. |
|
@alamb thanks for the review, I believe I addressed your concerns. |
alkis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together @emkornfield!
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @emkornfield and @danielcweeks
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
src/main/thrift/parquet.thrift
Outdated
| * https://parquet.apache.org/docs/file-format/implementationstatus/ tracks the implementation of readers and | ||
| * writers that support this page format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not seem useful to provide this link here as it would apply to most other Parquet features as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. @alamb suggested adding it but I agree we generally don't reference it.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the most part, just one suggestion
|
|
||
| /** | ||
| * New page format allowing reading levels without decompressing the data | ||
| * Alternate page format allowing reading levels without decompressing the data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would perhaps keep "New" or "More recent".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Alternate probably makes the most sense here. It is not exactly "new" any more.
How about "Alternative page format, introduced to the specification after DataPageHeader, ...."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the naming of "V2" somewhat implies that it came after DataPageHeader
What I think is confusing about the word "new" is that (to me) it implies an expectation that this header will eventually replace the existing one, which I don't think there is consensus on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the naming of "V2" somewhat implies that it came after DataPageHeader
Right, IIUC the concern from Antoine is to maybe note that this wasn't part of the original spec?
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 in general, two small suggestions
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
|
I also added a links to the related mailing list threads to the description on this PR:
|
|
Vote passed. Going to merge. @pitrou @alamb we can maybe take another pass at the V2 data page header in another PR if there are serious concerns here. |
Rationale for this change
Questions still arise on the state of V2, this tries to clarify my current understanding.
What changes are included in this PR?
Spec clarification. Will start a discussion on the mailing list for word-smithing and consensus
Do these changes have PoC implementations?
No
Closes Clarify state of versioning and V2 #534
related mailing list thread https://lists.apache.org/thread/0bdyyb7qobrxx94x8v7t5z7g2ksnpyr2
vote thread on mailing list: https://lists.apache.org/thread/1nf596dq7fpc0zp35bt17zy7srvo5wt8