Skip to content

Conversation

@emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Dec 2, 2025

Rationale for this change

Questions still arise on the state of V2, this tries to clarify my current understanding.

  1. All encodings can be used any place
  2. Version in the footer is meaningless and readers should determine readability by features.

What changes are included in this PR?

Spec clarification. Will start a discussion on the mailing list for word-smithing and consensus

Do these changes have PoC implementations?

No

@emkornfield emkornfield marked this pull request as draft December 2, 2025 18:18
Comment on lines 718 to 721
*
* N.B. this page header is not necessarily strictly better then DataPageHeader.
* Page indexes already require that rows are aligned on page boundaries, and compressing
* repetition and definition levels can still be effective in some cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying this is deprecated? Why do we need this comment? It's not clear what you're trying to achieve here. (Nit: prefer not to use abbreviations like N.B.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased, no I'm wasn't just saying it is not necessarily a strict improvement.

Comment on lines 1262 to 1267
/** Version of this file
*
* Deprecated. Readers should determine if they support reading based on
* specific metadata (e.g. encoding enum) rather then relying on this field
* to make this determination.
*/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this. I don't think we should abandon versioning, but rather be more explicit about breaking changes and what is included with version update. Regardless, this needs more discussion with the community and a clear path forward for how we support breaking changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased this to explain rationale. I agree it needs discussion in the community but I think based on previous conversations this is probably the consensus path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than potentially (implicitly) changing the spec, I suggest we add some langague to help people understand how to interpret this field and the tradeoffs between using different versions.

For example, maybe we could say something like

As of December 2025, there is no agreed upon set of features that constitute version 2, so for maximum compatibility, writers should populate "1" for version and accept "1" and "2" interchangeably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased along these lines.

Encodings.md Outdated
Comment on lines 25 to 29
Some Parquet implementations distinguish encodings as "v1" and "v2". From
a specification perspective this distinction is considered meaningless. Writers may use any
encoding with both data page v1 and data page v2. Readers should lazily evaluate if they can
read a file (e.g. only error when required to a read a page with an unknown encoding).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we're redefining what version means to be scoped only to encodings and then saying that it's not necessary. It seems like we want to either separate encodings from versioning (e.g. any encoding that is understood by a client should be considered supported regardless of when it was introduced) or be more explicit about associating new encodings with a version (along with other possible breaking structural/representational changes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased to remove version entirely as it might cause confusion.

@danielcweeks
Copy link

danielcweeks commented Dec 4, 2025

After thinking this through a little more, I think we should more clearly define what each "versioned identifier" means and clearly articulate under what conditions it would change. For example:

Magic Number PAR1:

  • What it means: Indicates that footer is still thrift compatible with Parquet V1 format and is expected to be parseable by any existing Parquet V1 client.
  • When it changes: the footer is changed in an incompatible way with the thrift definition or the footer is substantively changed in a way that older clients should not even attempt to read the file.
  • What it solves: helps determine whether a file/footer is parquet/corrupt/other or substantively different between versions.
  • Example: replacing the footer with FlatBuffers representation in an incompatible way.

Footer Version Number

  • What does it mean: largely redundant with PAR1
  • When it changes: (same as magic number)
  • What it solves: if the footer is stored somewhere outside of the file (e.g. a cache or supplied via some other mechanism)
  • Example: the footer is stored in a high-performance cache and keyed by path for faster pruning. No magic number is available for the reader.

What changes have been made without version updates:

  1. [backward incompatible] New compression codecs: brotli, zstandard, etc.
  2. [backward incompatible] New data types: variant, geo types, etc.
  3. [backward compatible] Addition of Page indexes
  4. [backward compatible] Updates to Column Stats

Given the incompatible changes like addition of codecs without a version change, it's confusing as to why the addition of encodings would require a version change. I do think there are valid reasons to update the version, but we just haven't had an example of it yet (or we've been hesitant to leverage the version because of the impact on readers).

Copy link
Contributor Author

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change wording for version.

@emkornfield emkornfield marked this pull request as ready for review December 5, 2025 22:50
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. THanks @emkornfield

I do think we should make sure that @danielcweeks 's concerns about version number are addressed before merging this

* in some scenarios). Page indexes require pages start and end at row boundaries regardless of which
* page header is used.
*
* As of December 2025, most known Parquet readers can read pages using this header.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could refer readers to https://parquet.apache.org/docs/file-format/implementationstatus/ for the most up to date information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also change the content of

  • New page format allowing reading levels without decompressing the data

To another word to make it clearer that there is no agreed upon expectation that all writers will eventually use this new header. Perhaps:

  • Alternate page format allowing reading levels without decompressing the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rehrased along these lines.

Comment on lines 1262 to 1267
/** Version of this file
*
* Deprecated. Readers should determine if they support reading based on
* specific metadata (e.g. encoding enum) rather then relying on this field
* to make this determination.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than potentially (implicitly) changing the spec, I suggest we add some langague to help people understand how to interpret this field and the tradeoffs between using different versions.

For example, maybe we could say something like

As of December 2025, there is no agreed upon set of features that constitute version 2, so for maximum compatibility, writers should populate "1" for version and accept "1" and "2" interchangeably.

@emkornfield
Copy link
Contributor Author

I do think we should make sure that @danielcweeks 's concerns about version number are addressed before merging this

I discussed the changes with @danielcweeks offline and I think he is OK with the current state. @danielcweeks please let me know if I misunderstood.

@emkornfield
Copy link
Contributor Author

@alamb thanks for the review, I believe I addressed your concerns.

Copy link
Contributor

@alkis alkis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together @emkornfield!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @emkornfield and @danielcweeks

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Comment on lines 723 to 724
* https://parquet.apache.org/docs/file-format/implementationstatus/ tracks the implementation of readers and
* writers that support this page format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem useful to provide this link here as it would apply to most other Parquet features as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. @alamb suggested adding it but I agree we generally don't reference it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the most part, just one suggestion


/**
* New page format allowing reading levels without decompressing the data
* Alternate page format allowing reading levels without decompressing the data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would perhaps keep "New" or "More recent".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Alternate probably makes the most sense here. It is not exactly "new" any more.

How about "Alternative page format, introduced to the specification after DataPageHeader, ...."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the naming of "V2" somewhat implies that it came after DataPageHeader

What I think is confusing about the word "new" is that (to me) it implies an expectation that this header will eventually replace the existing one, which I don't think there is consensus on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the naming of "V2" somewhat implies that it came after DataPageHeader

Right, IIUC the concern from Antoine is to maybe note that this wasn't part of the original spec?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 in general, two small suggestions

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

I also added a links to the related mailing list threads to the description on this PR:

@emkornfield
Copy link
Contributor Author

emkornfield commented Dec 19, 2025

Vote passed. Going to merge. @pitrou @alamb we can maybe take another pass at the V2 data page header in another PR if there are serious concerns here.

@emkornfield emkornfield merged commit 4b1c72c into apache:master Dec 19, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarify state of versioning and V2

8 participants