PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514

JFinis · 2025-08-09T12:46:32Z

This commit is a combination of the following PRs:

Introduce IEEE 754 total order PARQUET-2249: Introduce IEEE 754 total order #221
Add nan_count to handle NaNs in statistics PARQUET-2249: Add nan_count to handle NaNs in statistics #196

Both these PRs try to solve the same problems; read the description of the respective PRs for explanation.

This PR is the result of an extended discussion in which it was repeatedly brought up that another possible solution to the problem could be the combination of the two approaches. Please refer to this discussion on the mailing list and in the two PRs for details. the mailing list discussion can be found here:
https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s

The contents of this PR are basically a straightforward combination of the two approaches:

IEEE total order is introduced as a new order for floating point types
nan_count and nan_counts fields are added

Legacy writers may not write nan_count(s) fields,
so readers have to handle them being absent. Also, legacy writers may have included NaNs into min/max bounds, so readers also have to handle that.

As there are no legacy writers writing IEEE total order, nan_count(s) are defined to be mandatory if this order is used, so readers can assume their presense when this order is used.

This commit removes nan_pages from the ColumnIndex, which the nan_counts PR mandated. We don't need them anymore, as IEEE total order solves this: As this is a new order and there are thus no legacy writers and readers, we have the freedom to define that for only-NaN pages using this order, we can actually write NaNs into min & max bounds in this case, and readers can assume that isNaN(min) signals an only-NaN page.

Note: If this PR is chosen to be adopted, I will update the commit message to state the problem without just pointing onto the other PRs. But until then, I rather link to them instead of repeating myself, in case this PR isn't chosen anyway.

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

This commit is a combination of the following PRs: * Introduce IEEE 754 total order apache#221 * Add nan_count to handle NaNs in statistics apache#196 Both these PRs try to solve the same problems; read the description of the respective PRs for explanation. This PR is the result of an extended discussion in which it was repeatedly brought up that another possible solution to the problem could be the combination of the two approaches. Please refer to this discussion on the mailing list and in the two PRs for details. the mailing list discussion can be found here: https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s The contents of this PR are basically a straightforward combination of the two approaches: * IEEE total order is introduced as a new order for floating point types * nan_count and nan_counts fields are added Legacy writers may not write nan_count(s) fields, so readers have to handle them being absent. Also, legacy writers may have included NaNs into min/max bounds, so readers also have to handle that. As there are no legacy writers writing IEEE total order, nan_count(s) are defined to be mandatory if this order is used, so readers can assume their presense when this order is used. This commit removes `nan_pages` from the ColumnIndex, which the nan_counts PR mandated. We don't need them anymore, as IEEE total order solves this: As this is a new order and there are thus no legacy writers and readers, we have the freedom to define that for only-NaN pages using this order, we can actually write NaNs into min & max bounds in this case, and readers can assume that isNaN(min) signals an only-NaN page.

alamb

I went through the proposal and I left some stylistic comments which I think might help clarity, but are not required.

Thank you @JFinis for pushing this along

alamb · 2025-08-11T10:48:40Z

README.md

+sort orders for logical and primitve types and also special orders for types
+where the common sort order is not unambiguously defined (e.g., NaN ordering
+for floating point types). The details are documented in the


There is a spelling mistake (primitve). Also here is a proposal to make this more concise

Suggested change

sort orders for logical and primitve types and also special orders for types

where the common sort order is not unambiguously defined (e.g., NaN ordering

for floating point types). The details are documented in the

sort orders for logical and primitive types and also special orders for

types with potentially ambiguous semantics (e.g., NaN ordering

for floating point types). The details are documented in the

alamb · 2025-08-11T12:42:59Z

src/main/thrift/parquet.thrift

+   /** 
+    * count of NaN values in the column; only present if physical type is FLOAT
+    * or DOUBLE, or logical type is FLOAT16. 
+    * Readers MUST distinguish between nan_count not being present and nan_count == 0.
+    * If nan_count is not present, readers MUST NOT assume nan_count == 0.
+    */
+   9: optional i64 nan_count;


I would find this easier to understand if it were more concise and didn't have a double negative. Here is a suggestion:

Suggested change

/**

* count of NaN values in the column; only present if physical type is FLOAT

* or DOUBLE, or logical type is FLOAT16.

* Readers MUST distinguish between nan_count not being present and nan_count == 0.

* If nan_count is not present, readers MUST NOT assume nan_count == 0.

*/

9: optional i64 nan_count;

/**

* count of NaN values in the column; only present if physical type is FLOAT

* or DOUBLE, or logical type is FLOAT16.

* If this field is not present, readers must assume NaNs may be present

* (MUST NOT assume nan_count == 0).

*/

9: optional i64 nan_count;

alamb · 2025-08-11T12:45:41Z

src/main/thrift/parquet.thrift

   * 
   *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     - It is suggested to always set the nan_count field for floating


The heading says "the following rules should be followed" but the first one says "it is suggested" which seems to imply it is optional. I recommend removing the "suggested"

Even 'should' is still optional per RFC2119. If this isn't optional, 'must' or 'shall' is the correct term. But this is a problem in more places in this document, so I take it you're not following that RFC.

I think the requirement should be to set nan_count, so I agree that this should not be a suggestion.

This whole document uses "should", where "shall" or "must" would actually be more appropriate. Therefore, I was going with the general speak of this document for consistency. I totally agree though; it always bothered me a lot that the spec always uses "should", making it sounds as if all of this is just a mere suggestion and implementors can do what they want.

That being said, if we want to change this, I would rather do it in one PR that updates the whole document to stay consistent.

alamb · 2025-08-11T12:47:24Z

src/main/thrift/parquet.thrift

+   *     - NaNs should not be written to min or max statistics fields except
+   *       in the column index, where min_values and max_values are not optional
+   *       so a NaN value must be written if all non-null values in a page
+   *       are NaN.


Suggested change

* - NaNs should not be written to min or max statistics fields except

* in the column index, where min_values and max_values are not optional

* so a NaN value must be written if all non-null values in a page

* are NaN.

* - NaNs should not be written to min or max statistics fields except

* in the column index when a page contains only NaN values. In this

* case, since min_values and max_values are required, a NaN value

* must be written.

alamb · 2025-08-11T12:48:36Z

src/main/thrift/parquet.thrift

+   * - Writing the nan_count field is mandatory when using this ordering,
+   *   especialy also if it is zero.


I if the field is mandatory, the "especially" part makes no sense

Suggested change

* - Writing the nan_count field is mandatory when using this ordering,

* especialy also if it is zero.

* - Writing the nan_count field is mandatory when using this ordering.

alamb · 2025-08-11T12:50:20Z

src/main/thrift/parquet.thrift

+   *   still included NaN in min_values and max_values even if the page had
+   *   non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
+   * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
+   * * If IEEE754_TOTAL_ORDER is used for the column and all non-null values


I think the heading - bullet is different:

Suggested change

* * If IEEE754_TOTAL_ORDER is used for the column and all non-null values

* - If IEEE754_TOTAL_ORDER is used for the column and all non-null values

alamb · 2025-08-11T12:50:59Z

src/main/thrift/parquet.thrift

+    * might not be NaN values in any page, as NaNs should not be included
+    * in min_values or max_values.


Suggested change

* might not be NaN values in any page, as NaNs should not be included

* in min_values or max_values.

* might not be NaN values in any page, as NaNs are not included

* in min_values or max_values.

This looks imprecise to me. NaNs are included in min_values/max_values when IEEE754TotalOrder is used. So perhaps we should simply say that If this field is not present, readers MUST assume that there might be NaN values in any page?

orlp · 2025-08-11T14:00:53Z

src/main/thrift/parquet.thrift

+   * must be followed:
+   * - Writing the nan_count field is mandatory when using this ordering,
+   *   especialy also if it is zero.
+   * - NaNs should not be written to min or max statistics fields except


I think the spec should specify exactly what should be written, not just rule out what shouldn't be. I assume you meant to specify here that the smallest/largest non-NaN value must be written instead?

I also think this rule (assuming I understood you correctly) is a bit overcomplicated? Not sure why we'd make the rules different depending or not whether it is a column index. May I propose the following instead which applies regardless of whether it is a column index?

When writing statistics for columns with this order, the following rules must be followed:

Min or max statistics if written must contain the smallest or largest non-NaN value respectively. If there are exclusively NaNs they must contain the smallest and largest NaN values respectively instead, as defined by the IEEE 754 total order.

Sure, we can also keep it the same for the column index and statistics here. My initial idea was to keep the behavior close to what we currently have where possible. But I agree that we can alos instead make it different from the current behavior and therefore more consistent between column index and statistics.

rdblue · 2025-08-12T20:55:52Z

src/main/thrift/parquet.thrift

 struct DataPageHeader {
  /**
-   * Number of values, including NULLs, in this data page.
+   * Number of values, including nulls, in this data page.


Can you please remove these unnecessary changes? It pollutes git blame and is no more clear.

rdblue · 2025-08-12T20:56:58Z

src/main/thrift/parquet.thrift

   *
-   * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
+   * (*) Because the precise sorting order is ambiguous for floating


I think it should be clear that this is referring to the TypeDefinedOrder for floating points.

rdblue · 2025-08-12T21:04:04Z

src/main/thrift/parquet.thrift

+   * anything else. It also defines an order between different bit representations
+   * of the same value.
+   *
+   * The formal definition is as follows:


Do we need to copy this here? Having a version here that duplicates the IEEE-754 version makes me worry that they will get out of sync or have errors due to copying. We should make it clear that the IEEE-754 version is correct if the two ever differ.

I don't fully understand what lines your comment is referring to. What is the part you see duplicated?

rdblue · 2025-08-12T21:10:00Z

src/main/thrift/parquet.thrift

+   * * If IEEE754_TOTAL_ORDER is used for the column and all non-null values
+   *   of a page are NaN, then min_values[i] and max_values[i] must be set to
+   *   the smallest and largest NaN value contained in the page, as defined
+   *   by the IEEE 754 total order.


Why does this require actual bound for NaN rather than using a standard NaN value?

My intentaion was here that if we already use an order that orders NaNs in a well defined manner, we can also use it. I agree we could also settle for "any NaN values is okay in this case", but that is somewhat going against the spirit of IEEE total order.

FWIW it is all these strange special cases that are why I still prefer the total order PR - nan_counts sound simple but in practice there be dragons.

I agree with @tustvold, I don't think the added complexity of this proposal is worth it given the meager added benefit. This proposal still means that query engines that do use IEEE total ordering will be unable to filter due to the lack of knowledge of the sign of the NaNs seen. This is the flip side of total ordering not quite working for engines that lump all NaNs together. So the only real benefit to this proposal is to not "poison" the statistics.

rdblue · 2025-08-12T21:10:12Z

src/main/thrift/parquet.thrift

   * null counts are 0.
   */
  5: optional list<i64> null_counts
-


Nit: unnecessary whitespace change.

wgtmac · 2025-08-13T08:31:56Z

src/main/thrift/parquet.thrift

   *
-   * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
+   * (*) Because the precise sorting order is ambiguous for floating


Suggested change

* (*) Because the precise sorting order is ambiguous for floating

* (*) Because the TypeDefinedOrder is ambiguous for floating

Perhaps explicitly use TypeDefinedOrder or TYPE_ORDER here?

wgtmac · 2025-08-13T08:34:24Z

src/main/thrift/parquet.thrift

+   *     - If the nan_count field is set, a reader can compute
+   *       nan_count + null_count == num_values to deduce whether all non-null
+   *       values are NaN.
+   *     - When looking for NaN values, min and max should be ignored.


Moving these to line 1113 as it has duplicate content.

wgtmac · 2025-08-13T08:56:39Z

src/main/thrift/parquet.thrift

+   * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
+   * NaN values are not to be included in these bounds. If all non-null values
+   * of a page are NaN, then a writer must do the following:
+   * - If the order of this column is TypeDefinedOrder, then no column index


Suggested change

* - If the order of this column is TypeDefinedOrder, then no column index

* - If the order of this column is TYPE_ORDER, then no column index

Just to be consistent

wgtmac · 2025-08-13T09:00:18Z

src/main/thrift/parquet.thrift

+   * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
+   * * If IEEE754_TOTAL_ORDER is used for the column and all non-null values
+   *   of a page are NaN, then min_values[i] and max_values[i] must be set to
+   *   the smallest and largest NaN value contained in the page, as defined
+   *   by the IEEE 754 total order.


Suggested change

* - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]

* * If IEEE754_TOTAL_ORDER is used for the column and all non-null values

* of a page are NaN, then min_values[i] and max_values[i] must be set to

* the smallest and largest NaN value contained in the page, as defined

* by the IEEE 754 total order.

* - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]

* and max_values[i] must be set to the smallest and largest NaN value

* contained in the page, as defined by the IEEE 754 total order.

wgtmac · 2025-08-13T09:05:48Z

src/main/thrift/parquet.thrift

+    * might not be NaN values in any page, as NaNs should not be included
+    * in min_values or max_values.


This looks imprecise to me. NaNs are included in min_values/max_values when IEEE754TotalOrder is used. So perhaps we should simply say that If this field is not present, readers MUST assume that there might be NaN values in any page?

wgtmac · 2025-08-13T09:10:02Z

src/main/thrift/parquet.thrift

+    * might not be NaN values in any page, as NaNs should not be included
+    * in min_values or max_values.
+    */
+   8: optional list<i64> nan_counts


I remember that there was concern with introducing nan_counts to column index which is not applicable to all types. If we don't receive any objection during this round of discussion, can we conclude that we have consensus on this?

That's the trade-off for nan_counts in general: You add fields to thrift that are only relevant to floating point types. Let's see if there is any objection on this.

JFinis · 2025-08-13T09:31:48Z

@tustvold @crepererum @westonpace I would especially like to get your opions on this, as you were proponents of total order for simplicity. Does this PR still cover what you wanted to get out of this?

@mapleFU @etseidl @gszadovszky @pitrou FYI, as you were part of the initial discussion on the previous PRs.

tustvold

Am I understanding correctly that the change to add nan_counts means that NaNs are now excluded from the min/max statistics? I believe this creates a new problem for engines that use total ordering as the sign of the NaNs is not known, and therefore there are scenarios where predicates can't be pushed down where they could be in the absence of nan_counts.

My 2 cents is that there probably is no perfect solution here, adding nan_counts is worse for some engines and better for some others, and as such my personal preference would still be to go with the simpler solution, that has existing implementations and can ship today.

The argument for nan_counts appears to be that the presence of a NaN has the effect of poisoning the statistics. I don't really follow why NaNs are special in this regard, any extreme value has the same effect, if people want to effectively push down such predicates they should be sorting so as to cluster data points of similar magnitude. It has been said that people coming from numpy use NaN instead of Null, but as was said on the mailing list, I don't think the onus should be on parquet to handle this, rather people should convert NaNs to Nulls if that is the semantic they want.

That being said if people are motivated to push ahead with this hybrid solution, I don't really feel strongly. Ultimately the issue is fixing the fact predicates can't be pushed down on floats in general, how that applies to NaNs IMO is kind of irrelevant to the vast majority of workloads. This has dragged on for 2 years, we'd almost reached consensus and I can't help feeling we've been startled by one dissenting opinion...

orlp · 2025-08-13T10:13:42Z

@tustvold

Am I understanding correctly that the change to add nan_counts means that NaNs are now excluded from the min/max statistics? I believe this creates a new problem for engines that use total ordering as the sign of the NaNs is not known, and therefore there are scenarios where predicates can't be pushed down where they could be in the absence of nan_counts.

No, the predicates still can be pushed down, regardless of the semantics of your engine.

If your engine uses total ordering with predicate col > c you skip a page when the statistics indicate col_max <= c && nan_count == 0. If your engine uses partial ordering you simply check if col_max <= c.

My 2 cents is that there probably is no perfect solution here, adding nan_counts is worse for some engines and better for some others

It's not worse for any engine as far as I can see, a strict improvement for everyone in filtering capabilities.

I don't really follow why NaNs are special in this regard, any extreme value has the same effect

NaNs aren't extreme values for engines without total ordering, they are simply unordered.

tustvold · 2025-08-13T10:26:29Z

I seem to remember from a previous discussion around the need to have signed NaNs to allow for predicate pushdown, but I can't remember the example.

Regardless my broader point still stands that we are adding a non-trivial amount of complexity, there is a lot of subtlety both at read and write time, to yield an improvement that to be brutally honest I am not really sure will benefit all that many real workloads. It's all tradeoffs, if people want to and are motivated to proceed with the hybrid approach I don't feel strongly, but I personally prefer the simpler option that is more likely to be implemented correctly across the many many parquet implementations. The proposal here is strictly more complex than the existing specification which implementations still implement incorrectly.

orlp · 2025-08-13T10:33:54Z

I seem to remember from a previous discussion around the need to have signed NaNs to allow for predicate pushdown, but I can't remember the example.

Signed NaN counts are only useful to avoid treating a -NaN as a false positive for the col > c case for total ordering engines. To give a concrete example, if the page contains [-NaN, -3, 42] and the predicate is col > 100 then this page could be skipped by such an engine with statistics total_min = -NaN, total_max = 42 but not with min = -3, max = 42, nan_count = 1, because that singular nan could be positive NaN which would match +NaN > 100. Thus if signed nan counts are unavailable any NaN regardless of sign will have to be treated as an extremal value by those engines.

But only having a single NaN count doesn't block predicate pushdown whatsoever, and since you're fine with NaN poisoning anyway I don't think you'd care about the above -NaN edge case.

JFinis · 2025-08-13T10:45:23Z

I seem to remember from a previous discussion around the need to have signed NaNs to allow for predicate pushdown, but I can't remember the example.

Regardless my broader point still stands that we are adding a non-trivial amount of complexity, there is a lot of subtlety both at read and write time, to yield an improvement that to be brutally honest I am not really sure will benefit all that many real workloads. It's all tradeoffs, if people want to and are motivated to proceed with the hybrid approach I don't feel strongly, but I personally prefer the simpler option that is more likely to be implemented correctly across the many many parquet implementations. The proposal here is strictly more complex than the existing specification which implementations still implement incorrectly.

Gut feeling wise, I agree with your assessment. I feel like I personally can implement this correctly in our engine, but I have now spent years of my life thinking about this, so I guess I'm not representative for the average Parquet maintainer. I also feel that this has such a high chance of being implemented incorrectly that the more straightforward solution of just IEEE total ordering would be preferrable. My gut feels like the risk of added implementation complexity and thus chance for wrong implementations is higher than the advantage of not having "NaN poisoning" of statistics.

alamb · 2025-08-15T13:27:51Z

I have filed a ticket to track trying to implement this in the arrow-rs implementation of Parquet, which will hopefully help us quantify how complicated this proposal is to implement:

[Parquet] Prototype: PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514 arrow-rs#8156

wgtmac · 2025-09-04T07:01:55Z

@tustvold @crepererum @westonpace I would especially like to get your opions on this, as you were proponents of total order for simplicity. Does this PR still cover what you wanted to get out of this?

@mapleFU @etseidl @gszadovszky @pitrou FYI, as you were part of the initial discussion on the previous PRs.

I think we have waited enough time for feedback. Do you want to start a vote with two proposals on the dev ML, @JFinis?

alamb approved these changes Aug 11, 2025

View reviewed changes

This was referenced Aug 11, 2025

PARQUET-2249: Add nan_count to handle NaNs in statistics #196

Open

PARQUET-2249: Introduce IEEE 754 total order #221

Open

orlp reviewed Aug 11, 2025

View reviewed changes

rdblue reviewed Aug 12, 2025

View reviewed changes

src/main/thrift/parquet.thrift

* null counts are 0.

*/

5: optional list<i64> null_counts

Copy link

Contributor

rdblue Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary whitespace change.

wgtmac reviewed Aug 13, 2025

View reviewed changes

tustvold reviewed Aug 13, 2025

View reviewed changes

This was referenced Aug 15, 2025

[Parquet] Prototype: PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514 apache/arrow-rs#8156

Open

[Not for Merge] PoC implementation of PARQUET-2249: Introduce IEEE 754 total order apache/arrow-rs#7408

Draft

PingLiuPing mentioned this pull request Oct 9, 2025

feat: Add NaN statistics to parquet writer facebookincubator/velox#14725

Open

		* - Writing the nan_count field is mandatory when using this ordering,
		* especialy also if it is zero.

	* * If IEEE754_TOTAL_ORDER is used for the column and all non-null values
	* - If IEEE754_TOTAL_ORDER is used for the column and all non-null values

		* might not be NaN values in any page, as NaNs should not be included
		* in min_values or max_values.

	* (*) Because the precise sorting order is ambiguous for floating
	* (*) Because the TypeDefinedOrder is ambiguous for floating

	* - If the order of this column is TypeDefinedOrder, then no column index
	* - If the order of this column is TYPE_ORDER, then no column index

PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514

Are you sure you want to change the base?

PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514

Uh oh!

Conversation

JFinis commented Aug 9, 2025

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orlp Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JFinis commented Aug 13, 2025

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

orlp Aug 11, 2025 •

edited

Loading

orlp commented Aug 13, 2025 •

edited

Loading

tustvold commented Aug 13, 2025 •

edited

Loading