-
Notifications
You must be signed in to change notification settings - Fork 462
spec: remove the longitude wraparound of the geometry type bbox #526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@wgtmac please review when you have time. Thank you! |
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Thanks @jiayuasu for the clarification!
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth I am not sure this change is useful...the existing language has existed in the released spec for many months and all implementations interpret X statistics (geometry or geography) according to the previous language. Implementing wraparound bounding is not hard (took me about an hour, is 200 lines of Rust: https://github.com/apache/arrow-rs/blob/760b7b6077559e0e73a01771a5b423da535fc96a/parquet-geospatial/src/bounding.rs#L29-L125 ). All of that code has to exist anyway for Geography bounding so it's not saving anybody any work.
If this is a change we think is necessary, it should (1) go through the Parquet mailing list so that implementors are aware of it/can weigh in and (2) specify what implementations should do if they encounter statistics like this.
I don't think there's any technical reason this needs to be linked to Iceberg (and no reason that the Parquet format needs to inherit the constraints of the Java Iceberg implementation, which I believe is what is driving this particular conversation). Indeed, the ability for Parquet implementations to read and write wraparound bounds will allow options for Iceberg V4 that don't have to wait for changes to dozens of Parquet implementations.
|
@paleolimbot Iceberg uses the underlying Parquet statistics for its own statistics. So if we don't sync the Iceberg change to Parquet, how would Iceberg side handle the bbox wraparound if the underlying Parquet file has one? |
|
I think it's fair that Parquet files with wraparound bounds can't be used in Iceberg V3 tables. Iceberg is one of many, many uses of a Parquet file. |
|
Ok, I think what Dewey said makes sense. I don’t believe we necessarily need to update the Parquet spec, since the Iceberg side can simply avoid using wraparound longitude for geometry bounding boxes. Of course, we’re not Parquet PMC members, so if the Parquet PMC feels strongly about it, we can start a vote or continue iterating on the PR. |
|
I agree with @paleolimbot that Parquet does not have to be bound by Iceberg, especially when there are Parquet implementations with wraparound support. However my concern is that Iceberg may come up with a different solution later which diverges from the Parquet spec. |
I think it's fine to come up with another solution...I don't think there's any solution would change the existing meaning of |
|
It seems that the two proposals from https://lists.apache.org/thread/x9ll3rhg26mngm10cjn74w66ov23grmm may add additional attributes to the geospatial logical types. I'm not sure it may break the assumption of current Parquet implementations. BTW, do you want to port the wraparound logic from arrow-rs to arrow-cpp? @paleolimbot |
Yes! I had hoped for Arrow 22 but given the feature freeze tomorrow I think that's unlikely (I can definitely ensure this and the null count are updated for Arrow 23).
I don't think either of those options intends to make changes to the Parquet geometry type that would break the assumptions of existing readers (I suppose the second one adds an option to the geography enum, which would require readers to update their enums). From the Parquet end of things the first option is just a way to signal to Parquet via writer options that In any case, I don't think these are planning to assign alternative meaning to |
nit: I did the feature freeze for Arrow v22 ~1 hour ago (was scheduled yesterday but gave a little more time). We are still fixing blockers though. |
Rationale for this change
Sync the recent Iceberg spec change to Parquet: apache/iceberg#14250
Geometry type calculation should always be cartesian. A reader / writer should be able to read / write geometry data without understanding CRS.
What changes are included in this PR?
Do these changes have PoC implementations?
No