Skip to content

Conversation

@heilhead
Copy link
Collaborator

@heilhead heilhead commented Dec 9, 2025

Description

Refactors parquet-based analytics exporters to add more configuration options, as well as add a serde-based exporter (using serde_arrow crate).

There are some minor API changes in how these serializers are instantiated and configured, but the default configuration hasn't changed.

The crate now offers two features (both disabled by default):

  • parquet-native: Serializes the data using the native parquet RecordWriter<T> implementation (often derived using the ParquetRecordWriter macro).
  • parquet-serde: Generates parquet schema using serde_arrow and serializes the data using serde. Note that this requires both Serialize and DeserializeOwned being implemented on the exported data, which breaks a common use case of using &'static str in the data exports. I suggest using something like ArcStr to cover all cases of exporting strings - owned, shared and static.

For easier migration from native to serde serializer, and for usage in integration testing, both serializers now offer a schema() function that returns the schema generated for the exported type. There's also the schema_from_str() function that parses a string schema for verification. See this crate's parquet_schema integration test for an example.

Note that this crate is no longer using the forked versions of parquet and parquet_derive. In case of using the native serializer, the version of parquet used in the consumer should match the one used in this crate, while parquet_derive can be of any version, e.g.:

# Dependencies of a downstream consumer
[dependencies]
parquet = { version = "57.1", default-features = false, features = [
  "flate2-rust_backened",
] }
parquet_derive = { git = "https://github.com/WalletConnect/arrow-rs.git", rev = "e862f32" }

How Has This Been Tested?

Existing tests, with a few new ones to cover serialization and schema matching between multiple implementations of parquet serializers.

Due Diligence

  • Breaking change
  • Requires a documentation update
  • Requires a e2e/integration test update

@heilhead heilhead marked this pull request as ready for review December 10, 2025 08:16
@mario-reown
Copy link

I think I'm missing some context, why are we doing this? We should keep our code as simple as possible and having multiple features-gated options to export to parquet goes in the opposite direction.

@heilhead
Copy link
Collaborator Author

As it turned out, serde_arrow has a couple of limitations, which would complicate the migration. This PR offers easier path to migrate from native to serde exporter. Eventually, we'd remove the native exporter, as it's somewhat cumbersome to use (to the point where we had to fork it to make it useable).

When migrating to using serde_arrow exporter, one needs to do the following:

  • Ensure the exported data schema hasn't changed.
  • Refactor some of the exported types to implement DeserializeOwned on those types.

With the changes in this PR, it offers both options for easier migration.

@mario-reown
Copy link

Eventually, we'd remove the native exporter, as it's somewhat cumbersome to use (to the point where we had to fork it to make it useable)

Isn't the fork changing only 3 lines of code and only exists because we use Arc<str>? It's not great but doesn't make arrow-rs not useable.

arrow-rs is the official library from Apache and the people that made parquet and arrow. It is widely adopted and tested and the recommended library for rust.
serde_arrow looks like a hobby project at best.

In general I'm against this kind of changes unless there is a very good reason to do them.

@heilhead
Copy link
Collaborator Author

Isn't the fork changing only 3 lines of code and only exists because we use Arc<str>?

Yes. Supporting other types (e.g. Cow<'_, str>) is even more of a PITA with parquet and its derive macro.

It's not great but doesn't make arrow-rs not useable.

I'm not talking about arrow-rs in general, but specifically about the derive macros in parquet_derive. These implement the bare minimum serialization, and don't even support a lot of other native (std) types.

arrow-rs is the official library from Apache and the people that made parquet and arrow. It is widely adopted and tested and the recommended library for rust.
serde_arrow looks like a hobby project at best.

serde_arrow implementation is using all of the same arrow-rs primitives to export data. The only difference is in how the parquet schema is generated for the exported types, and how that data is fed into the arrow-native RecordBatch.

@xDarksome
Copy link
Member

So the blocker for using native parquet is the lack of Arc<str> support?

Do we really need the struct on which we derive ParquetRecordWriter (I presume) to have those? Can't we just borrow &str and put there, which is supported?

@heilhead
Copy link
Collaborator Author

So the blocker for using native parquet is the lack of Arc<str> support?

Not strictly speaking a blocker, since we have a fork with Arc<str> support. Also, the upstream PR was approved (merging it may not be easy though), so I guess we don't really need anything else for the time being.

Do we really need the struct on which we derive ParquetRecordWriter (I presume) to have those? Can't we just borrow &str and put there, which is supported?

Not unless we create another layer to borrow &str from owned values. The complication here is that parquet is most efficiently written when the data is batched together. And batching borrowed data that comes from different sources is always problematic, even if we forget that the actual export is a blocking operation, which should not be executed on an async thread pool.

Anyhow, since there's no appetite for replacing ParquetRecordWriter with serde_arrow, I'm closing this PR. Maybe we'll come back to this later when we need to modify our data exports with some other unsupported stuff.

@heilhead heilhead closed this Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants