CEP-45: Replication type migration #4508

bdeggleston · 2025-12-02T18:06:58Z

Thanks for sending a pull request! Here are some tips if you're new here:

Ensure you have added or run the appropriate tests for your PR.
Be sure to keep the PR description updated to reflect all changes.
Write your PR title to summarize what this PR proposes.
If possible, provide a concise example to reproduce the issue for a faster review.
Read our contributor guidelines
If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

aweisberg

Still reviewing just noticed this

test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java

test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java

src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java

src/java/org/apache/cassandra/db/Mutation.java

aweisberg

Finished review of the implementation code. Still need to finish reviewing the test code.

For range reads you 100% need to adapt InteropTokenRangeTest to test this splitting code. There are a bunch of nasty corner cases when dealing with range reads and what ends up being generated as the bounds of the read command and that covers most of them.

I also think that property based tests would better for the things that are unit testable and maybe even dtests. It would be nice to create a partitioner that has a range of like +5 to -5 but is otherwise identical to a real partitioner and hashes integer values to their real value. Then property tests could enumerate the entire state space very quickly.

src/java/org/apache/cassandra/journal/Segments.java

src/java/org/apache/cassandra/service/consensus/migration/ConsensusMigrationMutationHelper.java

src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java

src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java

aweisberg · 2025-12-04T18:18:19Z

src/java/org/apache/cassandra/db/Keyspace.java

    public Future<?> applyFuture(Mutation mutation, boolean writeCommitLog, boolean updateIndexes)
    {
-        return getMetadata().useMutationTracking()
+        return MigrationRouter.isFullyTracked(mutation)


If it's partially tracked and migrating to mutation tracking and don't apply the partially tracked portion as tracked then it forces background reconciliation to reconcile all the mutations? I don't get how it makes sense to bypass the mutation tracking plumbing during migration.

I thought that writes would flow through the mutation tracking system and background reconciliation would occur and it would be fine even if there it does extra/redundant work with read repair.

Basically I think for writes there isn't a range based form of "is it tracked" and that is only relevant for reads which can't be correct without relying on range based repairs Just throw it into the mutation tracking system and call it good?

We don't bypass mutation tracking plumbing during migration. The use of MigrationRouter.isFullyTracked(mutation) here is making the assumption that the mutation routing logic did it's job properly higher up the stack and has already produced mutations that are either fully tracked or fully untracked. Probably the right thing to do here is confirm that the mutation is fully one or the other and select the correct path accordingly

OK, so the splitting is supposed to happen at the coordinator, but that is racy since things could change when we get here? Then we surface an error to the client?

StorageProxy.mutateWithTracking is also an unused method now.

src/java/org/apache/cassandra/service/replication/migration/MutationTrackingMigrationState.java

aweisberg · 2025-12-04T18:50:09Z

src/java/org/apache/cassandra/tcm/transformations/AlterSchema.java

+                    .map(table -> table.id)
+                    .collect(Collectors.toList());
+
+                migrationState = migrationState.withKeyspaceMigrating(diff.after.name, tableIds, nextEpoch);


A little sneaky that this is a reversal on whatever is already there.

Maybe a comment or a name change to reflect the fact that it does reversal?

aweisberg

Also noticed that the check to see if the repair is in an epoch that can effect the ranges seems to be missing in MutationTrackingMigrationState

test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java

src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java

aweisberg

I don't see a happens before edge for repair where when repair starts and snapshots tables that it is guaranteed there are no inflight non-tracked mutations that might not be included in the repair.

The way to ensure this would be for nodes to refuse to apply mutations if they think the write should have been tracked and for repair to then wait out in flight writes after it has updated cluster metadata at each node as part of prepare.

WDYT?

test/unit/org/apache/cassandra/service/replication/migration/MigrationRouterTest.java

src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java

src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java

test/distributed/org/apache/cassandra/distributed/test/MutationTrackingMigrationTest.java

...t/org/apache/cassandra/service/replication/migration/MutationTrackingMigrationStateTest.java

test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java

test/unit/org/apache/cassandra/tcm/transformations/AlterSchemaMutationTrackingTest.java

bdeggleston · 2025-12-09T18:47:38Z

I don't see a happens before edge for repair where when repair starts and snapshots tables that it is guaranteed there are no inflight non-tracked mutations that might not be included in the repair.

The way to ensure this would be for nodes to refuse to apply mutations if they think the write should have been tracked and for repair to then wait out in flight writes after it has updated cluster metadata at each node as part of prepare.

WDYT?

I added a "wait for epoch" in the prepare phase handler for repair which should serve that purpose. A repair starts, it waits for the epoch, and then no more untracked writes will be coordinated by it by the time it replies

bdeggleston · 2025-12-10T18:43:25Z

Can you go into more detail about what you’d expect to see for the interop token range test and the property tests?

In the case of the interop token range tests, there’s not really an analog for interop reads. Reads are either tracked or untracked. Untracked reads are well tested, and tracked testing / hardening is a separate issue that’s already being worked on. The interop tests don’t check reads in partially migrated clusters afaict.

In the case of property tests, we’re mainly checking a handful of state changes driven by schema changes and repair here. We’re not dealing with the sort of multiple intersecting state changes and parameter spaces that property tests are good at uncovering bugs in. I could see value in property based simulator testing of reads and writes and state changes, but I don’t think that’s possible without a lot of simulator work.

aweisberg · 2025-12-11T17:00:13Z

src/java/org/apache/cassandra/db/Keyspace.java

+                return applyInternal(mutation, writeCommitLog, updateIndexes, true, true, new AsyncPromise<>());
+            case TRACKED:
+                return applyInternalTracked(mutation, new AsyncPromise<>());
+            case MIXED:


This looks quite problematic because it's going to happen when migrating away from mutation tracking and then your unlogged batches stop working?

Once we've gotten here we expect the logic higher up the call stack to have properly separated the mutation into tracked and untracked. That said, there is a race where things may have changed during the processing of the mutation. However, not doing this creates a possible (albeit unlikely) race where we apply an untracked mutation to a tracked keyspace, which will create a read monotonicity correctness issue.

aweisberg · 2025-12-11T17:14:53Z

test/unit/org/apache/cassandra/tcm/transformations/AlterSchemaMutationTrackingTest.java

+
+        metadata = ClusterMetadata.current();
+        KeyspaceMigrationInfo migrationInfo = metadata.mutationTrackingMigrationState.getKeyspaceInfo(ksName);
+        assertNotNull(migrationInfo);


This should really assert it is as expected, tables, ranges, epoch etc to make sure nothing is mangled from AlterSchema. I know a lot of this is already tested in KeyspaceMigrationInfoTest, but this does the end to end part with AlterSchema.

aweisberg · 2025-12-11T18:09:48Z

RE InteropTokenRange test and testing splitting of range reads
The take away from InteropTokenRangeTest is that you test queries with those token bounds and GT/GTE/LT/LTE to make sure that your migration code correctly splits range reads across tracked/untracked boundaries. There are just a ton of edge cases there. The thing you are testing here is the splitting and that the commands it outputs are valid and actually return the data that is supposed to be in the read.

InteropTokenRangeTest doesn't demonstrate different permutations of migration, but there a small number of tokens so you can enumerate and test all range migration combinations.

I added a "wait for epoch" in the prepare phase handler for repair which should serve that purpose. A repair starts, it waits for the epoch, and then no more untracked writes will be coordinated by it by the time it replies

That tells you that the epoch has been received but it doesn't tell you anything about in flight writes that didn't complete yet? For that you would do the op order thing where you have the node update to the epoch and then create an op order to make sure all in flight writes that don't use that epoch are retired.

There is still the separate issue of this can cause writes to fail and the errors get surfaced to the client which is not ideal.

RE Hints
Right now hints will get stuck for all keyspaces if you enable mutation tracking and there are hints for a keyspace with mutation tracking. That seems problematic. Hints are host oriented and contain a log of hints for multiple keyspaces. There doesn't seem to be graceful cleanup of hints for keyspaces that have mutation tracking enabled so hint delivery for other keyspaces can continue.

…handler

bdeggleston requested a review from aweisberg December 2, 2025 18:06

CEP-45: Replication type migration

43b4b29

bdeggleston force-pushed the C20388-migration branch from c765a00 to 43b4b29 Compare December 2, 2025 22:07

aweisberg requested changes Dec 2, 2025

View reviewed changes

test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java Outdated Show resolved Hide resolved

test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java Outdated Show resolved Hide resolved

aweisberg self-requested a review December 2, 2025 22:14

aweisberg reviewed Dec 2, 2025

View reviewed changes

src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java Outdated Show resolved Hide resolved

bdeggleston added 2 commits December 2, 2025 16:39

test fixes

b180a4f

test fixes #2

f6c07bc

aweisberg reviewed Dec 3, 2025

View reviewed changes

src/java/org/apache/cassandra/db/Mutation.java Outdated Show resolved Hide resolved

aweisberg requested changes Dec 4, 2025

View reviewed changes