Skip to content

Conversation

@gguptp
Copy link
Contributor

@gguptp gguptp commented Dec 12, 2025

Purpose of the change

[FLINK-37627][BugFix][Connectors/Kinesis] Restarting from a checkpoint/savepoint which coincides with shard split causes data loss

This PR updates the following PR: #198

Today Flink does not support distributed consistency of events from subtask (Task Manager) to coordinator (Job Manager) - https://issues.apache.org/jira/browse/FLINK-28639. As a result we have a race condition that can lead to a shard and it's children shards stopped being processed after a job restart.

  • A checkpoint started
  • Enumerator took a checkpoint (shard was assigned here)
  • Enumerator sent checkpoint event to reader
  • Before taking reader checkpoint, a SplitFinishedEvent came up in reader
  • Reader took checkpoint
  • Now, just after checkpoint complete, job restarted

This can lead to a shard lineage getting lost because of a shard being in ASSIGNED state in enumerator and not being part of any task manager state.
This PR changes the behaviour by also checkpointing the finished splits events received in between two checkpoints and on restore, those events again getting replayed.

Verifying this change

Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing

(Please pick either of the following options)

  • Added UTs
  • I manually verified this by running the connector in a local flink cluster which was getting restarted every 10 minutes. No checkpoint inconsistency was observed

(example:)

  • Added integration tests for end-to-end deployment
  • Added unit tests
  • Manually verified by running the Kinesis connector on a local Flink cluster.

Significant changes

(Please check any boxes [x] if the answer is "yes". You can first publish the PR and check them afterwards, for convenience.)

  • Dependencies have been added or upgraded
  • Public API has been changed (Public API is any class annotated with @Public(Evolving))
  • Serializers have been changed
  • New feature has been introduced
    • If yes, how is this documented? (not applicable / docs / JavaDocs / not documented)

Copy link

@ferenc-csaky ferenc-csaky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing a quick answer, code itself LGTM. But I'd like to ask to keep the original commit as it was in the older PR, with the original author, and put your change on top of it in a separate commit.

@gguptp
Copy link
Contributor Author

gguptp commented Dec 14, 2025

Thanks @ferenc-csaky i have made the respective changes in the PR and brought the commits from the original author in the PR

@ferenc-csaky
Copy link

Closing this, i squashed it into the original.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants