Skip to content

Conversation

@JacksonYao287
Copy link
Contributor

there is a very corner case like following:

    T1: L1 is leader, push data to follower F1, F1 generate rreq1 and allocate blk for this data. let`s say the
    lsn of this data is lsn-100

    T2: leader switchs to L2, L2 send log to F1, F1 generate rreq2 and try to allocate blk but got
    no_space_left. then it will set the quiesce flag and waiting for the commit of lsn-99 (100 - 1).

    T3: leader switchs back to L1, L1 send log (lsn-100) to F1, when F1 tries to create rreq for lsn-100,  rreq1
    (which is created at T1) will be found since the rkey{server,term,dsn} is same as rreq1. so F1 will skip
    creating a new rreq. since the data for lsn-100 is already in written at T1, F1 start appending log entries
    to its log store, which will call logstore::append_log_entries and thus call localize_journal_entry_finish.

    T4: lsn-99 is committed at F1 and clear_chunk_req is called, so all the rreqs in F1 including rreq1 are
    cleared.

    T5: F1 call localize_journal_entry_finish, rreq1 is not found since it is cleared at T4, so F1 will try to
    create a new rreq for lsn-100. but now, F1 is in quiesce state, all the rreq creation will be rejected, so a
    nullptr will be returned. and cause RELEASE_ASSERT(rreq != nullptr) failure.

when we got no_space_left in log channel at follower, we set this latch_lsn, so if any lsn in this batch that is >= latch_lsn, the whole batch will be rejected. after we successfully handle no_space_left and call resume_accepting_reqs, latch_lsn will be reset to max value.

Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general lgtm. The corner case explanation can goes into commit message.

@JacksonYao287 JacksonYao287 force-pushed the add-latch-lsn branch 5 times, most recently from 2c68776 to 10b9c93 Compare December 9, 2025 10:11
Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also considering adding more logs before latch_lsn.store() for the sake of debugability (better to wrap it into a function ).

if (!get_pending_request_num()) break;
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt these lines can be removed as we still go through async mode for non-repl-dev cases(UT)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , this is good suggestion, I will keep it here and add a comment to notify the later comers that if they want to use aysnc mode, they should take m_pending_callback into account

Comment on lines +2015 to +2033
// there is a very corner case like following:
// T1: L1 is leader, push data to follower F1, F1 generate rreq1 and allocate blk for this data. let`s say the
// lsn of this data is lsn-100

// T2: leader switchs to L2, L2 send log to F1, F1 generate rreq2 and try to allocate blk but got
// no_space_left. then it will set the quiesce flag and waiting for the commit of lsn-99 (100 - 1).

// T3: leader switchs back to L1, L1 send log (lsn-100) to F1, when F1 tries to create rreq for lsn-100, rreq1
// (which is created at T1) will be found since the rkey{server,term,dsn} is same as rreq1. so F1 will skip
// creating a new rreq. since the data for lsn-100 is already in written at T1, F1 start appending log entries
// to its log store, which will call logstore::append_log_entries and thus call localize_journal_entry_finish.

// T4: lsn-99 is committed at F1 and clear_chunk_req is called, so all the rreqs in F1 including rreq1 are
// cleared.

// T5: F1 call localize_journal_entry_finish, rreq1 is not found since it is cleared at T4, so F1 will try to
// create a new rreq for lsn-100. but now, F1 is in quiesce state, all the rreq creation will be rejected, so a
// nullptr will be returned. and cause RELEASE_ASSERT(rreq != nullptr) failure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest move these lines into commit message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iet`s keep it here, and I will put it into commit message when merging

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 28.57143% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.65%. Comparing base (1a0cef8) to head (1cdebad).
⚠️ Report is 295 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp 29.41% 5 Missing and 7 partials ⚠️
src/include/homestore/replication/repl_dev.h 0.00% 2 Missing ⚠️
src/lib/logstore/log_dev.cpp 0.00% 0 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #839      +/-   ##
==========================================
- Coverage   56.51%   49.65%   -6.86%     
==========================================
  Files         108      110       +2     
  Lines       10300    11324    +1024     
  Branches     1402     5334    +3932     
==========================================
- Hits         5821     5623     -198     
+ Misses       3894     2082    -1812     
- Partials      585     3619    +3034     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@JacksonYao287 JacksonYao287 merged commit e03d114 into eBay:master Dec 15, 2025
40 of 41 checks passed
@JacksonYao287 JacksonYao287 deleted the add-latch-lsn branch December 15, 2025 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants