[VPlan] Handle early exit loops with predicated successors #172454

lukel97 · 2025-12-16T10:51:43Z

If we have an early exit loop with non-dereferenceable loads after the exit, we currently bail:

int z;
for (int i = 0; i < N; i++) {
  if (x[i])
    break;
  z = y[i];
}

If the early exit block dominates the block containing these loads, we can predicate them with a mask like

for (int i = 0; i < N/VF; i++) {
  c[0..VF] = x[i..i+VF]
  z[0..VF] = y[i..i+VF], mask=(icmp ult step-vector, (first-active-lane c))
  if (anyof(c))
    break;
}

VPlanPredicator can handle predicating these blocks, but in tryToBuildVPlanWithVPRecipes we first disconnect all early exits before the masks are introduced:

// entry -> exiting -> ... -> latch
//             |
//             +-----> earlyexit
VPlanTransforms::handleEarlyExits(*Plan);
// entry -> exiting -> ... -> latch
VPlanTransforms::introduceMasksAndLinearize(*Plan);

This is needed to keep the region single entry/single exit, but it also means that there isn't any control flow by the time we want to add the masks:

exiting:
  %earlyexitcond = ...
// one successor (latch)
latch:
  %exitcond = or (anyof %earlyexitcond), %origexitcond
  br %exitcond, entry, exit

This patch propagates the information to VPlanPredicator that the successors should be predicated even though there isn't actually control flow with a new EarlyExit VPInstruction:

exiting:
  %earlyexitcond = ...
  earlyexit (icmp ult step-vector, (first-active-lane %earlyexitcond)
// one successor (latch)
latch:
  %exitcond = or (anyof %earlyexitcond), origexitcond
  br %exitcond, entry, exit

It's just a placeholder and gets immediately removed whenever VPlanPredicator sees it, but allows it to use the correct mask.

This makes way for supporting more types of loops, as we could also support stores/divs etc. as long as the exiting block dominates them. See the note in canUncountableExitConditionLoadBeMoved as to why we can't predicate stores when they're before the exiting block.

However the main motivation for this to allow us to support tail folding with early exits, which I believe will be needed to make supporting fault-only-first loads simpler: #151300

In order to actually test the changes from this, this PR allows non-dereferenceable loads that are properly dominated by the exiting block in LoopVectorizationLegality. But in practice, something else usually transforms these loops to be multiple-entry which prevents them from being vectorized.

Stacked on #172343 to so we can define first-active-lane in the loop region, included in this PR

If a recipe in the vector loop region doesn't have any users inside the region, it can be sunk to the nearest common denominator of its users. This gives slightly more accurate cost modelling, and on RISC-V seems to cause some loops to be vectorized with a higher VF. The underlying motivation for this is to simplify an upcoming patch by not worrying about having to manually sink a recipe to the middle block. In RISCV/divrem.ll and X86/cost-conditional-branches.ll I added a store to the original loop to prevent them from being completely sunk, so it still tests the original intent.

…ze/licm-sink

If we have an early exit loop with non-dereferenceable loads after the exit, we currently bail: int z; for (int i = 0; i < N; i++) { if (x[i]) break; z = y[i]; } If the early exit block dominates the block containing these loads, we can predicate them with mask like for (int i = 0; i < N/VF; i++) { c[0..VF] = x[i..i+VF] z[0..VF] = y[i..i+VF], mask=c if (anyof(c)) break; } In VPlan terms, this is `icmp ult step-vector, (first-active-lane exit-cond)` VPlanPredicator can handle predicating these blocks, but in tryToBuildVPlanWithVPRecipes we first disconnect all early exits before the masks are introduced: // entry -> exiting -> ... -> latch // | // +-----> earlyexit VPlanTransforms::handleEarlyExits(*Plan); // entry -> exiting -> ... -> latch VPlanTransforms::introduceMasksAndLinearize(*Plan); This is needed to keep the region single entry/single exit, but it also means that there isn't any control flow by the time we want to add the masks: exiting: %earlyexitcond = ... // one successor (latch) latch: %exitcond = or (anyof %earlyexitcond), %origexitcond br %exitcond, entry, exit This patch propagates the information to VPlanPredicator that the successors should be predicated even though there isn't actually control flow with a new EarlyExit VPInstruction: exiting: %earlyexitcond = ... earlyexit (icmp ult step-vector, (first-active-lane %earlyexitcond) // one successor (latch) latch: %exitcond = or (anyof %earlyexitcond), origexitcond br %exitcond, entry, exit It's just a placeholder and gets immediately removed whenever VPlanPredicator sees it, but allows it to use the correct mask. This makes way for supporting more types of loops, as we could also support stores/divs etc. as long as the exiting block dominates them. See the note in canUncountableExitConditionLoadBeMoved as to why we can't predicate stores when they're before the exiting block. However the main motivation for this to allow us to support tail folding with early exits, which I believe will be needed to make supporting fault-only-first loads simpler: llvm#151300 In order to actually test the changes from this, this PR allows non-dereferenceable loads that are properly dominated by the exiting block in LoopVectorizationLegality. But in practice, something else usually transforms these loops to be multiple-entry which prevents them from being vectorized.

llvmbot · 2025-12-16T10:52:16Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-vectorizers

Author: Luke Lau (lukel97)

Changes

If we have an early exit loop with non-dereferenceable loads after the exit, we currently bail:

int z;
for (int i = 0; i &lt; N; i++) {
  if (x[i])
    break;
  z = y[i];
}

If the early exit block dominates the block containing these loads, we can predicate them with mask like

for (int i = 0; i &lt; N/VF; i++) {
  c[0..VF] = x[i..i+VF]
  z[0..VF] = y[i..i+VF], mask=c
  if (anyof(c))
    break;
}

In VPlan terms, this is icmp ult step-vector, (first-active-lane exit-cond)

VPlanPredicator can handle predicating these blocks, but in tryToBuildVPlanWithVPRecipes we first disconnect all early exits before the masks are introduced:

// entry -&gt; exiting -&gt; ... -&gt; latch
//             |
//             +-----&gt; earlyexit
VPlanTransforms::handleEarlyExits(*Plan);
// entry -&gt; exiting -&gt; ... -&gt; latch
VPlanTransforms::introduceMasksAndLinearize(*Plan);

This is needed to keep the region single entry/single exit, but it also means that there isn't any control flow by the time we want to add the masks:

exiting:
  %earlyexitcond = ...
// one successor (latch)
latch:
  %exitcond = or (anyof %earlyexitcond), %origexitcond
  br %exitcond, entry, exit

This patch propagates the information to VPlanPredicator that the successors should be predicated even though there isn't actually control flow with a new EarlyExit VPInstruction:

exiting:
  %earlyexitcond = ...
  earlyexit (icmp ult step-vector, (first-active-lane %earlyexitcond)
// one successor (latch)
latch:
  %exitcond = or (anyof %earlyexitcond), origexitcond
  br %exitcond, entry, exit

It's just a placeholder and gets immediately removed whenever VPlanPredicator sees it, but allows it to use the correct mask.

This makes way for supporting more types of loops, as we could also support stores/divs etc. as long as the exiting block dominates them. See the note in canUncountableExitConditionLoadBeMoved as to why we can't predicate stores when they're before the exiting block.

However the main motivation for this to allow us to support tail folding with early exits, which I believe will be needed to make supporting fault-only-first loads simpler: #151300

In order to actually test the changes from this, this PR allows non-dereferenceable loads that are properly dominated by the exiting block in LoopVectorizationLegality. But in practice, something else usually transforms these loops to be multiple-entry which prevents them from being vectorized.

Stacked on #172343 to so we can define first-active-lane in the loop region, included in this PR

Patch is 240.45 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/172454.diff

51 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+9)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+4)
(modified) llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp (+11)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+6)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+41-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+14-14)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/invariant-replicate-region.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-neon.ll (+38-38)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product.ll (+33-33)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/divrem.ll (+37-25)
(added) llvm/test/Transforms/LoopVectorize/RISCV/early-exit.ll (+115)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/pointer-induction.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/X86/cost-conditional-branches.ll (+54-260)
(modified) llvm/test/Transforms/LoopVectorize/X86/drop-inbounds-flags-for-reverse-vector-pointer.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/induction-costs.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/X86/invariant-load-gather.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/X86/pr141968-instsimplifyfolder.ll (+15-95)
(modified) llvm/test/Transforms/LoopVectorize/X86/pr72969.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/X86/replicate-recipe-with-only-first-lane-used.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/X86/replicating-load-store-costs.ll (+17-25)
(modified) llvm/test/Transforms/LoopVectorize/X86/tail_loop_folding.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-dead-instructions.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-scalable-vf1.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-with-uniform-ops.ll (+9-9)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+37-41)
(modified) llvm/test/Transforms/LoopVectorize/instruction-only-used-outside-of-loop.ll (+5-5)
(modified) llvm/test/Transforms/LoopVectorize/iv-select-cmp-decreasing.ll (+8-8)
(modified) llvm/test/Transforms/LoopVectorize/iv_outside_user.ll (+17-19)
(modified) llvm/test/Transforms/LoopVectorize/lcssa-crashes.ll (+8-8)
(modified) llvm/test/Transforms/LoopVectorize/no_outside_user.ll (+45-45)
(modified) llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-liveout.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/optsize.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/pointer-induction.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/pr36983-multiple-lcssa.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/pr43166-fold-tail-by-masking.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/pr51614-fold-tail-by-masking.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/pr55167-fold-tail-live-out.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/pr66616.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/reduction-minmax-users-and-predicated.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/reduction-order.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/reduction-small-size.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/scalable-first-order-recurrence.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/select-reduction.ll (+5-5)
(modified) llvm/test/Transforms/LoopVectorize/single_early_exit.ll (+111)
(modified) llvm/test/Transforms/LoopVectorize/single_early_exit_live_outs.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/store-reduction-results-in-tail-folded-loop.ll (+5-5)
(modified) llvm/test/Transforms/LoopVectorize/tail-folding-vectorization-factor-1.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/vplan-printing-reductions.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+1-1)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 5238a5d7d7c24..67ccb094f292e 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -25,6 +25,7 @@
 #include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/Analysis/ValueTracking.h"
 #include "llvm/Analysis/VectorUtils.h"
+#include "llvm/IR/Dominators.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/PatternMatch.h"
 #include "llvm/Transforms/Utils/SizeOpts.h"
@@ -1852,6 +1853,9 @@ bool LoopVectorizationLegality::isVectorizableEarlyExitLoop() {
 
   // Check non-dereferenceable loads if any.
   for (LoadInst *LI : NonDerefLoads) {
+    // Occurs after the early exit, so we can predicate it.
+    if (DT->properlyDominates(SingleUncountableExitingBlock, LI->getParent()))
+      continue;
     // Only support unit-stride access for now.
     int Stride = isConsecutivePtr(LI->getType(), LI->getPointerOperand());
     if (Stride != 1) {
@@ -2050,6 +2054,11 @@ bool LoopVectorizationLegality::canVectorize(bool UseVPlanNativePath) {
         else
           return false;
       }
+      // isVectorizableEarlyExitLoop will have predicated some instructions when
+      // they previously weren't. Call canVectorizeWithIfConvert again to
+      // repopulate MaskedOp with any new instructions.
+      if (!canVectorizeWithIfConvert())
+        return false;
     }
   }
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 7023a5d4af176..7992fb836054b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1130,6 +1130,9 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags,
     ResumeForEpilogue,
     /// Returns the value for vscale.
     VScale,
+    /// Used to tell VPlanPredicator when to predicate early exit control
+    /// flow. Removed during introduceMasksAndLinearize.
+    EarlyExit,
     OpsEnd = VScale,
   };
 
@@ -1211,6 +1214,7 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags,
     case Instruction::AtomicRMW:
     case VPInstruction::BranchOnCond:
     case VPInstruction::BranchOnCount:
+    case VPInstruction::EarlyExit:
       return false;
     default:
       return true;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp b/llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp
index f7e7fc29bc203..fee4f5123b0bb 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp
@@ -97,6 +97,17 @@ VPValue *VPPredicator::createEdgeMask(VPBasicBlock *Src, VPBasicBlock *Dst) {
 
   VPValue *SrcMask = getBlockInMask(Src);
 
+  // If we see an EarlyExit, predicate all successors with its mask.
+  if (!Src->empty() &&
+      match(&Src->back(),
+            m_VPInstruction<VPInstruction::EarlyExit>(m_VPValue(EdgeMask)))) {
+    assert(Src->getSingleSuccessor() && "EarlyExit has multiple successors?");
+    if (SrcMask)
+      EdgeMask = Builder.createLogicalAnd(SrcMask, EdgeMask);
+    Src->back().eraseFromParent();
+    return setEdgeMask(Src, Dst, EdgeMask);
+  }
+
   // If there's a single successor, there's no terminator recipe.
   if (Src->getNumSuccessors() == 1)
     return setEdgeMask(Src, Dst, SrcMask);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index fb7aaec4e93a9..862042b5b5c98 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -433,6 +433,7 @@ unsigned VPInstruction::getNumOperandsForOpcode(unsigned Opcode) {
   case Instruction::Freeze:
   case Instruction::Load:
   case VPInstruction::BranchOnCond:
+  case VPInstruction::EarlyExit:
   case VPInstruction::Broadcast:
   case VPInstruction::BuildStructVector:
   case VPInstruction::BuildVector:
@@ -498,6 +499,7 @@ bool VPInstruction::canGenerateScalarForFirstLane() const {
   case Instruction::Select:
   case VPInstruction::BranchOnCond:
   case VPInstruction::BranchOnCount:
+  case VPInstruction::EarlyExit:
   case VPInstruction::CalculateTripCountMinusVF:
   case VPInstruction::CanonicalIVIncrementForPart:
   case VPInstruction::PtrAdd:
@@ -1184,6 +1186,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
   case VPInstruction::AnyOf:
   case VPInstruction::BranchOnCond:
   case VPInstruction::BranchOnCount:
+  case VPInstruction::EarlyExit:
   case VPInstruction::Broadcast:
   case VPInstruction::BuildStructVector:
   case VPInstruction::BuildVector:
@@ -1328,6 +1331,9 @@ void VPInstruction::printRecipe(raw_ostream &O, const Twine &Indent,
   case VPInstruction::BranchOnCount:
     O << "branch-on-count";
     break;
+  case VPInstruction::EarlyExit:
+    O << "early-exit";
+    break;
   case VPInstruction::Broadcast:
     O << "broadcast";
     break;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 9c4f4246deb41..58621fb595f1d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2465,6 +2465,33 @@ static void licm(VPlan &Plan) {
       R.moveBefore(*Preheader, Preheader->end());
     }
   }
+
+  // Sink any recipes which don't have any users in the region to the nearest
+  // common dominator of its users.
+  VPDominatorTree VPDT(Plan);
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
+           vp_post_order_shallow(LoopRegion->getEntry()))) {
+    for (VPRecipeBase &R : make_early_inc_range(reverse(*VPBB))) {
+      if (cannotHoistOrSinkRecipe(R))
+        continue;
+      SmallSetVector<VPBasicBlock *, 4> UserVPBBs;
+      if (any_of(R.definedValues(), [&UserVPBBs](VPValue *V) {
+            return any_of(V->users(), [&UserVPBBs](VPUser *U) {
+              auto *UR = cast<VPRecipeBase>(U);
+              UserVPBBs.insert(UR->getParent());
+              return UR->getParent()->getEnclosingLoopRegion();
+            });
+          }))
+        continue;
+      if (UserVPBBs.empty())
+        continue;
+      VPBasicBlock *SinkVPBB = UserVPBBs.front();
+      for (auto *UserVPBB : drop_begin(UserVPBBs))
+        SinkVPBB = cast<VPBasicBlock>(
+            VPDT.findNearestCommonDominator(SinkVPBB, UserVPBB));
+      R.moveBefore(*SinkVPBB, SinkVPBB->begin());
+    }
+  }
 }
 
 void VPlanTransforms::truncateToMinimalBitwidths(
@@ -3796,7 +3823,7 @@ void VPlanTransforms::handleUncountableEarlyExit(VPBasicBlock *EarlyExitingVPBB,
       cast<VPIRPhi>(&R)->swapOperands();
   }
 
-  VPBuilder Builder(LatchVPBB->getTerminator());
+  VPBuilder Builder(EarlyExitingVPBB->getTerminator());
   VPBlockBase *TrueSucc = EarlyExitingVPBB->getSuccessors()[0];
   assert(match(EarlyExitingVPBB->getTerminator(), m_BranchOnCond()) &&
          "Terminator must be be BranchOnCond");
@@ -3806,6 +3833,19 @@ void VPlanTransforms::handleUncountableEarlyExit(VPBasicBlock *EarlyExitingVPBB,
                               ? CondOfEarlyExitingVPBB
                               : Builder.createNot(CondOfEarlyExitingVPBB);
 
+  // Create a EarlyExit instruction so VPlanPredicator knows to mask off any
+  // "exited" lanes in successor blocks.
+  VPValue *FirstActiveLane =
+      Builder.createNaryOp(VPInstruction::FirstActiveLane, {CondToEarlyExit},
+                           DebugLoc::getUnknown(), "first.active.lane");
+  VPValue *SuccMask = Builder.createICmp(
+      CmpInst::ICMP_ULT,
+      Builder.createNaryOp(VPInstruction::StepVector, {},
+                           Type::getInt64Ty(Plan.getContext())),
+      FirstActiveLane);
+  Builder.createNaryOp(VPInstruction::EarlyExit, SuccMask);
+
+  Builder.setInsertPoint(LatchVPBB->getTerminator());
   // Split the middle block and have it conditionally branch to the early exit
   // block if CondToEarlyExit.
   VPValue *IsEarlyExitTaken =
@@ -3836,9 +3876,6 @@ void VPlanTransforms::handleUncountableEarlyExit(VPBasicBlock *EarlyExitingVPBB,
     VPValue *IncomingFromEarlyExit = ExitIRI->getOperand(EarlyExitIdx);
     if (!IncomingFromEarlyExit->isLiveIn()) {
       // Update the incoming value from the early exit.
-      VPValue *FirstActiveLane = EarlyExitB.createNaryOp(
-          VPInstruction::FirstActiveLane, {CondToEarlyExit},
-          DebugLoc::getUnknown(), "first.active.lane");
       IncomingFromEarlyExit = EarlyExitB.createNaryOp(
           VPInstruction::ExtractLane, {FirstActiveLane, IncomingFromEarlyExit},
           DebugLoc::getUnknown(), "early.exit.value");
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
index 40db6a53b49e4..46bb50dfd9799 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
@@ -94,10 +94,10 @@ define i64 @pointer_induction_only(ptr %start, ptr %end) {
 ; CHECK-NEXT:    [[TMP0:%.*]] = sub i64 [[END1]], [[START2]]
 ; CHECK-NEXT:    [[TMP1:%.*]] = lshr i64 [[TMP0]], 2
 ; CHECK-NEXT:    [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
-; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 8
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 16
 ; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
-; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 4
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 8
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP2]], [[N_MOD_VF]]
 ; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[N_VEC]], 4
 ; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[START]], i64 [[TMP3]]
@@ -106,15 +106,15 @@ define i64 @pointer_induction_only(ptr %start, ptr %end) {
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 4
 ; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[OFFSET_IDX]]
-; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr i32, ptr [[NEXT_GEP]], i64 2
-; CHECK-NEXT:    [[WIDE_LOAD4:%.*]] = load <2 x i32>, ptr [[TMP7]], align 1
-; CHECK-NEXT:    [[TMP9:%.*]] = zext <2 x i32> [[WIDE_LOAD4]] to <2 x i64>
-; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
+; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr i32, ptr [[NEXT_GEP]], i64 4
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP5]], align 1
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
 ; CHECK-NEXT:    [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK:       middle.block:
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1
-; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
+; CHECK-NEXT:    [[TMP7:%.*]] = zext <4 x i32> [[WIDE_LOAD]] to <4 x i64>
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i64> [[TMP7]], i32 3
+; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP7]], i32 2
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
 ; CHECK:       scalar.ph:
@@ -163,11 +163,11 @@ define i64 @int_and_pointer_iv(ptr %start, i32 %N) {
 ; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[START]], i64 [[OFFSET_IDX]]
 ; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr i32, ptr [[NEXT_GEP]], i64 4
 ; CHECK-NEXT:    [[WIDE_LOAD3:%.*]] = load <4 x i32>, ptr [[TMP3]], align 4
-; CHECK-NEXT:    [[TMP5:%.*]] = zext <4 x i32> [[WIDE_LOAD3]] to <4 x i64>
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
 ; CHECK-NEXT:    [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
 ; CHECK-NEXT:    br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:       middle.block:
+; CHECK-NEXT:    [[TMP5:%.*]] = zext <4 x i32> [[WIDE_LOAD3]] to <4 x i64>
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP5]], i32 2
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       exit:
@@ -271,7 +271,7 @@ define i64 @test_ptr_ivs_and_widened_ivs(ptr %src, i32 %N) {
 ; CHECK-NEXT:    [[TMP0:%.*]] = add i32 [[N]], -1
 ; CHECK-NEXT:    [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
 ; CHECK-NEXT:    [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
-; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 8
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], 16
 ; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], 8
@@ -289,15 +289,15 @@ define i64 @test_ptr_ivs_and_widened_ivs(ptr %src, i32 %N) {
 ; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[SRC]], i64 [[OFFSET_IDX]]
 ; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr i32, ptr [[NEXT_GEP]], i64 4
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP6]], align 4
-; CHECK-NEXT:    [[TMP7:%.*]] = xor <4 x i32> [[WIDE_LOAD]], splat (i32 1)
-; CHECK-NEXT:    [[TMP8:%.*]] = zext <4 x i32> [[TMP7]] to <4 x i64>
-; CHECK-NEXT:    [[TMP9:%.*]] = zext <4 x i32> [[STEP_ADD]] to <4 x i64>
-; CHECK-NEXT:    [[TMP10:%.*]] = shl <4 x i64> [[TMP8]], [[TMP9]]
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
 ; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <4 x i32> [[STEP_ADD]], splat (i32 4)
 ; CHECK-NEXT:    [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
 ; CHECK-NEXT:    br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP11:![0-9]+]]
 ; CHECK:       middle.block:
+; CHECK-NEXT:    [[TMP9:%.*]] = xor <4 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT:    [[TMP12:%.*]] = zext <4 x i32> [[TMP9]] to <4 x i64>
+; CHECK-NEXT:    [[TMP13:%.*]] = zext <4 x i32> [[STEP_ADD]] to <4 x i64>
+; CHECK-NEXT:    [[TMP10:%.*]] = shl <4 x i64> [[TMP12]], [[TMP13]]
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i64> [[TMP10]], i32 3
 ; CHECK-NEXT:    [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP10]], i32 2
 ; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/invariant-replicate-region.ll b/llvm/test/Transforms/LoopVectorize/AArch64/invariant-replicate-region.ll
index 9dfb987bd24a6..3951b88e41d55 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/invariant-replicate-region.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/invariant-replicate-region.ll
@@ -42,11 +42,11 @@ define i32 @test_invariant_replicate_region(i32 %x, i1 %c) {
 ; CHECK-NEXT:    br label %[[PRED_UREM_CONTINUE6]]
 ; CHECK:       [[PRED_UREM_CONTINUE6]]:
 ; CHECK-NEXT:    [[TMP12:%.*]] = phi <4 x i32> [ [[TMP11]], %[[PRED_UREM_CONTINUE4]] ], [ [[TMP14]], %[[PRED_UREM_IF5]] ]
-; CHECK-NEXT:    [[PREDPHI:%.*]] = select i1 [[C]], <4 x i32> [[TMP12]], <4 x i32> zeroinitializer
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
 ; CHECK-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], 100
 ; CHECK-NEXT:    br i1 [[TMP16]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[PREDPHI:%.*]] = select i1 [[C]], <4 x i32> [[TMP12]], <4 x i32> zeroinitializer
 ; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <4 x i32> [[PREDPHI]], i32 3
 ; CHECK-NEXT:    br label %[[EXIT:.*]]
 ; CHECK:       [[EXIT]]:
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-neon.ll b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-neon.ll
index b2be0e1d7a442..e80449432d29b 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-neon.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-neon.ll
@@ -493,12 +493,12 @@ define i32 @not_dotp_not_loop_carried(ptr %a, ptr %b) {
 ; CHECK-INTERLEAVE1-NEXT:    [[WIDE_LOAD1:%.*]] = load <16 x i8>, ptr [[TMP4]], align 1
 ; CHECK-INTERLEAVE1-NEXT:    [[TMP6:%.*]] = zext <16 x i8> [[WIDE_LOAD1]] to <16 x i32>
 ; CHECK-INTERLEAVE1-NEXT:    [[TMP7]] = mul <16 x i32> [[TMP6]], [[TMP3]]
-; CHECK-INTERLEAVE1-NEXT:    [[TMP8:%.*]] = shufflevector <16 x i32> [[VECTOR_RECUR]], <16 x i32> [[TMP7]], <16 x i32> <i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30>
-; CHECK-INTERLEAVE1-NEXT:    [[TMP9:%.*]] = add <16 x i32> [[TMP7]], [[TMP8]]
 ; CHECK-INTERLEAVE1-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
 ; CHECK-INTERLEAVE1-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
 ; CHECK-INTERLEAVE1-NEXT:    br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK-INTERLEAVE1:       middle.block:
+; CHECK-INTERLEAVE1-NEXT:    [[TMP8:%.*]] = shufflevector <16 x i32> [[VECTOR_RECUR]], <16 x i32> [[TMP7]], <16 x i32> <i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30>
+; CHECK-INTERLEAVE1-NEXT:    [[TMP9:%.*]] = add <16 x i32> [[TMP7]], [[TMP8]]
 ; CHECK-INTERLEAVE1-NEXT:    [[TMP11:%.*]] = extractelement <16 x i32> [[TMP9]], i32 15
 ; CHECK-INTERLEAVE1-NEXT:    br label [[FOR_EXIT:%.*]]
 ; CHECK-INTERLEAVE1:       for.exit:
@@ -526,12 +526,12 @@ define i32 @not_dotp_not_loop_carried(ptr %a, ptr %b) {
 ; CHECK-INTERLEAVED-NEXT:    [[TMP12:%.*]] = zext <16 x i8> [[WIDE_LOAD3]] to <16 x i32>
 ; CHECK-INTERLEAVED-NEXT:    [[TMP7:%.*]] = mul <16 x i32> [[TMP6]], [[TMP3]]
 ; CHECK-INTERLEAVED-NEXT:    [[TMP13:%.*]] = mul <16 x i32> [[TMP12]], [[TMP8]]
-; CHECK-INTERLEAVED-NEXT:    [[TMP14:%.*]] = shufflevector <16 x i32> [[TMP7]], <16 x i32> [[TMP13]], <16 x i32> <i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30>
-; CHECK-INTERLEAVED-NEXT:    [[TMP9:%.*]] = add <16 x i32> [[TMP13]], [[TMP14]]
 ; CHECK-INTERLEAVED-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 32
 ; CHECK-INTERLEAVED-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
 ; CHECK-INTERLEAVED-NEXT:    br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK-INTERLEAVED:       middle.block:
+; CHECK-INTERLEAVED-NEXT:    [[TMP14:%.*]] = shufflevector <16 x i32> [[TMP7]], <16 x i32> [[TMP13]], <16 x i32> <i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30>
+; CHECK-INTERLEAVED-NEXT:    [[TMP9:%.*]] = add <16 x i32> [[TMP13]], [[TMP14]]
 ; CHECK-INTERLEAVED-NEXT:    [[TMP11:%.*]] = extractelement <16 x i32> [[TMP9]], i32 15
 ; CHECK-INTERLEAVED-NEXT:    br label [[FOR_EXIT:%.*]]
 ; CHECK-INTERLEAVED:       for.exit:
@@ -553,12 +553,12 @@ define i32 @not_dotp_not_loop_carried(ptr %a, ptr %b) {
 ; CHECK-MAXBW-NEXT:    [[WIDE_LOAD1:%.*]] = load <16 x i8>, ptr [[TMP4]], align 1
 ; CHECK-MAXBW-NEXT:    [[TMP6:%.*]] = zext <16 x i8> [[WIDE_LOAD1]] to <16 x i32>
 ; CHECK-MAXBW-NEXT:    [[TMP7]] = mul <16 x i32> [[TMP6]], [[TMP3]]
-; CHECK-MAXBW-NEXT:    [[TMP8:%.*]] = shufflevector <16 x i32> [[VECTOR_RECUR]], <16 x i32> [[TMP7]], <16 x i32> <i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30>
-; CHECK-MAXBW-NEXT:    [[TMP9:%.*]] = add <16 x i32> [[TMP7]], [[TMP8]]
 ; CHECK-MAXBW-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
 ; CHECK-MAXBW-NEXT:    [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
 ; CHECK-MAXBW-NEXT:    br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
 ; CHECK-MAXBW:       middle.block:
+; CHECK-MAXBW-NEXT:    [[TMP8:%.*]] = shufflevector <16 x i32> [[VECTOR_RECUR]], <16 x i32> [[TMP7]], <16 x i32> <i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30>
+; CHECK-MAXBW-NEXT:    [[TMP9:%.*]] = add <16 x i32> [[TMP7]], [[TMP8]]
 ; CHECK-MAXBW-NEXT:    [[TMP11:%.*]] = extractelement <16 x i32> [[TMP9]], i32 15
 ; CHECK-MAXBW-NEXT:    br label [[FOR_EXIT:%.*]]
 ; CHECK-MAXBW:    ...
[truncated]

david-arm · 2025-12-16T11:06:52Z

llvm/test/Transforms/LoopVectorize/single_early_exit.ll

 }
+
+define i8 @predicate_exit_block_successors(ptr %p0) {
+; CHECK-LABEL: define i8 @predicate_exit_block_successors(


Could you add a similar test to LoopVectorize/AArch64/single_early_exit.ll please? I'm hoping the code looks a lot better with masked load support. :)

It's definitely a lot better with masked loads! Added a SVE test in ed7d701, and there's a RISC-V one too in RISCV/early-exit.ll

lukel97 added 5 commits December 16, 2025 02:05

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

bbaa617

…ze/licm-sink

Use smallsetvector of blocks

fed7547

Precommit tests

314c575

lukel97 requested review from arcbbb, david-arm, fhahn and huntergr-arm December 16, 2025 10:51

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Dec 16, 2025

david-arm reviewed Dec 16, 2025

View reviewed changes

Add SVE test

ed7d701

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VPlan] Handle early exit loops with predicated successors #172454

[VPlan] Handle early exit loops with predicated successors #172454

lukel97 commented Dec 16, 2025 •

edited

Loading

Uh oh!

llvmbot commented Dec 16, 2025 •

edited

Loading

Uh oh!

david-arm Dec 16, 2025

Uh oh!

lukel97 Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[VPlan] Handle early exit loops with predicated successors #172454

Are you sure you want to change the base?

[VPlan] Handle early exit loops with predicated successors #172454

Conversation

lukel97 commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-arm Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

lukel97 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lukel97 commented Dec 16, 2025 •

edited

Loading

llvmbot commented Dec 16, 2025 •

edited

Loading