[GVN] Support rnflow pattern matching and transform #162259

madhur13490 · 2025-10-07T10:19:06Z

This is an alternative patch to #144987, adding support in GVN.
In this patch, I added support in GVN to recognize PRE load and hoist it out of the loop. This patch leads to ~15% improvement for the rnflow benchmark in Polyhedron testsuite when tested on NVIDIA's Grace platform.

github-actions · 2025-10-07T10:21:19Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff origin/main HEAD --extensions h,cpp -- llvm/include/llvm/Transforms/Scalar/GVN.h llvm/lib/Transforms/Scalar/GVN.cpp --diff_from_common_commit

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.

diff --git a/llvm/include/llvm/Transforms/Scalar/GVN.h b/llvm/include/llvm/Transforms/Scalar/GVN.h
index db74c2066..ed384f930 100644
--- a/llvm/include/llvm/Transforms/Scalar/GVN.h
+++ b/llvm/include/llvm/Transforms/Scalar/GVN.h
@@ -19,10 +19,10 @@
 #include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/ADT/SmallVector.h"
+#include "llvm/Analysis/LoopInfo.h"
 #include "llvm/IR/Dominators.h"
 #include "llvm/IR/InstrTypes.h"
 #include "llvm/IR/PassManager.h"
-#include "llvm/Analysis/LoopInfo.h"
 #include "llvm/IR/ValueHandle.h"
 #include "llvm/Support/Allocator.h"
 #include "llvm/Support/Compiler.h"

madhur13490 · 2025-10-07T10:23:49Z

Hi @fhahn and @nikic,
This patch requires some cosmetic changes. My concern is this is to understand if this what GVN should do?

The patch is functionally correct as the rnflow application passes. However, compared to trunk this patch regresses the app by ~10-12% as the new select instruction the pass is generating is more constly than hoisting the load. I played around SelectOptimize.cpp by turning on --disable-loop-level-heuristics. That helps a bit but still ~10% regression persists.

david-arm · 2025-10-08T09:17:37Z

The patch is functionally correct as the rnflow application passes. However, compared to trunk this patch regresses the app by ~10-12% as the new select instruction the pass is generating is more constly than hoisting the load. I played around SelectOptimize.cpp by turning on --disable-loop-level-heuristics. That helps a bit but still ~10% regression persists.

In the test you've added the original IR seems to already have a select instruction - do you also see a select in rnflow prior to GVN?

david-arm · 2025-10-08T09:19:21Z

Sorry, please ignore me! I see there is an extra select being generated. Got it. :)

madhur13490 · 2025-10-27T05:52:33Z

Ping to reviewers!

nikic · 2025-10-27T16:26:09Z

The patch is functionally correct as the rnflow application passes. However, compared to trunk this patch regresses the app by ~10-12% as the new select instruction the pass is generating is more constly than hoisting the load. I played around SelectOptimize.cpp by turning on --disable-loop-level-heuristics. That helps a bit but still ~10% regression persists.

And this is also true with GCC (without vectorization), as IIRC it was doing this transform?

I feel like I saw a comment somewhere that doing the GVN transform would improve performance slightly, but I have trouble finding it, maybe I imagined it...

nikic · 2025-10-27T16:27:38Z

I feel like I saw a comment somewhere that doing the GVN transform would improve performance slightly, but I have trouble finding it, maybe I imagined it...

Ah, I think it was this one, on the test case PR: #141556 (comment) I guess that one was an estimated number and not a measurement.

madhur13490 · 2025-10-28T04:21:55Z

I feel like I saw a comment somewhere that doing the GVN transform would improve performance slightly, but I have trouble finding it, maybe I imagined it...

Ah, I think it was this one, on the test case PR: #141556 (comment) I guess that one was an estimated number and not a measurement.

Yeah, that was an estimate, but unfortunately, in reality, it does not match (at least) on the Grace platform.

llvmbot · 2025-11-10T04:54:23Z

@llvm/pr-subscribers-llvm-transforms

Author: Madhur Amilkanthwar (madhur13490)

Changes

This is an alternative patch to #144987, adding support in GVN.

Full diff: https://github.com/llvm/llvm-project/pull/162259.diff

3 Files Affected:

(modified) llvm/include/llvm/Transforms/Scalar/GVN.h (+4)
(modified) llvm/lib/Transforms/Scalar/GVN.cpp (+121)
(added) llvm/test/Transforms/GVN/PRE/rnflow-gvn-pre.ll (+59)

diff --git a/llvm/include/llvm/Transforms/Scalar/GVN.h b/llvm/include/llvm/Transforms/Scalar/GVN.h
index bc0f108ac8260..b886140348e79 100644
--- a/llvm/include/llvm/Transforms/Scalar/GVN.h
+++ b/llvm/include/llvm/Transforms/Scalar/GVN.h
@@ -22,6 +22,7 @@
 #include "llvm/IR/Dominators.h"
 #include "llvm/IR/InstrTypes.h"
 #include "llvm/IR/PassManager.h"
+#include "llvm/Analysis/LoopInfo.h"
 #include "llvm/IR/ValueHandle.h"
 #include "llvm/Support/Allocator.h"
 #include "llvm/Support/Compiler.h"
@@ -45,6 +46,7 @@ class FunctionPass;
 class GetElementPtrInst;
 class ImplicitControlFlowTracking;
 class LoadInst;
+class SelectInst;
 class LoopInfo;
 class MemDepResult;
 class MemoryAccess;
@@ -405,6 +407,8 @@ class GVNPass : public PassInfoMixin<GVNPass> {
   void addDeadBlock(BasicBlock *BB);
   void assignValNumForDeadCode();
   void assignBlockRPONumber(Function &F);
+  
+  bool optimizeMinMaxFindingSelectPattern(SelectInst *Select);
 };
 
 /// Create a legacy GVN pass.
diff --git a/llvm/lib/Transforms/Scalar/GVN.cpp b/llvm/lib/Transforms/Scalar/GVN.cpp
index 72e1131a54a86..f817d045862cd 100644
--- a/llvm/lib/Transforms/Scalar/GVN.cpp
+++ b/llvm/lib/Transforms/Scalar/GVN.cpp
@@ -2743,6 +2743,10 @@ bool GVNPass::processInstruction(Instruction *I) {
     }
     return Changed;
   }
+  if (SelectInst *Select = dyn_cast<SelectInst>(I)) {
+    if (optimizeMinMaxFindingSelectPattern(Select))
+      return true;
+  }
 
   // Instructions with void type don't return a value, so there's
   // no point in trying to find redundancies in them.
@@ -3330,6 +3334,123 @@ void GVNPass::assignValNumForDeadCode() {
   }
 }
 
+bool GVNPass::optimizeMinMaxFindingSelectPattern(SelectInst *Select) {
+  LLVM_DEBUG(
+      dbgs()
+      << "GVN: Analyzing select instruction for minimum finding pattern\n");
+  LLVM_DEBUG(dbgs() << "GVN: Select: " << *Select << "\n");
+  Value *Condition = Select->getCondition();
+  CmpInst *Comparison = dyn_cast<CmpInst>(Condition);
+  if (!Comparison) {
+    LLVM_DEBUG(dbgs() << "GVN: Condition is not a comparison\n");
+    return false;
+  }
+
+  // Check if this is ULT comparison.
+  CmpInst::Predicate Pred = Comparison->getPredicate();
+  if (Pred != CmpInst::ICMP_SLT && Pred != CmpInst::ICMP_ULT &&
+      Pred != CmpInst::FCMP_OLT && Pred != CmpInst::FCMP_ULT) {
+    LLVM_DEBUG(dbgs() << "GVN: Not a less-than comparison, predicate: " << Pred
+                      << "\n");
+    return false;
+  }
+
+  // Check that both operands are loads.
+  Value *LHS = Comparison->getOperand(0);
+  Value *RHS = Comparison->getOperand(1);
+  if (!isa<LoadInst>(LHS) || !isa<LoadInst>(RHS)) {
+    LLVM_DEBUG(dbgs() << "GVN: Not both operands are loads\n");
+    return false;
+  }
+
+  LLVM_DEBUG(dbgs() << "GVN: Found minimum finding pattern in Block: "
+                    << Select->getParent()->getName() << "\n");
+
+  // Transform the pattern.
+  // Hoist the chain of operations for the second load to preheader.
+  // Get predecessor of the block containing the select instruction.
+  BasicBlock *BB = Select->getParent();
+
+  // Get preheader of the loop.
+  Loop *L = LI->getLoopFor(BB);
+  if (!L) {
+    LLVM_DEBUG(dbgs() << "GVN: Could not find loop\n");
+    return false;
+  }
+  BasicBlock *Preheader = L->getLoopPreheader();
+  if (!Preheader) {
+    LLVM_DEBUG(dbgs() << "GVN: Could not find loop preheader\n");
+    return false;
+  }
+
+  // Hoist the chain of operations for the second load to preheader.
+  // %90 = sext i32 %.05.i to i64
+  // %91 = getelementptr float, ptr %0, i64 %90 ; %0 + (sext i32 %85 to i64)*4
+  // %92 = getelementptr i8, ptr %91, i64 -4 ; %0 + (sext i32 %85 to i64)*4 - 4
+  // %93 = load float, ptr %92, align 4
+
+  Value *BasePtr = nullptr, *IndexVal = nullptr, *OffsetVal = nullptr;
+  IRBuilder<> Builder(Preheader->getTerminator());
+  if (match(RHS,
+            m_Load(m_GEP(m_GEP(m_Value(BasePtr), m_SExt(m_Value(IndexVal))),
+                         m_Value(OffsetVal))))) {
+    LLVM_DEBUG(dbgs() << "GVN: Found pattern: " << *RHS << "\n");
+    LLVM_DEBUG(dbgs() << "GVN: Found pattern: " << "\n");
+
+    PHINode *Phi = dyn_cast<PHINode>(IndexVal);
+    if (!Phi) {
+      LLVM_DEBUG(dbgs() << "GVN: IndexVal is not a PHI node\n");
+      return false;
+    }
+    Value *InitialMinIndex = Phi->getIncomingValueForBlock(Preheader);
+
+    // Insert PHI node at the top of this block.
+    PHINode *KnownMinPhi =
+        PHINode::Create(Builder.getFloatTy(), 2, "known_min", BB->begin());
+
+    // Build the GEP chain in the preheader.
+    // 1. hoist_0 = sext i32 to i64
+    Value *HoistedSExt =
+        Builder.CreateSExt(InitialMinIndex, Builder.getInt64Ty(), "hoist_sext");
+
+    // 2. hoist_gep1 = getelementptr float, ptr BasePtr, i64 HoistedSExt
+    Value *HoistedGEP1 = Builder.CreateGEP(Builder.getFloatTy(), BasePtr,
+                                           HoistedSExt, "hoist_gep1");
+
+    // 3. hoist_gep2 = getelementptr i8, ptr HoistedGEP1, i64 OffsetVal
+    Value *HoistedGEP2 = Builder.CreateGEP(Builder.getInt8Ty(), HoistedGEP1,
+                                           OffsetVal, "hoist_gep2");
+
+    // 4. hoisted_load = load float, ptr HoistedGEP2
+    LoadInst *NewLoad =
+        Builder.CreateLoad(Builder.getFloatTy(), HoistedGEP2, "hoisted_load");
+
+    // Replace all uses of load with new load.
+    RHS->replaceAllUsesWith(NewLoad);
+    dyn_cast<LoadInst>(RHS)->eraseFromParent();
+
+    // Replace second operand of comparison with KnownMinPhi.
+    Comparison->setOperand(1, KnownMinPhi);
+
+    // Create new select instruction for selecting the minimum value.
+    IRBuilder<> SelectBuilder(BB->getTerminator());
+    SelectInst *CurrentMinSelect =
+        dyn_cast<SelectInst>(SelectBuilder.CreateSelect(
+            Comparison, LHS, KnownMinPhi, "current_min"));
+
+    // Populate PHI node.
+    KnownMinPhi->addIncoming(NewLoad, Preheader);
+    KnownMinPhi->addIncoming(CurrentMinSelect, BB);
+    LLVM_DEBUG(dbgs() << "Transformed the code\n");
+    return true;
+  } else {
+    LLVM_DEBUG(dbgs() << "GVN: Could not find pattern: " << *RHS << "\n");
+    return false;
+  }
+  return false;
+}
+
+
 class llvm::gvn::GVNLegacyPass : public FunctionPass {
 public:
   static char ID; // Pass identification, replacement for typeid.
diff --git a/llvm/test/Transforms/GVN/PRE/rnflow-gvn-pre.ll b/llvm/test/Transforms/GVN/PRE/rnflow-gvn-pre.ll
new file mode 100644
index 0000000000000..6f17d4ab30240
--- /dev/null
+++ b/llvm/test/Transforms/GVN/PRE/rnflow-gvn-pre.ll
@@ -0,0 +1,59 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 6
+; Minimal test case containing only the .lr.ph.i basic block
+; RUN: opt -passes=gvn -S < %s | FileCheck %s
+
+define void @test_lr_ph_i(ptr %0) {
+; CHECK-LABEL: define void @test_lr_ph_i(
+; CHECK-SAME: ptr [[TMP0:%.*]]) {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[HOIST_GEP1:%.*]] = getelementptr float, ptr [[TMP0]], i64 1
+; CHECK-NEXT:    [[HOIST_GEP2:%.*]] = getelementptr i8, ptr [[HOIST_GEP1]], i64 -4
+; CHECK-NEXT:    [[HOISTED_LOAD:%.*]] = load float, ptr [[HOIST_GEP2]], align 4
+; CHECK-NEXT:    br label %[[DOTLR_PH_I:.*]]
+; CHECK:       [[_LR_PH_I:.*:]]
+; CHECK-NEXT:    [[KNOWN_MIN:%.*]] = phi float [ [[HOISTED_LOAD]], %[[ENTRY]] ], [ [[CURRENT_MIN:%.*]], %[[DOTLR_PH_I]] ]
+; CHECK-NEXT:    [[INDVARS_IV_I:%.*]] = phi i64 [ 1, %[[ENTRY]] ], [ [[INDVARS_IV_NEXT_I:%.*]], %[[DOTLR_PH_I]] ]
+; CHECK-NEXT:    [[TMP1:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[TMP10:%.*]], %[[DOTLR_PH_I]] ]
+; CHECK-NEXT:    [[DOT05_I:%.*]] = phi i32 [ 1, %[[ENTRY]] ], [ [[DOT1_I:%.*]], %[[DOTLR_PH_I]] ]
+; CHECK-NEXT:    [[INDVARS_IV_NEXT_I]] = add nsw i64 [[INDVARS_IV_I]], -1
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr float, ptr [[TMP0]], i64 [[INDVARS_IV_I]]
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr i8, ptr [[TMP2]], i64 -8
+; CHECK-NEXT:    [[TMP4:%.*]] = load float, ptr [[TMP3]], align 4
+; CHECK-NEXT:    [[TMP5:%.*]] = sext i32 [[DOT05_I]] to i64
+; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr float, ptr [[TMP0]], i64 [[TMP5]]
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr i8, ptr [[TMP6]], i64 -4
+; CHECK-NEXT:    [[TMP8:%.*]] = fcmp contract olt float [[TMP4]], [[KNOWN_MIN]]
+; CHECK-NEXT:    [[TMP9:%.*]] = trunc nsw i64 [[INDVARS_IV_NEXT_I]] to i32
+; CHECK-NEXT:    [[DOT1_I]] = select i1 [[TMP8]], i32 [[TMP9]], i32 [[DOT05_I]]
+; CHECK-NEXT:    [[TMP10]] = add nsw i64 [[TMP1]], -1
+; CHECK-NEXT:    [[TMP11:%.*]] = icmp samesign ugt i64 [[TMP1]], 1
+; CHECK-NEXT:    [[CURRENT_MIN]] = select i1 [[TMP8]], float [[TMP4]], float [[KNOWN_MIN]]
+; CHECK-NEXT:    br i1 [[TMP11]], label %[[DOTLR_PH_I]], label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %.lr.ph.i
+
+.lr.ph.i:                                         ; preds = %.lr.ph.i, %entry
+  %indvars.iv.i = phi i64 [ 1, %entry ], [ %indvars.iv.next.i, %.lr.ph.i ]
+  %86 = phi i64 [ 0, %entry ], [ %96, %.lr.ph.i ]
+  %.05.i = phi i32 [ 1, %entry ], [ %.1.i, %.lr.ph.i ]
+  %indvars.iv.next.i = add nsw i64 %indvars.iv.i, -1
+  %87 = getelementptr float, ptr %0, i64 %indvars.iv.i
+  %88 = getelementptr i8, ptr %87, i64 -8 ; first load : %0 + 4 * 1 - 8
+  %89 = load float, ptr %88, align 4
+  %90 = sext i32 %.05.i to i64
+  %91 = getelementptr float, ptr %0, i64 %90 ; %0 + 4 * 1
+  %92 = getelementptr i8, ptr %91, i64 -4 ; second load : %0 + 4 * 1 - 4
+  %93 = load float, ptr %92, align 4
+  %94 = fcmp contract olt float %89, %93
+  %95 = trunc nsw i64 %indvars.iv.next.i to i32
+  %.1.i = select i1 %94, i32 %95, i32 %.05.i
+  %96 = add nsw i64 %86, -1
+  %97 = icmp samesign ugt i64 %86, 1
+  br i1 %97, label %.lr.ph.i, label %exit
+
+exit:
+  ret void
+}

madhur13490 · 2025-11-13T06:38:55Z

Hi @nikic and @david-arm
With the latest rebase, I do see a performance gain with this patch contrary to the earlier reported regression.

Either this due to recent patches in LLVM or refreshed setup at my end. It is complicated to pinpoint to the patch which might have helped this patch gain the perf. Nevertheless, as it stands today, I see 15% end to end perf gain in Polyhedron's rnflow app on AArch64 Grace platform. My colleague, @ssijaric-nv also checked it on X86, AMD EPYC Milan1 and he sees slightly less gain but indeed there is roughtly 6% gain.

I have cleaned up this patch and is ready for review. Can you please have a look?

dtcxzyw · 2025-11-13T12:59:00Z

FYI nikic is currently out of the office and will come back at the start of December.

madhur13490 · 2025-12-08T06:14:54Z

Hi @nikic, If you're back, can you please have a look?

nikic · 2025-12-08T07:47:38Z

llvm/include/llvm/Transforms/Scalar/GVN.h

 #include "llvm/IR/Dominators.h"
 #include "llvm/IR/InstrTypes.h"
 #include "llvm/IR/PassManager.h"
+#include "llvm/Analysis/LoopInfo.h"


Can forward declare Loop instead?

nikic · 2025-12-08T07:49:58Z