Skip to content

Conversation

@hadv
Copy link
Owner

@hadv hadv commented Dec 9, 2025

secp256k1 AVX2/AVX-512 Proof of Concept

This PR explored using AVX2 SIMD instructions to accelerate secp256k1 field multiplication for EOA address mining.

Benchmark Results

Field Multiplication (GitHub Actions - AMD EPYC 7763)

Implementation Throughput Speedup
Scalar (4x sequential) 46.22 M mul/sec 1x
AVX2 (4-way parallel) 169.11 M mul/sec 3.66x

Point Addition (Local - Apple Silicon via Rosetta)

Implementation Throughput Speedup
Scalar (4x sequential) 16.83 M add/sec 1x
AVX2 (4-way parallel) 11.60 M add/sec 0.69x (slower)

Key Findings

  1. AVX2 field multiplication shows excellent speedup (3.66x on AMD EPYC)

  2. AVX2 point addition is slower than scalar in this PoC due to:

    • Simplified field multiplication that ignores carries (for PoC simplicity)
    • Memory layout overhead from limb-slicing
    • Need for proper 128-bit intermediate handling
  3. AVX-512 IFMA not available on GitHub Actions runners (AMD EPYC doesn't have it)

Technical Approach

  • 5×52-bit limb representation for field elements
  • Limb-slicing technique: Pack corresponding limbs from 4 field elements into 256-bit YMM registers
  • 26-bit split for multiplication: Split 52-bit limbs for vpmuludq compatibility
  • Jacobian coordinates for efficient point operations

Files Created

poc/secp256k1-avx2/
├── field.h              # 5×52 representation
├── field_mul.h          # Scalar reference + add/sub
├── field_mul_avx2.h     # AVX2 4-way multiplication
├── field_ops_avx2.h     # AVX2 4-way add/sub/neg
├── field_mul_avx512.h   # AVX-512 IFMA (for future)
├── group.h              # Point structures + generator G
├── group_avx2.h         # AVX2 4-way point operations
├── bench.c              # Field multiplication benchmark
├── bench_point.c        # Point addition benchmark
├── Makefile
└── README.md

Conclusion

While AVX2 shows promising speedup for individual field operations, achieving end-to-end speedup for point operations requires:

  1. Proper carry propagation in field multiplication
  2. Optimized memory layout to reduce gather/scatter overhead
  3. Potentially AVX-512 IFMA for native 52×52→104 bit multiply-add

For EOA mining, CUDA/GPU implementation is likely more practical since GPUs can run thousands of parallel point operations vs 4-8 with SIMD.


This PR is closed as an exploratory PoC. The code remains in the branch for reference.

prpeh added 2 commits December 9, 2025 11:46
This PoC demonstrates 4-way parallel secp256k1 field multiplication using AVX2
and 8-way parallel using AVX-512 IFMA instructions.

Results on AVX2:
- Scalar (4x sequential): 43.20 M mul/sec
- AVX2 (4-way parallel): 105.83 M mul/sec
- Speedup: 2.45x

Includes GitHub Actions workflow to benchmark AVX-512 IFMA on cloud runners.
- Add field_ops_avx2.h: 4-way parallel field add/sub/neg
- Add group.h: Jacobian point structures and generator G
- Add group_avx2.h: 4-way parallel point doubling and addition
- Add bench_point.c: Point addition benchmark

Local results (Apple Silicon via Rosetta):
- Scalar: 16.83 M additions/sec
- AVX2: 11.60 M additions/sec (0.69x - needs optimization)

The AVX2 point addition is currently slower due to:
1. Simplified field multiplication (ignores carries)
2. Memory layout overhead for limb-slicing
3. Need for proper 128-bit intermediate handling

Next steps: Optimize field multiplication with proper carry propagation.
@hadv hadv closed this Dec 9, 2025
@hadv hadv deleted the feature/secp256k1-avx2-poc branch December 9, 2025 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants