Add secp256k1 AVX2/AVX-512 field multiplication proof of concept #32

hadv · 2025-12-09T04:53:18Z

secp256k1 AVX2/AVX-512 Proof of Concept

This PR explored using AVX2 SIMD instructions to accelerate secp256k1 field multiplication for EOA address mining.

Benchmark Results

Field Multiplication (GitHub Actions - AMD EPYC 7763)

Implementation	Throughput	Speedup
Scalar (4x sequential)	46.22 M mul/sec	1x
AVX2 (4-way parallel)	169.11 M mul/sec	3.66x

Point Addition (Local - Apple Silicon via Rosetta)

Implementation	Throughput	Speedup
Scalar (4x sequential)	16.83 M add/sec	1x
AVX2 (4-way parallel)	11.60 M add/sec	0.69x (slower)

Key Findings

AVX2 field multiplication shows excellent speedup (3.66x on AMD EPYC)
AVX2 point addition is slower than scalar in this PoC due to:
- Simplified field multiplication that ignores carries (for PoC simplicity)
- Memory layout overhead from limb-slicing
- Need for proper 128-bit intermediate handling
AVX-512 IFMA not available on GitHub Actions runners (AMD EPYC doesn't have it)

Technical Approach

5×52-bit limb representation for field elements
Limb-slicing technique: Pack corresponding limbs from 4 field elements into 256-bit YMM registers
26-bit split for multiplication: Split 52-bit limbs for vpmuludq compatibility
Jacobian coordinates for efficient point operations

Files Created

poc/secp256k1-avx2/
├── field.h              # 5×52 representation
├── field_mul.h          # Scalar reference + add/sub
├── field_mul_avx2.h     # AVX2 4-way multiplication
├── field_ops_avx2.h     # AVX2 4-way add/sub/neg
├── field_mul_avx512.h   # AVX-512 IFMA (for future)
├── group.h              # Point structures + generator G
├── group_avx2.h         # AVX2 4-way point operations
├── bench.c              # Field multiplication benchmark
├── bench_point.c        # Point addition benchmark
├── Makefile
└── README.md

Conclusion

While AVX2 shows promising speedup for individual field operations, achieving end-to-end speedup for point operations requires:

Proper carry propagation in field multiplication
Optimized memory layout to reduce gather/scatter overhead
Potentially AVX-512 IFMA for native 52×52→104 bit multiply-add

For EOA mining, CUDA/GPU implementation is likely more practical since GPUs can run thousands of parallel point operations vs 4-8 with SIMD.

This PR is closed as an exploratory PoC. The code remains in the branch for reference.

This PoC demonstrates 4-way parallel secp256k1 field multiplication using AVX2 and 8-way parallel using AVX-512 IFMA instructions. Results on AVX2: - Scalar (4x sequential): 43.20 M mul/sec - AVX2 (4-way parallel): 105.83 M mul/sec - Speedup: 2.45x Includes GitHub Actions workflow to benchmark AVX-512 IFMA on cloud runners.

- Add field_ops_avx2.h: 4-way parallel field add/sub/neg - Add group.h: Jacobian point structures and generator G - Add group_avx2.h: 4-way parallel point doubling and addition - Add bench_point.c: Point addition benchmark Local results (Apple Silicon via Rosetta): - Scalar: 16.83 M additions/sec - AVX2: 11.60 M additions/sec (0.69x - needs optimization) The AVX2 point addition is currently slower due to: 1. Simplified field multiplication (ignores carries) 2. Memory layout overhead for limb-slicing 3. Need for proper 128-bit intermediate handling Next steps: Optimize field multiplication with proper carry propagation.

prpeh added 2 commits December 9, 2025 11:46

hadv closed this Dec 9, 2025

hadv deleted the feature/secp256k1-avx2-poc branch December 9, 2025 05:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add secp256k1 AVX2/AVX-512 field multiplication proof of concept #32

Add secp256k1 AVX2/AVX-512 field multiplication proof of concept #32

Uh oh!

hadv commented Dec 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add secp256k1 AVX2/AVX-512 field multiplication proof of concept #32

Add secp256k1 AVX2/AVX-512 field multiplication proof of concept #32

Uh oh!

Conversation

hadv commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

secp256k1 AVX2/AVX-512 Proof of Concept

Benchmark Results

Field Multiplication (GitHub Actions - AMD EPYC 7763)

Point Addition (Local - Apple Silicon via Rosetta)

Key Findings

Technical Approach

Files Created

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hadv commented Dec 9, 2025 •

edited

Loading