Accelerate zkSTARK by CPU AVX

7 min readJan 7, 2024

The SIMD is a fantastic approach to speed up ZKP proving, the eigen-zkvm takes advantage of Intel AVX2/AVX512 to achieve more than a half time-saving for MerkleHash, and more than 15% time saving for the whole proving stage.

In this blog, we’ll dive into some critical steps in our implementation and hope for your feedback.

Part 1: Overview of Goldilocks Field

Inspired by the Golden Ratio, denoted as φ, the Goldilocks Field adopts a formula: p=φ²−φ+1. By choosing φ=2³², the resulting modulus, p = 2⁶⁴- 2³² + 1, emerges as an effective solution in applications where computational efficiency is paramount. This reduced size facilitates faster computations, making it a more suitable choice for environments with limited processing capabilities. The choice of “+1” increases the number of roots of unity available in the field. This property is particularly advantageous in NTT [1], a key component in certain cryptographic protocols where fast polynomial multiplication is essential.

Part 2: Implementing Group Operations Using AVX2

AVX2 (Advanced Vector Extensions 2), an extension of the x86 instruction set, significantly enhances parallel processing capabilities. With its 256-bit wide registers in the SIMD (Single Instruction, Multiple Data) architecture, AVX2 is adept at handling multiple data elements concurrently. This ability allows AVX2 to process four 64-bit elements simultaneously, particularly beneficial for operations in the Goldilocks field, leading to a substantial increase in the efficiency of field operations. Next, we’ll explore in detail how AVX2 facilitates key operations in the Goldilocks field, emphasizing the mechanisms that maintain elements within the field’s boundaries and the acceleration of multiplication tasks.

AVX2 handles four 64-bit Goldilocks field elements parallelly

2.1 The canonicalize_s function ensures that the result of a field operation remains within the valid range of the Goldilocks field, i.e., between 0 and GOLDILOCKS_FIELD_ORDER - 1.

Suppose we have a set of four 64-bit integers that we need to canonicalize:

Input Values (in 256-bit register): [a, b, c, d], each 64-bit long.
Assume after some field operations, our values are:
a = 0xFFFFFFFF00000002 (just slightly over the field order)
b = 0xFFFFFFFF00000001 (equal to the field order)
c = 0x0000000100000000 (within the field order)
d = 0x00000000FFFFFFFF (within the field order)

* Shift Operation: The values are initially shifted by 2⁶³ due to the nature of signed arithmetic in AVX2. Let’s represent this shift abstractly as shift(value).

* Comparing and Adjusting:

(1) Comparison: Compare each shifted value with the shifted field order.

mask = _mm256_cmpgt_epi(SHIFTED_FIELD_ORDER, shift([a, b, c, d]))
The mask will be [0, 0, 0xffffffffffffffff, 0xffffffffffffffff] because a and b are no less than the field order, while c and d are.

(2) Wrapback Amount Calculation:

If the value is greater than or equal to the field order, we need to subtract the field order from it.
wrapback_amt = _mm256_andnot_si256(mask,EPSILON)
EPSILON = FIELD_ORDER_wrapping_neg(equivalent to -FIELD_ORDER mod p)
Our example, wrapback_amt of[a, b, c, d] would be [0xffffffff, 0xffffffff, 0, 0].

(3) Final Adjustment:

canonicalized_values = shift([a, b, c, d]) + wrapback_amt
This step brings a and b back into the field range by subtracting the field order from them.

After executing the canonicalize_s function, perform the shift operation again to achieve restoration.

2.2 Multiplication

Reflecting its inspiration from the Golden Ratio, the design of the Goldilocks Field lends itself to efficient reduction operations. This unique property of the field makes it highly compatible with the Fast Karatsuba Multiplication algorithm. In this context, the multiplication of two elements, x, and y, in the Goldilocks field can be expressed as the product of x=(b+aφ)and y=(d+cφ), utilizing the distributive property to simplify and accelerate the computation. This approach enhances the efficiency of multiplication operations within the field.

Part 3: Achieving Poseidon Hash with AVX2

3.1 Structure of Poseidon Hash

Overview of the construction of Poseidon.

Each round within Poseidon Hash encompasses three main components: (1) AddRoundConstants (AddC): This step involves the addition of constants to the state, ensuring that each round has a unique transformation, thereby preventing repetitive patterns that could be exploited.

(2) S-box(S): This is the non-linear transformation phase where specific functions are applied to the state to create confusion.

(3) MixLayer (MDS): The linear transformation step uses a Matrix-Diagonal-Spread (MDS) matrix to disperse the state across the entire matrix width. It’s designed to maximize the diffusion of the non-linear transformation effects from the S-box phase.

The Poseidon Hash design integrates the HADES strategy of Substitution-Permutation Network (SPN) which allows for rounds with partial S-box layers. This strategy is a hybrid that incorporates elements of the sponge construction, where the data is absorbed into the state, transformed, and then squeezed out to produce the hash output. The use of the HADES SPN design in Poseidon allows for a more efficient computation without significantly compromising the security, which is essential for modern cryptographic applications that demand both high security and performance. While the HADES strategy underpinning Poseidon Hash aims for robust security, some research suggests that the complexity of potential algebraic attacks might be underestimated [2]. The Poseidon Hash with its 8 full rounds and 22 partial rounds achieves a security level of 95 bits.

3.2 Optimization with AVX2 in Poseidon Hash

Optimizing the Poseidon Hash function with AVX2 primarily focuses on two key aspects:

(1)Vectorization: AVX2 enhances the ability to process multiple data points concurrently with a single instruction. In our specific use case, considering the Poseidon Hash utilizes 12 elements, the code is designed to process these elements using three separate AVX2 vectors (as shown in the figure below, “state0, state1, state2”) that can handle these elements more efficiently.

Take the first four elements as the result, after 12 elements have been passed through Poseidon hash

(2)Parallelism: Specific operations like pow7 (S-box layer) and other operations related to the MDS layer, such as add_avx, mmult_avx, and spmv_avx_4*12, can benefit from AVX2's parallelism.

Part 4: Comparing AVX512 and AVX2

1. Differences in Field Operations

AVX512 handles eight 64-bit Goldilocks field elements parallelly

AVX512’s wider vector registers allow the simultaneous processing of 8 elements in a single operation, effectively doubling the capacity compared to AVX2.

2. Impact on Poseidon Hash

(AVX512) Take the first eight elements as the result

In the AVX512 implementation of the Poseidon Hash, the computation is optimized to process two sets of 12 elements. For the first group of 12 elements, which are stored in the first four slots of state0, state1, and state2 (highlighted in pink in the illustration), the corresponding hash output occupies the first four slots of the result array.

Part 5: Performance Analysis

Introduction to Linear Hash and Merkle Hash

In the Linear hash configuration, the block size for the Poseidon hash is set to 4. This setup caters to processing arrays or matrices, where elements are enqueued in a row (+ column) order. To align with the hash function’s requirements, the total number of elements queued for hashing is padded to be a multiple of 8. These elements are then grouped into sets of 8, known as NHashElements. An initial state, designated as initState and typically consisting of four zeros ([0,0,0,0]), is used as a starting point. The Poseidon hash function is then applied to each NHashElements group along with the initState, iteratively updating the initState with the hash output of each group. This approach ensures a consistent and structured processing of data, optimizing the hash function for arrays or matrices and maintaining efficiency in cryptographic operations. This precise structure of queuing and processing elements is the rationale behind our code’s choice to use 12 elements as the input for the Poseidon hash.

The Merkle Hash is a pivotal operation in the FRI (Fast Reed-Solomon Interactive) protocol, where its leaf nodes consist of the commitment vector, culminating in an output that is a vector of four elements. Typically, to compute a Merkle tree’s root, one must supply the Merkle path (often referred to as ‘key’ in code) which is the sequence of nodes required to traverse from a leaf to the root, including the values of the leaf nodes and the values of each sibling node encountered along the path. The computation proceeds stepwise, combining sibling pairs using a hash function at each level to eventually arrive at the root. When calculating the Merkle Hash, the first step involves using the Linear hash to compute the commitment of the original message, thus constructing the values for the leaf nodes. Subsequently, these values are used to compute the Merkle tree root.

2-Performance Comparison

The three sets of benchmark tests for the merklehash/merkelize function use different levels of vectorization (standard, AVX2, and AVX512). Users interested in conducting these tests can use the command cargo bench merklehash.

Base Implementation: The initial benchmark results, without the use of advanced vectorization techniques, set the baseline for performance.
AVX2 Implementation: The introduction of AVX2 shows a notable decrease in execution time. The performance improvement is significant, demonstrating around a 46% reduction in time.
AVX512 Implementation: A further reduction in execution time is observed with AVX512, with an additional 30% improvement over the AVX2 results.

References

[1] https://cp4space.hatsya.com/2021/09/01/an-efficient-prime-for-number-theoretic-transforms/

[2]https://tosc.iacr.org/index.php/ToSC/article/view/9850