🤖 Harold's Notes

Search

❯

❯

❯

GPU programming

❯

Prefix Sum - Scan algorithm

Prefix Sum - Scan algorithm

Oct 19, 20253 min read

Relevant for efficient Linear RNNs and State Space Models (SSMs)

Definition

Takes:
- An input array $[x_{0}, \dots, x_{n}]$
- An associate operator $\oplus$
  - e.g. sum, product, min, max
Returns:
- An output array $[y_{0}, \dots, y_{n - 1}]$
  - Inclusive scan: $y_{i} = x_{0} \oplus \dots \oplus x_{i}$
    - $y_{0} = x_{0}$
  - Excluse scan: $y_{i} = x_{0} \oplus \dots \oplus x_{i - 1}$
    - $y_{0} = identity$
      - e.g. 0 for sum

Parallel scan

Parallel scan requires synchronization across parallel workers
- On GPUs, it’s cheaper to synchronize with threads on the same thread block compared to across blocks.

Segmented scan

Approach: segmented scan
- Every thread block scans a segment
- Scan the segments’ partial sums (parallel across blocks)
- Add each segment’s scanned partial sum last value to the next segment
  - This is another scan but with the last value of each segments.
- Diagram

Parallel scan implementation within a single thread block

Kogge-Stone

Kogge-Stone example with 8 elements

We’re looking at parallel (inclusive) scan
Let’s do ONE parallel reduction tree
- We write in place in the same array
- We get the last element, and some other as byproduct
  - Blue values are “already valid” solutions
- SOME VALUE ARE NEVER TOUCHED e.g. $x_{2}, x_{4}, x_{6}$
Now, if we overlap and do 4 reduction trees, we get
- First iteration: for every element, add the element one step before it
- Second iteration: for every element, add the element two steps before it
- Third iteration: for every element, add the element four steps before it
We get $O (l o g N)$ iterations

Kogge-Stone Parallel (Inclusive) Scan

Use the above algorithm with one thread for each element

Runtime analysis

A parallel algorithm is work-efficient if it performs the same amount of work as the corresponding sequential algorithm
Scan work efficiency
- Sequential scan performs $N$ additions
- Kogge-Stone parallel scan performs:
  - $l o g (N)$ steps, $N - 2^{s t e p}$ operations per step
  - Total: $(N - 1) + (N - 2) + \dots + (N - N /2) = N * l o g (N) - (N - 1) == O (N * l o g (N))$ operations
- Algorithm is not work efficient
If resources are limited, parallel algorithm will be slow because of low work efficiency

Implementation

Memory optimization

The input array $x$ is in global memory
- It doesn’t make sense to keep reading and writing to and from global memory
Load once to a shared memory buffer, and do everything in shared memory
Write out at the end

Details

You need to sync the threads between every iteration
You need to be careful of race conditions
- To fix it, you need to ensure that all threads finish reading, before starting to write
  - read → __syncthreads(); → write
- Better way to do it
  - double buffering
    - read from buffer1, write to buffer 2
    - alternate (read from buffer 2, write in buffer 1)

Brent-Kung

Brent-Kung takes more steps but is more work-efficient

Brent-Kung example with 8 elements

Diagram

Analysis

Reduction stage:
- $l o g (N)$ steps
- $N /2 + N /4 + \dots + 2 + 1 = N - 1$ operations
Post-reduction stage:
- $l o g (N) - 1$ steps
- $(2 - 1) + (4 - 1) + \dots + (N /2 - 1) = (N - 2) - (l o g (N) - 1)$ operations
Total:
- $2 * l o g (N) - 1$ steps
- $2 * N - l o g (N) - 2 = O (N)$ operations

Graph View

Definition
Parallel scan
Segmented scan
Parallel scan implementation within a single thread block
Kogge-Stone
Kogge-Stone example with 8 elements
Kogge-Stone Parallel (Inclusive) Scan
Brent-Kung
Brent-Kung example with 8 elements
Analysis

Backlinks

Welcome to Harold's Notes

Created with Quartz v4.2.3 © 2025