I/O is no longer the bottleneck? (part 2)

Nov 30th, 2022

My quest to count words faster than NVMe sequential read speed has come to a close. In my previous blog post I ended up with a rather unconvincing 1.45 GB/s throughput using AVX2 instructions, even though NVMe sequential read speed on a warm cache was 12.8 GB/s. It was unconvincing not only cause it was below NVMe throughput, but also because it wasn't even doing the job of counting word frequencies — it's only counting total words.

I'm happy someone showed me a related project: fastlwc. Of course someone has already implemented a truly fast word count. I tried to understand their ideas and learned about two great tricks. Then I implemented them myself, but in a simpler way, getting equivalent performance.

The result is very fast word count for AVX2 in just a handful lines of C.

Trick 1: Better shifting

This idea occurred to me earlier, but somehow I did not pursue it. In the previous blog post I used Move Byte Mask tricks to obtain a mask (32 bits int) with the locations of whitespace for 32 characters of text. Then I used Find First Set to iterate over the whitespace locations.

This was not entirely efficient, cause we had to ensure we were not counting whitespaces directly preceded by another whitespace.

Instead, a very simple bit-shift trick can be used to turn a mask of whitespace into a mask of word boundaries. Given a string and a whitespace mask

|this  is a   sentence|

simply shift right and and to get a mask for whitespaces that are directly preceded by a whitespace:

|this  is a   sentence|
|000011001011100000000| = mask
|000001100101110000000| = mask >> 1
|000001000001100000000| = mask & (mask >> 1)

So, if we flip the bits we get a mask for characters that are non-whitespace or whitespace directly after a word. We want just whitespaces after a word, so we and with the original whitespace mask again:

|this  is a   sentence|
|000011001011100000000| = mask
|000001000001100000000| = mask & (mask >> 1)
|000010001010000000000| = mask & ~(mask & (mask >> 1))

So, with x & ~(x & (x >> 1)) we get a mask for the first whitespaces after each word.

Finally, we have to take the boundary into account. The following example

|this is              |      four words     |

should give us the masks

|this is              |      four words     |

The second mask should start with a 0 because it was preceded by a blank. So, to be entirely correct, we just have to shift in the previous whitespace mask:

boundaries = curr & ~(curr & ((curr >> 1) | (prev << 31)));
prev = curr;

And with this we don't even have to iterate over bits! We can just use popcount to count the end of words in the current window.

Note: in reality we have the characters in reverse in our register, so replace left shift with right shift and vice versa. The above is just for demonstration purposes.

Trick 2: Creating masks efficiently

In my original code I used six == operations and five or to create a mask for 32 bytes of characters, scanning for spaces, newlines, tabs, vertical tabs, form feed and carriage return characters. It looked like this:

mask  = _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, space));
mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, tab));
mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, newline));
mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, vtab));
mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, feed));
mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, carriage));

This can be reduced to one shuffle and one comparsion instruction. That absolutely baffled me. To understand the trick, it's crucial to realize the ascii values of all whitespace characters:

 ' ' == 0x20
'\t' == 0x09
'\n' == 0x0A
'\v' == 0x0B
'\f' == 0x0C
'\r' == 0x0D

One optimization could be to exploit that five characters are in a range 0x09 - 0x0D, e.g. x >= '\t' & x <= 'r' would only be two comparisons for five types of whitespace. But we can do better.

I've only ever used shuffle instructions with a fixed shuffle vector on input data. For example to turn an array of structs into a struct of arrays. What was completely new to me was the idea of taking the shuffle vector as input and apply it on fixed data.

When you do this, you can think of shuffle instructions as a miniature hash map or lookup table! For the partciular shuffle instruction we're using, the hash function is

hash(char) = char | 0x0F

The hash map or lookup table (let's call it map) is 16 entries big, mapping as follows:

0x00 => ' '
0x01 => 0x00
0x02 => 0x00
0x03 => 0x00
0x04 => 0x00
0x05 => 0x00
0x06 => 0x00
0x07 => 0x00
0x08 => 0x00
0x09 => '\t'
0x0A => '\n'
0x0B => '\v'
0x0C => '\f'
0x0D => '\r'
0x0E => 0x00
0x0F => 0x00

So, we can detect any whitespace through map[hash(char)] == char. One shuffle, one comparsion. Great! This only works because the whitespace characters do not have collisions.

With intrinsics, it looks like this:

// actually it's two maps for the first 16 + 16 characters
__m256i map = _mm256_set_epi64x(

// values = map[vec] element-wise
__m256i values = _mm256_shuffle_epi8(map, vec);

// map[vec] == vec element-wise
__m256i vec_mask = _mm256_cmpeq_epi8(vec, values));

int mask = _mm256_movemask_epi8(vec_mask);

And then Trick 1 can be applied to the ouput mask.

The AVX2 word count code

With Trick 1 and 2 combined, we end up with only 47 lines of C code

#define BUF_SIZE 262144

#include <stdint.h>
#include <stdio.h>
#include <ctype.h>
#include <immintrin.h>

int main() {
    char buf_unaligned[BUF_SIZE + 32];
    char *buf = (char *)(((uintptr_t)buf_unaligned + 31) & -32);
    size_t num_words = 0;
    unsigned int all_whitespace_prev = -1;

    while (1) {
        size_t num_read = fread(buf, 1, BUF_SIZE, stdin);
        const char *p = buf;

        for (; p + 32 <= buf + num_read; p += 32) {
            __m256i vec = _mm256_load_si256((__m256i *)p);
            __m256i map = _mm256_set_epi64x(0x00000d0c0b0a0900, 0x0000000000000020, 0x00000d0c0b0a0900, 0x0000000000000020);
            unsigned int all_whitespace = _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, _mm256_shuffle_epi8(map, vec)));
            unsigned int word_end = all_whitespace & ~(all_whitespace & ((all_whitespace << 1) | (all_whitespace_prev >> 31)));
            num_words += __builtin_popcount(word_end);
            all_whitespace_prev = all_whitespace;

        int is_whitespace = (all_whitespace_prev >> 31) & 1;
        for (; p < buf + num_read; ++p) {
            if (!isspace(*p)) {
                is_whitespace = 0;
            if (!is_whitespace) {
                is_whitespace = 1;

        if (num_read < BUF_SIZE) {
            // The last word.
            if (!is_whitespace) ++num_words;

    printf("%zu\n", num_words);

It's quite ugly, but in a good way.


I optimized it a bit more by unrolling the vectorized loop four times. And with that I got the following performance:

Benchmark 1: ./wc-avx2 < bible-100.txt
  Time (mean ± σ):      45.0 ms ±   0.7 ms    [User: 9.3 ms, System: 35.7 ms]
  Range (min … max):    43.6 ms …  46.5 ms    66 runs
Benchmark 2: wc -w < bible-100.txt
  Time (mean ± σ):      1.730 s ±  0.011 s    [User: 1.691 s, System: 0.039 s]
  Range (min … max):    1.712 s …  1.750 s    10 runs

Finally it runs at a decent speed, 43.6ms to process 413MB of data, or in other words: 9.3 GB/s.

Most importantly, user time (9.3ms) is way below sys time (35.7ms). That means that counting actually happens at a speed of 43 GB/s. Finally I can conclude: I/O still is the bottleneck.

We're still not counting word frequencies though. Maybe I'll try that too.