I/O is no longer the bottleneck? (part 2)
Nov 30th, 2022
My quest to count words faster than NVMe sequential read speed has come to a close. In my previous blog post I ended up with a rather unconvincing 1.45 GB/s throughput using AVX2 instructions, even though NVMe sequential read speed on a warm cache was 12.8 GB/s. It was unconvincing not only cause it was below NVMe throughput, but also because it wasn't even doing the job of counting word frequencies — it's only counting total words.
I'm happy someone showed me a related project: fastlwc. Of course someone has already implemented a truly fast word count. I tried to understand their ideas and learned about two great tricks. Then I implemented them myself, but in a simpler way, getting equivalent performance.
The result is very fast word count for AVX2 in just a handful lines of C.
Trick 1: Better shifting
This idea occurred to me earlier, but somehow I did not pursue it. In the previous blog post I used Move Byte Mask tricks to obtain a mask (32 bits int
) with the locations of whitespace for 32 characters of text. Then I used Find First Set to iterate over the whitespace locations.
This was not entirely efficient, cause we had to ensure we were not counting whitespaces directly preceded by another whitespace.
Instead, a very simple bit-shift trick can be used to turn a mask of whitespace into a mask of word boundaries. Given a string and a whitespace mask
|this is a sentence| |000011001011100000000|
simply shift right
and and
to get a mask for whitespaces that are directly preceded by a whitespace:
|this is a sentence| |000011001011100000000| = mask |000001100101110000000| = mask >> 1 |000001000001100000000| = mask & (mask >> 1)
So, if we flip the bits we get a mask for characters that are non-whitespace or whitespace directly after a word. We want just whitespaces after a word, so we and
with the original whitespace mask again:
|this is a sentence| |000011001011100000000| = mask |000001000001100000000| = mask & (mask >> 1) |000010001010000000000| = mask & ~(mask & (mask >> 1))
So, with x & ~(x & (x >> 1))
we get a mask for the first whitespaces after each word.
Finally, we have to take the boundary into account. The following example
|this is | four words |
should give us the masks
|this is | four words | |000010010000000000000|000000000010000010000|
The second mask should start with a 0
because it was preceded by a blank. So, to be entirely correct, we just have to shift in the previous whitespace mask:
boundaries = curr & ~(curr & ((curr >> 1) | (prev << 31))); prev = curr;
And with this we don't even have to iterate over bits! We can just use popcount
to count the end of words in the current window.
Note: in reality we have the characters in reverse in our register, so replace left shift with right shift and vice versa. The above is just for demonstration purposes.
Trick 2: Creating masks efficiently
In my original code I used six ==
operations and five or
to create a mask for 32 bytes of characters, scanning for spaces, newlines, tabs, vertical tabs, form feed and carriage return characters. It looked like this:
mask = _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, space)); mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, tab)); mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, newline)); mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, vtab)); mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, feed)); mask |= _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, carriage));
This can be reduced to one shuffle and one comparsion instruction. That absolutely baffled me. To understand the trick, it's crucial to realize the ascii values of all whitespace characters:
' ' == 0x20 '\t' == 0x09 '\n' == 0x0A '\v' == 0x0B '\f' == 0x0C '\r' == 0x0D
One optimization could be to exploit that five characters are in a range 0x09 - 0x0D
, e.g. x >= '\t' & x <= 'r'
would only be two comparisons for five types of whitespace. But we can do better.
I've only ever used shuffle instructions with a fixed shuffle vector on input data. For example to turn an array of structs into a struct of arrays. What was completely new to me was the idea of taking the shuffle vector as input and apply it on fixed data.
When you do this, you can think of shuffle instructions as a miniature hash map or lookup table! For the partciular shuffle instruction we're using, the hash function is
hash(char) = char | 0x0F
The hash map or lookup table (let's call it map
) is 16 entries big, mapping as follows:
0x00 => ' ' 0x01 => 0x00 0x02 => 0x00 0x03 => 0x00 0x04 => 0x00 0x05 => 0x00 0x06 => 0x00 0x07 => 0x00 0x08 => 0x00 0x09 => '\t' 0x0A => '\n' 0x0B => '\v' 0x0C => '\f' 0x0D => '\r' 0x0E => 0x00 0x0F => 0x00
So, we can detect any whitespace through map[hash(char)] == char
. One shuffle, one comparsion. Great! This only works because the whitespace characters do not have collisions.
With intrinsics, it looks like this:
// actually it's two maps for the first 16 + 16 characters __m256i map = _mm256_set_epi64x( 0x00000d0c0b0a0900, 0x0000000000000020, 0x00000d0c0b0a0900, 0x0000000000000020 ); // values = map[vec] element-wise __m256i values = _mm256_shuffle_epi8(map, vec); // map[vec] == vec element-wise __m256i vec_mask = _mm256_cmpeq_epi8(vec, values)); int mask = _mm256_movemask_epi8(vec_mask);
And then Trick 1 can be applied to the ouput mask
.
The AVX2 word count code
With Trick 1 and 2 combined, we end up with only 47 lines of C code
#define BUF_SIZE 262144 #include <stdint.h> #include <stdio.h> #include <ctype.h> #include <immintrin.h> int main() { char buf_unaligned[BUF_SIZE + 32]; char *buf = (char *)(((uintptr_t)buf_unaligned + 31) & -32); size_t num_words = 0; unsigned int all_whitespace_prev = -1; while (1) { size_t num_read = fread(buf, 1, BUF_SIZE, stdin); const char *p = buf; for (; p + 32 <= buf + num_read; p += 32) { __m256i vec = _mm256_load_si256((__m256i *)p); __m256i map = _mm256_set_epi64x(0x00000d0c0b0a0900, 0x0000000000000020, 0x00000d0c0b0a0900, 0x0000000000000020); unsigned int all_whitespace = _mm256_movemask_epi8(_mm256_cmpeq_epi8(vec, _mm256_shuffle_epi8(map, vec))); unsigned int word_end = all_whitespace & ~(all_whitespace & ((all_whitespace << 1) | (all_whitespace_prev >> 31))); num_words += __builtin_popcount(word_end); all_whitespace_prev = all_whitespace; } int is_whitespace = (all_whitespace_prev >> 31) & 1; for (; p < buf + num_read; ++p) { if (!isspace(*p)) { is_whitespace = 0; continue; } if (!is_whitespace) { ++num_words; is_whitespace = 1; } } if (num_read < BUF_SIZE) { // The last word. if (!is_whitespace) ++num_words; break; } } printf("%zu\n", num_words); }
It's quite ugly, but in a good way.
Performance
I optimized it a bit more by unrolling the vectorized loop four times. And with that I got the following performance:
Benchmark 1: ./wc-avx2 < bible-100.txt Time (mean ± σ): 45.0 ms ± 0.7 ms [User: 9.3 ms, System: 35.7 ms] Range (min … max): 43.6 ms … 46.5 ms 66 runs Benchmark 2: wc -w < bible-100.txt Time (mean ± σ): 1.730 s ± 0.011 s [User: 1.691 s, System: 0.039 s] Range (min … max): 1.712 s … 1.750 s 10 runs
Finally it runs at a decent speed, 43.6ms to process 413MB of data, or in other words: 9.3 GB/s.
Most importantly, user time (9.3ms) is way below sys time (35.7ms). That means that counting actually happens at a speed of 43 GB/s. Finally I can conclude: I/O still is the bottleneck.
We're still not counting word frequencies though. Maybe I'll try that too.