I am currently working on an operating system in a new systems programming language called Zig. While working on my OS, one of the things I was curious about is how much faster an application could run on bare metal without all interrupts and overhead from OS scheduler. Thus, I thought a pseudo Bitcoin miner would be the perfect candidate to test this theory! As a quick overview, my pseudo Bitcoin miner does the following:
- Takes in 3 hardcoded arguments:
- 76 bytes of preprocessed data. In this context, we don’t really care what these bytes are.
- The “target” which is a 32-byte number.
- A “start” and “end” 4-byte number.
- Initialize a 4-byte (32-bit) number to the “start” number. We will call this the nonce.
- Concatenate the 76 byte and the nonce together to get an 80 byte string.
- Hash this 80 byte string with the SHA256 and take the result and hash it again with SHA256, i.e. SHA256(SHA256(S)) where S is the 80 byte string.
- If the resulting hash is less than the target then we successfully mined a block! We stop here.
- If not, we increment the nonce. If the nonce is less than or equal to “end” we go back to 3. Else, we stop and we failed to mine a block.
To be clear, this program does not connect to the internet (nor does my OS have network support yet), but it does do the same hashing work that a real Bitcoin miner does.
I wrote two versions of the miner, 1 to run on my OS and a Linux version to compare with. This miner had spawns multiple threads to split up the 32-bit number space of the nonce and mine each piece of this space concurrently. This miner spawns the number of hardware threads the CPU supports, minus 1, leaving one thread open for the main kernel thread.
My machine’s CPU has 4 hardware threads, so the miner spawns 3 threads on this machine:
- Thread 0 will try nonces 0x0000_0000 – 0x5555_5555 (inclusive). To be specific, 0x0000_0000 is the “start” number and 0x5555_5555 is the “end” number.
- Thread 1 will try nonces 0x5555_5556 – 0xAAAA_AAAAA.
- Thread 2 will try nonces 0xAAAA_AAAAB – 0xFFFF_FFFFF.
To get valid data, I had to make sure that I was running Linux on the same machine I was booting to my OS on. I measured the following per thread:
- The average number of CPU cycles it took to do one hash.
- The maximum number of CPU cycles it took to do one hash.
- The minimum number of CPU cycles it took to do one hash.
- The total time it took to mine a block in seconds.
CPU cycles are measured using the rdtscp AMD64 instruction. I am running on an Intel Core i5-7600k @ ~4ghz (overclocked). Which means each cycle is about 0.25 nanoseconds. It should be noted that in this example, Thread 1 is the thread that happens to mine the block. Once it finds the block, it and all other threads stop. Here are the numbers of the miner running on my OS vs Linux:
Pseudo Bitcoin Miner Performance
|Total Time To Mine Block||285 seconds||261 seconds|
|Avg # Cycles/Hash (Thread 0)||2,989 cycles||2,678 cycles|
|Max # Cycles/Hash (Thread 0)||5,015,404 cycles||65,555,356 cycles|
|Min # Cycles/Hash (Thread 0)||2,816 cycles||2,424 cycles|
|Total # Cycles (Thread 0)||1,066,361,841,588 cycles||970,554,751,530 cycles|
|Avg # Cycles/Hash (Thread 1)||2,956 cycles||2,679 cycles|
|Max # Cycles/Hash (Thread 1)||5,015,968 cycles||59,477,408 cycles|
|Min # Cycles/Hash (Thread 1)||2,818 cycles||2,430 cycles|
|Total # Cycles (Thread 1)||1,045,330,042,596 cycles||947,279,596,516 cycles|
|Avg # Cycles/Hash (Thread 2)||2,941 cycles||2,679 cycles|
|Max # Cycles/Hash (Thread 2)||5,016,996 cycles||61,743,548 cycles|
|Min # Cycles/Hash (Thread 2)||2,816 cycles||2,430 cycles|
|Total # Cycles (Thread 2)||1,035,229,379,122 cycles||951,599,787,438 cycles|
Wait, what? The miner on my OS is running bare metal with no scheduler or any other interrupts happening. So why is the Linux version successfully mining a block more than 20 seconds faster than the version on my OS? I checked the code and I made sure that the Linux and my OS version were sharing as much code as they could. So I was pretty sure it wasn’t changes in the code.
The next thing to do was to compare the assembly of the Linux to version on my OS. Let’s take a look at this piece of code which sets up the 80-byte string to hash (i.e. mine):
Now here is the disassembly of the Linux version of the same line:
To the (un)trained eye, you can immediately tell the difference. Not only are there fewer lines in the Linux version than the version for my OS, they are different instructions. The “movups”/”movaps” instructions move 16 bytes of data at a time, while the more efficient “vmovups” moves 32 bytes at a time! The movups/movaps instructions are part of the SSE instruction set (introduced in 1999) and vmovups is part of the AVX instruction set (introduced in 2011).
Note: “More efficient” is very hand-wavy here. There are subtle complex factors you need to consider (e.g. pipelining, state of the cache, etc) to evaluate to determine if an instruction is “more efficient”. But for the purposes of this article, all you need to understand is this: vmovups is actually slower than movaps/movups, but vmovups is copying twice the data. Vmovups less than 2 times slower than movups/movaps, so it’s a win in the end.
Why is it choosing a more efficient version for the Linux version than the one for my OS? Well, because when I am compiling the Linux version, I am compiling natively for my machine, which supports AVX. But, when I compile for my OS, it’s in cross-compilation mode for UEFI applications and chooses the lowest common denominator of AMD64 processors, which does not include AVX.
Something I should note before continuing is my OS is divided up into 2 parts right now. The bootloader and the kernel. The bootloader is a proper UEFI application (i.e. a PE32+/COFF binary), while the kernel is binary blob embedded as a byte array using Zig’s “@embedFile()”. The bootloader starts up and then copies the kernel into working (virtualized) memory and jumps to the kernel’s entry point.
It turns out that you need to pass the “-mcpu=native+avx2” to Zig to get it to use the AVX2 instruction set. I am choosing AVX2 (introduced in 2013) instead of AVX since my machine supports it.
So I compiled my OS and booted and……. it crashes. It turns out when the machine enters the bootloader from the BIOS/UEFI shell, it does not have AVX instructions enabled yet. So, when it encounters some AVX instructions, the CPU throws an Undefined Instruction exception. So, I added in code to enable AVX (which is flipping a bit in the XCR0 control register) in the bootloader and made sure that the bootloader is not compiled in AVX mode since that’s the place where we enable it. I guess that’d be a good reason why Zig doesn’t compile with AVX instructions for UEFI by default!
So I compiled my OS and booted and……. it works! Without further ado, here are the final stats:
Pseudo Bitcoin Miner Performance with AVX
|Total Time To Mine Block||251 seconds||261 seconds|
|Avg # Cycles/Hash (Thread 0)||2,520 cycles||2,678 cycles|
|Max # Cycles/Hash (Thread 0)||4,919,960 cycles||65,555,356 cycles|
|Min # Cycles/Hash (Thread 0)||2,436 cycles||2,424 cycles|
|Total # Cycles (Thread 0)||908,680,867,300 cycles||970,554,751,530 cycles|
|Avg # Cycles/Hash (Thread 1)||2,559 cycles||2,679 cycles|
|Max # Cycles/Hash (Thread 1)||4,920,852 cycles||59,477,408 cycles|
|Min # Cycles/Hash (Thread 1)||2,450 cycles||2,430 cycles|
|Total # Cycles (Thread 1)||904,949,266,874 cycles||947,279,596,516 cycles|
|Avg # Cycles/Hash (Thread 2)||2,542 cycles||2,679 cycles|
|Max # Cycles/Hash (Thread 2)||4,921,496 cycles||61,743,548 cycles|
|Min # Cycles/Hash (Thread 2)||2,438 cycles||2,430 cycles|
|Total # Cycles (Thread 2)||927,407,896,000 cycles||951,599,787,438 cycles|
Much better! The version for my OS runs 10 seconds faster, which is pretty cool. One caveat here is that if I run the miner more than once for both my OS and Linux, the version for my OS has about the same timings, but the Linux one becomes a lot faster due to caching, I assume. At that point the version for my OS only runs about 1 second faster. But, I will say if you are writing a Bitcoin miner, you will never be trying mining the same block twice!
TL;DR: My Bitcoin miner ran faster on Linux because I didn’t realize to enable AVX instructions for my OS.
I hope you found this dissection informative. Any questions and comments are greatly appreciated!