Bitcoin Miner on my OS vs Linux

Bitcoin Miner on my OS vs Linux
Screenshot of my OS with the Bitcoin miner running center-right. Despite it saying “Reach OS” in the screenshot, my OS is currently unnamed. When it boots up, it randomly chooses a word from a word list and appends “OS” to it. 

I am currently working on an operating system in a new systems programming language called Zig. While working on my OS, one of the things I was curious about is how much faster an application could run on bare metal without all interrupts and overhead from OS scheduler. Thus, I thought a pseudo Bitcoin miner would be the perfect candidate to test this theory! As a quick overview, my pseudo Bitcoin miner does the following:

  1. Takes in 3 hardcoded arguments:
    1. 76 bytes of preprocessed data. In this context, we don’t really care what these bytes are.
    2. The “target” which is a 32-byte number.
    3. A “start” and “end” 4-byte number.
  2. Initialize a 4-byte (32-bit) number to the “start” number. We will call this the nonce.
  3. Concatenate the 76 byte and the nonce together to get an 80 byte string.
  4. Hash this 80 byte string with the SHA256 and take the result and hash it again with SHA256, i.e. SHA256(SHA256(S)) where S is the 80 byte string.
  5. If the resulting hash is less than the target then we successfully mined a block! We stop here.
  6. If not, we increment the nonce. If the nonce is less than or equal to “end” we go back to 3. Else, we stop and we failed to mine a block.

To be clear, this program does not connect to the internet (nor does my OS have network support yet), but it does do the same hashing work that a real Bitcoin miner does.

I wrote two versions of the miner, 1 to run on my OS and a Linux version to compare with. This miner had spawns multiple threads to split up the 32-bit number space of the nonce and mine each piece of this space concurrently. This miner spawns the number of hardware threads the CPU supports, minus 1, leaving one thread open for the main kernel thread. 

My machine’s CPU has 4 hardware threads, so the miner spawns 3 threads on this machine:

  • Thread 0 will try nonces 0x0000_0000 – 0x5555_5555 (inclusive). To be specific, 0x0000_0000 is the “start” number and 0x5555_5555 is the “end” number.
  • Thread 1 will try nonces 0x5555_5556 – 0xAAAA_AAAAA. 
  • Thread 2 will try nonces 0xAAAA_AAAAB – 0xFFFF_FFFFF.

To get valid data, I had to make sure that I was running Linux on the same machine I was booting to my OS on. I measured the following per thread:

  • The average number of CPU cycles it took to do one hash.
  • The maximum number of CPU cycles it took to do one hash.
  • The minimum number of CPU cycles it took to do one hash.
  • The total time it took to mine a block in seconds.

CPU cycles are measured using the rdtscp AMD64 instruction. I am running on an Intel Core i5-7600k @ ~4ghz (overclocked). Which means each cycle is about 0.25 nanoseconds. It should be noted that in this example, Thread 1 is the thread that happens to mine the block. Once it finds the block, it and all other threads stop. Here are the numbers of the miner running on my OS vs Linux:

Pseudo Bitcoin Miner Performance

MeasurementsMy OSLinux
Total Time To Mine Block285 seconds261 seconds
Avg # Cycles/Hash (Thread 0)2,989 cycles2,678 cycles
Max # Cycles/Hash (Thread 0)5,015,404 cycles65,555,356 cycles
Min # Cycles/Hash (Thread 0)2,816 cycles2,424 cycles
Total # Cycles (Thread 0)1,066,361,841,588 cycles970,554,751,530 cycles
Avg # Cycles/Hash (Thread 1)2,956 cycles2,679 cycles 
Max # Cycles/Hash (Thread 1)5,015,968 cycles59,477,408 cycles
Min # Cycles/Hash (Thread 1)2,818 cycles2,430 cycles
Total # Cycles (Thread 1)1,045,330,042,596 cycles947,279,596,516 cycles
Avg # Cycles/Hash (Thread 2)2,941 cycles2,679 cycles
Max # Cycles/Hash (Thread 2)5,016,996 cycles61,743,548 cycles
Min # Cycles/Hash (Thread 2)2,816 cycles2,430 cycles
Total # Cycles (Thread 2)1,035,229,379,122 cycles951,599,787,438 cycles

Wait, what? The miner on my OS is running bare metal with no scheduler or any other interrupts happening. So why is the Linux version successfully mining a block more than 20 seconds faster than the version on my OS? I checked the code and I made sure that the Linux and my OS version were sharing as much code as they could. So I was pretty sure it wasn’t changes in the code.

The next thing to do was to compare the assembly of the Linux to version on my OS. Let’s take a look at this piece of code which sets up the 80-byte string to hash (i.e. mine):

The constant “current_block” is an 80-byte array where the first 76 bytes are the hardcoded preprocessed data we don’t care about. The last 4 bytes are 0s. We create a new variable called “block_header” with this 80 byte array and then fill out the last 4 bytes with the nonce we are going to mine with.

Let’s take a look at the disassembly of just the first line. First is the disassembly of the version for my OS:

Now here is the disassembly of the Linux version of the same line:

To the (un)trained eye, you can immediately tell the difference. Not only are there fewer lines in the Linux version than the version for my OS, they are different instructions. The “movups”/”movaps” instructions move 16 bytes of data at a time, while the more efficient “vmovups” moves 32 bytes at a time! The movups/movaps instructions are part of the SSE instruction set (introduced in 1999) and vmovups is part of the AVX instruction set (introduced in 2011).

Note: “More efficient” is very hand-wavy here. There are subtle complex factors you need to consider (e.g. pipelining, state of the cache, etc) to evaluate to determine if an instruction is “more efficient”. But for the purposes of this article, all you need to understand is this: vmovups is actually slower than movaps/movups, but vmovups is copying twice the data. Vmovups less than 2 times slower than movups/movaps, so it’s a win in the end.

Why is it choosing a more efficient version for the Linux version than the one for my OS? Well, because when I am compiling the Linux version, I am compiling natively for my machine, which supports AVX. But, when I compile for my OS, it’s in cross-compilation mode for UEFI applications and chooses the lowest common denominator of AMD64 processors, which does not include AVX.

Something I should note before continuing is my OS is divided up into 2 parts right now. The bootloader and the kernel. The bootloader is a proper UEFI application (i.e. a PE32+/COFF binary), while the kernel is binary blob embedded as a byte array using Zig’s “@embedFile()”. The bootloader starts up and then copies the kernel into working (virtualized) memory and jumps to the kernel’s entry point. 

It turns out that you need to pass the “-mcpu=native+avx2” to Zig to get it to use the AVX2 instruction set. I am choosing AVX2 (introduced in 2013) instead of AVX since my machine supports it.

So I compiled my OS and booted and……. it crashes. It turns out when the machine enters the bootloader from the BIOS/UEFI shell, it does not have AVX instructions enabled yet. So, when it encounters some AVX instructions, the CPU throws an Undefined Instruction exception. So, I added in code to enable AVX (which is flipping a bit in the XCR0 control register) in the bootloader and made sure that the bootloader is not compiled in AVX mode since that’s the place where we enable it. I guess that’d be a good reason why Zig doesn’t compile with AVX instructions for UEFI by default!

So I compiled my OS and booted and……. it works! Without further ado, here are the final stats:

Pseudo Bitcoin Miner Performance with AVX

MeasurementsMy OSLinux
Total Time To Mine Block251 seconds261 seconds
Avg # Cycles/Hash (Thread 0)2,520 cycles2,678 cycles
Max # Cycles/Hash (Thread 0)4,919,960 cycles65,555,356 cycles
Min # Cycles/Hash (Thread 0)2,436 cycles2,424 cycles
Total # Cycles (Thread 0)908,680,867,300 cycles970,554,751,530 cycles
Avg # Cycles/Hash (Thread 1)2,559 cycles2,679 cycles 
Max # Cycles/Hash (Thread 1)4,920,852 cycles59,477,408 cycles
Min # Cycles/Hash (Thread 1)2,450 cycles2,430 cycles
Total # Cycles (Thread 1)904,949,266,874 cycles947,279,596,516 cycles
Avg # Cycles/Hash (Thread 2)2,542 cycles2,679 cycles
Max # Cycles/Hash (Thread 2)4,921,496 cycles61,743,548 cycles
Min # Cycles/Hash (Thread 2)2,438 cycles2,430 cycles
Total # Cycles (Thread 2)927,407,896,000 cycles951,599,787,438 cycles

Much better! The version for my OS runs 10 seconds faster, which is pretty cool. One caveat here is that if I run the miner more than once for both my OS and Linux, the version for my OS has about the same timings, but the Linux one becomes a lot faster due to caching, I assume. At that point the version for my OS only runs about 1 second faster. But, I will say if you are writing a Bitcoin miner, you will never be trying mining the same block twice!

TL;DR: My Bitcoin miner ran faster on Linux because I didn’t realize to enable AVX instructions for my OS.

I hope you found this dissection informative. Any questions and comments are greatly appreciated!

If you’re interested in following development on my OS, please follow this blog and checkout and follow my Twitch and my Twitter. I try my best to stream development weekly on Twitch!