Bitcoin Miner on my OS vs Linux

Bitcoin Miner on my OS vs Linux
Screenshot of my OS with the Bitcoin miner running center-right. Despite it saying “Reach OS” in the screenshot, my OS is currently unnamed. When it boots up, it randomly chooses a word from a word list and appends “OS” to it. 

I am currently working on an operating system in a new systems programming language called Zig. While working on my OS, one of the things I was curious about is how much faster an application could run on bare metal without all interrupts and overhead from OS scheduler. Thus, I thought a pseudo Bitcoin miner would be the perfect candidate to test this theory! As a quick overview, my pseudo Bitcoin miner does the following:

  1. Takes in 3 hardcoded arguments:
    1. 76 bytes of preprocessed data. In this context, we don’t really care what these bytes are.
    2. The “target” which is a 32-byte number.
    3. A “start” and “end” 4-byte number.
  2. Initialize a 4-byte (32-bit) number to the “start” number. We will call this the nonce.
  3. Concatenate the 76 byte and the nonce together to get an 80 byte string.
  4. Hash this 80 byte string with the SHA256 and take the result and hash it again with SHA256, i.e. SHA256(SHA256(S)) where S is the 80 byte string.
  5. If the resulting hash is less than the target then we successfully mined a block! We stop here.
  6. If not, we increment the nonce. If the nonce is less than or equal to “end” we go back to 3. Else, we stop and we failed to mine a block.

To be clear, this program does not connect to the internet (nor does my OS have network support yet), but it does do the same hashing work that a real Bitcoin miner does.

I wrote two versions of the miner, 1 to run on my OS and a Linux version to compare with. This miner had spawns multiple threads to split up the 32-bit number space of the nonce and mine each piece of this space concurrently. This miner spawns the number of hardware threads the CPU supports, minus 1, leaving one thread open for the main kernel thread. 

My machine’s CPU has 4 hardware threads, so the miner spawns 3 threads on this machine:

  • Thread 0 will try nonces 0x0000_0000 – 0x5555_5555 (inclusive). To be specific, 0x0000_0000 is the “start” number and 0x5555_5555 is the “end” number.
  • Thread 1 will try nonces 0x5555_5556 – 0xAAAA_AAAAA. 
  • Thread 2 will try nonces 0xAAAA_AAAAB – 0xFFFF_FFFFF.

To get valid data, I had to make sure that I was running Linux on the same machine I was booting to my OS on. I measured the following per thread:

  • The average number of CPU cycles it took to do one hash.
  • The maximum number of CPU cycles it took to do one hash.
  • The minimum number of CPU cycles it took to do one hash.
  • The total time it took to mine a block in seconds.

CPU cycles are measured using the rdtscp AMD64 instruction. I am running on an Intel Core i5-7600k @ ~4ghz (overclocked). Which means each cycle is about 0.25 nanoseconds. It should be noted that in this example, Thread 1 is the thread that happens to mine the block. Once it finds the block, it and all other threads stop. Here are the numbers of the miner running on my OS vs Linux:

Pseudo Bitcoin Miner Performance

MeasurementsMy OSLinux
Total Time To Mine Block285 seconds261 seconds
Avg # Cycles/Hash (Thread 0)2,989 cycles2,678 cycles
Max # Cycles/Hash (Thread 0)5,015,404 cycles65,555,356 cycles
Min # Cycles/Hash (Thread 0)2,816 cycles2,424 cycles
Total # Cycles (Thread 0)1,066,361,841,588 cycles970,554,751,530 cycles
Avg # Cycles/Hash (Thread 1)2,956 cycles2,679 cycles 
Max # Cycles/Hash (Thread 1)5,015,968 cycles59,477,408 cycles
Min # Cycles/Hash (Thread 1)2,818 cycles2,430 cycles
Total # Cycles (Thread 1)1,045,330,042,596 cycles947,279,596,516 cycles
Avg # Cycles/Hash (Thread 2)2,941 cycles2,679 cycles
Max # Cycles/Hash (Thread 2)5,016,996 cycles61,743,548 cycles
Min # Cycles/Hash (Thread 2)2,816 cycles2,430 cycles
Total # Cycles (Thread 2)1,035,229,379,122 cycles951,599,787,438 cycles

Wait, what? The miner on my OS is running bare metal with no scheduler or any other interrupts happening. So why is the Linux version successfully mining a block more than 20 seconds faster than the version on my OS? I checked the code and I made sure that the Linux and my OS version were sharing as much code as they could. So I was pretty sure it wasn’t changes in the code.

The next thing to do was to compare the assembly of the Linux to version on my OS. Let’s take a look at this piece of code which sets up the 80-byte string to hash (i.e. mine):

The constant “current_block” is an 80-byte array where the first 76 bytes are the hardcoded preprocessed data we don’t care about. The last 4 bytes are 0s. We create a new variable called “block_header” with this 80 byte array and then fill out the last 4 bytes with the nonce we are going to mine with.

Let’s take a look at the disassembly of just the first line. First is the disassembly of the version for my OS:

Now here is the disassembly of the Linux version of the same line:

To the (un)trained eye, you can immediately tell the difference. Not only are there fewer lines in the Linux version than the version for my OS, they are different instructions. The “movups”/”movaps” instructions move 16 bytes of data at a time, while the more efficient “vmovups” moves 32 bytes at a time! The movups/movaps instructions are part of the SSE instruction set (introduced in 1999) and vmovups is part of the AVX instruction set (introduced in 2011).

Note: “More efficient” is very hand-wavy here. There are subtle complex factors you need to consider (e.g. pipelining, state of the cache, etc) to evaluate to determine if an instruction is “more efficient”. But for the purposes of this article, all you need to understand is this: vmovups is actually slower than movaps/movups, but vmovups is copying twice the data. Vmovups less than 2 times slower than movups/movaps, so it’s a win in the end.

Why is it choosing a more efficient version for the Linux version than the one for my OS? Well, because when I am compiling the Linux version, I am compiling natively for my machine, which supports AVX. But, when I compile for my OS, it’s in cross-compilation mode for UEFI applications and chooses the lowest common denominator of AMD64 processors, which does not include AVX.

Something I should note before continuing is my OS is divided up into 2 parts right now. The bootloader and the kernel. The bootloader is a proper UEFI application (i.e. a PE32+/COFF binary), while the kernel is binary blob embedded as a byte array using Zig’s “@embedFile()”. The bootloader starts up and then copies the kernel into working (virtualized) memory and jumps to the kernel’s entry point. 

It turns out that you need to pass the “-mcpu=native+avx2” to Zig to get it to use the AVX2 instruction set. I am choosing AVX2 (introduced in 2013) instead of AVX since my machine supports it.

So I compiled my OS and booted and……. it crashes. It turns out when the machine enters the bootloader from the BIOS/UEFI shell, it does not have AVX instructions enabled yet. So, when it encounters some AVX instructions, the CPU throws an Undefined Instruction exception. So, I added in code to enable AVX (which is flipping a bit in the XCR0 control register) in the bootloader and made sure that the bootloader is not compiled in AVX mode since that’s the place where we enable it. I guess that’d be a good reason why Zig doesn’t compile with AVX instructions for UEFI by default!

So I compiled my OS and booted and……. it works! Without further ado, here are the final stats:

Pseudo Bitcoin Miner Performance with AVX

MeasurementsMy OSLinux
Total Time To Mine Block251 seconds261 seconds
Avg # Cycles/Hash (Thread 0)2,520 cycles2,678 cycles
Max # Cycles/Hash (Thread 0)4,919,960 cycles65,555,356 cycles
Min # Cycles/Hash (Thread 0)2,436 cycles2,424 cycles
Total # Cycles (Thread 0)908,680,867,300 cycles970,554,751,530 cycles
Avg # Cycles/Hash (Thread 1)2,559 cycles2,679 cycles 
Max # Cycles/Hash (Thread 1)4,920,852 cycles59,477,408 cycles
Min # Cycles/Hash (Thread 1)2,450 cycles2,430 cycles
Total # Cycles (Thread 1)904,949,266,874 cycles947,279,596,516 cycles
Avg # Cycles/Hash (Thread 2)2,542 cycles2,679 cycles
Max # Cycles/Hash (Thread 2)4,921,496 cycles61,743,548 cycles
Min # Cycles/Hash (Thread 2)2,438 cycles2,430 cycles
Total # Cycles (Thread 2)927,407,896,000 cycles951,599,787,438 cycles

Much better! The version for my OS runs 10 seconds faster, which is pretty cool. One caveat here is that if I run the miner more than once for both my OS and Linux, the version for my OS has about the same timings, but the Linux one becomes a lot faster due to caching, I assume. At that point the version for my OS only runs about 1 second faster. But, I will say if you are writing a Bitcoin miner, you will never be trying mining the same block twice!

TL;DR: My Bitcoin miner ran faster on Linux because I didn’t realize to enable AVX instructions for my OS.

I hope you found this dissection informative. Any questions and comments are greatly appreciated!

If you’re interested in following development on my OS, please follow this blog and checkout and follow my Twitch and my Twitter. I try my best to stream development weekly on Twitch!

GBEmu 0.0.3 Released!

I have released GBEmu 0.0.3 on GitHub and Handmade Network.   This is small update which includes now being able to resize the screen and replacing the OpenGL renderer with a Metal renderer on macOS.  Resizing the screen keep the Game Boy screen in proportion and fills the border area with white.  As a result of the new Metal renderer, VSync has been reenabled on macOS. Here are is the changelog:

– Now supports arbitrary resizing of screen
– Full screen is now in proportion, and has a white border
– Replaced OpenGL renderer with Metal renderer on Mac
– Memory leak and other bug fixes

As always, comments and questions are welcomed and encouraged.  Feel free to post below in the comments or on https://gbemu.handmade.network/forums.

GBEmu 0.0.2 Released

I have released the next version of GBEmu, 0.0.2 on GitHub and Handmade Network.  The major change in this release contains support for a configuration file, called config.txt.   This includes support for non-English keyboards as well.  This file currently supports custom keyboard and gamepad mappings, and controlling the size of the screen.  The final config.txt syntax is very similar to the syntax I posted in my previous post.  The main difference is now you can put more than one config value on one line.  In addition the macOS and Windows executables now have icons.

Screenshot of a portion of config.txt with US Keyboard:

Screen Shot 2018-10-29 at 6.53.45 PM

Screeshot of a portion of config.txt with Hebrew Keyboard:

Screen Shot 2018-10-29 at 6.52.12 PM

Now for the unfortunate part: I disabled VSync on Mac. The reason for this is while working on this release, macOS Mojave was released, which contained deprecation of OpenGL. This broke GBEmu’s rendering interface, SDL, in many ways. GBEmu would show up with a black screen and so I decided to wait for fixes for SDL before releasing this update.

Even with SDL fixes coming in, there was still a persistent black screen in the ROM debugger. This was due to the fact I was rendering the debugger on a different thread since all GBEmu windows are VSynced and OpenGL on Mojave doesn’t seem to like that.

“Conveniently”, I learned that VSync on OpenGL in Mojave is completely broken anyway and, so, I decided to take the easy way out and disable VSync on Mac. I hope to re-enable VSync when either SDL gets their Metal backend working, or I manually render in Metal myself.  VSync still works perfectly fine in Windows and Linux.

As always, comments and questions are welcomed and encouraged.  Feel free to post below in the comments or on https://gbemu.handmade.network/forums.

The Config File

The next update incoming for GBEmu is support for a configuration file. The file will be called “config.txt” and will live at the root of the GBEmu home folder. Initially, config.txt will support 3 things:

  1. Custom keyboard mappings of GBEmu functionality (Game Boy controls, Mute, etc).
  2. Custom controller mappings of GBEmu functionality (Game Boy controls, Mute, etc).
  3. The scale of the Game Boy screen.

The idea is to have a clearly readable and editable config file.  I arrived at a simple case-insensitive key-value syntax of the form:

<GameBoy button or option to set> = <the value>.

E.g. to assign the up button on the Game Boy to the W key, the config line would read:

Up = Key W

One thing to note is that the syntax is case insensitive.  In addition, if a config.txt file does not exist in the GBEmu home directory, a default one is generated for you with default key bindings.  And then from there, you can change the bindings as you wish.

Here is a sample of what the default keyboard bindings config.txt would look like:

Screen Shot 2018-09-10 at 2.25.51 PM.png

Of course, as time goes on, more options will be supported by config.txt. If anyone has any comments or questions, I’d love to hear them.  I hope to wrap this up soon and work on the next feature (TBD) for GBEmu!

 

Welcome!

Hey everyone! Now that I have finally released the first version of my Game Boy emulator, GBEmu, I have decided to start blogging my ideas and design decisions that will go into this project. This blog won’t strictly be for GBEmu, but it will be for the time being. My first post will be about the next non-trivial feature I am working on which is support for a config file, so stay tuned!