Upward Port Devlog: Questions, Decisions, and Progress

Last week I decided I wanted to port Upward to the Playdate. So, now what? Well, other than creating a new Visual Studio Code project using my Playdate project template for Zig (shameless plug), the first thing I wanted to do is write out a list of questions to start with. Here are the questions I started with:

  1. What resolution is the game going to be in?
  2. What is the target framerate?
  3. Should I keep the PICO-8 API in the game code and write a small PICO-8 API implementation, or modify the game code to directly call out to the Playdate APIs?
  4. Should I use 16.16 fixed point numbers like PICO-8, or use floating point?
  5. What is the first thing I want to get on the screen?

The first question I wanted to tackle was the resolution question:

What resolution is the game going to be in?

PICO-8 has a resolution of 128×128 pixels (1:1 aspect ratio) and the Playdate has a resolution of 400×240 pixels (5:3 aspect ratio). With the aspect ratios being completely different, I had 2 choices:

  1. Just keep the resolution of the PICO-8 and stick the game in the middle of the screen and deal with the large border around it.
  2. Try to modify the graphics and levels to take advantage of the extra resolution

Choice 2. is probably the ideal choice, but unfortunately, that would take a significant rewrite of everything and the whole point of this project was to keep things manageable. So choice 1. it is. So lets see what a 128×128 view would look like on the Playdate:

Uhh, yea that’s a bit small. So what’s the highest resolution we can get to that would we can get to that would fit on the screen? Ideally, we’d like it to be a whole number multiple of 128×128. So, 2x? Well, 2×128 = 256, which _just_ over Playdates 240 pixel vertical resolution. That’s a shame.

What now? Well, PICO-8 graphics are made up of of 8×8 pixel cells, and luckily all of the main sprites of Upward are drawn to be 8×8. How high can we can we extend the cell resolution? Can’t do 16×16 since that make the resolution 256×256, which is too big. We can try 15×15 (which actually ends up fitting perfectly), but I felt that making the resolution an odd number would add more work into redoing the graphics (feel free to call me out on this if I’m wrong!). Thus, the best resolution for a cell is 14×14. The new resolution of the game would be (14/8)*128 = 224 -> 224×224! Let’s see what 224×224 looks like:

That’s much better! Of course there is a lot of unused screen real estate, but won’t strain your eyes at least. With that decided, let’s move on to the next question.

What is the target framerate?

PICO-8 supports 2 target framerate modes: 30 FPS (when the _update function is defined) and 60 FPS (when _update60 is defined). The Playdate supports a max of 48-49 FPS (if you do partial frame refreshes, you can get 50 FPS, but that’s out of scope here). So, if I can support 48-49 FPS, why not?

Upward targets 30 FPS. I modified the code to run at 60 FPS (defining the _update60 function), and it turns the player and enemies move far too quickly to be playable. So as of now, I am targeting 30 FPS. I don’t believe it’d be too difficult to support 48-49 FPS, but it would take some gameplay code modification.

Should I keep the PICO-8 API in the game code and write a small PICO-8 API implementation, or modify the game code to directly call out to the Playdate APIs?

I first looked to see what PICO-8 API calls were being used. Before I even started this project, I vetted the code to make sure that tricky APIs like memcpy, peek, and poke weren’t used. PICO-8 is a virtual machine that has it’s own memory map. For this project, I wanted to avoid having to implement these memory-mapped facilities since they are typically used for interesting graphics and sound effects. Luckily only APIs that map well to Playdate’s API are used. Here are some examples:

  • spr — Maps well to playdate->graphics->drawBitmap.
  • sspr — Maps well to usage of playdate->sprite->setClipRect and playdate->graphics->drawBitmap together.
  • map — Maps (pun intended) well to usage of playdate->graphics->getTableBitmap

Since these APIs have a simple mapping, I decide to keep the PICO-8 API calls and write implementations of them, inlining when possible.

Should I use 16.16 fixed point numbers like PICO-8, or use floating point?

PICO-8 only has 1 number data type which is a signed 16.16 fixed point number. However, the Playdate has an FPU which means using 32-bit floating point is also an option. Looking at the instruction timings, fixed point seems to be potentially 2 times faster due to the dual-issue instructions here (but only in very specific situations, otherwise, they appear to be the same speed). In addition, fixed point is more “true-to-the-code” here and you don’t have to deal with any potential rounding issues floating point gives you.

But, for ease of porting I decided to use floating point, for now. Why is it easier to use floating point on the Playdate, even though PICO-8 uses fixed point? Well, consider the following PICO-8 code:

rectfill(0, 56, 127, 72,0)

This code draws a black rectangle where the top left is at (22, 56) and the bottom right is at (127, 72). With when porting using floating point, the Zig code looks like:

rectfill(0, 56, 127, 72, .BlackColor, game);

Zig auto converts all the number literals to floating point values. But, with fixed point, the integer lives in the upper 16 bits of the 32-bit number. So, I’d have write the code to look something like this:

rectfill(fixed(0), fixed(56), fixed(127), fixed(72), .BlackColor, game);

Kind of annoying to deal with. Plus, the fact that I am targeting 30 FPS means that the win for using fixed point is probably not as big here. I am open to changing to fixed point later if I encounter issues, however. And finally, our last (and probably our most important) question…

What is the first thing I want to get on the screen?

Luckily Upward has a very simple game-state structure, a title screen and the game itself. That’s it. Even the pause screen is handled by the built-in PICO-8 pause screen, so we don’t really have to worry about that since the Playdate also has a built-in pause screen!

With this knowledge, the first thing I want to get on the screen is the title screen. In order to do this, I not only will do I have to write the code, but also modify the the 8×8 sprites to be 14×14 sprites. To do this I:

  • Used PICO-8’s EXPORT command that to export all of the sprites into one 128×128 (16×16 sprites) PNG file.
  • Loaded that PNG file into Photoshop and enlarged it to 224×224 pixels with “Nearest Neighbor” resample selected, so there are no “fuzzy” pixels.
  • Since we are not doing an integer multiple of 128, the sprites get a bit distorted, unfortunetly. So, I went in with a 1-pixel pencil tool and did the best I could.

Progress Report

I am happy to report that I do have the title screen working! (Unfortunately, I can’t embed it here because WordPress does something weird with GIF). As you can see the font is not PICO-8 font, but I plan on changing that soon. I’ll use Panic’s Caps tool to create the font. Also, if you look very closely, the right side of the cliff and the left side of the cliff are mismatched by a couple of pixels. You can see it more clearly here :

I do hope to fix this during this project. But, I feel it’s a relatively low priority since it’s not that noticeable (at least to me, let me know if it bothers you!)

Next up is to get level 1 working!

I think that’s it for this post. As always, all questions and feedback are welcome!

Advertisement

What’s that? ANOTHER project?

“Upward” by Matthias Falk

My personal pastime is starting a new software project with no thought to how long it will take.  My Documents/devStuff folder is filled with tons of unfinished projects.  There have been a couple of times I’ve been able to stick with a project for a while (notably BoksOS and Pictoblox), none of these are nowhere near the finish line.  The only project I feel I’ve _finished_ is TweetCartRunner.  TweetCartRunner is by no means bug-free, but I am satisfied with the way it is now. 

So when I was getting the urge to start yet another software project, I felt something needed to change.  This time, the project needs to be small and have rules. If I break any of these rules, I’m done working on the project.  Better to be done and unfinished than to randomly get sick of the project and then promise to “get back to it one day”.

What was the idea I had this time?  Port a small game to Playdate.  So, I asked on Twitter what would people like to see and @MonstersGo suggested a PICO-8 game called “Upward” made by Matthias Falk (@heymatthias_) which looked like a perfect small game for the Playdate.  I talked with Matthias and he was totally cool with porting this game to Playdate. 

Another thing I’d like to improve on is my writing. I _hate_ writing! But it’s a skill I feel I need to improve on since I want to be able to articulate my decisions precisely. But, more importantly, I believe it helps me catch bad decisions as I try to write them down. If I can’t really justify a decision on paper, it’s either a bad idea, or an idea that needs more research! I have to give credit to Abner Coimbre for asking where everyone can follow my progress. I feel without that question, I may have not made the devlog as part of my rules As such, I am also planning on keeping a devlog for this project. 

So here are the aforementioned rules for this project:

  • I have until Jan 8 11:59p ET to finish this project. However, if I feel I’m close to finishing it or “in the flow”, I will extend the deadline. 
  • I will write at least one devlog post a week about what I did and the decisions I made (this blog post counts as week 1). 

Unless there are extenuating personal circumstances, if I break any of these 2 rules, I will end this project. That would be profoundly disappointing! So, I’ll try my hardest not to do that. 

Also, here are also some things about this project you might want to know:

  • After discussion with Matthias, I will release this port for free and open source the project with a non-commercial license. 
  • Will write this port in Zig (yes I know it’d make more sense in Lua, but I always like have fine grain control of my memory management).

I’m also gonna try to stream on https://twitch.tv/danbokser when I can.

If you have any feedback about this project or my rules, please let me know! And with that I’m going to get started!  

Bitcoin Miner on my OS vs Linux

Bitcoin Miner on my OS vs Linux
Screenshot of my OS with the Bitcoin miner running center-right. Despite it saying “Reach OS” in the screenshot, my OS is currently unnamed. When it boots up, it randomly chooses a word from a word list and appends “OS” to it. 

I am currently working on an operating system in a new systems programming language called Zig. While working on my OS, one of the things I was curious about is how much faster an application could run on bare metal without all interrupts and overhead from OS scheduler. Thus, I thought a pseudo Bitcoin miner would be the perfect candidate to test this theory! As a quick overview, my pseudo Bitcoin miner does the following:

  1. Takes in 3 hardcoded arguments:
    1. 76 bytes of preprocessed data. In this context, we don’t really care what these bytes are.
    2. The “target” which is a 32-byte number.
    3. A “start” and “end” 4-byte number.
  2. Initialize a 4-byte (32-bit) number to the “start” number. We will call this the nonce.
  3. Concatenate the 76 byte and the nonce together to get an 80 byte string.
  4. Hash this 80 byte string with the SHA256 and take the result and hash it again with SHA256, i.e. SHA256(SHA256(S)) where S is the 80 byte string.
  5. If the resulting hash is less than the target then we successfully mined a block! We stop here.
  6. If not, we increment the nonce. If the nonce is less than or equal to “end” we go back to 3. Else, we stop and we failed to mine a block.

To be clear, this program does not connect to the internet (nor does my OS have network support yet), but it does do the same hashing work that a real Bitcoin miner does.

I wrote two versions of the miner, 1 to run on my OS and a Linux version to compare with. This miner had spawns multiple threads to split up the 32-bit number space of the nonce and mine each piece of this space concurrently. This miner spawns the number of hardware threads the CPU supports, minus 1, leaving one thread open for the main kernel thread. 

My machine’s CPU has 4 hardware threads, so the miner spawns 3 threads on this machine:

  • Thread 0 will try nonces 0x0000_0000 – 0x5555_5555 (inclusive). To be specific, 0x0000_0000 is the “start” number and 0x5555_5555 is the “end” number.
  • Thread 1 will try nonces 0x5555_5556 – 0xAAAA_AAAAA. 
  • Thread 2 will try nonces 0xAAAA_AAAAB – 0xFFFF_FFFFF.

To get valid data, I had to make sure that I was running Linux on the same machine I was booting to my OS on. I measured the following per thread:

  • The average number of CPU cycles it took to do one hash.
  • The maximum number of CPU cycles it took to do one hash.
  • The minimum number of CPU cycles it took to do one hash.
  • The total time it took to mine a block in seconds.

CPU cycles are measured using the rdtscp AMD64 instruction. I am running on an Intel Core i5-7600k @ ~4ghz (overclocked). Which means each cycle is about 0.25 nanoseconds. It should be noted that in this example, Thread 1 is the thread that happens to mine the block. Once it finds the block, it and all other threads stop. Here are the numbers of the miner running on my OS vs Linux:

Pseudo Bitcoin Miner Performance

MeasurementsMy OSLinux
Total Time To Mine Block285 seconds261 seconds
Avg # Cycles/Hash (Thread 0)2,989 cycles2,678 cycles
Max # Cycles/Hash (Thread 0)5,015,404 cycles65,555,356 cycles
Min # Cycles/Hash (Thread 0)2,816 cycles2,424 cycles
Total # Cycles (Thread 0)1,066,361,841,588 cycles970,554,751,530 cycles
Avg # Cycles/Hash (Thread 1)2,956 cycles2,679 cycles 
Max # Cycles/Hash (Thread 1)5,015,968 cycles59,477,408 cycles
Min # Cycles/Hash (Thread 1)2,818 cycles2,430 cycles
Total # Cycles (Thread 1)1,045,330,042,596 cycles947,279,596,516 cycles
Avg # Cycles/Hash (Thread 2)2,941 cycles2,679 cycles
Max # Cycles/Hash (Thread 2)5,016,996 cycles61,743,548 cycles
Min # Cycles/Hash (Thread 2)2,816 cycles2,430 cycles
Total # Cycles (Thread 2)1,035,229,379,122 cycles951,599,787,438 cycles

Wait, what? The miner on my OS is running bare metal with no scheduler or any other interrupts happening. So why is the Linux version successfully mining a block more than 20 seconds faster than the version on my OS? I checked the code and I made sure that the Linux and my OS version were sharing as much code as they could. So I was pretty sure it wasn’t changes in the code.

The next thing to do was to compare the assembly of the Linux to version on my OS. Let’s take a look at this piece of code which sets up the 80-byte string to hash (i.e. mine):

The constant “current_block” is an 80-byte array where the first 76 bytes are the hardcoded preprocessed data we don’t care about. The last 4 bytes are 0s. We create a new variable called “block_header” with this 80 byte array and then fill out the last 4 bytes with the nonce we are going to mine with.

Let’s take a look at the disassembly of just the first line. First is the disassembly of the version for my OS:

Now here is the disassembly of the Linux version of the same line:

To the (un)trained eye, you can immediately tell the difference. Not only are there fewer lines in the Linux version than the version for my OS, they are different instructions. The “movups”/”movaps” instructions move 16 bytes of data at a time, while the more efficient “vmovups” moves 32 bytes at a time! The movups/movaps instructions are part of the SSE instruction set (introduced in 1999) and vmovups is part of the AVX instruction set (introduced in 2011).

Note: “More efficient” is very hand-wavy here. There are subtle complex factors you need to consider (e.g. pipelining, state of the cache, etc) to evaluate to determine if an instruction is “more efficient”. But for the purposes of this article, all you need to understand is this: vmovups is actually slower than movaps/movups, but vmovups is copying twice the data. Vmovups less than 2 times slower than movups/movaps, so it’s a win in the end.

Why is it choosing a more efficient version for the Linux version than the one for my OS? Well, because when I am compiling the Linux version, I am compiling natively for my machine, which supports AVX. But, when I compile for my OS, it’s in cross-compilation mode for UEFI applications and chooses the lowest common denominator of AMD64 processors, which does not include AVX.

Something I should note before continuing is my OS is divided up into 2 parts right now. The bootloader and the kernel. The bootloader is a proper UEFI application (i.e. a PE32+/COFF binary), while the kernel is binary blob embedded as a byte array using Zig’s “@embedFile()”. The bootloader starts up and then copies the kernel into working (virtualized) memory and jumps to the kernel’s entry point. 

It turns out that you need to pass the “-mcpu=native+avx2” to Zig to get it to use the AVX2 instruction set. I am choosing AVX2 (introduced in 2013) instead of AVX since my machine supports it.

So I compiled my OS and booted and……. it crashes. It turns out when the machine enters the bootloader from the BIOS/UEFI shell, it does not have AVX instructions enabled yet. So, when it encounters some AVX instructions, the CPU throws an Undefined Instruction exception. So, I added in code to enable AVX (which is flipping a bit in the XCR0 control register) in the bootloader and made sure that the bootloader is not compiled in AVX mode since that’s the place where we enable it. I guess that’d be a good reason why Zig doesn’t compile with AVX instructions for UEFI by default!

So I compiled my OS and booted and……. it works! Without further ado, here are the final stats:

Pseudo Bitcoin Miner Performance with AVX

MeasurementsMy OSLinux
Total Time To Mine Block251 seconds261 seconds
Avg # Cycles/Hash (Thread 0)2,520 cycles2,678 cycles
Max # Cycles/Hash (Thread 0)4,919,960 cycles65,555,356 cycles
Min # Cycles/Hash (Thread 0)2,436 cycles2,424 cycles
Total # Cycles (Thread 0)908,680,867,300 cycles970,554,751,530 cycles
Avg # Cycles/Hash (Thread 1)2,559 cycles2,679 cycles 
Max # Cycles/Hash (Thread 1)4,920,852 cycles59,477,408 cycles
Min # Cycles/Hash (Thread 1)2,450 cycles2,430 cycles
Total # Cycles (Thread 1)904,949,266,874 cycles947,279,596,516 cycles
Avg # Cycles/Hash (Thread 2)2,542 cycles2,679 cycles
Max # Cycles/Hash (Thread 2)4,921,496 cycles61,743,548 cycles
Min # Cycles/Hash (Thread 2)2,438 cycles2,430 cycles
Total # Cycles (Thread 2)927,407,896,000 cycles951,599,787,438 cycles

Much better! The version for my OS runs 10 seconds faster, which is pretty cool. One caveat here is that if I run the miner more than once for both my OS and Linux, the version for my OS has about the same timings, but the Linux one becomes a lot faster due to caching, I assume. At that point the version for my OS only runs about 1 second faster. But, I will say if you are writing a Bitcoin miner, you will never be trying mining the same block twice!

TL;DR: My Bitcoin miner ran faster on Linux because I didn’t realize to enable AVX instructions for my OS.

I hope you found this dissection informative. Any questions and comments are greatly appreciated!

If you’re interested in following development on my OS, please follow this blog and checkout and follow my Twitch and my Twitter. I try my best to stream development weekly on Twitch!

GBEmu 0.0.3 Released!

I have released GBEmu 0.0.3 on GitHub and Handmade Network.   This is small update which includes now being able to resize the screen and replacing the OpenGL renderer with a Metal renderer on macOS.  Resizing the screen keep the Game Boy screen in proportion and fills the border area with white.  As a result of the new Metal renderer, VSync has been reenabled on macOS. Here are is the changelog:

– Now supports arbitrary resizing of screen
– Full screen is now in proportion, and has a white border
– Replaced OpenGL renderer with Metal renderer on Mac
– Memory leak and other bug fixes

As always, comments and questions are welcomed and encouraged.  Feel free to post below in the comments or on https://gbemu.handmade.network/forums.

GBEmu 0.0.2 Released

I have released the next version of GBEmu, 0.0.2 on GitHub and Handmade Network.  The major change in this release contains support for a configuration file, called config.txt.   This includes support for non-English keyboards as well.  This file currently supports custom keyboard and gamepad mappings, and controlling the size of the screen.  The final config.txt syntax is very similar to the syntax I posted in my previous post.  The main difference is now you can put more than one config value on one line.  In addition the macOS and Windows executables now have icons.

Screenshot of a portion of config.txt with US Keyboard:

Screen Shot 2018-10-29 at 6.53.45 PM

Screeshot of a portion of config.txt with Hebrew Keyboard:

Screen Shot 2018-10-29 at 6.52.12 PM

Now for the unfortunate part: I disabled VSync on Mac. The reason for this is while working on this release, macOS Mojave was released, which contained deprecation of OpenGL. This broke GBEmu’s rendering interface, SDL, in many ways. GBEmu would show up with a black screen and so I decided to wait for fixes for SDL before releasing this update.

Even with SDL fixes coming in, there was still a persistent black screen in the ROM debugger. This was due to the fact I was rendering the debugger on a different thread since all GBEmu windows are VSynced and OpenGL on Mojave doesn’t seem to like that.

“Conveniently”, I learned that VSync on OpenGL in Mojave is completely broken anyway and, so, I decided to take the easy way out and disable VSync on Mac. I hope to re-enable VSync when either SDL gets their Metal backend working, or I manually render in Metal myself.  VSync still works perfectly fine in Windows and Linux.

As always, comments and questions are welcomed and encouraged.  Feel free to post below in the comments or on https://gbemu.handmade.network/forums.

The Config File

The next update incoming for GBEmu is support for a configuration file. The file will be called “config.txt” and will live at the root of the GBEmu home folder. Initially, config.txt will support 3 things:

  1. Custom keyboard mappings of GBEmu functionality (Game Boy controls, Mute, etc).
  2. Custom controller mappings of GBEmu functionality (Game Boy controls, Mute, etc).
  3. The scale of the Game Boy screen.

The idea is to have a clearly readable and editable config file.  I arrived at a simple case-insensitive key-value syntax of the form:

<GameBoy button or option to set> = <the value>.

E.g. to assign the up button on the Game Boy to the W key, the config line would read:

Up = Key W

One thing to note is that the syntax is case insensitive.  In addition, if a config.txt file does not exist in the GBEmu home directory, a default one is generated for you with default key bindings.  And then from there, you can change the bindings as you wish.

Here is a sample of what the default keyboard bindings config.txt would look like:

Screen Shot 2018-09-10 at 2.25.51 PM.png

Of course, as time goes on, more options will be supported by config.txt. If anyone has any comments or questions, I’d love to hear them.  I hope to wrap this up soon and work on the next feature (TBD) for GBEmu!

 

Welcome!

Hey everyone! Now that I have finally released the first version of my Game Boy emulator, GBEmu, I have decided to start blogging my ideas and design decisions that will go into this project. This blog won’t strictly be for GBEmu, but it will be for the time being. My first post will be about the next non-trivial feature I am working on which is support for a config file, so stay tuned!