Dodging a crash
In the latest Low Byte Productions video, Francis explores the BPS file format and implements a binary patch tool from scratch.
At the end of the video, the program almost runs to completion, but it fails on the final CRC check. Off camera he finds two bugs. And the video ends with him showing the bugs, fixing them and finally demonstrating the working program.
But while I was watching the video I noticed Francis making the off-by-one error in the action length when he wrote the code.
The off-by-one error in the action length means that TargetRead actions would consume 1 byte less than intended. This desynchronizes the patch data stream, and all remaining actions would be misinterpreted. I expected the corrupted actions to eventually contain a length or offset field that would fail the bound checks, and result in an error. So in the subsequent run of the program I expected it to error-out. But the anticipated crash didn’t occur and to my surprise the program made it through the actions loop and only failed on the final CRC check.
How was this possible?
Off-by-one
In order to get a bit more insight into the execution flow an action trace has been added. The trace is created by printf’ing the offset within the patch file, the type and the length of the applied actions.
The following trace shows the actions of the bugged program, so before the known bugs are fixed. The program ends with the CRC check error, just as it did in the video.
Reading source and patch file...
Read patch file (133078 bytes)
Read source file (524288 bytes)
Reading patch header...
Source size according to patch = 524288 bytes
Target size according to patch = 2097152 bytes
Checking source and patch CRCs...
Source CRC check passed.
Patch CRC check passed.
Allocating target buffer...
Metadata size = 0 bytes
Applying patch actions...
0000000B SourceRead 481
0000000D TargetRead 3
00000011 TargetCopy 7
00000013 TargetRead 0
00000014 SourceRead 3
00000015 SourceRead 2
00000016 TargetRead 2
00000019 TargetCopy 7
0000001C TargetRead 2
0000001F SourceRead 4
00000020 SourceRead 26
00000021 TargetRead 167
000000CA TargetCopy 4
000000CD TargetRead 3
000000D1 TargetCopy 7
000000D4 SourceCopy 3
000000D7 SourceRead 9
000000D8 TargetRead 0
000000D9 SourceRead 135216
Finished applying 19 patch actions.
Checking target CRC...
Target CRC did not match.
Comparing the bugged trace to the trace of the bug-free program, the actions start to diverge after the first TargetRead, as expected.
…
Applying patch actions...
0000000B SourceRead 482
0000000D TargetRead 4
00000012 SourceRead 32
00000013 TargetRead 1
00000015 SourceRead 3
00000016 TargetRead 3
0000001A SourceRead 710
0000001C TargetRead 3
00000020 SourceRead 27
00000021 TargetRead 168
000000CB SourceRead 440
000000CD TargetRead 4
000000D2 SourceRead 745
000000D4 SourceCopy 4
000000D7 SourceRead 10
000000D8 TargetRead 1
000000DA SourceRead 1057
000000DC TargetRead 1
…
000207BE TargetCopy 219
000207C3 TargetCopy 1047276
Finished applying 23842 patch actions.
Checking target CRC...
Target CRC check passed.
Writing target to file roms/patched.smc
Done. Cleaning up.
Bug #2
Another difference between the two traces is that the bug-free program applies 23842 patch actions, while the bugged program only applies 19.
This is caused by the second bug, the incorrect action loop condition. Let see what happens if we only fix this bug, and run the program again.
…
0000024B SourceRead 20
0000024C TargetCopy 28
0000024E SourceRead 3556
00000250 SourceCopy 24
SourceCopy failed
Now the program errors-out because a bound check in one of the actions failed. Finally the bugged program behaves as I anticipated.
The second bug significantly reduced the number of applied actions, and this improved the odds that all the corrupted actions that were applied had valid bound checks. And in the case of patching “Super Mario World Odyssey”, the odds where good enough that even though ‘all’ actions where applied ‘correctly’, it still failed the final CRC check.
So is there some cliché moral to this story? Sure!
You may feel clever to catch a bug, but next bug will still catch you by surprise
Addendum: Extra constraints
While the program with only the off-by-one bug will terminate with an error, the bound checks only failed after applying 216 actions from the corrupted action stream. So the program can still run quite some time before an error occurs.
I wondered if other bound checks or constraints could be added that cause the program to terminate earlier?
In the end of the video, Francis already adds two extra bound checks, but unfortunately these don’t cause earlier termination.
The only additional constraint that I could think off is that the TargetCopy action should only copy previously written bytes from the target buffer. The target buffer is filled in a linear fashion, so only the bytes till outputOffset are written. With the following additional constraint in BPS_TargetCopy() copying of uninitialized data is prevented.
if (state->targetRelativeOffset >= state->outputOffset) return false;
This extra check terminates the bugged program a lot faster. The error is triggered early enough that even with the reduced number of applied actions caused by bug #2, the CRC check is not reached.
Reading source and patch file...
Read patch file (133078 bytes)
Read source file (524288 bytes)
Reading patch header...
Source size according to patch = 524288 bytes
Target size according to patch = 2097152 bytes
Checking source and patch CRCs...
Source CRC check passed.
Patch CRC check passed.
Allocating target buffer...
Metadata size = 0 bytes
Applying patch actions...
0000000B SourceRead 481
0000000D TargetRead 3
00000011 TargetCopy 7
00000013 TargetRead 0
00000014 SourceRead 3
00000015 SourceRead 2
00000016 TargetRead 2
00000019 TargetCopy 7
TargetCopy failed
Addendum: Resynchronisation
When looking closely at the actions traces, it can be seen that they also reconverge, only to diverge again. It is kind of funny to see the weird behavior caused by the off-by-one bug, and that the desynchronized patch stream is sometimes resynchronized by misinterpreting the variable length encoded action/length number.
/ 11 \ / 14 \ / 19 \ / 1F \ / CA \ / D1 \
0B - 0D - - 13 - - 16 - - 1C - - 20 - 21 - - CD - - …
\ 12 / \ 15 / \ 1A / \ -- / \ CB / \ D2 /
The graph shows the sequences of action offsets, and how the bugged and bug-free programs constantly diverge and reconverge.