You can get the code on github.
I suspect there's at least one other project implementing pretty much the same thing out on the net somewhere, but I deliberately haven't looked yet as I wanted the challenge of implementing this for myself without being put off by someone else already having done it better. All I can say is that Google doesn't seem to think the name "py8dis" (or "pydis8") is already taken.
BeebDis has been a big influence on py8dis; it's only disassembler I've used recently, and I've tried to copy some of its command names. Apart from the fact that it's sometimes fun to reinvent the wheel, it was really just the appeal of making the disassembler programmable that motivated me to write py8dis.
As with BeebDis, py8dis is based around the idea of iteratively developing a control file as you learn more about the program being disassembled. The difference is that the control file is itself a Python program and py8dis is used as a kind of library by that Python program.
Let me illustrate with a semi-contrived example - disassembling Acorn DFS 2.26. You can get this at mdfs.net.
To start off with, we create a basic control file, dfs226.py:
Code: Select all
from commands import *
from trace6502 import *
import acorn
load(0x8000, "dfs226.orig", "f083f49d6fe66344c650d7e74249cb96")
set_output_filename("dfs226.rom")
acorn.add_standard_labels()
acorn.is_sideways_rom()
go()
set_output_filename() is optional but controls the filename given to the beebasm SAVE command; it has no effect if you're using the acme output format.
The two calls to the acorn module define some Acorn OS constants (e.g. oswrch at &FFEE) and automatically interpret the sideways ROM header respectively. If you weren't disassembling a ROM, you'd omit the call to acorn.is_sideways_rom() and replace it with something like:
Code: Select all
entry(0x2000, "start")
To keep things simple, put dfs226.py in the same directory as acorn.py, trace.py etc. (There are some notes in the github README on setting PYTHONPATH to avoid needing to do this.) Running dfs226.py:
Code: Select all
python dfs226.py > dfs226.asm
The output (edited for brevity, of course) will look something like this:
Code: Select all
org &8000
guard &c000
.pydis_start
; Sideways ROM header
.rom_header
.language_entry
equb &00, &00, &00 ; 8000: ...
.service_entry
jmp service_handler ; 8003: 4c c8 be
.rom_type
equb &82 ; 8006: .
...
.l80c8
and #&0f ; 80c8: 29 0f
cmp #&0a ; 80ca: c9 0a
bcc l80d0 ; 80cc: 90 02
adc #&06 ; 80ce: 69 06
.l80d0
adc #&30 ; '0' ; 80d0: 69 30
rts ; 80d2: 60
equb &20, &e3, &80, &ca, &ca, &20, &db, &80 ; 80d3: .... ..
equb &b1, &b0, &9d, &72, &10, &e8, &c8, &60 ; 80db: ...r...`
equb &20, &e6, &80, &b1, &b0, &95, &ba, &e8 ; 80e3: .......
equb &c8, &60 ; 80eb: .`
The output includes a hex dump for each line; this can be turned off later, but to start with it's very helpful for getting the hex addresses we'll need as we incrementally add to the control file.
We decide to take a look at the service call handler as a starting point. We see that it is split up a bit; the main code starts at the "service_handler" label at &BEC8 but it does a "JMP &B1B1" fairly early on. To try to keep things straight as we puzzle the code out, let's use a more meaningful label by adding (all additions are towards the end of dfs226.py, just above "go()"):
Code: Select all
label(0xb1b1, "general_service_handler")
general_service_handler does a JSR to a JSR and it starts to feel a bit confusing. laea9 does "cmp #&09"; does the accumulator still contain the service call number at this point? We inspect the code and decide it does, which means that &09 is the service call number for *HELP. We annotate the disassembly further by adding:
Code: Select all
constant(0x09, "service_help")
expr(0xaeaa, "service_help")
The expr() call says that the byte at &AEAA is a reference to the service_help constant. If we re-run dfs226.py the output now has:
Code: Select all
.laea9
cmp #service_help ; aea9: c9 09
bne laed7 ; aeab: d0 2a
Obviously this is a fairly tedious way to introduce named constants into the assembly. At some point you will have identified all the segments of code and data in the binary, documented them in the control file and you'll then take py8dis's output and start hacking on it in a text editor as you study the code further. The idea behind expr() is just that you can do some initial addition of named constants while you're still iterating with the disassembler.
As we browse the output some more, we spot this weirdness:
Code: Select all
tya ; aebc: 98
beq laed3 ; aebd: f0 14
jsr l8077 ; aebf: 20 77 80
ora l5554 ; aec2: 0d 54 55
equs "BE HOST 2.30" ; aec5: 42 45 20 ...
equb &0d, &ea ; aed1: ..
.laed3
Studying the code at &8077, we work out what it's doing and add a comment to remind us:
Code: Select all
comment(0x8077,
"""Print (XXX: using l809f, which seems to be quite fancy) an inline string
terminated by a top-bit set instruction to execute after printing the string.
Carry is always clear on exit.""")
Code: Select all
hook_subroutine(0x8077, "print_inline_l809f_top_bit_clear", stringhi_hook)
Re-running the disassembly, we now have this:
Code: Select all
tya ; aebc: 98
beq laed3 ; aebd: f0 14
jsr print_inline_l809f_top_bit_clear ; aebf: 20 77 80
equs &0d, "TUBE HOST 2.30", &0d ; aec2: 0d 54 55 ...
nop ; aed2: ea
.laed3
Staring intently at the disassembly a bit longer, we find the service call 4 (unrecognised * command) handler and realise it's dispatching to different routines for different commands via the subroutine at &8703 and a table at &861C. Rather that try to understand the code properly, we spot the suspicious-looking "lda:pha:lda:pha:rts" code at &873C which suggests the table contains addresses suitable for use with RTS (i.e. one byte before the actual addresses of the corresponding code). Looking further at the table we deduce that each entry seems to have the format:
- command name
- big-endian address of code-1; as this is a ROM the high byte of the address will always be >=&80 so this terminates the command name implicitly with a top-bit-set byte
- some sort of extra byte
Code: Select all
pc = 0x861c
label(pc, "command_table")
for i in range(20):
pc = stringhi(pc)
pc = rts_code_ptr(pc + 1, pc)
pc += 1 # XXX: what are we skipping here?
- We call stringhi() to mark the command name as a string and to get the address of the top-bit-set byte terminating it.
- We call rts_code_ptr() to indicate that there's an RTS-style pointer with its low byte at address pc+1 and its high byte at pc.
- We skip over the byte we don't currently understand; it will get labelled as byte data by default.
Code: Select all
.command_table
l861d = command_table+1
equs "ACCESS" ; 861c: 41 43 43 ...
equb >(l89e6-1) ; 8622: .
equb <(l89e6-1) ; 8623: .
equb &32 ; 8624: 2
equs "BACKUP" ; 8625: 42 41 43 ...
equb >(la417-1) ; 862b: .
equb <(la417-1) ; 862c: .
Examining the disassembly further, we notice there's another subroutine which takes inline data:
Code: Select all
.l9436
jsr l9ad8 ; 9436: 20 d8 9a
jsr l8048 ; 9439: 20 48 80
ora (l0045),y ; 943c: 11 45
equs "scape" ; 943e: 73 63 61 ...
equb &00 ; 9443: .
Code: Select all
comment(0x8048,
"""Generate an OS error using inline data. Called as either:
jsr XXX:equb errnum, "error message", 0
to actually generate an error now, or as:
jsr XXX:equb errnum, "partial error message", instruction...
to partially construct an error (on the stack) and continue executing
'instruction' afterwards; its opcode must have its top bit set. Carry is
always clear on exit.""")
def generate_error_hook(target, addr):
# addr + 3 is the error number
pc = stringhiz(addr + 4)
if memory[pc] == 0:
# An OS error will be generated and the subroutine won't return.
return None
else:
# A partial OS error will be constructed on the stack and the subroutine
# will transfer control to the instruction following the partial error.
return pc
hook_subroutine(0x8048, "generate_error", generate_error_hook)
I'll stop here, but you can see a fuller (but by no means complete; it was just created as a test/demonstration of py8dis) version of the disassembly in examples/dfs226.py. There are also example disassemblies of ANFS 4.18 and BASIC 4r32. You can also see how commands like rts_code_ptr() are implemented in commands.py, which may be helpful in writing your own variants.
I suspect the way this all works might be a little idiosyncratic; since I wrote it and I've been evolving it as I go along it feels relatively natural to me (if a little clunky in places), but I have no idea if anyone else will be able to get along with it or not. Comments, questions (except "why?" ), bug reports and feature requests are welcome!
I think it's somewhat inevitable that you will run up against assertion failures with somewhat scary looking backtraces as a result of the "disassembler as a library" approach; I am open to trying to make things a bit more friendly if possible, but I'd also hope that with a bit of practice these become useful at indicating what's gone wrong. Please post if you give this a try and get stuck and I'll do my best to help.