640x512 50 fps Bad Apple on a B... how?!

new graphics/music demos - bitshifters, 0xc0de, The Master + others
Post Reply
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

Edit: YouTube link - https://www.youtube.com/watch?v=D_ta5QxBSMk

Motivation:
I was talking with Kieran last year at the Leicester ABUG about 2nd processor acceleration for games & the limit of what is practicable. The ultimate I suggested was to execute code directly from the Tube memory window FFE0-FFFF. This is a proof of concept for that method.
A few back of the envelope calculations and it seemed like it might be plausible to do Bad Apple in a graphics mode.

Hardware:
The machine is a stock B. I'm using a PC as a 'copro' on the Tube via a USB board (the one from the 6502decode thread).
Before anyone cries foul - this is exactly what the Tube port is for... hardware acceleration which Acorn did not conceive of ahead of time. I also added 8K of extra buffer to prevent USB buffer underflow/underruns.

Initial:
I started out with only the Tube snoop board connected to the Tube. Now you might realise why every address in the Tube window can be read from in comms mode :D
Using BeebASM I created a short bit of code which plotted a string on the screen repeatedly - a hello world to see if it would work reliably. Since all bus accesses come from the Tube I padded the instructions with extra bytes for the double reads the 6502 performs (every cycle does a bus access on the NMOS 6502).
This is the actual source I used:

Code: Select all

\ Simple example illustrating use of BeebAsm

oswrch = &FFEE
osasci = &FFE3
addr = &70

MACRO PUTS val
	LDA #val
	JSR &FFEE
	NOP \ dead bus cycle
ENDMACRO

ORG &0000         ; code origin (like P%=&2000)

.start
	SEI:NOP
	LDA #1
	STA &FEFF
	JMP &FEE0
FOR n, 1, 512
	PUTS 'c'
	PUTS 'f'
	PUTS 'm'
	PUTS ' '
	JMP &FEE0
	PUTS 'r'
	PUTS 'u'
	PUTS 'l'
	PUTS 'e'
	PUTS 'z'
	JMP &FEE0
	PUTS ' '
	PUTS '0'+(n DIV 100) MOD 10
	PUTS '0'+(n DIV 10) MOD 10
	PUTS '0'+n MOD 10	
	PUTS ' '
	JMP &FEE0
	INC &70
	LDA &70
	AND #7
	ORA #&30
	JSR &FFEE
	NOP
	PUTS 10
	PUTS 13
	JMP &FEE0
NEXT	
	RTS
\.mytext EQUS "Hello world!", 13, 0
.end

SAVE "test.bin", start, end
Every so often I JMP &FEE0 so the execution stays in the 32 byte Tube window (FEE0-FFFF). I copied this to the virtual COM port on the PC and JSR &FEE8... hey presto 512 repeats of my message :)

More abition:
It looked like the technique was going to work so I started work on the Bad Apple video...

The bulk of the work was writing the instruction stream compiler. It runs on a PC and I wrote it in C#. The program is task based and each task creates a stream of timed actions. The instruction compiler/packer then tries to create an instruction sequence which keeps looping in the Tube memory area and completes the actions by the time the task has set.

While I was developing/testing the 6502 instruction scheduler I found the USB chip couldn't keep up with the datarate (2MB/s) with it's piddly 1K internal buffer. I used the Tube snoop and 6502decode to work out it was underflowing the buffer. I had to hook up extra hardware :( to get some more buffering. So I used my 2nd processor prototype as it had the necessary memory and already had a level shifted and debugged Tube interface.

Initially I tested the scheduler writing blocks of text to mode 7. Then I moved on to the audio. The scheduler stays in sync with the CPU and counts cycle stretching too. PCM playback of the Bad Apple audio stream sounded best although it is really quiet :(
Once the instruction packer was working it is simple to add the audio. A task just requests actions periodically and leaves it up to the instruction packer to make them happen!

Next step was some form of video output. I created a boot loader which syncs with the video field. Then always launches the main instruction stream with a know time delay after the vsync on a the odd field. I tested this and scan line counting by setting blocks of colour on the screen by altering the palette. The video on the BBC uses the same clock as the processor so perfect cycle counting by my instruction packer means I can issue arbitrary code at arbitrary times... want something 10 scanlines down? Set the action deadline to 10*64us.

After more instruction packer debugging I tried adding the video. First I set a task to dump a frame every few seconds in mode 0. Digital audio playing by now of course. Next 5 fps full frames. Next calculate the difference between adjacent fields... and only send the changes.

By counting microseconds & having a known raster position I schedule the screen updates to actions with an earliest time and latest time. This 'chases the raster' as the image is composed. It tries to write the next field data just behind the display of the current field. In frames when the is high motion (lots of bytes to stuff) the raster starts outrunning the fill window... but it has to catch and overtake to get a visual artefact. In a low motion portion of the video the byte stuffing catches the raster up again and sits just behind it.

This worked unbelievably well the first go and I dumped my list of optimisations I had yet to implement in the bin! (e.g. use iny,dey,inx,dex to save a byte for less looping). A free aspect of this is I have to update the video at 50fps so I get double vertical resolution for free! Hence 640x512 in mode 0.

The audio was running at 22kHz... it runs at 44kHz too but doesn't sound any better on the BBC's limited hardware.... so I saved the cycles.

Polish:
I tried adding dithering in the video task. This made some portions of the video look significantly better (star field when she's on the broomstick) but some bits worse (some of the shadows). 2x2 ordered dither was the best... I tried 4x4 but that removed too much detail.

There is a problem with the ordered dither and a TV in mode 0 though. Mode 0 dot pitch is higher than PAL... so some of the patterns become a bit flickery. I'll try it on a CUB monitor at some point and see how that looks... it didn't look good on Tricky's TV at ABUG and an LCD TV's temporal filters make a hash of decoding it. The flicker is what anyone who used high res modes on a TV (e.g. Amiga) BITD will have forgotten occured!

I gave up at that point because it is just a proof of concept.

Simples:
It is just that easy :lol:
Seriously though the smarts is in the instruction scheduler... what you do with that is simple. I could have added copper bars... vertical copper bars!... all sorts... a task just need specify an action, timestamp and priority and the scheduler will do it. The scheduler tries to pack similar operations together with the constraints of higher priority actions and the small loop window (JMP &FFE0). If it overshoots it unwinds and runs the high priority action. It pads absolute time actions so they end _on_ the timestamp specified (used for 6522/sound chip) and tries to do as much of the other actions as it can (with the windowed time constraints). So the packer can even do useful work updating the screen in the sound chip 8us write enable time - instead of NOPs. It uses A,X & Y with an LRU policy to do memory writes - which reduces the number LDA/X/Y #abs. As a last resort if it really has nothing to do it pads the stream with NOPs or more usefully JMP &FFE0

Reveal:
I didn't tell Kieran about it ahead of time.. and waited for Sunday for the reveal... #-o

End Result:
640x512 50 fps Bad Apple on a B with 22kHz digital audio. That's full PAL resolution. LCD TV's make a pigs ear of the dithered shading in mode 0 for some of the patterns but hey, it's a computer from 1981.
It could be improved for CRT - but I don't have one here to test on... it could be improved for LCD too... perhaps and error propogation dithering would do better... but as I'd sunk enough time into it I stopped at 2x2 ordered working.
Last edited by cmorley on Thu Jun 14, 2018 9:49 am, edited 2 times in total.
User avatar
BigEd
Posts: 6261
Joined: Sun Jan 24, 2010 10:24 am
Location: West Country
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by BigEd »

Spectacular! Wish I'd been around to see it. Any chance of a video?
User avatar
1024MAK
Posts: 12783
Joined: Mon Apr 18, 2011 5:46 pm
Location: Looking forward to summer in Somerset, UK...
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by 1024MAK »

I must say, it was impressive. Chris demoed it just as I was about to leave, I think everyone who was still there was crowded around watching it. The addition of the 'grey' definitely made it special.

So well done Chris =D>

Mark
User avatar
tricky
Posts: 7697
Joined: Tue Jun 21, 2011 9:25 am
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by tricky »

Could the playback have been done bitd with some simple counters and 74 series logic and a large ROM?
A counter that increments when a read from the second 16 bytes happens after the previous one being from the first 16 bytes. This would be used to address the ROM for reads for the first 16 bytes. A second set would be used for the other 16 bytes of the interface. The hight bit of the interface address coding which counter to use.
I know you couldn't fit the whole video in and the encoding would need doing up front.

I was also thinking of some other optimisations that you don't need:
TXS, TSX.
ASL, LSR, ROL, ROR.
Branch instead of jmp as you can know the state of the flags, or just use V.
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

tricky wrote: Wed Jun 06, 2018 8:39 pm Could the playback have been done bitd with some simple counters and 74 series logic and a large ROM?
A counter that increments when a read from the second 16 bytes happens after the previous one being from the first 16 bytes. This would be used to address the ROM for reads for the first 16 bytes. A second set would be used for the other 16 bytes of the interface. The hight bit of the interface address coding which counter to use.
I know you couldn't fit the whole video in and the encoding would need doing up front.

I was also thinking of some other optimisations that you don't need:
TXS, TSX.
ASL, LSR, ROL, ROR.
Branch instead of jmp as you can know the state of the flags, or just use V.
Yes it sure could. It is just a stream of 6502 code. I don't use the addresses on the Tube even, it just stuffs a byte when one is asked for by the CPU. Since the code generator tracks the memory location it will always know the CPU execution point. The 2MHz CPU clock needs (first order approximation) 2MB/s to keep it fed. 120MB/min would be a _big_ BITD ROM but feasible if not affordable.

TSX, TXS... humm I didn't think of that one.
User avatar
tricky
Posts: 7697
Joined: Tue Jun 21, 2011 9:25 am
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by tricky »

I use it in my self test OS to save X as it doesn't rely on any working ram.
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

BigEd wrote: Wed Jun 06, 2018 6:45 pm Spectacular! Wish I'd been around to see it. Any chance of a video?
I'll see what I can do about a video... my phone camera is low res video unfortunately. I don't have a PC capture card either.
User avatar
BigEd
Posts: 6261
Joined: Sun Jan 24, 2010 10:24 am
Location: West Country
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by BigEd »

No worries, I'm sure there'll be a second chance to see it.
User avatar
kieranhj
Posts: 1103
Joined: Sat Sep 19, 2015 11:11 pm
Location: Farnham, Surrey, UK
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by kieranhj »

Amazing work! I love the idea of someone plugging in a massive cartridge back in the 1980's and playing the Bad Apple video like this. It'd undoubtedly have been both physically massive and massively expensive as well..!!

Shall we all chip in and buy Chris a decent phone to record a video on? ;) We all love retro Chris but you can't survive with an 8-bit phone in 2018. :D
Bitshifters Collective | Retro Code & Demos for BBC Micro & Acorn computers | https://bitshifters.github.io/
User avatar
Rich Talbot-Watkins
Posts: 2054
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by Rich Talbot-Watkins »

Amazing the technology you can get these days.

Image
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

The code stream doesn't have to be precomputed... it can be generated on the fly. Small bursts could be used to move sprites for example from a pi or fpga co-pro.

I am sure it will be possible to get a video. Either one here or at another ABUG.
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

A mate came over and filmed it on his new fangled smartphone thingy.

BBC Tube Bad Apple

There are a few moire effects not visible in real life & you can see the hash that the LCD makes of some of the dithering... but it works. If you hear mumbling in the background that is us talking in the other room... sorry!
User avatar
BigEd
Posts: 6261
Joined: Sun Jan 24, 2010 10:24 am
Location: West Country
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by BigEd »

Wow!
User avatar
Elminster
Posts: 4315
Joined: Wed Jun 20, 2012 9:09 am
Location: Essex, UK
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by Elminster »

That is officially crazy. I don’t believe it but just in case put me down for two. I am not sure what exactly it is but I need one.

Even if it is just a cable from a pc to a bbc driving it via the tube, which I am sort of guessing it is from the blurb.
User avatar
marcusjambler
Posts: 1147
Joined: Mon May 22, 2017 12:20 pm
Location: Bradford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by marcusjambler »

=D> seriously impressive =D>
User avatar
kieranhj
Posts: 1103
Joined: Sat Sep 19, 2015 11:11 pm
Location: Farnham, Surrey, UK
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by kieranhj »

Witchcraft, I tell you! :D

I'm still slightly confused about one detail. The CPU is still only running at 2MHz so, even though you have a perfectly precomputed instruction stream, you've still only got 39936 cycles/frame so can write a maximum of 9984 bytes/frame which is far less than the MODE 0 screen size. There are some frames in the Bad Apple video that invert every pixel in a single frame so would require a complete screen fill (not to mention keeping the sampled audio running.) Are you just dropping frames here or spreading the work across multiple frames somehow?
Bitshifters Collective | Retro Code & Demos for BBC Micro & Acorn computers | https://bitshifters.github.io/
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

kieranhj wrote: Wed Jun 13, 2018 11:22 am Witchcraft, I tell you! :D

I'm still slightly confused about one detail. The CPU is still only running at 2MHz so, even though you have a perfectly precomputed instruction stream, you've still only got 39936 cycles/frame so can write a maximum of 9984 bytes/frame which is far less than the MODE 0 screen size. There are some frames in the Bad Apple video that invert every pixel in a single frame so would require a complete screen fill (not to mention keeping the sampled audio running.) Are you just dropping frames here or spreading the work across multiple frames somehow?
The codegen cycle counts and is synchronised with VSYNC (& odd field) when execution is passed to it. It generates field differences then packages them up into horizontal strips. Since we are synced to the raster I set the "not before" time and the "finish by" timestamp for each strip based on the raster time. This means that the strip can only be updated after the raster has emitted that screen region for the previous frame and should try to complete the strip before the raster would emit this frame.

So... in periods of low motion, the strip emitter catches up with the raster and my code starts packing NOPs to prevent it overtaking. The raster is lagging the strip emitter by nearly a full field (~1/50s).

In periods of fast motion the strip emitter can't update at the raster pace. You would see an artefact if the raster catches the strip emitter (i.e. it hasn't written the next field data before the display sends it to the TV/monitor). The strip emitter has a huge head start though. The raster starts catching up with the strip emitter... but the strip emitter is still advancing too so we have (much) more than 1 field of time to update the screen.

If you simplify the maths a little and say the raster goes twice as fast as the slowed strip emitter then you have 2 full fields to draw the screen. i.e. you can draw a full screen with _no_ artefact.
Not quite true in realitly (overheads) but close...

So there may be some fields where the previous field data is shown but these will get updated by the next field & there are no visible aretfacts and tearing - so I don't think it happens much. There is tearing if I allow the strip emitter to overtake the raster so I know what to look for.

Lastly (essay is getting long!) the orginal source is 30fps... I found the raw original 960x720 video. So to increase the framerate to 50fps duplicate frames are inserted... this means that quite regularly the emitter has almost no work to do for a field (only the dithering) so regularly gets a chance to catch right up to the raster again.

For those that want a (not too terrible) analogy:
Think of two cyclists say on a velodrome track.

The raster cyclist (R) goes at a constant speed and every 10m shouts out the colour of the track. The strip emitter cyclist (P) is desparately painting the track ahead of the raster cyclist R. The P cyclist goes half the speed of R if they are painting but super fast if the track is already the right colour and don't need to paint it... so if the track is mostly the right colour then P catches up to the back of R. P is not allowed to overtake or R will see & shout out the wrong colour that lap.

If the painter P has to paint loads of track then they will slow but still keep moving... Since they go at about half the pace of R (it can be shown that) R will take 2 full laps to catch P. (geometry or algebra etc... Google type interview question!) If R overtakes P then R will see the wrong colour. But P has painted the track in time (just) - now the track is the correct colour and P speeds up and catches all the way up to the back of R again.

So P managed to paint the entire track without R seeing the wrong colour even though P cycles at half R's speed when painting.
User avatar
oss003
Posts: 3849
Joined: Tue Jul 14, 2009 12:57 pm
Location: Netherlands
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by oss003 »

Wow, never thought that this was possible .... great job =D>
I guessed playing Bad Apple at 20 frames per second on my Atom was fast .....

Greetings
Kees
User avatar
tricky
Posts: 7697
Joined: Tue Jun 21, 2011 9:25 am
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by tricky »

For the full frame swap, you could just update the pallet (offscreen) ;)
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

tricky wrote: Wed Jun 13, 2018 3:23 pm For the full frame swap, you could just update the pallet (offscreen) ;)
I had that in my list of optimisations I didn't need in the end. Also multiple palette swaps during the frame... again not needed.
User avatar
tricky
Posts: 7697
Joined: Tue Jun 21, 2011 9:25 am
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by tricky »

I know it doesn't matter here, but you could detect that you were going to run out of time and leave any bytes only Changing 1 bit until the next frame, prioritising ones that have several similar neighbors.

Ps prioritising ones means picking them first for deferring.
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

It's a shame it won't run directly from the buffer in the FT232H (in CPU mode) otherwise anyone with a board from eBay would be able to run it on their machine. I had to add some more buffering with extra hardware. No reason why it couldn't be made to run from a pi-tube direct i suppose.
User avatar
hoglet
Posts: 12665
Joined: Sat Oct 13, 2012 7:21 pm
Location: Bristol
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by hoglet »

cmorley wrote: Wed Jun 13, 2018 5:39 pm No reason why it couldn't be made to run from a pi-tube direct i suppose.
I don't think it would be easy though.... in fact I would go as far as saying it would be very tricky.

The main difficulty is preventing the cache misses the Pi will experience from interrupting/crashing the instruction stream.

At 2MHz the Pi only has ~350ns to respond to a read request on the 6502 bus. A single cache miss will blow the timing budget. The original version of PiTubeDirect used the ARM core only, and once the memory foot print of the Co Pro exceeded the cache size, the timings became quite unpredictable.

In the later versions, Dominic was able to port the 6502 bus handing code to run on the GPU, leaving the ARM core free to emulate the Co Processor at it's leisure. The two work together by exchanging data via a "unused" block of 8 I/O registers. By avoiding using main memory (between the ARM and the GPU) we guarantee the GPU never experiences any main memory contention, and it's response time is predictable.

We are not complete sure what the intended use of these I/O registers is. I only found them after writing some code that did an exhaustive search. As far as I remember, there is no region larger than 8 words that behaves transparently (i.e. it appears to operate just like memory).

FYI, the registers we re-purpose are:
- MS_MBOX_0 (0x7e0000a0)
...
- MS_MBOX_7 (0x7e0000bc)

So just possibly there is enough space there (32-bytes) to fill the whole 0xFEE0-0xFEFF window.

I think the hardest problem would be somehow synchronising the ARM core filling these registers with the 6502 executing then. One additional flag (in an I/O register) would need to be toggled by the GPU when it wants the ARM to supply the next 4 words (kind of a double buffering arrangement). The ARM would have ~8us to respond, which gives some margin for cache misses.

How big is the uncompressed instruction stream? I'm guessing 300-400MB (for 3 mins 30 sec video)? This is small enough that it would fit in RAM on the Pi, which is handy.

Dave
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

There is the benefit of course that you don't need to care about the address... just reply on any and every Tube read. Perhaps this helps with the low level register requirement? To the first order it needs 2MB/s (120MB/min) to feed the 6502. Is there any DMA that could help - like on the embedded cortex Mn chips?
User avatar
hoglet
Posts: 12665
Joined: Sat Oct 13, 2012 7:21 pm
Location: Bristol
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by hoglet »

There is DMA, but I'm not sure how it would help here.

I think as a first step it would be worth coding this as a tight ARM-only loop, and see how it's affected by cache misses. We might just get away with it, and possibly there might be pre-fetching that would mask the true DRAM latency. This would be quite a small amount of work to try.

Failing that, trying essentially the same code, but on the GPU.

Is there any chance you could upload an example of the instruction stream somewhere?

I know I could easily make up a test one, but it would be fun to develop with the real thing...

Dave
Last edited by hoglet on Mon Jun 18, 2018 7:47 pm, edited 1 time in total.
cmorley
Posts: 1867
Joined: Sat Jul 30, 2016 8:11 pm
Location: Oxford
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by cmorley »

hoglet wrote: Wed Jun 13, 2018 7:31 pmIs there any chance you could upload an example of the instruction stream somewhere?

I know I could easily make up a test one, but it would be fun to develop with the real thing....
I could render 10s or 20s of video into an instruction stream say. I'll PM you a link.
User avatar
hoglet
Posts: 12665
Joined: Sat Oct 13, 2012 7:21 pm
Location: Bristol
Contact:

Re: 640x512 50 fps Bad Apple on a B... how?!

Post by hoglet »

cmorley wrote: Wed Jun 13, 2018 7:37 pm I could render 10s or 20s of video into an instruction stream say. I'll PM you a link.
Thanks, I might have a play with this tomorrow.
Last edited by hoglet on Mon Jun 18, 2018 7:46 pm, edited 1 time in total.
Post Reply

Return to “new projects and releases: demoscene”