Wave Runner Demo Details

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:33 am

Introduction

In a similar manner to Kieran's Twisted Brain write-up from last year, I'm going to describe how Wave Runner works under the hood, as well as providing some details about its genesis and how it all came together.

The demo itself can be found here: https://bitshifters.github.io/posts/pro ... unner.html

I'm hoping to get all the parts written over the next two weeks, time permitting. I'll block out a number of posts for the stuff I want to talk about, and fill them in as and when they're ready. Several demo effects (as well as a lot of the build system and all kinds of other things!) were done by Tom Seddon and with luck he'll have time to describe his work, so I'll block out parts for those as well.

Rough plan is to do a high-level introduction to the demo framework, talk a bit about Stable Raster, NOP Slides and Clockslides, then move onto a description of each of the effects, and finish up with some closing thoughts.

So, without further ado... Let's start with a Framework Overview.

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:34 am

The Demo Framework

Wave Runner is heavily influenced by, and shares some code with, Twisted Brain (hereafter known as 'TB') There will be places in this write-up where it's easier to refer to TB and describe how Wave Runner is different than to describe how Wave Runner works in detail.

Similarities to TB include:

- Both demos have a 'Render' function which runs while the Video ULA is scanning out the visible portion of the frame, followed by a number of 'Update' functions which do music decompression and playback, run 'Update' code for the current effect, and do 'Scripting' (deciding what other code to run each frame).
- They music playback is very similar. Exomizer Decompression to decompress up to 11 bytes per frame, that are sent to the SN chip immediately after the 'Render' function completes.

Significant differences include:

- Wave Runner uses fully Stable Raster (cycle-accurate timing with respect to the Video ULA output). It achieves this by use of a NOP slide, of which more later.
- The 'Effect Render' function starts approximately 192 cycles (ie 1.5 scanlines) *before* the start of the visible frame. This is to allow the render function to do any 'preparation' necessary before the effects starts rendering.
- Wave Runner runs with interrupts enabled (although only the System Via Timer1 is enabled). This has some positive ramifications as described shortly.
- TB used Exomizer 'streaming' compression to decrunch the music stream, and PuCrunch to decompress images and data. WR uses two separate Exomizer decompressors, one in 'Streaming' mode (for the music) and one in 'Targetted' mode (for all other decompression).
- WR has the ability to run code 'in the background'. It is interrupted once per frame to run the entirety of the Render/Update loop, but then returns to a loop which can be doing useful stuff like Exomizer decompression or clearing the screen, that runs until the T1 interrupt triggers the next Render/Update loop.
- The music player was heavily optimised for the Master 128's 65C02 by HexWab.

An overview of the Framework

The demo is split into several systems:

- The main 'Render/Update loop': Triggered once per frame, just before the Video ULA starts scanning out the visible frame. Responsible for calling the current effect's Render and Update functions, as well as ticking all the other systems.

- The 'Background Processing' loop: Runs all the time except when interrupted by the Render/Update loop. Responsible for Targetted Exomizer decompression and screen clearing.

- The 'Effect System': Maintains a big table of render/update/startup/shutdown functions for each effect, and is responsible for calling them appropriately to transition between effects. Also manages Sideways RAM banks and Shadow/Main memory state for each effect.

- The 'Task System': Runs up to 6 additional functions per frame. Each task has access to a small block of data containing its arguments. The system can run tasks for a specified number of frames, or until the task function marks itself as complete.

- The 'Timeline'. This reads a stream of bytes in memory and interprets it as instructions such as 'Wait for 60 frames then spawn this task' or 'Wait until the current decrunch has completed and then kick off another decrunch', etc. Timeline points can be relative to the start of the demo, the start of the effect, the last timeline point, or can wait for various 'flags' to be set. Each Effect has its own timeline and some have several timelines used at different points.

- The 'VGM Player'. Decrunches bytes of music data and sends them to the sound chip.

Memory map

&0000 - &00FF : Zero page. All kinds of stuff that is referred to frequently by the code, e.g. timers tracking how long it's been since the start of the demo, the current effect, and the last 'timeline point', 32 bytes 'effect workspace' that each effect can use for whatever it likes, small buffers needed for the Exomizer decompressers, etc.

&0100 - &01FF : 6502 Stack, but also contains an 156-byte table used by Targetted Exo Decompressor.

&0300 - &0FFF : 3328 byte buffer used by streaming Exo3 decompressor (for music).

&1000 - &1FFF : All the demo framework code, plus several tables of sine values at various amplitudes.

&2000 - &2FFF : 'Effect workspace'. Each effect is free to put whatever code or data it wants here.

&3000 - &7FFF : Screen memory. (The demo runs in a mixture of MODE1 and MODE2, both of which require the full 20k). The demo will often display 'Main' memory while writing a new image to 'Shadow' or vice versa.

&8000 - &BFFF : Sideways RAM banks x 4. Three banks contain the code and data for all the effects, plus the Exo-compressed images. The fourth bank contains the first 16k of the compressed music.

&C000 - &DFFF : HAZEL, which contains the rest of the compressed music, and right at the end an another 156-byte workspace used by the Streaming Exo decompressor.

&E000 - &FFFF : OS ROM, interrupt handling routines etc.

Notes on memory map:

Exomizer provides a trade-off between the amount of 'workspace' needed at runtime and the compression ratio. By specifying a larger workspace during the compression step, you can reduce the size of the compressed data. For the music ("Synergy Main Menu" by Scavenger) we were lucky in that using a workspace size of 3328 bytes compresses the music data into 24411 bytes. This fits into one SWR Bank plus most of HAZEL, leaving space for an additional 156 bytes right at the end of HAZEL (used for another small Exo-based workspace) with just 9 bytes free! The 3328-byte workspace fits between &200 and the demo framework code at &1000.

ANDY is not used. It's reserved for future demos when we really start to run out of space.

Similarly to TB, we keep HAZEL active all the time (the demo never uses the OS VDU routines and keeps that part of the OS ROM paged out) and the streaming music decompressor runs down through SWR bank 3 and straight into HAZEL.

The Render/Update loop

Here's what happens in the IRQ Handler that's triggered by System Via Timer1. (Note many details omitted for clarity!):

- (Housekeeping code that caches X and Y so we can return from the IRQ properly. A is already cached in &FC.)
- Correct for interrupt jitter to achieve stable raster (see section on NOP slides).
- Set up SWR and main/shadow state for the current effect.
- Run 'Render' function for current effect.
- Run music player.
- Tick the Timeline System. (This may lead to a transition to the next effect, because all effect transitions are triggered by the effect timelines).
- Tick the Task System, which will tick all active Tasks.
- Run 'Update' function for current effect.
- Deliberately waste several scanline's worth of cycles. Reserving cycles gives us a crude measure of how close to 'CPU capacity' the demo is.
- (Update the various counters that increment once per frame).
- Restore Shadow/Main state and SWR bank to those needed for the Background Processing.
- Restore X, Y and A, and RTI.

The Background Processing loop continuously does the following:

- Check if the "Clear Screen Requested" flag is non-zero. If so, jump to the code that handles screen-clearing.
- Check if the "Exomizer Decrunch Requested" flag is non-zero. If so, jump to the code that does Exo decompression.

Overall, the system is designed to let you run timing-critical rendering code syncred to the raster beam, but to also run code 'once per frame at some point' or 'in the background as fast as possible'.

This diagram correlates when the different bits of the framework are running with the CRTC cycle:

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:34 am

NOP Slides and Clockslides

In several places the demo needs to delay by an exact number of cycles, but the cycle count is continually changing and only known at runtime. The techniques necessary to do this are out there on the Web (e.g. https://www.pagetable.com/?p=669) but for those who are interested and/or haven't seen them before I'll go over them briefly.

NOP Slides: When you need to delay by 2N cycles.

The Wave Runner code synchronises itself to the vertical sync interrupt using techniques already described in the Twisted Brain writeup. This gets you an IRQ handler that is called every frame at a known offset from the vsync, but with a few cycles of jitter. (This jitter is caused by several things, most notably the fact that when an interrupt fires, the CPU must wait for the current instruction to finish -- which could take between 1 and 7 cycles -- before servicing the interrupt. Combined with other effects such as cycle stretching when reading the VIAs, in effect you can have up to 8 cycles of jitter.)

To correct for this, you do the following:

Read Timer1 Low.
Extract the lowest 3 bits and invert them. (This gives a value from 0 to 7 where 0 means 'Timer value was large, so correct with a long delay' and 7 means 'Timer value was small, so correct with a short delay'. Remember the counter is counting down, not up!).
Write the value into the second byte of a Branch instruction, ie the branch offset.
Branch into a series of repeating NOPs.

The code that does this in Wave Runner looks like this:

Code: Select all

.aboutToReadT1
lda sysViaStart + viaReg_T1CounterLow \read T1L, clear interrupt, also sync to 1MHz due to cycle stretching
.t1lInAReadyToSlide
; Extract lowest 3 bits, use result to control a NOP slide. This corrects for timer jitter and provides stable raster.
and #7
eor #7
sta branch+1
.branch
bpl branch \always
.slide
; Note: this slide delays (CPU cycles) by TWICE the 'input' to the slide, which is
; what we want because the T1 counter is 1MHz, but the CPU runs at 2MHz.
nop:nop:nop:nop
nop:nop:cmp &3

Because the 1MHz VIA timers operate at half the speed of the CPU, and NOPS take two cycles, this has the effect of introducing a delay which exactly counteracts the jitter.

Credit goes to Hexwab for detailing this technique (in much more detail!) here.

(At this point I have to admit that I have no idea why I put a CMP &3 at the end. It's an easy way to use 3 cycles instead of two, and I suspect it was because at some point I needed to delay for an extra cycle. It might look like I've missed one NOP -- there are only 6 NOPS, but the branch values range from 0 to 7 -- so the code might branch to the "&3" byte of the final CMP, and treat it as an instruction. But on the 65C02, opcode 03 is a one-cycle NOP, which means the jitter correction still works!)

One detail that the original article mentions, but which took me ages to appreciate the importance of: the number of cycles between the interrupt firing and reading Timer1 Low is crucial. You need to carefully set up the code so that the Timer1 read is at just the right point within an 8-cycle repeating loop.

So when you want to delay by 2N cycles, use a NOP slide. But what if you want to delay in 1-cycle increments, instead of two?

ClockSlides: When you need to delay by N (+ constant)

The concept of a clockslide is similar to a NOP slide, but by changing the 'control' value you can change how many cycles to waste at one-cycle granularity.

Here's a clockslide that expects a value between 0 and 13 in A, and introduces a delay of between 15 and 2 cycles (not including the cycles for the STA and the BRA):

Code: Select all

STA slide+1
.slide
BRA slide
cmp #&C9 : cmp #&C9 : cmp #&C9 : cmp #&C9 : cmp #&C9 : cmp #&C9 : cmp &EA

The way this works is as follows:

If A is 0, it executes 6 x "CMP #&C9" (CMP immediate, 12 cycles) plus one "CMP &EA" (CMP zero-page, 3 cycles), total: 15
If A is 1, it branches to the second (comparison value) byte of the first CMP... which is &C9... which is the opcode for CMP immediate! So it executes 6 x "CMP #&C9" again (12 cycles), but this time at the end, it treats the "&EA" as an instruction which is... NOP (2 cycles). Total: 14.
If A is 2, it branches two bytes forward, executes 5 x "CMP #&C9" (10 cycles) plus the final "CMP &EA" (3 cycles). Total: 13.
... and the pattern repeats all the way down to:
If A is 13, it branches straight to the final &EA (NOP) : 2 cycles.

By changing the number of NOPS, you can introduce variable delays up to the limit of the branch instruction.

Interestingly, I started using these techniques before I became aware of the 1-cycle NOPS provided by the 65C02. I think there may be some interesting possibilities for using NOP1s in these 'slide' techniques that have yet to be explored.

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:34 am

Double Sine Wave Effect

Introduction

This effect uses stable raster to render a superposition of two sine waves. Each wave can have its left/right movement speed and vertical scale adjusted independently, and by choosing values carefully a variety of interesting patterns can be created.

When the effect starts, the whole screen is filled with the value &F, and the ULA is set to MODE1. This means that by changing the palette register for just one entry (the one that maps logical colour %1111 to a physical colour) you can change the black 'background' colour. As the effect progresses, various images are decrunched to the screen, but the images are all set up so the right hand side (where the wave effect takes place) side stays filled with &F, and all of the palette changes that alter the look of the images (which only appear on the left) leave logical colour &F set to black.

The upshot is that you can draw an animated wave using all 8 colours on the right of the screen, while displaying any MODE1 image on the left (as long as the image has a black background!)

The effect uses a 256-entry sine table whose values vary between 0 and 14. During frame update, 16-bit additions are performed to step two pointers 'through' the table, to provide new 'start values' for the two waves. To draw the 'final' wave at frame render, the two waves start at the 'start value' and for each scanline, they step through the sine table (16-bit addition again) and take the high byte of the result as an index into the table. Two sine values (varying between the values 0 and 14) are thus retrieved from the table, and summed together (giving a possible range of 0-28).

This value between 0 and 28 is used to select one of 29 hand-crafted functions. Each of these functions essentially does:

Wait(first)
Write to palette register to change logical colour &F to a colour. (See below for how the colour is chosen!)
Wait(second)
Write to palette to change colour &F back to black.

... where wait(first) and wait(second) always sum to the same value.

For example, the firs pattern in the effect is composed of this sine wave...

: Vid1Scaled.gif (95.02 KiB) Viewed 7625 times

... added to this sine wave...

: Vid2Scaled.gif (236.99 KiB) Viewed 7625 times

... to give this result:

: Vid12CombScaled.gif (343.44 KiB) Viewed 7625 times

Adding Colour

However, there is an additional complication. The effect was originally monochrome (black/white). This meant that to achieve the "wait(delay)/write palette/wait(inversed delay)/write palette" behaviour, all you needed was two clockslides with the 'set to white' in between and the 'set to black' at the end.

But when I added colour, I used the tried-and-tested '16-bit add, then use high byte as an index' technique to grab colour values from another 256-entry table. By choosing different step speeds, it is possible to create different colour movement patterns. All of the moving colours in the sine wave are generated from the same colour table, arranged something like this:

This meant that in addition to the above, the code is also doing (per line):

16-bit add to step through the colour table.
Use high byte to index into colour table and retrieve palette entry.

... and the code that does this is interleaved among the 'wait' and 'write palette' instructions. That is why there are 29 different functions. Each one does the same thing, but the order and timing of operations changes for each one to ensure the two palette writes are at the right time.

For instance, here's the function that swaps the palette as early as possible, ie '0 cycles of delay':

Code: Select all

.delay_0
    NOP
    ; First part of a 16-bit add: low byte of (colour index per line + colour scale)
    LDA sineEffects_ColourIndexPerLineLow ; 3
    ADC sineEffects_ColourScaleLow ; 3
    STA sineEffects_ColourIndexPerLineLow ; 3

    ; At this point, we've added the low byte, we have carry flag set appropriately... so we can load the 'current'
    ; high byte, store it to palette reg, and then get on with adding the high addend to it.
    LDX sineEffects_ColourIndexPerLineHigh ; 3
    LDA colourTable,X    ;4

    ; Additional wait before store to palette register.
    WAIT_16
    STA &FE21

    ; then need another 17 cycles before the store of black colour (ie 15 before the LDA #im (black colour))
    TXA ; 2 -- put index-per-line-high back into A
    ADC sineEffects_ColourScaleHigh ; 3
    STA sineEffects_ColourIndexPerLineHigh ; 3
    WAIT_3
    lda #mainColToBlack \ 2
    sta &FE21 \ 4
    JMP thinSinReturn

And here's the one that swaps as late as possible (28 cycles later, compared to delay_0):

Code: Select all

.delay_28
    LDX sineEffects_ColourIndexPerLineHigh ; 3
    LDA colourTable,X    ;4
    STA &FE21

    NOP ; 2

    LDA sineEffects_ColourIndexPerLineLow ; 3
    ADC sineEffects_ColourScaleLow ; 3
    STA sineEffects_ColourIndexPerLineLow ; 3

    TXA ; 2 -- put index-per-line-low back into A
    ADC sineEffects_ColourScaleHigh ; 3
    STA sineEffects_ColourIndexPerLineHigh ; 3
    WAIT_19
    lda #mainColToBlack \ 2
    sta &FE21 \ 4
    JMP thinSinReturn

(The WAIT_XX macros insert a series of NOPs plus possibly an additional 1-cycle NOP to achieve the desired wait time).

(As an aside... I think it would be interesting to explore dynamically generating this sort of code at runtime instead of creating it by hand!)

Here's another example of how adding two simple sine waves gives an interesting effect, this time with added colour. This wave (note it's moving, just very slowly):...

: Vid5Scaled.gif (742.32 KiB) Viewed 7625 times

... plus this one (which is almost the same, just a bit faster and with a very slightly different scale):...

: Vid6Scaled.gif (552.75 KiB) Viewed 7625 times

... combines to form this result:

: Vid56CombScaled.gif (803.04 KiB) Viewed 7625 times

All of this, of course, has to run in exactly 128 cycles per scanline! In actual fact there are some cycles spare, because the WAIT_XX macros are 'dead' cycles that could be put to use somehow. I considered various possibilities but didn't have time to try them out.

Fading up and down

The 'fade waves up down' effect (which is used to change between patterns) is done by patching the code that loads
from the sine table to refer to a variety of different tables which were pre-generated for different amplitudes. Essentially the effect render code is redirected to a variety of different sine tables over the course of a few seconds, to fade the amplitude down from 14 to 0, then swap the values that control the wave pattern to new values, then interpolate the amplitude back from 0 to 14.

Fading colours in/out

The initial fade from white to coloured, and the final fade from coloured to black, is done by spawning tasks which copy values from predefined tables of colours (palette entries) to the 'actual' colour table. The indices to copy each frame are chosen from a table of random numbers (the numbers 0-255 in random order) which is how we get the nice random-looking 'fade in' and the 'fade out' at the end.

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:35 am

Logo Dissolve Effect

Overview

This effect -- the first one in the demo -- is based on vertical rupture, not stable raster. The effect reprograms the CRTC each scanline to choose which line of the logo to render (and also changes the palette to control the logo colour) but it does not 'draw' images by palette-swapping.

As is common with vrup-based techniques, the 'source' image is very different to the image rendered on-screen. In this case, the source image data consists of each unique line from the Bitshifters logo, repeated eight times. The eight-line offset is necessary because the CRTC can only address lines whose addresses start at an eight-byte alignment. (When setting CRTC addresses, you divide the 'actual' address by eight). In actual fact, because we reset the CRTC start address each scanline, only the first line of every eight is ever displayed on screen, and seven out of every eight lines could be set to anything at all without the effect looking different.

The 'unique lines' image was generated from the original Bitshifters logo, using a C# command-line tool written specifically for the task. I extracted one 'Bitshifters' from the four in the original image, and ran it through the tool.

The original logo looks like this:

: bslogo_single.png (1.25 KiB) Viewed 7538 times

And the new image looks like this:

: bslogo_unique_annotated.png (1.97 KiB) Viewed 7538 times

(I added the green lines to delineate each unique line. As you can see there are only 13 different lines, including the blank line).

The tool also emits a list of line indices. For each of the 56 lines in the single logo image, it lists the corresponding index in the 'unique lines' image, in a format easily ingestible by BeebAsm, specifically something like this (I added the comments manually!):

Code: Select all

EQUB 2		; First line of logo -- top of b, i, t,     h, i, f, t -- maps to line 2 in the unique lines image
EQUB 2		; Ditto
EQUB 0		; Third line of logo is totally blank
EQUB 2		; Another three lines like the first and second...
EQUB 2		; ...
EQUB 2		; ...
EQUB 0		; And another blank line
EQUB 1		; Now we're onto a different line. Top of b,    t,    h,       f, t -- maps to line 1 in the unique lines image
EQUB 1		; etc
EQUB 0
EQUB 11
EQUB 11
EQUB 0
(And so on for 56 entries!)

This file is used to create a 256-entry table where each entry is the 'unique line index' (between 0 and 12) to use to render that line. This is done by including the file four times, with some 'EQUB 0s' (blank lines) in between and at the top and bottom.

Another 256-entry table contains the colour to use for each line.

Effect rendering

It's interesting to compare TB's effect to this one. Both of them use one-line vertical rupture to choose, per-scanline, which line from an image to draw. (One-line vertical rupture is covered extensively in the Twisted Brain write-up). However Wave Runner 'thins out' the logo vertically, as compared to TB's horizontal movement. The TB version stores two copies of the whole logo, one with a two-pixel offset, and it uses these to move the effect horizontally in two-pixel increments. WR on the other hand stores one 'processed' copy of the logo (each unique line appears only once) and 'moves' them vertically.

This vertical 'splitting' is achieved relatively simply. Before the first visible scanline, we initialise a 16-bit variable (the 'current line pointer') with an initial value. Each scanline, another 16-bit value, the 'per-line offset', is added to the 'current line pointer'. The following logic then happens:

If the addition involved a carry from the low to high byte, then draw a line from the logo:
- Take high byte of 'current line pointer' and use it as index into the 256-entry table of unique line indices.
- Take that unique line index and use it to look up into another table of CRTC start addresses. This table contains the start address of each line in the image.
- Set CRTC start address to that address.
- Also use the high-byte of the 'current line pointer' to look up the colour from the 256-entry table of colours.
- Set the palette (by writing to the ULA palette register four times).
If, however, the addition did NOT involve a carry from low to high byte, draw a blank line:
- Exactly the same logic as above, but force the unique line index to 0, which is the 'blank' line. (Note how the top row of the processed image is a totally empty line).

Essentially, what this is doing is stepping 'through' the logo by a fractional number of lines for each scanline, but only drawing a logo line when you step to a 'new' line.

Animating the logo to split up is then a simple matter of spawning tasks that interpolate the 'per-line offset' between different values to make the logo expand, contract and then expand again.

There are many ways this effect could be extended but, as with just about everything else, there wasn't time to try them all out! My biggest regret is that the logo expands downwards instead of in both directions. All the infrastructure is in place to do it (all you need is to interpolate the 'starting value' as you change the per-line offset) but, once again, not time to try it out! Perhaps next year...

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:35 am

Intro and Outro Image Sequence

These are relatively simple effects, which showcase Dethmunk's artistic skills by displaying a sequence of images.

There are essentially three types of transition used by these effects:

Palette-based: The palette changes from 'all colours black' to 'standard MODE2 colours' or vice versa. Instant transition from black->image or image->black.
Shadow/Main Display: Bit 0 of ACCCON is flipped to change from rendering the image in Shadow memory to the one in Main memory, or vice versa. Instant transition from one image to another.
Gradual fade from all lines rendering Main, to all lines rendering Shadow, or vice versa (see details below). This can do a gradual transition from one image to another, (or if one image is black you can transition to/from black).

Demo Intro

When the 'Intro Images' effect starts, the palette is all black. The first image ("Bitshifters presents") is decrunched into main RAM, and then a timeline event changes the palette to standard MODE2, thus displaying the image.

The second image is then decrunched into shadow RAM, and after a few seconds, ACCCON bit 0 is flipped to instantly display the 'Wave Runner' logo.

Demo Outro

This effect is very similar to the Intro effect, with one additional feature: On each scanline, it reads a 256-entry table (whose values are all either 0 or 1, indicating 'render from shadow memory' or 'render from main memory', and uses that to set the appropriate bit in ACCCON.

In fact the 'render' function is so simple that I'll show the whole thing here!

Code: Select all

.OutroRender
{
    ; Wait until near to the beginning of the visible frame.
    JSR wait128
    WAIT_40

    ldy #0  ; Set up Y to count 256 lines. 

    .loop
    LDA lineShadowTable,Y   ; 4 .. load shadow/main state for this line (this is either 0 or 1)
    TSB &FE34               ; 6 .. if bit 0 is set, set it in ACCON
    EOR #%00000001          ; 2 .. invert bit 0
    TRB &FE34               ; 6 .. and now it it's set (ie it was clear when loaded), clear it in ACCON

    ; Wait so the loop takes 128 cycles
    WAIT_35
    WAIT_35
    WAIT_35

    ; Must be 123 cycles to here...
    dey      ; 2 == 125
    bne loop ; 3 == 128

    ; Loop has finished, we're done rendering this frame!
    JMP EffectRenderReturn
}

(Note the use of the 65C02-specific 'TSB' and 'TRB' opcodes! When you first hear about these you think they'll be incredibly useful... but then you realise that they only have 'absolute' addressing modes, so they're mostly helpful for flipping bits in memory-mapped registers.)

The first 'Outro' image is displayed by unpacking it to main memory at the start of the effect while the palette is set to all-black, and then setting the palette so normal MODE2 as soon as the decrunch is finished. This means there's a period of black screen between the end of the 'Chequerboard' effect and the image being displayed. Annoyingly, I've realised literally as I'm writing this that we could have easily avoided this delay -- it would have taken 5 minutes to implement -- but I didn't think of it at the time!

The 'random line-by-line fade to next image' is implemented by decrunching the final 'Goodbye' image to shadow memory, and then kicking off a task that copies the value '1' (meaning 'render from shadow for this line') into the 256-entry table that specifies main/shadow render state for each scanline. This copies one value per frame, which means the fade takes around 5 seconds to complete.

The final fade back to black is done by decrunching an entirely black screen to main RAM (overwriting the flying flaming Acorn image) and then kicking off another random line-by-line copy to set the state back to 'Render from Main memory' ...

... except that there's a bug, and instead of transitioning to a black screen, it randomly transitions the lines from 'main' to 'shadow' continually. So the 'Goodbye' image continues to fade in and out instead of disappearing. I decided that this actually looked pretty good (and people would think it was deliberate) so we left it as-is instead of trying to debug it!

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:35 am

Vertical Scrolltext

Perhaps surprisingly, this effect is one of the more complicated ones, both in terms of code complexity in the Update and Render functions, and in terms of the pre-processing steps and tools used to generate the code and data.

To understand this effect, we'll start by describing the Render function, then describe how the text movement is handled in the Update, and then finish with how the code and data is structured and created.

The Render function

Each frame, stable raster is used to 'draw' large characters. The characters are not stored as bitmapped images. Instead, they are drawn using a large number of specialised 'glyph line rendering' functions. Each of these uses palette register updates to flip from rendering black to rendering a colour and then back again. In essence there is a 'glyph line renderer' function for each unique horizontal line in the entire font tile set. They all take the same number of cycles (specifically, 34).

A 256-entry table contains, for each scanline, an index value indicating which glypn line function to use on that scanline. (Or it may be a special index which causes a colour change).

The 'starting point' in this table changes from frame to frame. Because there are 256 scanlines, and 256 entries in the table, every entry is used for every frame. (The line funtion indices are retrieved from the table using absolute indexed addressing based on the the table's starting address). However, the order in which they are rendered changes from frame to frame. This is how the scrolltext moves vertically. (See below).

On each scanline:

To make the pattern move left and right:

The tried-and-tested 16-bit-addition is used to increment a counter.
The high byte of this counter is used as an index into a 256-entry sine table (whose values vary between 0 and 48).
The retrieved value is compared to the value that was used on the previous line.
The difference value is used to control a Clockslide. This variable delay makes sure that the glyph line rendering starts in the right place with respect to the raster scanning.

To draw the glyph line

The next entry is retrieved from the 256-entry table of 'line function indices'.
If the high bit is set, then this is a 'control code'. If so:
- The value (with high bit cleared) is used to look up into the table of 'control code handling functions'.
- The appropriate function is called (using JSR)
If the high bit is not set, then this is a normal glyph line:
- Use the value as an index into a table of addresses of functions that draw glyph lines.
- JMP to the function.

This logic continues for 256 lines.

Starting the pattern in the right place

Each line, this effect uses a clockslide to introduce a delay, whose length is based on the differnce between the previous and current lines' horizontal position (which comes from the sine table).

You may therefore be wondering how the top line of the screen (which has no 'previous' line to compare to) is delayed by the correct value, based on the 'starting' sine table value.

In actual fact a much longer Clockslide -- which can delay by between 0 and 48 cycles -- is used to posiion the pattern in the correct place on the first line.

The Update Function

This function is responsible for:

Updating the starting position in the sine table, which changes at what point in the left/right movement the top scanline starts. (A simple 16-bit addition using the current 'left/right wibble speed' variable).
Scrolling the text vertically.
Processing scrolltext control codes.

Scrolling the Scrolltext

Moving the scrolltext is done using a technique akin to hardware scrolling. Each frame, depending on the scrolltext speed, the 'start' index (essentially a 'pointer' into the table of line functions to call for each scanline) is incremented (and wraps around from 255 to 0). Then the lines between the old and the new 'start' indices -- which are the ones which just scrolled off the top of the screen -- are filled in with new values. These 'new' lines are the ones which scrolled onto the bottom of the screen this frame.

The function to add the new lines is one of the most complicated in the codebase. Very broadly it goes something like this:

Code: Select all

Loop over <number of lines to be updated this frame> 
.LoadNextByte
  - Load next byte from the current scrolltext character stream.
  - If it is a control code:
    - Immediately process it (jump to appropriate control-code-processing function0.
    - Continue (branch to LoadNextByte).
  - Otherwise:
    - Update the 256-entry 'line rendering functions' table at <line index to be updated>.
Next <line index>

The devil is in the details of the penultimate line above ("Update the line rendering functions table")! Without going into great detail, the code must keep track of:

the position in the scrolltext character stream.
what glyph is at that position.
the current line of the glyph (all glyphs are 15 lines high, but the code can handle glyphs of any height).
the index of the 'line drawing function' corresponding to the current glyph line.
how many times the line has been repeated so far (because each glyph line repeats for 4 scanlines)

To do this it maintains various pointers and indices -- all of them in zero-page -- and changes them as necessary to do the following:

Code: Select all

- For each character in the scrolltext:
  - Use the byte value as an index into the glyph tables. Look up:
    - How many lines it contains, N. (In practise, always 15).
    - The start address of a N-size table that contains indices of 'line drawing functions'.
  - For each line in the glyph:
    - Look up the index of the line drawing function for that line.
    - For L = 1 to (number of scanlines per glyph line)
      - Copy the line drawing function index into the 'scanline -> line drawing function' mapping table.
    - Move to next scanline.

Data and tools

This effect relies on 6502 code (and a number of tables) that was generated by two C# command-line tools, created specially for the effect. As such it blurs the distinction between 'code' and 'data'.

The first tool, 'FontExtractor', is responsible for splitting up font sheets into individual glyph images. The font itself is a monochrome one called 'Razor' and its glyphs are all 14x14 pixels.

So the tool splits this font sheet:

: Razor14x14_example.png (2.91 KiB) Viewed 7334 times

Into these glyphs (this image just shows some of them), all 14x14 pixels:

The second tool takes the individual glyph images, and does the following:

Unique line detection

Reads every line from every image, and generates a list of all unique lines. There are 107 unique lines across the entire set of glyphs. For this write-up I extended the tool to create a debug image showing all the lines:

: uniqueGlyphLines.png (1.4 KiB) Viewed 7334 times

Raster-code generation

For every unique line, the tool creates 6502 assembly that will 'draw' the line by writing to the ULA palette register as the raster beam scans across the screen. The code assumes that register A contains the 'foreground' colour palette value and X contains the 'background' (black) colour palette value. Each source pixel take two cycles. But stores to the ULA palette register take 4 cycles, so the tool needs to deal with the fact that once it has changed colour, it cannot do so again for 2 pixels!

It then generates code like this:

Code: Select all

.line0
NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:
JMP lineReturn
.line1
NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:STA &FE21:STX &FE21:NOP:
JMP lineReturn
.line2
NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:STA &FE21:STX &FE21:NOP:NOP:
JMP lineReturn
.line3
NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:NOP:STA &FE21:NOP:NOP:STX &FE21:NOP:
JMP lineReturn

(and so on to line 106, ie 107 lines in total!)

The tool can only emit three instructions: "NOP", "STA &FE21" and "STX &FE21". As it works its way along the pixels in each line, it keeps track of the current colour, the desired colour, and the number of cycles left since the last instruction was emitted. As soon as the 'current instruction cycle count' reaches zero, it emits another instruction: Either a NOP (if the current and desired colours are identical) or a STA/STX to change the colour.

Every line takes 34 cycles to execute. This provides enough time to make a colour change on any of the last few pixels, wait (4 cycles) for the STA/STX to complete, and then do a final "STX &FE21" to ensure that all lines end on the background/black colour.

Glyph table generation

For each glyph, the tool emits a table the says which unique line index to use to draw each line in the glyph.

For instance, the 'A' glyph has the following table:

Code: Select all

.glyph_A
EQUB 15 \\ Line count
EQUB 32:EQUB 43:EQUB 77:EQUB 94:EQUB 94:EQUB 94:EQUB 49:EQUB 49:EQUB 85:EQUB 85:EQUB 85:EQUB 85:EQUB 85:EQUB 85:EQUB 0

Note that the first line index is '32'. If we look at line 32 we see this:

Code: Select all

.line32
NOP:NOP:NOP:STA &FE21:NOP:NOP:NOP:NOP:NOP:NOP:STX &FE21:NOP:NOP:NOP:NOP:
JMP lineReturn

This is delaying for 6 cycles, then switching to the foreground colour, then delaying for 16 cycles, switching back to background (black) and then waiting another 12. This has the effect of drawing the top line of the A: one line towards the middle of the glyph area.

The final line (ignoring the last 'empty' line, which all glyphs end with to provide a break between glyphs) is '85'. Line 85 looks like this:

Code: Select all

.line85
STA &FE21:NOP:NOP:NOP:NOP:STX &FE21:NOP:NOP:STA &FE21:NOP:NOP:STX &FE21:NOP:
JMP lineReturn

This is flipping between black and foreground twice, to draw the two vertical bars at the bottom of the capital 'A'.

Glyph table address generation

Code is emitted that provides the mapping from the scrolltext characters (ie the actual scrolltext strings, stored as EQUS directives directly in the source) to the addresses of their glyph table.

Glyph line address generation

Similarly, code is emitted that defines a table that maps from unique line indices to the address of the appropriate 'line rendering' function.

(I've skimmed over the additional complexities that deal with adding the 'control codes' to the tables described above! Suffice to say that the files generated by the tool are 'include-d' in a special order with some additional hard-coded lines in between that 'mix in' the control-code data and functions.)

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:35 am

(Placeholder: 'Blobs' Effect)

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:36 am

(Placeholder: Chequerboard Effect)

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:36 am

Closing thoughts

Before I get to anything else, I'd like to reiterate that Wave Runner was a team effort. Myself and Tom Seddon did the framework and effects code, the music compression and playback came from SimonM (via Twisted Brain), HexWab helped out with music player speedups (and also prototyped the stable raster system which I started playing around with a year ago as an introduction to 6502 coding!), and it wouldn't have been half the demo without Dethmunk's graphics. Beyond that numberous other people helped out in all kinds of ways, from making suggestions and contributing code snippets, patching JSBeeb, and so on. It would never have been made without a spirit of positive collaboration.

Final thoughts

If anyone was wondering why the scrolltext sometimes changes colour towards the top of the screen, it's basically a bug: the control code that is meant to change the colour of the text doesn't get recorded in the 'last colour-change that left the top of the screen' variable. If we ever do a v1.1 release that's the first thing I'll be patching!
The 'background' exomizer decompression was great. You can decompress an entire screen in 4-5 seconds using the CPU cycles that are spare between the end of the 'Update' loop and the start of the next frame. It helped us reduce the time spent staring at a black screen while the demo decrunches stuff. We could have made more use of it for some of the later effects.
There's still quite a way to go before we start pushing at the edge of the Master's capabilities, both audibly and visually. There was memory spare at the end of development, and there are plenty of techniques still to discovered. (Combining stable raster and vertical rupture, for instance!) ... So expect another demo at some point!
We'll be releasing the sourcecode soon. Just need to do a few final tidy-ups.
And that's it. We hope you enjoyed it!

VectorEyes · Post by **VectorEyes** » Wed Jul 10, 2019 12:37 am

(Placeholder: Anything else I forgot!)

VectorEyes · Post by **VectorEyes** » Wed Jul 17, 2019 12:26 am

Just a note to say that I have added the sections on 'NOP Slides and Clockslides' and the 'Double Sine Wave' effect. Work commitments mean I can't write these up as quickly as I'd like, but I'm very happy to answer questions or discuss areas where people want more detail, and I'm sure everyone else who worked on it would be happy to chip in with replies as well!

Phlamethrower · Post by **Phlamethrower** » Wed Jul 17, 2019 11:23 am

A link to the demo would be useful.

Nice use of animated GIFs

VectorEyes · Post by **VectorEyes** » Wed Jul 17, 2019 3:38 pm

Phlamethrower wrote: ↑Wed Jul 17, 2019 11:23 am A link to the demo would be useful.

Nice use of animated GIFs

Thanks, and good point! I'm sure my notes for the first section had a link but apparently not. I've amended the first section and added the link.

The animated GIFs were created by exporting a video from B2 and then running it through FFMPEG. It was a surprisingly easy process.

VectorEyes · Post by **VectorEyes** » Tue Aug 13, 2019 9:44 am

Thread bump! I have updated the Logo Dissolve, Intro/Outro and Vertical Scrolltext write-ups. I will write up some closing thoughts shortly.

BigEd · Post by **BigEd** » Tue Aug 13, 2019 9:45 am

Great - thanks for the bump! Reserving posts is a good idea, but the bump is needed too.

VectorEyes · Post by **VectorEyes** » Sat Aug 31, 2019 6:43 pm

Another thread bump! Closing Thoughts section has been done (actually I wrote it a few weeks ago but didn't want to bump the thread until I'd finished a few other things.)

The source code is now available on Github: https://github.com/bitshifters/wave-runner

Finally, the 'incorrect scrolltext colours' bug has been fixed, alongside a few timing improvements and improved effect transitions, and the new version is now provided by default from the Bitshifters website (https://bitshifters.github.io/posts/pro ... unner.html).

0xC0DE · Post by **0xC0DE** » Sat Aug 31, 2019 7:11 pm

Great read (so far)! And an inspiration for me to attempt some new demo effects on the more humble Electron.

tricky · Post by **tricky** » Sat Aug 31, 2019 8:07 pm

Great write-up

jbnbeeb · Post by **jbnbeeb** » Mon Sep 02, 2019 7:15 pm

Thankyou for the write up. Not read through it all yet (but I will) - but all very clearly described so far. Great work !

Thanks,
Jason

dominicbeesley · Post by **dominicbeesley** » Mon Sep 09, 2019 11:51 am

Thanks for this write up - behind the scenes explications are very welcome! I'll give it a good read later tonight

D

stardot.org.uk

Wave Runner Demo Details

Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details

Re: Wave Runner Demo Details