Hacker needed ... for Zarch ;-)

subjects relating to classic games for the archimedes and risc pc
Post Reply
Zarchos
Posts: 2355
Joined: Sun May 19, 2013 9:19 am
Location: FRANCE

Hacker needed ... for Zarch ;-)

Post by Zarchos »

Oh well so I'm Zarchos, and I'm about to complete what I think is some very fast segment filing routines (taking into account the MEMC1A video DMA 1 cycle non penalty when addresses used with STM are on quadwords) ... and there's this game everybody knows on the Archie ... Virus ... err... no I meant : Zarch.
So, have we got some hackers on the Archie on this forum ?
If by any chance you were to find the area of code where there are the triangle filing routines, please tell me where it is.
I think I've got a series of very fast segment filing routines, and I'd be curious to know if it can be implemented as a patch to Zarch (it'd be nice to have Zarch run fast enough to have an ingame tune and still 50 frames per second on an ARM2).
Who knows it could even be an 8 channel tune : if I'm not mistaken, Zarch runs at 50 frames per second on ARM2 Archies with the old MEMC, thus on a faster MEMC1A system + my very fast (I hope) routines and Steve's QTM, it could work.
Basically to fill a triangle or any shape, my routines need a pointer to a list of word couples defining each horizontal segment defining the shape, in the format (memory start address, horizontal length in pixels) terminated by 0.

Thanks for your help.
User avatar
SarahWalker
Posts: 1599
Joined: Fri Jan 14, 2005 3:56 pm
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by SarahWalker »

Zarchos wrote:if I'm not mistaken, Zarch runs at 50 frames per second
It's 15 fps from memory, not 50.
Zarchos
Posts: 2355
Joined: Sun May 19, 2013 9:19 am
Location: FRANCE

Re: Hacker needed ... for Zarch ;-)

Post by Zarchos »

TomWalker wrote:
Zarchos wrote:if I'm not mistaken, Zarch runs at 50 frames per second
It's 15 fps from memory, not 50.
Oh dear, I really thought this game ran faster on an ARM2,
https://www.youtube.com/watch?v=DbENFcmurbo

demonstrating how fast the Archie is ... anyway it's another good reason to try to patch it if, as I hope, my routines are really fast.

I've spent several hours per week these last weeks to code them, after you enlightened me with this video DMA penalty one can avoid if addresses are quadword aligned. (Thanks again btw).

PS :
For the record, Aldebaran, quite similar to Zarch, but coded by a member of the demomaker group Arc Angels, runs at 50 frames per second, it's clearly written in an advert I could find in a BBC ACORN USER from 1992 (December issue, page 31 of the games supplement). The game even displays some objects with realtime texture mapping.
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

Zarchos wrote:So, have we got some hackers on the Archie on this forum ?
If by any chance you were to find the area of code where there are the triangle filing routines, please tell me where it is.
I think I've got a series of very fast segment filing routines, and I'd be curious to know if it can be implemented as a patch to Zarch (it'd be nice to have Zarch run fast enough to have an ingame tune and still 50 frames per second on an ARM2).
Who knows it could even be an 8 channel tune : if I'm not mistaken, Zarch runs at 50 frames per second on ARM2 Archies with the old MEMC, thus on a faster MEMC1A system + my very fast (I hope) routines and Steve's QTM, it could work.
I'd start by running the code through Armalyser, then search for a couple of STR's a loop, you should spot it fairly quickly. After a 30 second search, I located it at DADC. It's one of the most unoptimized segment fill routines you're ever likely to see and makes no use of STM's, on entry R10=start address and R11=length in pixels.

On an ARM2, Zarch runs at an average of 12.5 fps - it aims for 25 fps but never really achieves it. To work around this poor performance, it doesn't wait on VSync when it flips the framebuffers, so averages slightly higher fps; Tom's probably close at 15fps.

There's close to zero chance it will ever achieve 50fps as there's just too many calculations and memory accesses going on. You might get a steady 25fps on an ARM3 if you replace it's complete triangle routine to fit into the cache, but not by using word couples as you'd be adding multiple memory accesses for each segment and have to keep resetting the fill registers; you'd achieve better performance by keeping the start/end in registers and merge the segment fill into the triangle code. Bare in mind that the bulk of the triangles are less than 20 pixels, so performance gains are going to be negligible.

Also be aware that Zarch was written for the ARM1, so doesn't make use of MUL/MLA; recoding that is likely to give a better improvement in speed. Back in the day, I coded a replacement triangle routine for Zarch which gave (IIRC) a 12.5% increase in performance on average. I never got around to sorting out the MUL/MLA issue though.

Way down on my list, I want to recode Zarch so it runs natively on the Pi, at a higher resolution, 50fps and fills the screen. Until I can get approval from Superior and David Braben though, it's a non-starter.
Zarchos
Posts: 2355
Joined: Sun May 19, 2013 9:19 am
Location: FRANCE

Re: Hacker needed ... for Zarch ;-)

Post by Zarchos »

Thanks for all these infos.

I'll have a look anyway.
I understand with your project to distribute the games you need an official approval, but for a patch for users already owning a copy, there's no need of that ...

When you say
' but not by using word couples as you'd be adding multiple memory accesses for each segment and have to keep resetting the fill registers; you'd achieve better performance by keeping the start/end in registers and merge the segment fill into the triangle code
'
I'm not sure I get it.
How can you avoid loading a memory address and a length for a segment filing routine ?
I don't have to reset the fill registers : my algo sets up the colour or pattern in R1 to R11 once for all in the callant routine, and then I call my routines with R13 pointing to the list of (memory start address, length).
R14 gets the memory start address
R0 gets the length
I use the value in R0 to set PC to branch to the right piece of code to actually store in memory at [R14] length pixels.
Inside the routines I use R0 and R12 to build, respectively, and if necessary, the 1st word aligned word of the segment including the background in R0, and in R12 the last word aligned word of the segment including the background. And/or I use R0 and/or R12 to get a copy of the colour/pattern with a simple MOV, to STM on quadwords as much as I can.

Note there's no return address in any register, it would waste a register.
Instead, there's a branch back to the piece of code to execute after the filing of the shape is complete, I achieve that when, by using R0=0 (0 is at the end of my table of couples (desination address, length)) it sets the PC to branch to a memory location where the callant routine has beforehand built a branch instruction to the code to execute upon completing the filing process.

The segment filing routines branch one to the next to use, there's no branch back to the callant routine during the whole process of actually drawing the shape.

There's the additional idea, by working with a table of couples (memory start address, length in pixels) that the filing routines will work ok with RasterMan (ie the memory start addresses will need to be modified if RasterMan wobbles the screen on each scanline).
The idea, to me, is that you build your table for a standard display, then you load the value to alter the initial start memory address of each segment to take into account the visual shifting RasterMan operates (which must be compensated in terms of actual bytes to add or remove to the memory start address of the segments, for each scanline).

Could I email you my routines to get your opinion on them, if you've got some time to read the source ?
I think they are clever, but I'd be happy to read your comments to improve them.
I'm no expert so your comments and advice would be more than welcome.
User avatar
SarahWalker
Posts: 1599
Joined: Fri Jan 14, 2005 3:56 pm
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by SarahWalker »

Zarchos wrote:PS :
For the record, Aldebaran, quite similar to Zarch, but coded by a member of the demomaker group Arc Angels, runs at 50 frames per second, it's clearly written in an advert I could find in a BBC ACORN USER from 1992 (December issue, page 31 of the games supplement). The game even displays some objects with realtime texture mapping.
"50Hz wireframe action" is referring only to a wireframe section of the game. The main game itself runs at a similar frame rate to Zarch.
Zarchos
Posts: 2355
Joined: Sun May 19, 2013 9:19 am
Location: FRANCE

Re: Hacker needed ... for Zarch ;-)

Post by Zarchos »

#-o :mrgreen:
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

Zarchos wrote:When you say
but not by using word couples as you'd be adding multiple memory accesses for each segment and have to keep resetting the fill registers; you'd achieve better performance by keeping the start/end in registers and merge the segment fill into the triangle code
I'm not sure I get it.
Dropping a segment fill in, would be slower...why, because it's currently retaining the start/end in registers, if you switch to storing pairs then you're adding an STM/LDM pair at around 1280nS per line, based on benchmarking my A310:
SI-A310-ARM2-MEMC1.png
EDIT: And here are the timings with a MEMC1a present
SI-A440-ARM2-MEMC1a.png
You'll need to modify the triangle route as well. And if you're doing that, you probably want to replace the whole triangle routine with an optimal one with the segment fill built into it, using as few fill registers as possible (but more than the STR it currently uses) that are set outside the triangle code. If the average triangle is 20 pixels wide (I'm guessing), then four fill registers is probably as many as you'd need to use, gaining back ~1000nS for every 16 word aligned pixel fills.

If you're going all out on gaining every nS back, you'd also want to optimize the segment fills to STRB the initial misalignment, ending up with four segment fill routines for every length of line up to the four registers. Beyond 16 pixels (ie the four fill words), its a compromise between the size of the segment fill code and using loops so it fits in memory. Statistical analysis of the segment lengths would give you a good cut-off point to switch to loops for the few instances of long line lengths.
Zarchos wrote:How can you avoid loading a memory address and a length for a segment filing routine ?
By building the segment fill into the triangle routine.
Zarchos wrote:I don't have to reset the fill registers : my algo sets up the colour or pattern in R1 to R11 once for all in the callant routine, and then I call my routines with R13 pointing to the list of (memory start address, length).
This wont be optimal for Zarch, due to the small size of most of the triangles.
Zarchos wrote:I use the value in R0 to set PC to branch to the right piece of code to actually store in memory at [R14] length pixels.
Probably the best method for ARM2 (but a big no-no for ARM3 as you're probably aware.)

I doubt we'd see much of an improvement on ARM2, not noticeably frame rate wise at any rate, we could probably improve the triangle routine by 20% at a push. ARM3 you could probably get good gains, by getting the triangle routine combined with the segment fill to fit into the cache.

What you're describing is identical to the routine in Elite, it builds up a pair list for triangles and passes to a flood fill routine. When we updated Elite last year, I promptly ripped it out for ARM3+ and used a slightly unrolled loop as a compromise between cache misses and the extra cycles the loop adds.
Zarchos wrote:Could I email you my routines to get your opinion on them, if you've got some time to read the source ?
Of course you can send them, although I doubt I'd be able to improve on them, I expect they're already about as optimal as you can get them on ARM2.
Last edited by sirbod on Wed Jul 06, 2016 7:47 am, edited 3 times in total.
steve3000
Posts: 2913
Joined: Sun Nov 25, 2012 12:43 am
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by steve3000 »

Looks like I've joined the conversation at the end of the discussion and I'm not sure I can add more than what's been said on the hacking/coding front - but you mentioned an interesting challenge right at the start - running QTM along with Zarch...

It turns out I actually had Zarch loaded up last week on my 'newly created' 1mb ARM2 Arthur OS 1.2 A310 :) ...and the disc is still in the drive - so I'll have a go at running this with QTM when I have a moment and report back!
User avatar
qUE
Posts: 110
Joined: Tue Dec 16, 2014 11:39 pm
Location: Bristol
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by qUE »

Hardest challenge is to "intelligently" assess whether the scanline on the fill is unaligned or not then just cope with the ends using a shifted mask. I suspect there's optimisations to be had by thinking about the fill inverted and cutting away the edges, who knows.

Loads of demo routines have been "open" sourced on the web for other architectures, I bet they've got some really wicked ways of cramming maximum efficiency on quad fills.

Also bet if you've got the memory and a reasonable range of preset vectors, pre-rendering into sprites would give gains.
Zarchos
Posts: 2355
Joined: Sun May 19, 2013 9:19 am
Location: FRANCE

Re: Hacker needed ... for Zarch ;-)

Post by Zarchos »

qUE wrote:Hardest challenge is to "intelligently" assess whether the scanline on the fill is unaligned or not then just cope with the ends using a shifted mask. I suspect there's optimisations to be had by thinking about the fill inverted and cutting away the edges, who knows.
I did that, using something similar to my sprites routines. There's no need of a shifed mask, though. The behaviour of LDR (to load the background) on a non word aligned address with one shift applied to the result + ADDing the shifted fill is enough, I'd say 4 times out of 5.
The fill 'inverted' from the end with STMDA, I did that too. (if I understand what you wrote).
Loads of demo routines have been "open" sourced on the web for other architectures, I bet they've got some really wicked ways of cramming maximum efficiency on quad fills.


Also bet if you've got the memory and a reasonable range of preset vectors, pre-rendering into sprites would give gains.
I don't like copying, and anyway I don't know well enough 68000 ASM.
Yes, as always, the more memory you have, the more it'll help produce faster algos.

I complete writing the routines, test them, send them to Jon, and if he thinks it's not aimed to directly be thrown to the dustbin :( I'll post the source on *.

'So we won't read anything' I hear you think ha ha. :wink:
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

I suppose I'd better figure out a way to benchmark Zarch. A couple of choices here:

1. Count the number of frames swaps over a set time period, say 5 mins when left in demo mode, to get an average FPS.
2. Count time taken vs number of triangles drawn, to get an average triangles/sec

I'll also have a hunt though my floppies to see if I still have my triangle routine that I shoved into Zarch back in 1987.
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

Have done some more poking around inside Zarch and located the triangle and quadrilateral routines, all coordinates are unsorted on entry:

Code: Select all

Line fill entry point:DADC
R8 - Colour
R10 - Length in pixels
R11 - Start address

Exit:
R0, R10, R11 - corrupted
R1-R9, R12 - preserved

Code: Select all

Quadrilateral entry point: CED0
R0 - X1
R1 - Y1
R2 - X2
R3 - Y2
R4 - X3
R5 - Y3
R6 - X4
R7 - Y4
R8 - Colour

Exit:
R0-R7 - corrupted
R8-R12 - preserved

Code: Select all

Triangle entry point: CEE8
R0 - X1
R1 - Y1
R2 - X2
R3 - Y2
R4 - X3
R5 - Y3
R8 - Colour

Exit:
R0-R7, R9-R12 - corrupted
The Quadrilateral routine calls the triangle routine twice - this could be sped up considerably by coding a bespoke quadrilateral routine. It's possibly the slowest part of Zarch when a full landscape is visible, so could do with optimizing.
Last edited by sirbod on Wed Nov 04, 2015 8:57 pm, edited 1 time in total.
paulb
Posts: 1775
Joined: Mon Jan 20, 2014 9:02 pm
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by paulb »

sirbod wrote:The Quadrilateral routine calls the triangle routine twice - this could be sped up considerably by coding a bespoke quadrilateral routine. It's possibly the slowest part of Zarch when a full landscape is visible, so could do with optimizing.
I am reminded of what Andrew Hutchings said in his Starfighter 3000 talk about drawing quadrilaterals as the general case, drawing triangles as a specialisation of that. I imagine that doing it that way allows horizontal line fills to be consolidated for better performance, since you don't need to revisit the same vertical point in a shape to continue filling where you left off earlier.

Either that doesn't make any sense or you already know what I'm talking about. :lol:
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

paulb wrote:Either that doesn't make any sense or you already know what I'm talking about. :lol:
You're spot on. It also removes the calculations needed to work out the angle of the dissecting diagonal (off the top of my head: 2xLDR, 2xMUL and 2n*ADD, where n is the height of the triangle)
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

Here's a compromise line fill that fits in under 8k. It sorts out the misaligned start/end first then fills the gap in 1 to 4 word chunks. The MOV R6/R7/R9,R8 needs to move into the triangle routine, I've put it in here so it works as-is. Entry registers match Zarch's line file detailed above.

This could be improved futher to get a few 100nS back with the trade-off of memory required...If the alignment correction was moved into the fill code, the code size would increase to around 50kb. If you fully expanded it to one fill code routine per line length, per start misalignment it would be around 200kb.

There's also a possible improvement to aligning to quad word boundaries, but it may be slower overall due to the alignment check up front.

Code: Select all

DIM code% 128*1024
FOR A%=0 TO 2 STEP 2
P%=code%
addr1=11:count=10:addr2=12
[OPT A%
.line
CMP   count,#4
BLO   short_fill

MOV R6,R8:MOV R7,R8:MOV R9,R8
ADD   addr2,addr1,count
TST   addr1,#%11:STRNEB R8,[addr1],#1
TSTNE addr1,#%11:STRNEB R8,[addr1],#1
TSTNE addr1,#%11:STRNEB R8,[addr1],#1
TST   addr2,#%11:STRNEB R8,[addr2,#-1]!
TSTNE addr2,#%11:STRNEB R8,[addr2,#-1]!
TSTNE addr2,#%11:STRNEB R8,[addr2,#-1]!
SUB   count,addr2,addr1

.short_fill
MOV   count,count,LSL #6-2       ;*64
ADD   count,count,addr2,LSR #1   ;*96
ADD   PC,PC,count
MOV   PC,R14
MOV   PC,R14
]:P%+=24-4
FOR loop%=1 TO 3:[OPT A%:FNline_fill(loop%):]:NEXT
FOR loop%=4 TO 320 STEP 4:[OPT A%:FNline_fill(loop%):]:NEXT
NEXT


DEFFNline_fill(N%)
IF N%>=4 THEN endP%=P%+96 ELSE endP%=P%+24
WHILE N%>=16
    [OPT A%:STMIA (addr1)!,{R6,R7,R8,R9}:]:N%-=16
ENDWHILE
IF N%=12 [OPT A%:STMIA (addr1)!,{R6,R7,R8}:]:N%-=12
IF N%=8 [OPT A%:STMIA (addr1)!,{R7,R8}:]:N%-=8
IF N%=4 [OPT A%:STR R8,[addr1],#4:]:N%-=4
IF N%=3 [OPT A%:STRB R8,[addr1],#1:]:N%-=1
IF N%=2 [OPT A%:STRB R8,[addr1],#1:]:N%-=1
IF N%=1 [OPT A%:STRB R8,[addr1],#1:]:N%-=1
[OPT A%:MOV PC,R14:]
WHILE P%<endP%:[OPT A%:DCD 0:]:ENDWHILE
=0
When I get time, I'll benchmark Zarch so we have a baseline to compare against. Then we can try different routines to see what's optimal. I'm going to pull some statistical analysis out of it as well, so we know what line lengths are more likely to occur. This might allow further optimization.

The big improvements however will come from replacing the triangle routine and implementing a quadrilateral. If you want to test your programming/optimization skills, now's your chance.
User avatar
qUE
Posts: 110
Joined: Tue Dec 16, 2014 11:39 pm
Location: Bristol
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by qUE »

Shifted mask should be quicker than multiple STRBs. Look at Acorn's horizontal line drawing routine in RISC OS for example.

I personally found performance on either rotated load/store (which you can't do on newer architectures anyway unless you disable align in CR) or byte load/store to hammer ARM2.
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

qUE wrote:Shifted mask should be quicker than multiple STRBs. Look at Acorn's horizontal line drawing routine in RISC OS for example.
Feel free to post a working example with timings. I believe the RISCOS code you're referring too is this in vduplot,v.

Outside of a triangle/quadrilateral routine it might at first appear to be quicker, but in reality you'd have to STM/LDM the extra registers you need to create the mask and bit shifts. That's a min of three registers, which adds at least 1500nS to the line plot; the knock on effect of that is to make the best case slower (best case being 1 or no pixel out of alignment, depending on use of a branch over the alignment code).

Based on the timings in the graph above, the STRB method is as follows if anyone want to improve on it:

Best case: ~1950nS (start/end aligned)
Worst case: ~4050nS (6 misaligned pixels)

Code: Select all

ADD   addr2,addr1,count
TST   addr1,#%11:STRNEB R8,[addr1],#1
TSTNE addr1,#%11:STRNEB R8,[addr1],#1
TSTNE addr1,#%11:STRNEB R8,[addr1],#1
TST   addr2,#%11:STRNEB R8,[addr2,#-1]!
TSTNE addr2,#%11:STRNEB R8,[addr2,#-1]!
TSTNE addr2,#%11:STRNEB R8,[addr2,#-1]!
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

I've now have a test suite, which plays back a recording of all the triangles Zarch plots over the first two demo levels. Triangle stats from that are as follows:

Total triangles recorded: 16380
Widest triangle: 56 pixels
Average triangle width: 18 pixels

Graph showing totals of triangle widths over the recording:
ZARCH_TRI_STATS.png
As I suspected, the bulk of the triangles are small, so we only need hand-crafted fills for the smaller range. Anything over 40 can be covered by a generic line fill.


Breaking these triangles down into their line fill length counts highlights where we need to optimize the fill routine:
ZARCH_LINE_FILL_STATS.png
The bulk of the line fills are under 4 pixels, the longest is 53 pixels, above 43 pixels its in single digits. Based on these figures, we only need to optimize the line fill up to 32 pixels (so 128 routines including start alignment) and leave the rest to a generic routine.
Last edited by sirbod on Thu Nov 05, 2015 4:15 pm, edited 2 times in total.
Zarchos
Posts: 2355
Joined: Sun May 19, 2013 9:19 am
Location: FRANCE

Re: Hacker needed ... for Zarch ;-)

Post by Zarchos »

A short message to say hello and thank you for the interest you show.
Believe me to me it's a real incentive when coding !

To tell you more about my routines, they're not elegant because they don't use a generic algo.
They're designed to be fast, simply.
It's 'brute code' as I like it :
- for each length (1 pixels to 200, I'll later expand that to 384), I've got 16 routines (depending on the position on the quadword aligned base address of the destination address, because any destination address can be described as a quadword aligned address + (0...15) byte(s)).

For the small lengths, the 16 routines for a given length are going to be the same, because there's no splitting of the amount of registers to STM so that most of them are on a quadword aligned address, yes, but it also allows to have a fixed size for each sequence of instructions to deal with the plotting of P pixels at offset 0...15, whatever the quadword alignment ; thus branching to a routine is faster to compute.

Yes it'll use quite a lot of memory, but I think it's not a problem considering most Archie games use 1 Mbyte of RAM when a standard Archie is in fact a 2 Mbyte machine.
Furthermore it's possible to have a table of (address, length) of 'holes' in the routines to use these memory locations to store datas, and this way reduce the 'waste' of unused memory.

I've just seen there's something I did wrong and could be further optimised, so the code should be ready next week, not this week end as I expected.
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

Zarchos wrote:Yes it'll use quite a lot of memory, but I think it's not a problem considering most Archie games use 1 Mbyte of RAM when a standard Archie is in fact a 2 Mbyte machine.
We need to fit this inside 640kb of RAM, minus what Zarch needs. Given that the largest triangle is 56 pixels wide, the line fill only needs hand crafting to 64 (which may be lower once I've analysed the line fills), beyond that can get a generic algorithm.

Sounds like you're making good headway. In the meantime, I'm going to look at coding a quadrilateral routine.
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

Line fill stats now in the post above. Very interesting, we only need to optimize up to 32 pixel line lengths.

There's a big optimization to be had by making the triangle routine fill when they're less that 4 pixels wide, which covers 35% of all fills.
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

sirbod wrote:There's a big optimization to be had by making the triangle routine fill when they're less that 4 pixels wide, which covers 35% of all fills.
I've run the demo loop on an A5000, varying how many pixels are STRB's within the triangle routine before off-loading to the line fill code above. Timings in ms on the Y axis and STRB's within the triangle routine on the X axis:
STRB_INSIDE_TRIANGLE_ROUTINE.png
Based on this in combination with the line fill stats above, we should include 6 STRB's within the triangle code, have bespoke code for 7-31 pixels and a generic for 32+ pixels. With 6 STRB's using the code above, we've already sped it up by 27%!

Interestingly, this runs at a smooth 70fps, so I did some digging into what Zarch is doing that's using cycles. Aside from the game logic, the biggest hitter is it's CLS routine. Oddly, there's two, one that clears the top third of the screen and another that clears the bottom.

The bottom code covers the landscape when it's at its highest point and could be sped up by using the line fill to blank to the left of the far left triangles and to the right of the far right triangles. This would reduce screen writes by around 90% when the landscape is fully visible.

EDIT: I've sketched out a quadrilateral routine and think I can squeeze 4 colour registers for the fill code, any more and we'll need to start preserving registers in/out of the fill, which would be counter productive.
Last edited by sirbod on Fri Nov 06, 2015 5:27 am, edited 1 time in total.
User avatar
qUE
Posts: 110
Joined: Tue Dec 16, 2014 11:39 pm
Location: Bristol
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by qUE »

Jon, I don't have anything setup to test the processing time, but this is theoretical just off the top of my head, how would something like this fair against multiple STRBs?

assuming you've already calculated the screen offest into r12, and r1-r7 are set with the fill colour

ANDS r10,r12,#3
MOVEQ r11,r12
MOVEQ r0,r1
BICNE r11,r12,#3
LDRNE r0,[r11]
MOVNE r10,r10,LSL#3
ORRNE r0,r0,r1,LSL r10
STMIA r11!,{r0-r7}
..then multiple STMIAs until near end and LSR version of above routine

it's just datasheet does mention a penalty for memory access and I can't help but think there's potentially 6 unaligned rotated/masked read/writes there, where as this should reduce it to two aligned
Last edited by qUE on Fri Nov 06, 2015 10:57 am, edited 1 time in total.
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

qUE wrote:Jon, I don't have anything setup to test the processing time, but this is theoretical just off the top of my head, how would something like this fair against multiple STRBs?

assuming you've already calculated the screen offest into r12, and r1-r7 are set with the fill colour
Thanks for the code, I'll drop it into the Zarch test suite and post timing results soon. We have 4 colour registers to play with incidentally, I've just updated my previous post to state that.
User avatar
qUE
Posts: 110
Joined: Tue Dec 16, 2014 11:39 pm
Location: Bristol
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by qUE »

I was thinking if registers are tight, I do a trick where I cram any values under 256 into one register and unpack them with mask(/shift) for use when needed, it basically packs upto 4 registers into 1. I normally use it for counting and flags, but that could probably be used for any of the increment/decrements for "growing" the triangle from a point. You can unpack the sign out of it with SUB r,r,#128, or decrement/increment any of the values with ADD/SUB r,r,#1<<value_offset and check if it's hit zero with TST r,#255<<value_offset.

On register allocation I basically treat LDM/STM as king when I can :)
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

qUE wrote:On register allocation I basically treat LDM/STM as king when I can :)
As 99% of all line fills are below 32 pixels, we can get away with four fill registers. Once I've done the analysis on alignment of those fills, we'll probably find three registers is all we need, as I suspect next to no 32 pixel fills are going to be word aligned.
Deleted User 9142

Re: Hacker needed ... for Zarch ;-)

Post by Deleted User 9142 »

Sorry if this is deemed OT, but since you're discussing polygons...

I'm assuming that David Braben with Zarch is calculating the polygon edges using a fairly standard method like Bresenham's line algo? When he clips the edges at the viewport boundaries (or screen edges in this case), is he just doing a simple endpoint recalculation (via e.g. Cohen-Sutherland algorithm)? Or is he clipping the lines 'properly' so that the polygon edges (or lines) are drawn precisely the same as if they were not actually clipped?

99% of line clipping routines I've seen in the past (including all of my own) just use the naive endpoint recalculation approach. I still haven't figured how to do it 'properly' and efficiently, myself. Such routines suffer from the 'out-by-one pixel' problem when rendering or calculating clipped lines, due mostly to rounding errors.

I was pleased (and a bit surprised) to see that the line drawing (and clipping) routine in the BBC Micro's MOS does it 'properly'! As does BB4W's LINE command (which relies on the Windows GDI). RISC OS's line drawer probably does it properly, too.

Here is a good article by the author of VirtualDub on this subject:

http://www.virtualdub.org/blog/pivot/entry.php?id=341


David.

http://www.proggies.uk
Zarchos
Posts: 2355
Joined: Sun May 19, 2013 9:19 am
Location: FRANCE

Re: Hacker needed ... for Zarch ;-)

Post by Zarchos »

David1664 wrote:Sorry if this is deemed OT, but since you're discussing polygons...

I'm assuming that David Braben with Zarch is calculating the polygon edges using a fairly standard method like Bresenham's line algo? When he clips the edges at the viewport boundaries (or screen edges in this case), is he just doing a simple endpoint recalculation (via e.g. Cohen-Sutherland algorithm)? Or is he clipping the lines 'properly' so that the polygon edges (or lines) are drawn precisely the same as if they were not actually clipped?

99% of line clipping routines I've seen in the past (including all of my own) just use the naive endpoint recalculation approach. I still haven't figured how to do it 'properly' and efficiently, myself. Such routines suffer from the 'out-by-one pixel' problem when rendering or calculating clipped lines, due mostly to rounding errors.

I was pleased (and a bit surprised) to see that the line drawing (and clipping) routine in the BBC Micro's MOS does it 'properly'! As does BB4W's LINE command (which relies on the Windows GDI). RISC OS's line drawer probably does it properly, too.

Here is a good article by the author of VirtualDub on this subject:

http://www.virtualdub.org/blog/pivot/entry.php?id=341


David.

http://www.proggies.uk
I remember using this algo when I tried to code 3D on the Archie.
I used the dichotomic method to find the edge of the clipped segments.
And it was Bresenham algo I used to trace lines, the 'basic' algo, not the one where there are some increments already 'known' because some segments have a specific slope.
All this was explained in these fantastic 'Graphics gems' books I had (if I remember correctly the titles. And all this in English of course).
Fond memories spending hours and hours trying to understand the contents of these books full of maths at a far too high level of complexity for my knowledge.
sirbod
Posts: 1624
Joined: Mon Apr 09, 2012 9:44 am
Location: Essex
Contact:

Re: Hacker needed ... for Zarch ;-)

Post by sirbod »

David1664 wrote:Sorry if this is deemed OT, but since you're discussing polygons...

I'm assuming that David Braben with Zarch is calculating the polygon edges using a fairly standard method like Bresenham's line algo? When he clips the edges at the viewport boundaries (or screen edges in this case), is he just doing a simple endpoint recalculation (via e.g. Cohen-Sutherland algorithm)? Or is he clipping the lines 'properly' so that the polygon edges (or lines) are drawn precisely...
Bresenham's is an infinitely thin line algorithm based on integers as far as I'm aware, and not really applicable to polygons which have depth. I've not really analysed DB's triangle routine as I'm not planning on using it, I'll be replacing it with a quadrilateral routine.

I can't say I've ever looked at "industry standard" algorithms, I've always worked out my own maths from scratch, so I couldn't tell you what DB implemented in known algorithm terms.

I can tell you that I'll be using 19 bit precision floats, lookups tables for vector angles of distances smaller than 64x64 pixels (which covers 99% of tri/quads plotted) and that left and right screen clipping is done as part of the line fill and is effectively free. Vertical clipping may not be required on the top of the screen (it's covered by the top screen blanking code) and the bottom clipping comes free as you stop at Y=256

Essentially I've tailored a routine to fit the majority of cases after analysis of the game's requirements. I'll post the code up once it's coded, along with comparison timings, for people to improve on.
Post Reply

Return to “32-bit acorn software: classic games”