ARM instruction timings

jubber · Post by **jubber** » Wed Nov 22, 2023 8:29 am

Came across the following line of text

"ADD R1,R1,R1,LSL#1; R1 = R1 + (R1 << 1). Shifting a number left one place multiplies it by two, so this instruction multiplies R1 by three, thus avoiding a MUL instruction."

Are MUL instructions bad? I hadn't really thought about execution time until reading that, but each instruction takes a different amount of time on other processors, so I suppose it makes sense. Although, curious how this doesn't screw the pipeline.

Does this apply to other instructions too? For instance does the number of registers you place on the stack make a difference to the speed of execution of a STM instruction?

Cheers,

Robin.

SKS1 · Post by **SKS1** » Wed Nov 22, 2023 9:50 am

jubber wrote: ↑Wed Nov 22, 2023 8:29 am Are MUL instructions bad? I hadn't really thought about execution time until reading that, but each instruction takes a different amount of time on other processors, so I suppose it makes sense. Although, curious how this doesn't screw the pipeline.

Does this apply to other instructions too? For instance does the number of registers you place on the stack make a difference to the speed of execution of a STM instruction?

To begin with on ARM1, we didn't have MUL! MUL is usually fine, but many cases can be done quicker using shifts (power of two) or shifts and adds (as your example). Have a gander at Pete's book, section 3.7. https://www.chiark.greenend.org.uk/~the ... kerell.pdf

jubber · Post by **jubber** » Wed Nov 22, 2023 9:52 am

https://gab.wallawalla.edu/~curt.nelson ... dix_B3.pdf partly answers this, although it's for modern ARM cpus. I'm guessing the pipeline and everything else just stalls while a slow instruction is executed. Still not sure about multiple registers with an LDM/STM - but really interested by the note that suggests instructions that nop themselves due to condition codes don't definitely use one cycle. Again, this info might be wrong for ARM 2 era machines.

jubber · Post by **jubber** » Wed Nov 22, 2023 9:58 am

That's a great pointer! Thanks for the book recommendation in another post - I've been skimming parts of it with the search function from time to time. I've got about five books open in various tabs along with other useful resources like an interactive immediate value checker

https://alisdair.mcdiarmid.org/arm-imme ... -encoding/

but there are only so many hours in the day to work this stuff out while also having a job, kids etc. Wish I was 16 again.

So far I have managed to write a program that can plot a dot in mode 9. It's slow going!

NickLuvsRetro · Post by **NickLuvsRetro** » Wed Nov 22, 2023 10:47 am

Worth pointing out the stardot Discord server has a #programming channel which is also worth tapping into for advice on this kind of stuff.

Can be useful for quick regular chats on ARM specifics.

gfoot · Post by **gfoot** » Wed Nov 22, 2023 11:05 am

I investigated this a bit a few years ago and made a video about it showing my understanding of what's going on - I don't remember whether the video was any good, but it's here: https://youtu.be/59sO1BGYqWs

Here's a diagram from the video that shows what I observed and my understanding of it - in the video I talk through it a lot more of course:

I think later CPUs had much more complex pipelines but it was fairly simple for ARM2. I believe instructions have a fetch cycle, a decode cycle, and then one or more execution cycles. The execution cycles execute in sequence, and one instruction executes fully before the next can start - however, the instruction fetch and decode are overlapped, and take priority over execution cycles, so an instruction fetch requires access to memory and will cause any pending execution cycle which also requires memory access to be delayed.

Regarding MUL, I'm not sure whether I showed it in this video, but it takes a variable number of cycles depending upon the arguments, due to the way the multiplier works. Something like 10-20 cycles I believe, from memory. So executing one or two constant-time instructions instead is a big win if one argument is fixed and only has a few bits set.

jubber · Post by **jubber** » Wed Nov 22, 2023 12:52 pm

Thanks for the discord hint NikLuvsRetro! I found it with a quick google https://discord.gg/pRy44Wz

And gfoot - I'll watch your video. Thanks for the information. It does indeed look like MUL is slow. I'm using it in my simple point plotter to calculate x + (y*320) but perhaps that would be better as a combination of y*256 + y*64 (or 128 and 32 for MODE 9).

gfoot · Post by **gfoot** » Wed Nov 22, 2023 3:30 pm

Yes, to multiply by a constant, two-bit-set number you can load a value into a register, and then add it to a shifted version of itself.

It looks like it was a different video where I saw the cost of MUL in cycles: https://youtu.be/s715Rv86KtA?t=279 This shows a slow one (17 cycles) and a faster one (9 cycles) as they happened to have different arguments. MOV followed by ADD would be just two cycles.

jubber · Post by **jubber** » Wed Nov 22, 2023 3:59 pm

The difference isn't vast, but I ran my little bit of code 50,000 times from BASIC with a simple TIME=0 PRINT TIME thing around it and got the following -

MOV R3,#160 ; mode 9 width (4 bit mode)
MUL R1,R3,R1 ; y = 160 * y

and got 626 time units (milliseconds?)

MOV R3,R1,LSL #7 ; temp=y*128
ADD R1,R3,R1,LSL #5 ; y=temp + (y*32) so y=y 160 * y

and got 617 - so enough of a difference it's worth doing, but MUL isn't fatally slow, in this specific case. Of course I don't know how MUL works - maybe it has early exits for cases like this.

Post by **Rich Talbot-Watkins** » Wed Nov 22, 2023 4:03 pm

If you're just CALLing that 50,000 times, I imagine that's not a fair test as the majority of the time you measure will just be the overhead of the CALL.

A fairer test would be to literally assemble that snippet 50,000 times and call it once, i.e.

Code: Select all

DIM code% 50000*8+4
P%=code%
FOR c%=1 TO 50000
[OPT 2
MOV R3,#160
MUL R1,R3,R1
]
NEXT
[OPT 2:MOV PC,R14:]
:
TIME=0:CALL code%:PRINT TIME

jubber · Post by **jubber** » Wed Nov 22, 2023 4:06 pm

That's a great tip - thanks.

gfoot · Post by **gfoot** » Wed Nov 22, 2023 4:06 pm

Note that you'll need the P% initialisation before the FOR loop for that to work.

Post by **Rich Talbot-Watkins** » Wed Nov 22, 2023 4:11 pm

Ugh yeah. Thanks! (going to correct that snippet)

So used to setting P% inside a loop.

jubber · Post by **jubber** » Wed Nov 22, 2023 4:14 pm

For the curious - the printed TIME values - 4 for the MUL approach and 2 for the shifts.

Post by **Rich Talbot-Watkins** » Wed Nov 22, 2023 4:16 pm

Basically MUL can take up to 17 cycles to execute, depending on how big the second operand is. It keeps shifting out the bottom two bits each cycle until it's zero. So, in your case, when that operand is 160, I would expect it to take five cycles to execute: 1 to fetch the opcode, and 4 to perform the multiplication.

Post by **Rich Talbot-Watkins** » Wed Nov 22, 2023 4:17 pm

Maybe try something like:

Code: Select all

DIM code% 50000*8+16
P%=code%
[OPT 2
MOV R0,#32
.loop
]
FOR c%=1 TO 50000
[OPT 2
MOV R3,#160
MUL R1,R3,R1
]
NEXT
[OPT 2
SUBS R0,R0,#1
BNE loop
MOV PC,R14
]
:
TIME=0:CALL code%:PRINT TIME

for a bit more precision!

IanJeffray · Post by **IanJeffray** » Mon Nov 27, 2023 2:19 pm

jubber wrote: ↑Wed Nov 22, 2023 3:59 pm 626 time units (milliseconds?)

Centiseconds.

stardot.org.uk

ARM instruction timings

ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings

Re: ARM instruction timings