ARM instruction timings

bbc micro/electron/atom/risc os coding queries and routines
Post Reply
User avatar
jubber
Posts: 379
Joined: Sat May 14, 2016 1:05 pm
Contact:

ARM instruction timings

Post by jubber »

Came across the following line of text

"ADD R1,R1,R1,LSL#1; R1 = R1 + (R1 << 1). Shifting a number left one place multiplies it by two, so this instruction multiplies R1 by three, thus avoiding a MUL instruction."

Are MUL instructions bad? I hadn't really thought about execution time until reading that, but each instruction takes a different amount of time on other processors, so I suppose it makes sense. Although, curious how this doesn't screw the pipeline.

Does this apply to other instructions too? For instance does the number of registers you place on the stack make a difference to the speed of execution of a STM instruction?

Cheers,

Robin.
User avatar
SKS1
Posts: 330
Joined: Sat Sep 19, 2020 12:04 am
Location: Highland Perthshire
Contact:

Re: ARM instruction timings

Post by SKS1 »

jubber wrote: Wed Nov 22, 2023 8:29 am Are MUL instructions bad? I hadn't really thought about execution time until reading that, but each instruction takes a different amount of time on other processors, so I suppose it makes sense. Although, curious how this doesn't screw the pipeline.

Does this apply to other instructions too? For instance does the number of registers you place on the stack make a difference to the speed of execution of a STM instruction?
To begin with on ARM1, we didn't have MUL! MUL is usually fine, but many cases can be done quicker using shifts (power of two) or shifts and adds (as your example). Have a gander at Pete's book, section 3.7. https://www.chiark.greenend.org.uk/~the ... kerell.pdf
Miserable old curmudgeon who still likes a bit of an ARM wrestle now and then. Pi 4, 3, ARMX6, SA Risc PC, A540, A440
User avatar
jubber
Posts: 379
Joined: Sat May 14, 2016 1:05 pm
Contact:

Re: ARM instruction timings

Post by jubber »

https://gab.wallawalla.edu/~curt.nelson ... dix_B3.pdf partly answers this, although it's for modern ARM cpus. I'm guessing the pipeline and everything else just stalls while a slow instruction is executed. Still not sure about multiple registers with an LDM/STM - but really interested by the note that suggests instructions that nop themselves due to condition codes don't definitely use one cycle. Again, this info might be wrong for ARM 2 era machines.
User avatar
jubber
Posts: 379
Joined: Sat May 14, 2016 1:05 pm
Contact:

Re: ARM instruction timings

Post by jubber »

That's a great pointer! Thanks for the book recommendation in another post - I've been skimming parts of it with the search function from time to time. I've got about five books open in various tabs along with other useful resources like an interactive immediate value checker

https://alisdair.mcdiarmid.org/arm-imme ... -encoding/

but there are only so many hours in the day to work this stuff out while also having a job, kids etc. Wish I was 16 again.

So far I have managed to write a program that can plot a dot in mode 9. It's slow going!
User avatar
NickLuvsRetro
Posts: 288
Joined: Sat Jul 17, 2021 4:18 pm
Contact:

Re: ARM instruction timings

Post by NickLuvsRetro »

Worth pointing out the stardot Discord server has a #programming channel which is also worth tapping into for advice on this kind of stuff. :)

Can be useful for quick regular chats on ARM specifics.
gfoot
Posts: 987
Joined: Tue Apr 14, 2020 9:05 pm
Contact:

Re: ARM instruction timings

Post by gfoot »

I investigated this a bit a few years ago and made a video about it showing my understanding of what's going on - I don't remember whether the video was any good, but it's here: https://youtu.be/59sO1BGYqWs

Here's a diagram from the video that shows what I observed and my understanding of it - in the video I talk through it a lot more of course:
pipelining.png
I think later CPUs had much more complex pipelines but it was fairly simple for ARM2. I believe instructions have a fetch cycle, a decode cycle, and then one or more execution cycles. The execution cycles execute in sequence, and one instruction executes fully before the next can start - however, the instruction fetch and decode are overlapped, and take priority over execution cycles, so an instruction fetch requires access to memory and will cause any pending execution cycle which also requires memory access to be delayed.

Regarding MUL, I'm not sure whether I showed it in this video, but it takes a variable number of cycles depending upon the arguments, due to the way the multiplier works. Something like 10-20 cycles I believe, from memory. So executing one or two constant-time instructions instead is a big win if one argument is fixed and only has a few bits set.
User avatar
jubber
Posts: 379
Joined: Sat May 14, 2016 1:05 pm
Contact:

Re: ARM instruction timings

Post by jubber »

Thanks for the discord hint NikLuvsRetro! I found it with a quick google https://discord.gg/pRy44Wz

And gfoot - I'll watch your video. Thanks for the information. It does indeed look like MUL is slow. I'm using it in my simple point plotter to calculate x + (y*320) but perhaps that would be better as a combination of y*256 + y*64 (or 128 and 32 for MODE 9).
gfoot
Posts: 987
Joined: Tue Apr 14, 2020 9:05 pm
Contact:

Re: ARM instruction timings

Post by gfoot »

Yes, to multiply by a constant, two-bit-set number you can load a value into a register, and then add it to a shifted version of itself.

It looks like it was a different video where I saw the cost of MUL in cycles: https://youtu.be/s715Rv86KtA?t=279 This shows a slow one (17 cycles) and a faster one (9 cycles) as they happened to have different arguments. MOV followed by ADD would be just two cycles.
User avatar
jubber
Posts: 379
Joined: Sat May 14, 2016 1:05 pm
Contact:

Re: ARM instruction timings

Post by jubber »

The difference isn't vast, but I ran my little bit of code 50,000 times from BASIC with a simple TIME=0 PRINT TIME thing around it and got the following -

MOV R3,#160 ; mode 9 width (4 bit mode)
MUL R1,R3,R1 ; y = 160 * y

and got 626 time units (milliseconds?)

MOV R3,R1,LSL #7 ; temp=y*128
ADD R1,R3,R1,LSL #5 ; y=temp + (y*32) so y=y 160 * y

and got 617 - so enough of a difference it's worth doing, but MUL isn't fatally slow, in this specific case. Of course I don't know how MUL works - maybe it has early exits for cases like this.
User avatar
Rich Talbot-Watkins
Posts: 2054
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca
Contact:

Re: ARM instruction timings

Post by Rich Talbot-Watkins »

If you're just CALLing that 50,000 times, I imagine that's not a fair test as the majority of the time you measure will just be the overhead of the CALL.

A fairer test would be to literally assemble that snippet 50,000 times and call it once, i.e.

Code: Select all

DIM code% 50000*8+4
P%=code%
FOR c%=1 TO 50000
[OPT 2
MOV R3,#160
MUL R1,R3,R1
]
NEXT
[OPT 2:MOV PC,R14:]
:
TIME=0:CALL code%:PRINT TIME
Last edited by Rich Talbot-Watkins on Wed Nov 22, 2023 4:11 pm, edited 1 time in total.
Reason: Corrected the code
User avatar
jubber
Posts: 379
Joined: Sat May 14, 2016 1:05 pm
Contact:

Re: ARM instruction timings

Post by jubber »

That's a great tip - thanks.
gfoot
Posts: 987
Joined: Tue Apr 14, 2020 9:05 pm
Contact:

Re: ARM instruction timings

Post by gfoot »

Note that you'll need the P% initialisation before the FOR loop for that to work.
User avatar
Rich Talbot-Watkins
Posts: 2054
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca
Contact:

Re: ARM instruction timings

Post by Rich Talbot-Watkins »

Ugh yeah. Thanks! (going to correct that snippet)

So used to setting P% inside a loop.
User avatar
jubber
Posts: 379
Joined: Sat May 14, 2016 1:05 pm
Contact:

Re: ARM instruction timings

Post by jubber »

For the curious - the printed TIME values - 4 for the MUL approach and 2 for the shifts.
User avatar
Rich Talbot-Watkins
Posts: 2054
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca
Contact:

Re: ARM instruction timings

Post by Rich Talbot-Watkins »

Basically MUL can take up to 17 cycles to execute, depending on how big the second operand is. It keeps shifting out the bottom two bits each cycle until it's zero. So, in your case, when that operand is 160, I would expect it to take five cycles to execute: 1 to fetch the opcode, and 4 to perform the multiplication.
User avatar
Rich Talbot-Watkins
Posts: 2054
Joined: Thu Jan 13, 2005 5:20 pm
Location: Palma, Mallorca
Contact:

Re: ARM instruction timings

Post by Rich Talbot-Watkins »

Maybe try something like:

Code: Select all

DIM code% 50000*8+16
P%=code%
[OPT 2
MOV R0,#32
.loop
]
FOR c%=1 TO 50000
[OPT 2
MOV R3,#160
MUL R1,R3,R1
]
NEXT
[OPT 2
SUBS R0,R0,#1
BNE loop
MOV PC,R14
]
:
TIME=0:CALL code%:PRINT TIME
for a bit more precision!
User avatar
IanJeffray
Posts: 6016
Joined: Sat Jun 06, 2020 3:50 pm
Contact:

Re: ARM instruction timings

Post by IanJeffray »

jubber wrote: Wed Nov 22, 2023 3:59 pm 626 time units (milliseconds?)
Centiseconds.
Post Reply

Return to “programming”