ARM instruction timings
ARM instruction timings
Came across the following line of text
"ADD R1,R1,R1,LSL#1; R1 = R1 + (R1 << 1). Shifting a number left one place multiplies it by two, so this instruction multiplies R1 by three, thus avoiding a MUL instruction."
Are MUL instructions bad? I hadn't really thought about execution time until reading that, but each instruction takes a different amount of time on other processors, so I suppose it makes sense. Although, curious how this doesn't screw the pipeline.
Does this apply to other instructions too? For instance does the number of registers you place on the stack make a difference to the speed of execution of a STM instruction?
Cheers,
Robin.
"ADD R1,R1,R1,LSL#1; R1 = R1 + (R1 << 1). Shifting a number left one place multiplies it by two, so this instruction multiplies R1 by three, thus avoiding a MUL instruction."
Are MUL instructions bad? I hadn't really thought about execution time until reading that, but each instruction takes a different amount of time on other processors, so I suppose it makes sense. Although, curious how this doesn't screw the pipeline.
Does this apply to other instructions too? For instance does the number of registers you place on the stack make a difference to the speed of execution of a STM instruction?
Cheers,
Robin.
Re: ARM instruction timings
To begin with on ARM1, we didn't have MUL! MUL is usually fine, but many cases can be done quicker using shifts (power of two) or shifts and adds (as your example). Have a gander at Pete's book, section 3.7. https://www.chiark.greenend.org.uk/~the ... kerell.pdfjubber wrote: ↑Wed Nov 22, 2023 8:29 am Are MUL instructions bad? I hadn't really thought about execution time until reading that, but each instruction takes a different amount of time on other processors, so I suppose it makes sense. Although, curious how this doesn't screw the pipeline.
Does this apply to other instructions too? For instance does the number of registers you place on the stack make a difference to the speed of execution of a STM instruction?
Miserable old curmudgeon who still likes a bit of an ARM wrestle now and then. Pi 4, 3, ARMX6, SA Risc PC, A540, A440
Re: ARM instruction timings
https://gab.wallawalla.edu/~curt.nelson ... dix_B3.pdf partly answers this, although it's for modern ARM cpus. I'm guessing the pipeline and everything else just stalls while a slow instruction is executed. Still not sure about multiple registers with an LDM/STM - but really interested by the note that suggests instructions that nop themselves due to condition codes don't definitely use one cycle. Again, this info might be wrong for ARM 2 era machines.
Re: ARM instruction timings
That's a great pointer! Thanks for the book recommendation in another post - I've been skimming parts of it with the search function from time to time. I've got about five books open in various tabs along with other useful resources like an interactive immediate value checker
https://alisdair.mcdiarmid.org/arm-imme ... -encoding/
but there are only so many hours in the day to work this stuff out while also having a job, kids etc. Wish I was 16 again.
So far I have managed to write a program that can plot a dot in mode 9. It's slow going!
https://alisdair.mcdiarmid.org/arm-imme ... -encoding/
but there are only so many hours in the day to work this stuff out while also having a job, kids etc. Wish I was 16 again.
So far I have managed to write a program that can plot a dot in mode 9. It's slow going!
- NickLuvsRetro
- Posts: 288
- Joined: Sat Jul 17, 2021 4:18 pm
- Contact:
Re: ARM instruction timings
Worth pointing out the stardot Discord server has a #programming channel which is also worth tapping into for advice on this kind of stuff.
Can be useful for quick regular chats on ARM specifics.
Can be useful for quick regular chats on ARM specifics.
Re: ARM instruction timings
I investigated this a bit a few years ago and made a video about it showing my understanding of what's going on - I don't remember whether the video was any good, but it's here: https://youtu.be/59sO1BGYqWs
Here's a diagram from the video that shows what I observed and my understanding of it - in the video I talk through it a lot more of course: I think later CPUs had much more complex pipelines but it was fairly simple for ARM2. I believe instructions have a fetch cycle, a decode cycle, and then one or more execution cycles. The execution cycles execute in sequence, and one instruction executes fully before the next can start - however, the instruction fetch and decode are overlapped, and take priority over execution cycles, so an instruction fetch requires access to memory and will cause any pending execution cycle which also requires memory access to be delayed.
Regarding MUL, I'm not sure whether I showed it in this video, but it takes a variable number of cycles depending upon the arguments, due to the way the multiplier works. Something like 10-20 cycles I believe, from memory. So executing one or two constant-time instructions instead is a big win if one argument is fixed and only has a few bits set.
Here's a diagram from the video that shows what I observed and my understanding of it - in the video I talk through it a lot more of course: I think later CPUs had much more complex pipelines but it was fairly simple for ARM2. I believe instructions have a fetch cycle, a decode cycle, and then one or more execution cycles. The execution cycles execute in sequence, and one instruction executes fully before the next can start - however, the instruction fetch and decode are overlapped, and take priority over execution cycles, so an instruction fetch requires access to memory and will cause any pending execution cycle which also requires memory access to be delayed.
Regarding MUL, I'm not sure whether I showed it in this video, but it takes a variable number of cycles depending upon the arguments, due to the way the multiplier works. Something like 10-20 cycles I believe, from memory. So executing one or two constant-time instructions instead is a big win if one argument is fixed and only has a few bits set.
Re: ARM instruction timings
Thanks for the discord hint NikLuvsRetro! I found it with a quick google https://discord.gg/pRy44Wz
And gfoot - I'll watch your video. Thanks for the information. It does indeed look like MUL is slow. I'm using it in my simple point plotter to calculate x + (y*320) but perhaps that would be better as a combination of y*256 + y*64 (or 128 and 32 for MODE 9).
And gfoot - I'll watch your video. Thanks for the information. It does indeed look like MUL is slow. I'm using it in my simple point plotter to calculate x + (y*320) but perhaps that would be better as a combination of y*256 + y*64 (or 128 and 32 for MODE 9).
Re: ARM instruction timings
Yes, to multiply by a constant, two-bit-set number you can load a value into a register, and then add it to a shifted version of itself.
It looks like it was a different video where I saw the cost of MUL in cycles: https://youtu.be/s715Rv86KtA?t=279 This shows a slow one (17 cycles) and a faster one (9 cycles) as they happened to have different arguments. MOV followed by ADD would be just two cycles.
It looks like it was a different video where I saw the cost of MUL in cycles: https://youtu.be/s715Rv86KtA?t=279 This shows a slow one (17 cycles) and a faster one (9 cycles) as they happened to have different arguments. MOV followed by ADD would be just two cycles.
Re: ARM instruction timings
The difference isn't vast, but I ran my little bit of code 50,000 times from BASIC with a simple TIME=0 PRINT TIME thing around it and got the following -
MOV R3,#160 ; mode 9 width (4 bit mode)
MUL R1,R3,R1 ; y = 160 * y
and got 626 time units (milliseconds?)
MOV R3,R1,LSL #7 ; temp=y*128
ADD R1,R3,R1,LSL #5 ; y=temp + (y*32) so y=y 160 * y
and got 617 - so enough of a difference it's worth doing, but MUL isn't fatally slow, in this specific case. Of course I don't know how MUL works - maybe it has early exits for cases like this.
MOV R3,#160 ; mode 9 width (4 bit mode)
MUL R1,R3,R1 ; y = 160 * y
and got 626 time units (milliseconds?)
MOV R3,R1,LSL #7 ; temp=y*128
ADD R1,R3,R1,LSL #5 ; y=temp + (y*32) so y=y 160 * y
and got 617 - so enough of a difference it's worth doing, but MUL isn't fatally slow, in this specific case. Of course I don't know how MUL works - maybe it has early exits for cases like this.
- Rich Talbot-Watkins
- Posts: 2054
- Joined: Thu Jan 13, 2005 5:20 pm
- Location: Palma, Mallorca
- Contact:
Re: ARM instruction timings
If you're just CALLing that 50,000 times, I imagine that's not a fair test as the majority of the time you measure will just be the overhead of the CALL.
A fairer test would be to literally assemble that snippet 50,000 times and call it once, i.e.
A fairer test would be to literally assemble that snippet 50,000 times and call it once, i.e.
Code: Select all
DIM code% 50000*8+4
P%=code%
FOR c%=1 TO 50000
[OPT 2
MOV R3,#160
MUL R1,R3,R1
]
NEXT
[OPT 2:MOV PC,R14:]
:
TIME=0:CALL code%:PRINT TIME
Last edited by Rich Talbot-Watkins on Wed Nov 22, 2023 4:11 pm, edited 1 time in total.
Reason: Corrected the code
Reason: Corrected the code
Re: ARM instruction timings
That's a great tip - thanks.
Re: ARM instruction timings
Note that you'll need the P% initialisation before the FOR loop for that to work.
- Rich Talbot-Watkins
- Posts: 2054
- Joined: Thu Jan 13, 2005 5:20 pm
- Location: Palma, Mallorca
- Contact:
Re: ARM instruction timings
Ugh yeah. Thanks! (going to correct that snippet)
So used to setting P% inside a loop.
So used to setting P% inside a loop.
Re: ARM instruction timings
For the curious - the printed TIME values - 4 for the MUL approach and 2 for the shifts.
- Rich Talbot-Watkins
- Posts: 2054
- Joined: Thu Jan 13, 2005 5:20 pm
- Location: Palma, Mallorca
- Contact:
Re: ARM instruction timings
Basically MUL can take up to 17 cycles to execute, depending on how big the second operand is. It keeps shifting out the bottom two bits each cycle until it's zero. So, in your case, when that operand is 160, I would expect it to take five cycles to execute: 1 to fetch the opcode, and 4 to perform the multiplication.
- Rich Talbot-Watkins
- Posts: 2054
- Joined: Thu Jan 13, 2005 5:20 pm
- Location: Palma, Mallorca
- Contact:
Re: ARM instruction timings
Maybe try something like:
for a bit more precision!
Code: Select all
DIM code% 50000*8+16
P%=code%
[OPT 2
MOV R0,#32
.loop
]
FOR c%=1 TO 50000
[OPT 2
MOV R3,#160
MUL R1,R3,R1
]
NEXT
[OPT 2
SUBS R0,R0,#1
BNE loop
MOV PC,R14
]
:
TIME=0:CALL code%:PRINT TIME
- IanJeffray
- Posts: 6019
- Joined: Sat Jun 06, 2020 3:50 pm
- Contact: