Challenge #16
According to the challenge description this time will be harder. As always, here is the link for the original challenge. this time I couldn’t solve it completely, but I think the challenge it’s lacking information in the code. The instruction movdqa xmm1, xmmword ptr [rip + .LCPI0_0]
reference to .LCPI0_0
which is not in the code. Yes I can assume that here we are copying a global variable, but what is it’s value. For that reason, I just partially solved this challenge, and my formal description of the program can be inaccurate.
Also if I’m missing something obvious, related to this .LCPI0_0 tag feel free to let me know.
Analysis
Let’s break it into pieces. The code until commented tag # BB#1.
f: # @f
# BB#0:
xor eax, eax
test rsi, rsi
je .LBB0_8
# BB#1: # %.lr.ph.preheader
xor ecx, ecx
mov rax, rsi
and rax, -4 ;; why -4?
pxor xmm0, xmm0
pxor xmm2, xmm2
je .LBB0_5
We start the program checking if rsi
, which it’s a parameter supplied, it’s 0. If that’s the case we just return the program with eax = 0
from the xor eax, eax
instruction.
Right after that we something quite tricky, if you noticed I added a note to the code that says why -4?. These instructions, the one on the last block of code, first set registers ecx, xmm0, and xmm2
to 0. Then we have an and operation between rsi
and -4
, in case this operation it’s 0 the program will jump to .LBB0_5 tag. The trick here it’s that this will be hold only when 0 <= rsi <= 3
. This operation it’s also equivalent to (rsi >> 2) << 2
, or rsi / 4 * 4
with integer operations. The reason for this, is that the representation of -4
as a 2 complement, it’s 0b1111111111111100
. Then when we and with this representation, we are zeroing the two last bits of a number, which it’s equivalent to the previously mentioned. This is super smart, I mean with only one operation in assembly you achieve that, compilers do magic.
Main logic
The main logic of the program it’s in tag .LBB0_3, if you look at the first block of code you can spot right away a loop. Where the stop condition is if rax
, it’s equal to the counter rcx
. This counter increases by 4 on each iteration. Let’s analyze the whole loop.
.LBB0_3: # %vector.body
# =>This Inner Loop Header: Depth=1
movdqa xmm3, xmm2
movdqa xmm4, xmm0
movzx edx, word ptr [rdi + rcx]
movd xmm0, edx
pinsrw xmm0, edx, 0
movzx edx, dh
pinsrw xmm0, edx, 4
movzx edx, word ptr [rdi + rcx + 2]
movd xmm2, edx
pinsrw xmm2, edx, 0
movzx edx, dh
pinsrw xmm2, edx, 4
pand xmm0, xmm1
pand xmm2, xmm1
paddq xmm0, xmm4
paddq xmm2, xmm3
add rcx, 4
cmp rax, rcx
jne .LBB0_3
Ok, we have here an array in rdi
. In the loop we are making two copies, each one of 2 bytes. For example, these are the copies on each iteration:
Iteration 0:
rdi[0] to xmm0, and rdi[2] to xmm2
Iteration 1:
rdi[4] to xmm0, and rdi[6] to xmm2
Iteration 2:
rdi[8] to xmm0, and rdi[10] to xmm2
Iteration 3:
rdi[12] to xmm0, and rdi[10] to xmm2
...
Iteration i:
rdi[4*i] to xmm0, and rdi[4*i+2] to xmm2
Now how many iterations we will have, depends on the value of (rsi >> 2)
. We got this value from instructions mov rax, rsi
and and rax, -4
, which it’s (rsi >> 2) << 2
, we omit this last bitwise operation because rcx
increase 4 by 4. Given that this program is not so big, we can try to guess the code in C that produce this.
int i = 0;
double a = 0, b = 0;
int end = (rax >> 2) << 2;
for (i = 0; i < end; i++) {
a += rdi[4*i];
b += rdi[4*i+2];
}
a += b;
if (rsi == end)
return 2*a;
// put array into last position that we processed
rdi += end;
rsi -= end;
// process remaining values of array
while (rsi != 0) {
a += *rdi;
rdi++;
}
return a;
This is how would look like the code bellow .LBB0_3 tag, including the part after the loop. This is basically computing the sums of value in the array, but why in this way? I mean, this code gives the same result as a simple loop over the array, and accumulating the sums in a variable.
I’m kind of lost on why, it’s implemented in this way. The formal description of the code would be then
Formal description
Computes the sum of values of an array, and return the result.
Necessary remarks
In this particular challenge I don’t feel quite confident with my answer, specially for the instruction movdqa xmm1, xmmword ptr [rip + .LCPI0_0]
. The tag .LCPI0_0
is not defined in the code, but given the use of rip
I can say this is a global variable. Also the code it’s so optimized, that instead of iteration over the array as a normal human, the compiler decided to iterate in this weird way. It’s all confusing, but if we assume the masking operations like pand xmm0, xmm1
and pand xmm2, xmm1
doesn’t change the value of xmm0
neither xmm2
, then we can be confident in our formal description. Given that this is not the case, I cannot mark this challenge as completely solved.