Challenge #16

According to the challenge description this time will be harder. As always, here is the link for the original challenge. this time I couldn’t solve it completely, but I think the challenge it’s lacking information in the code. The instruction movdqa xmm1, xmmword ptr [rip + .LCPI0_0] reference to .LCPI0_0 which is not in the code. Yes I can assume that here we are copying a global variable, but what is it’s value. For that reason, I just partially solved this challenge, and my formal description of the program can be inaccurate.

Also if I’m missing something obvious, related to this .LCPI0_0 tag feel free to let me know.


Let’s break it into pieces. The code until commented tag # BB#1.

f:                                      # @f
# BB#0:
        xor     eax, eax
        test    rsi, rsi
        je      .LBB0_8
# BB#1:                                 #
        xor     ecx, ecx
        mov     rax, rsi
        and     rax, -4 ;; why -4?
        pxor    xmm0, xmm0
        pxor    xmm2, xmm2
        je      .LBB0_5

We start the program checking if rsi, which it’s a parameter supplied, it’s 0. If that’s the case we just return the program with eax = 0 from the xor eax, eax instruction.

Right after that we something quite tricky, if you noticed I added a note to the code that says why -4?. These instructions, the one on the last block of code, first set registers ecx, xmm0, and xmm2 to 0. Then we have an and operation between rsi and -4, in case this operation it’s 0 the program will jump to .LBB0_5 tag. The trick here it’s that this will be hold only when 0 <= rsi <= 3. This operation it’s also equivalent to (rsi >> 2) << 2, or rsi / 4 * 4 with integer operations. The reason for this, is that the representation of -4 as a 2 complement, it’s 0b1111111111111100. Then when we and with this representation, we are zeroing the two last bits of a number, which it’s equivalent to the previously mentioned. This is super smart, I mean with only one operation in assembly you achieve that, compilers do magic.

Main logic

The main logic of the program it’s in tag .LBB0_3, if you look at the first block of code you can spot right away a loop. Where the stop condition is if rax, it’s equal to the counter rcx. This counter increases by 4 on each iteration. Let’s analyze the whole loop.

.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movdqa  xmm3, xmm2
        movdqa  xmm4, xmm0

        movzx   edx, word ptr [rdi + rcx]
        movd    xmm0, edx
        pinsrw  xmm0, edx, 0
        movzx   edx, dh
        pinsrw  xmm0, edx, 4

        movzx   edx, word ptr [rdi + rcx + 2]
        movd    xmm2, edx
        pinsrw  xmm2, edx, 0
        movzx   edx, dh
        pinsrw  xmm2, edx, 4

        pand    xmm0, xmm1
        pand    xmm2, xmm1
        paddq   xmm0, xmm4
        paddq   xmm2, xmm3

        add     rcx, 4
        cmp     rax, rcx
        jne     .LBB0_3

Ok, we have here an array in rdi. In the loop we are making two copies, each one of 2 bytes. For example, these are the copies on each iteration:

Iteration 0:
rdi[0] to xmm0, and rdi[2] to xmm2
Iteration 1:
rdi[4] to xmm0, and rdi[6] to xmm2
Iteration 2:
rdi[8] to xmm0, and rdi[10] to xmm2
Iteration 3:
rdi[12] to xmm0, and rdi[10] to xmm2
Iteration i:
rdi[4*i] to xmm0, and rdi[4*i+2] to xmm2

Now how many iterations we will have, depends on the value of (rsi >> 2). We got this value from instructions mov rax, rsi and and rax, -4, which it’s (rsi >> 2) << 2, we omit this last bitwise operation because rcx increase 4 by 4. Given that this program is not so big, we can try to guess the code in C that produce this.

int i = 0;
double a = 0, b = 0;
int end = (rax >> 2) << 2;

for (i = 0; i < end; i++) {
    a += rdi[4*i];
    b += rdi[4*i+2];

a += b;
if (rsi == end)
    return 2*a;

// put array into last position that we processed
rdi +=  end;
rsi -= end;

// process remaining values of array
while (rsi != 0) {
    a += *rdi;

return a;

This is how would look like the code bellow .LBB0_3 tag, including the part after the loop. This is basically computing the sums of value in the array, but why in this way? I mean, this code gives the same result as a simple loop over the array, and accumulating the sums in a variable.

I’m kind of lost on why, it’s implemented in this way. The formal description of the code would be then

Formal description

Computes the sum of values of an array, and return the result.

Necessary remarks

In this particular challenge I don’t feel quite confident with my answer, specially for the instruction movdqa xmm1, xmmword ptr [rip + .LCPI0_0]. The tag .LCPI0_0 is not defined in the code, but given the use of rip I can say this is a global variable. Also the code it’s so optimized, that instead of iteration over the array as a normal human, the compiler decided to iterate in this weird way. It’s all confusing, but if we assume the masking operations like pand xmm0, xmm1 and pand xmm2, xmm1 doesn’t change the value of xmm0 neither xmm2, then we can be confident in our formal description. Given that this is not the case, I cannot mark this challenge as completely solved.