Dr Clèm's Blog

Tags: Coding C++ Assembler Optimization GCC

Are ^ and != operators equivalent on unsigned char in C++?

Wednesday Oct 25, 2017 03:51, last edition on Wednesday Oct 25, 2017 03:57

I was trying to optimize a C++ code where I check if two unsigned char are different. So the first thing that I wrote was n1 != n2. unsigned char are an integer type, and an efficient way to perform this operation is an exclusive or using bitwise operators, which is ^ in C++. So I could simply change my code to n1 ^ n2. But I was wondering if there are equivalent or if one of them is more efficient. I wrote the simplest functions for each operation in separate files.

% cat neq.cpp
bool f(unsigned char n1, unsigned char n2)
{
 return n1 != n2;
}
% cat xor.cpp
bool f(unsigned char n1, unsigned char n2)
{
 return n1 ^ n2;
}
Then, I compiled them to produce assembler code with the -S option of GCC
% g++ -S neq.cpp xor.cpp
For each source file .cpp, the corresponding assembler code is store in a .s file. If we compare the two file, they are equivalent, of course apart for the file name of the source code
% diff neq.s xor.s
1c1
< .file "neq.cpp"
---
> .file "xor.cpp"
Using ^ or != are strictly equivalent for unsigned char.

My initial goal was to optimize my code but there is nothing that we can do to improve it in C++. To perform optimizations, we will use the compiler. Let us take a look at the assembler code, where the first line is skipped because irrelevant

.text .globl _Z1fhh .type _Z1fhh, @function _Z1fhh: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl %edi, %edx movl %esi, %eax movb %dl, -4(%rbp) movb %al, -8(%rbp) movzbl -4(%rbp), %eax cmpb -8(%rbp), %al setne %al popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size _Z1fhh, .-_Z1fhh .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609" .section .note.GNU-stack,"",@progbits
Let us compile with different sets of flats, -O, or -O1 which is equivalent, will drastically reduce the number of instruction passed to the CPU. Indeed, the block between .cfi_startproc and .cfi_endproc will be reduced to 3 instructions, against 15 without flag.
.text .globl _Z1fhh .type _Z1fhh, @function _Z1fhh: .LFB0: .cfi_startproc cmpb %sil, %dil setne %al ret .cfi_endproc .LFE0: .size _Z1fhh, .-_Z1fhh .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609" .section .note.GNU-stack,"",@progbits
-O2 can be used, -O3 and -Ofast will not affect the assembler code. In my case, the option -march (-march=bdver2) also affects the generate assembler code when combine with -O2. There is also -Os to optimize for size. Because I cannot judge which one is the best based on the assembler code, let us benchmark the operator with a dummy program
% cat main.cpp
int main()
{
 for(unsigned char n1 = 0; n1 < 255; ++n1)
 {
  for(unsigned char n2 = 0; n2 < 255; ++n2)
  {
   for(unsigned short int i = 0; i < 65535; ++i)
   {
    n1 ^ n2;
   }
  }
 }
 unsigned char n1 = 255;
 for(unsigned char n2 = 0; n2 < 255; ++n2)
 {
  for(unsigned short int i = 0; i < 65535; ++i)
  {
   n1 ^ n2;
  }
 }
}
Then, let us write a script to display the computation time in each case
% cat script.sh
#!/bin/bash

printf "g++ main.cpp; time ./a.out"
g++ main.cpp; time ./a.out
printf "\n"
printf "g++ -O1 main.cpp -o o1; time ./o1"
g++ -O1 main.cpp -o o1; time ./o1
printf "\n"
printf "g++ -O2 main.cpp -o o2; time ./o2"
g++ -O2 main.cpp -o o2; time ./o2
printf "\n"
printf "g++ -O2 -march=bdver2 main.cpp -o march; time ./march"
g++ -O2 -march=bdver2 main.cpp -o march; time ./march
printf "\n"
printf "g++ -Os main.cpp -o os; time ./os"
g++ -Os main.cpp -o os; time ./os
We now makethis script executable
% chmod +x script.sh
and we execute it
% ./script.sh g++ main.cpp; time ./a.out real 0m11.138s user 0m11.138s sys 0m0.000s g++ -O1 main.cpp -o o1; time ./o1 real 0m2.649s user 0m2.645s sys 0m0.004s g++ -O2 main.cpp -o o2; time ./o2 real 0m0.002s user 0m0.001s sys 0m0.000s g++ -O2 -march=bdver2 main.cpp -o march; time ./march real 0m0.002s user 0m0.001s sys 0m0.000s g++ -Os main.cpp -o os; time ./os real 0m0.002s user 0m0.001s sys 0m0.000s
With no doubt, the flags optimizes the code. But we don't have enough data to distinguish between -O2, -O2 -march=bdver2 and -Os, so let us change the code to deasable some optimizations closer to real life case and perform it only on these three last cases because the first two cases might take a very long time.
 % cat main2.cpp
#include <iostream> // std::cout
#include <ostream> // std::endl

bool f(const unsigned char & n1, const unsigned char & n2)
{
 return n1 ^ n2;
}

int main()
{
 bool * tmp = new bool [256];
 for(unsigned short int i = 0; i < 256; ++i)
 {
  tmp[i] = false;
 }
 for(unsigned char n1 = 0; n1 < 255; ++n1)
 {
  for(unsigned char n2 = 0; n2 < 255; ++n2)
  {
   {
    for(unsigned short int i = 0; i < 65535; ++i)
    {
     tmp[n1] += f(n1, n2);
    }
   }
  }
 }
 unsigned char n1 = 255;
 for(unsigned char n2 = 0; n2 < 255; ++n2)
 {
  {
   for(unsigned short int i = 0; i < 65535; ++i)
   {
    tmp[n1] += f(n1, n2);
   }
  }
 }
 std::cout<  return 0;
}
Here the new version of the script to test only the relevent flags
% cat script2.sh
#!/bin/bash

printf "g++ -O2 main2.cpp -o o2; time ./o2"
g++ -O2 main2.cpp -o o2; time ./o2
printf "\n"
printf "g++ -O2 -march=bdver2 main2.cpp -o march; time ./march"
g++ -O2 -march=bdver2 main2.cpp -o march; time ./march
printf "\n"
printf "g++ -Os main2.cpp -o os; time ./os"
g++ -Os main2.cpp -o os; time ./os
Now, let us execute this script
% ./script2.sh g++ -O2 main2.cpp -o o2; time ./o2 0x13cfc20 real 0m3.803s user 0m3.803s sys 0m0.001s g++ -O2 -march=bdver2 main2.cpp -o march; time ./march 0x16d0c20 real 0m3.843s user 0m3.841s sys 0m0.000s g++ -Os main2.cpp -o os; time ./os 0x1cc1c20 real 0m4.962s user 0m4.953s sys 0m0.008s
Running several times gives similar results. -Os made a code a little bit slower than the ones produced by -O2 and -O2 -march=bdver2. Notice that forcing inlining of f improves the computation time for -Os and give similar results than for the other flags.

About 50% of the time-O2 is faster -O2 -march=bdver2, but the relative difference is less than 1%. There is no clear gain from -march=bdver2 in this case.

.section .text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4,,15 .globl _Z1fhh .type _Z1fhh, @function _Z1fhh: .LFB0: .cfi_startproc cmpb %sil, %dil setne %al ret .cfi_endproc .LFE0: .size _Z1fhh, .-_Z1fhh .section .text.unlikely .LCOLDE0: .text .LHOTE0: .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609" .section .note.GNU-stack,"",@progbits
Here the difference in the f function
% diff o2.s march.s
6c6,7
< .p2align 4,,15
---
> .p2align 4,,10
> .p2align 3
To conclude, we first showed that ^ and != are strictly equivalent for unsigned char, then that the maximum of optimization for this operator is achived with -O2 and finally that -march=bdver2 does not provide noticeable improvement. In a real life code, other optimizations can make -Os performing better than -O2 and other flags, such as -O3, might affect the executable.

Mastodon Follow me Mastodon Share
Comments
There is no comment yet.
Dr Clément Février

I am Dr Clément Février, French, living in Grenoble. I defended my PhD on July 4th, 2016. After my defense I run as deputy deputy (not a typo) for the national parliamentary elections in the 1st circonscription of Isère for the political movement La France Insoumise.