Dr Clèm's Blog

Tags: Coding C++ Assembler Optimization GCC

Are ^ and != operators equivalent on unsigned char in C++?

Wednesday Oct 25, 2017 03:51, last edition on Wednesday Oct 25, 2017 03:57

I was trying to optimize a C++ code where I check if two unsigned char are different. So the first thing that I wrote was n1 != n2. unsigned char are an integer type, and an efficient way to perform this operation is an exclusive or using bitwise operators, which is ^ in C++. So I could simply change my code to n1 ^ n2. But I was wondering if there are equivalent or if one of them is more efficient. I wrote the simplest functions for each operation in separate files.

% cat neq.cpp
bool f(unsigned char n1, unsigned char n2)
{
 return n1 != n2;
}
% cat xor.cpp
bool f(unsigned char n1, unsigned char n2)
{
 return n1 ^ n2;
}
Then, I compiled them to produce assembler code with the -S option of GCC
% g++ -S neq.cpp xor.cpp
For each source file .cpp, the corresponding assembler code is store in a .s file. If we compare the two file, they are equivalent, of course apart for the file name of the source code
% diff neq.s xor.s
1c1
< .file "neq.cpp"
---
> .file "xor.cpp"
Using ^ or != are strictly equivalent for unsigned char.

My initial goal was to optimize my code but there is nothing that we can do to improve it in C++. To perform optimizations, we will use the compiler. Let us take a look at the assembler code, where the first line is skipped because irrelevant

.text .globl _Z1fhh .type _Z1fhh, @function _Z1fhh: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl %edi, %edx movl %esi, %eax movb %dl, -4(%rbp) movb %al, -8(%rbp) movzbl -4(%rbp), %eax cmpb -8(%rbp), %al setne %al popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size _Z1fhh, .-_Z1fhh .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609" .section .note.GNU-stack,"",@progbits
Let us compile with different sets of flats, -O, or -O1 which is equivalent, will drastically reduce the number of instruction passed to the CPU. Indeed, the block between .cfi_startproc and .cfi_endproc will be reduced to 3 instructions, against 15 without flag.
.text .globl _Z1fhh .type _Z1fhh, @function _Z1fhh: .LFB0: .cfi_startproc cmpb %sil, %dil setne %al ret .cfi_endproc .LFE0: .size _Z1fhh, .-_Z1fhh .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609" .section .note.GNU-stack,"",@progbits
-O2 can be used, -O3 and -Ofast will not affect the assembler code. In my case, the option -march (-march=bdver2) also affects the generate assembler code when combine with -O2. There is also -Os to optimize for size. Because I cannot judge which one is the best based on the assembler code, let us benchmark the operator with a dummy program
% cat main.cpp
int main()
{
 for(unsigned char n1 = 0; n1 < 255; ++n1)
 {
  for(unsigned char n2 = 0; n2 < 255; ++n2)
  {
   for(unsigned short int i = 0; i < 65535; ++i)
   {
    n1 ^ n2;
   }
  }
 }
 unsigned char n1 = 255;
 for(unsigned char n2 = 0; n2 < 255; ++n2)
 {
  for(unsigned short int i = 0; i < 65535; ++i)
  {
   n1 ^ n2;
  }
 }
}
Then, let us write a script to display the computation time in each case
% cat script.sh
#!/bin/bash

printf "g++ main.cpp; time ./a.out"
g++ main.cpp; time ./a.out
printf "\n"
printf "g++ -O1 main.cpp -o o1; time ./o1"
g++ -O1 main.cpp -o o1; time ./o1
printf "\n"
printf "g++ -O2 main.cpp -o o2; time ./o2"
g++ -O2 main.cpp -o o2; time ./o2
printf "\n"
printf "g++ -O2 -march=bdver2 main.cpp -o march; time ./march"
g++ -O2 -march=bdver2 main.cpp -o march; time ./march
printf "\n"
printf "g++ -Os main.cpp -o os; time ./os"
g++ -Os main.cpp -o os; time ./os
We now makethis script executable
% chmod +x script.sh
and we execute it
% ./script.sh g++ main.cpp; time ./a.out real 0m11.138s user 0m11.138s sys 0m0.000s g++ -O1 main.cpp -o o1; time ./o1 real 0m2.649s user 0m2.645s sys 0m0.004s g++ -O2 main.cpp -o o2; time ./o2 real 0m0.002s user 0m0.001s sys 0m0.000s g++ -O2 -march=bdver2 main.cpp -o march; time ./march real 0m0.002s user 0m0.001s sys 0m0.000s g++ -Os main.cpp -o os; time ./os real 0m0.002s user 0m0.001s sys 0m0.000s
With no doubt, the flags optimizes the code. But we don't have enough data to distinguish between -O2, -O2 -march=bdver2 and -Os, so let us change the code to deasable some optimizations closer to real life case and perform it only on these three last cases because the first two cases might take a very long time.
 % cat main2.cpp
#include <iostream> // std::cout
#include <ostream> // std::endl

bool f(const unsigned char & n1, const unsigned char & n2)
{
 return n1 ^ n2;
}

int main()
{
 bool * tmp = new bool [256];
 for(unsigned short int i = 0; i < 256; ++i)
 {
  tmp[i] = false;
 }
 for(unsigned char n1 = 0; n1 < 255; ++n1)
 {
  for(unsigned char n2 = 0; n2 < 255; ++n2)
  {
   {
    for(unsigned short int i = 0; i < 65535; ++i)
    {
     tmp[n1] += f(n1, n2);
    }
   }
  }
 }
 unsigned char n1 = 255;
 for(unsigned char n2 = 0; n2 < 255; ++n2)
 {
  {
   for(unsigned short int i = 0; i < 65535; ++i)
   {
    tmp[n1] += f(n1, n2);
   }
  }
 }
 std::cout<  return 0;
}
Here the new version of the script to test only the relevent flags
% cat script2.sh
#!/bin/bash

printf "g++ -O2 main2.cpp -o o2; time ./o2"
g++ -O2 main2.cpp -o o2; time ./o2
printf "\n"
printf "g++ -O2 -march=bdver2 main2.cpp -o march; time ./march"
g++ -O2 -march=bdver2 main2.cpp -o march; time ./march
printf "\n"
printf "g++ -Os main2.cpp -o os; time ./os"
g++ -Os main2.cpp -o os; time ./os
Now, let us execute this script
% ./script2.sh g++ -O2 main2.cpp -o o2; time ./o2 0x13cfc20 real 0m3.803s user 0m3.803s sys 0m0.001s g++ -O2 -march=bdver2 main2.cpp -o march; time ./march 0x16d0c20 real 0m3.843s user 0m3.841s sys 0m0.000s g++ -Os main2.cpp -o os; time ./os 0x1cc1c20 real 0m4.962s user 0m4.953s sys 0m0.008s
Running several times gives similar results. -Os made a code a little bit slower than the ones produced by -O2 and -O2 -march=bdver2. Notice that forcing inlining of f improves the computation time for -Os and give similar results than for the other flags.

About 50% of the time-O2 is faster -O2 -march=bdver2, but the relative difference is less than 1%. There is no clear gain from -march=bdver2 in this case.

.section .text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4,,15 .globl _Z1fhh .type _Z1fhh, @function _Z1fhh: .LFB0: .cfi_startproc cmpb %sil, %dil setne %al ret .cfi_endproc .LFE0: .size _Z1fhh, .-_Z1fhh .section .text.unlikely .LCOLDE0: .text .LHOTE0: .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609" .section .note.GNU-stack,"",@progbits
Here the difference in the f function
% diff o2.s march.s
6c6,7
< .p2align 4,,15
---
> .p2align 4,,10
> .p2align 3
To conclude, we first showed that ^ and != are strictly equivalent for unsigned char, then that the maximum of optimization for this operator is achived with -O2 and finally that -march=bdver2 does not provide noticeable improvement. In a real life code, other optimizations can make -Os performing better than -O2 and other flags, such as -O3, might affect the executable.

Mastodon Follow me Mastodon Share
Comments
There is no comment yet.
Post a comment

* required field.

Your comment

*

About you

*

*

Dr Clément Février

Bonjour, Je suis Clément Février, docteur en physique théorique de l’université de Grenoble Alpes, ingénieur Recherche et Développement dans le domaine de l’imagerie médicale et de la chirurgie mini-invasive chez Surgivisio et soutien du mouvement La France Insoumise.


Dites les velotaffeur, vous avez entendu parler d'un casque qui a un feu avant, un feu arrière et des clignotants sur les cotés / avant arrière ? (Enfin un simple bandeau led de chaque coté se gère hein)
Genre un truc pour que les voitures voient qu'on tourne à droite ou gauche sans qu'on ai besoin de tendre le bras ?

Parce que bon, j'ai testé ce soir, passer sur les marquages au sol + les rails du tramway avec un bras levé sous la pluie, niveau stabilité on repassera ^^'

J'arrive pas à trouver :/ et les solutions existantes de clignotants sont seulement pour l'arrière

Si le coeur vous en dit vous pouvez partager ;)

#velotaff #boostappréciés :)

The only thing that really matter is, at the end of the day, to choose the correct method automatically without conditional statement at runtime. I really wonder is a workaround is possible.
I already tried to add "std::tuple< d1*, d2 > b_tuple" to class b, using some forward declaration, but, of course, std::get< i >(b_tuple) cannot compile since it will be known at runtime only.

But I allow complete refactoring code, even the b and d1 & d2. Templates, new classes and sub classes and even are fine (I already put 100-lines define to add iterator to enum and another one to allow enum inheritance, and my coworkers begin to hate me ^^)
For example, method "create" can become a class, or class b can be construct using a macro to make somehow the base class aware of derived ones
BASE(b, ...) std::tuple< __VA_ARGS__ > g_ ## b; class b