# How to diagonalize a matrix of 4,722,366,482,869,645,213,696 elements? Part 2

###### Friday Nov 24, 2017 01:12

In the previous part, I was talk about how one shall be careful at the size of integer types. There are more considerations to take into account, in particular integral promotion. I have been been surprised to see that the return type of arithmetic operations are not the type of type of the two operands. For example, in the following code

unsigned char a;
unsigned char b;
auto c = a + b;
`c` is not an `unsigned char`. This is surprising because if both operands have the same type, then no further conversion is needed. However, and Andrey Chubukov, whom I had as a teacher during a workshop, was telling us that All great stories begin with However. Notice that it is not bijective. However,

"Arithmetic operators do not accept types smaller than `int` as arguments, and integral promotions are automatically applied after lvalue-to-rvalue conversion, if applicable."

cppreference.com

And indeed, one can check that `c` is an `int`.

###### A concluding word

Implicit conversions makes the use of `auto` more complicated because for types smaller than `int` the prototype is not

T T::operator+(const T2 & b) const;
but
int T::operator+(const int & b) const;
In our case, it can be problematic because memory consumption is the limiting factor.

##### Sparse symmetric matrix

Let us consider the antiferromagnetic Heisenberg model on one triangle at zero-field. The matrix representation of the Hamiltonian is ###### Sparse matrix

Apart from the diagonal terms, only few off-diagonal terms are non zero It is a sparse matrix. We can use this property to save memory by storing only finite elements.

Let us consider that we need maximum machine precision, then, all elements are `long double`. The size of the matrix of is then 22×Ns×`sizeof(long double)`, with Ns the number of sites. There are 3 sites on a triangle, and `long double` is 16Bytes, so, this matrix is 1kB (1024Bytes).

With a sparse matrix, one need only to store non-zero elements and their position. There are 8 diagonal terms, plus 6 on the upper triangle, and 6 on the lower triangle, so 20 non-zero elements. We know that eventually, we will deal with numbers up to 272 for the index in the matrix, so the index will be store on a 16Bytes integer type, called `index` in the following, that we will define later.
The size in memory is then 20×(`sizeof(long double)+sizeof(index)`) = 640Bytes. So we saved memory by a factor 1.6.

Notice that in this case, the indices can be stored in `unsigned char` and the values in `signed char`, which are both 1Bytes, thus this matrix can be stored in 40Bytes.

One can also notice that many elements share the same value, it can be useful to safe more memory by storing only different value. In addition, many values does not need to be stored in `long double`, but in smaller floating-point types to save even more memory. To achieve this, we can rewrite the Hamiltonian to favor `signed char` and then, apply a function to restore the actual physical values.

Sparse matrices take advance of the size. Indeed, the ratio between the memory consumption of the dense matrix and the sparse matrix increases with the size. In other words, sparse matrix will make you save relatively more space while the size increases, e.g. the ration is about 2.3 for Ns = 4.

###### Hermitian matrix

The Hamiltonian is an Hermitian operator. Thus, one need only the diagonal and either the upper or lower triangle of the matrix to have a complete knowledge of the matrix. Now, one computes only these elements and stores the non-zero ones, leading to a memory consumption of (8+6)×(`sizeof(long double)+sizeof(index)`) = 448Bytes, or 28Bytes will all optimizations, so about 2.3 times less memory.

##### Symmetries

The next step, is to use symmetries to block-diagonalize the Hamiltonian. It is introduced in Section 4.4.1 Symmetries of my PhD thesis. The short version is that one can diagonalize independently each block, and so save more memory because one works only in a subspace of the Hilbert space. ##### Algorithm

There are many algorithm to solve eigen-problems. But most of them need to know the matrix. So we want to focus on algorithm where the matrix can be implicit.

###### Power iteration

The power iteration is one of them. It is a very simple algorithm to implement and probably the one using the less memory because one need only to store one vector. Assuming elements are `double`, and applying the up/down symmetry, it takes 256GB of memory. Applying any other symmetry, e.g. translations or rotation C2, will make it fit. The problem is that it converge to the greatest (in absolute value) eigenvalue. We are looking for the ground state, which is the smallest eigenvalue, but there are always a degenerate state of higher absolute eigenvalue.

###### Locally Optimal Block Preconditioned Conjugate Gradient

Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) is a pretty good candidate. One can directly compute the smallest eigenvalue. The first step is to minimize the Rayleigh quotient, which can be done by only storing two floating-point variables, so at most 4Bytes. Then, one applies a simple iterative method to find the eigenvector. I did not try yet this method, but it looks quite promising. My main concern, and I guess that is why it is not used for very large matrix, is that one need to minimize the eigenvalue over 2Ns variables, in my case, 68,719,476,736 independent variables. So it will definitely takes a very long time to converge, but I will give it a try eventually.

###### Lanczos algorithm

Lanczos algorithm is wildly use in condensed matter physics to find the ground state of a system. It is a very fast algorithm. But one may need to store more informations. I said may because now that I am having a fresh look at the algorithm, it may be possible to store only one vector. I have to think about it.

In any case, a common way, because very easy, to write the routine is

1. Build the reduced sparse matrix (sparse matrix + Hermitian + symmetries)
2. Send the matrix to the routine
3. In the routine, at each iteration, apply the Lanczos vector to the matrix
This is what I did during my PhD thesis. Now, I want to use the fact that one can build the vector v with just a function such that v ↦ H v, where H is the Hamiltonian. In this case, one do not need to store the Hamiltonian anymore. I did not finish yet to rewrite my code, so I will come back to this algorithm later.

##### Random numbers

Lanczos algorithm begins by initializing a random vector.

###### Broken random numbers generators on Apple products
Two weeks ago, I wrote about bad practices in random number generation, where the actual good practices are inside the code sources and slightly documented. I wrote

"After 107374182350 iterations, the probabilities are still 0. I launched it on my laptop, a MacBook Pro from my lab (without Mac OS, but Ubuntu instead), but it does not have a hardware random generator. Maybe the issue comes from here. Latter, I will try on my server which have one."

rand()%range, or the wrong way to generate random numbers

It turned out that it always gave 0. The reason behind is beyond the scope of this post but I discover after some web research that it is a well-known issue that Apple products have a zero-probably for some numbers, e.g. the range [0, 7], on their implementation of what is suppose to be a uniform distribution. So, never use an Apple product to perform scientific computation involving `rand()`!

On a homemade computer, which has a true random generator, using stochastic processes, I obtain good results.

As you can see, I did not wait for the true random generator part to end because even in an environment with a fair level of entropy, it took couple of days for these around 161061273600 iterations because the software pends all its time waiting for the hardware to accumulate enough entropy. In any case, the trend is already clear.
It proves that `rand()%range` is not uniform, shows that random generators of C++11 standard are indeed uniform, and demonstrates that MaxBook Pro have a broken random number generator.

###### Lanczos vector

Because the range of random numbers do not span the whole range of numbers that they are suppose to, the initial vector is de facto constrain to a subset of the Hilbert space. In some cases, maybe most of them, this effect will cancel by the iterative method, but it can also be too restive and stuck the iterative process. In particular, it will be problematic on very small systems.

# How to diagonalize a matrix of 4,722,366,482,869,645,213,696 elements? Part 1

###### Introduction

I am trying to diagonalize a very large matrix (4,722,366,482,869,645,213,696 elements) in C++ and it leads to some issues. I noticed that I solved most of the issues that I encounter by myself by just pretending to explain the problem to someone. So I decided that for now on, I will share these notes. I will not explain everything, just the technical difficulties. I will give references for the rest, especially my PhD thesis.

Today, I will present you one of the first technical issue that I faced. It is linked to the type of variable.

A simple example is the results of 2N. In my code, I need to compute 236. The most straight forward way to achieve is by using bitwise operations. In fact, 2N can be written `1 << N`, for `N∈ℕ*`. Indeed, the operator `<<` move to the right all the bits by the number on the right hand. Few examples:
`1 << 0`. 1 is not changed, so `1 << 0` = 12 = 110 ≠ 20.
`1 << 1` moved the 1 to the right, so `1 << 1` = 102 = 210 = 21
`1 << 2` moved the 1 two times to the right, so `1 << 2` = 1002 = 410 = 22.
Here the issue. For C++, `1` is a `int`, which means that it is at least 16 bits, but the actual number of bits is not specified and is implementation dependent. Let us assume that `sizeof(int)` is equal to 4. It is the number of bytes, so it means that `int` is 32 bits (`sizeof(int)`×8).
××××××××××××××××××××××××××××××××
Only the first 31 bits (`sizeof(int)`×8-1) represent a positive number, which means that one can only move a bit up to 30 times (`sizeof(int)`×8-1-1) to the right and still get a positive number.
`1 << 30` = 010000000000000000000000000000002 = 107374182410 = 230
`1 << 31` = 100000000000000000000000000000002 = -214748364810 ≠ 231
`1 << 32` = 000000000000000000000000000000002 = 010 ≠ 232

If you try to compile a code with the last case, you will have a warning

main.cpp: In function ‘`int` main()’:
main.cpp:10:17: warning: left shift count >= width of type [-Wshift-count-overflow]
`auto` val = 1<<32;
^

As you can see `auto` does not become `long long int` but will be `int`, i.e. the return type of `1 << 32`.

###### What can we do?

The case `1 << 31` is pretty easy. We just need to explicitly declare `1 << 31` to be `unsigned int`. In this case, the last bit is not for the sign, so `(unsigned int)1 << 31` = 2147483648 = 231

What about `1 << 32`? Can I write `(unsigned long long int)1 << 32`? According to C++ standard, `unsigned long long int` is at least 64 bits, so one can think that `(unsigned long long int)1 << 32` will do the job. Let us see what happens. It is first `1 << 32` that is computed, which is an int, so it returns `0`, which is cast into an unsigned long long int. The result is then 0, so, no, it is not a workaround.
Notice that in this case, the compiler will not give you a warning, which is a bug, I think.

The solution is to actually to recast `1` into a type that stores enough bits for the bitwise operation and then apply the operator. Let us do it step by step
`1` is an `int`, so its binary representation is
000000000000000000000000000000012
Now, we add enough bits with the cast `((unsigned long long int) 1)`. It has the following binary representation
00000000000000000000000000000000000000000000000000000000000000012
`((unsigned long long int) 1) << 32` moves the first 1 32 times
00000000000000000000000000000001000000000000000000000000000000002 = 429496729610 = 232. We now have a safe 2N for N∈[1, 63]. It is enough for 236, which was the first thing that I had difficulties with.

Notice the largest number in C++ is 264-1 = 18446744073709551615, which is smaller than the number of elements in my matrix, 4722366482869645213696 = 272. We will see that it will be another problem to tackle.

###### What's next?

The main problem that I am facing is the memory consumption. I am using Lanczos algorithm do solve the eigen-problem. Right now, I am rewriting my code to use a very nice property of this method, which is the fact that the matrix can be defined implicitly.

Edit: The source code that I used for the tests. To compile

% g++ -std=gnu++14 main.cpp

###### main.cpp
```     1 #include <string>   // std::string
2 #include <iostream> // ostream (std::ostream, std::endl)
3                     // std::cout
4 #include <typeinfo> // typeid
5 #include <bitset>   // std::bitset
6 #include <array>    // std::array
7 #include <memory>   // std::shared_ptr
8
9 // Reviewed
10 const std::string exec(const auto * cmd)
+   -   11 +-- 14 lines: {
11 {
|   12  std::array<char, 128> buffer;
|   13  std::string result;
|   14  std::shared_ptr< FILE > pipe(popen(cmd, "r"), pclose);
|   15  if(!pipe) throw std::runtime_error("popen() failed!");
|   16  while(!feof(pipe.get()))
|+  |-  17 +---  6 lines: {
17  {
||  18   if(fgets(buffer.data(), 128, pipe.get()) != nullptr)
||+ ||- 19 +----  3 lines: {
19   {
||| 20    result += buffer.data();
||| 21   }
||  22  }
|   23  return result;
|   24 }
25
26 // Reviewed
27 // No auto for std::ostream
28 constexpr void display_type(
29  const auto & name,
30  const auto & var,
31  std::ostream & s = std::cout
32 )
+   -   33 +--  8 lines: {
33 {
|   34  // No auto
|   35  std::string command = "c++filt -t ";
|   36  command += typeid(var).name();
|   37  s<<"Type of \""<<name<<"\""<<std::endl;
|   38  auto stype = exec(command.data());
|   39  s<<stype<<std::endl;
|   40 }
41
42 int main()
+   -   43 +-- 31 lines: {
43 {
|   44  std::cout<<"* 1<<32"<<std::endl;
|   45  auto val1 = 1<<32;
|   46  display_type("1<<32", val1);
|   47  std::cout<<"1<<32 in base 2";
|   48  std::<<std::endl;
|   49  unsigned char * binary1 = reinterpret_cast< decltype(binary1) >(&val1);
|   50  for(auto i = (signed) sizeof(val1) - 1; i >= 0; --i)
|+  |-  51 +---  3 lines: {
51  {
||  52   std::cout<<std::bitset< 8 >(binary1[i]);
||  53  }
|   54  std::cout<<std::endl;
|   55  std::cout<<"1<<32 in base 10";
|   56  std::cout<<std::endl<<val1<<std::endl;
|   57
|   58  std::cout<<std::endl;
|   59  std::cout<<"* ((unsigned long long int) 1)<<32"<<std::endl;
|   60  auto val2 = ((unsigned long long int) 1)<<32;
|   61  display_type("((unsigned long long int) 1)<<32", val2);
|   62  std::cout<<"((unsigned long long int) 1)<<32 in base 2";
|   63  std::cout<<std::endl;
|   64  unsigned char * binary2 = reinterpret_cast< decltype(binary2) >(&val2);
|   65  for(auto i = (signed) sizeof(val2) - 1; i >= 0; --i)
|+  |-  66 +---  3 lines: {
66  {
||  67   std::cout<<std::bitset< 8 >(binary2[i]);
||  68  }
|   69  std::cout<<std::endl;
|   70  std::cout<<"((unsigned long long int) 1)<<32 in base 10";
|   71  std::cout<<std::endl<<val2<<std::endl;
|   72  return 0;
|   73 }
```

# Are ^ and != operators equivalent on unsigned char in C++?

###### Wednesday Oct 25, 2017 03:51, last edition on Wednesday Oct 25, 2017 03:57

I was trying to optimize a C++ code where I check if two `unsigned char` are different. So the first thing that I wrote was `n1 != n2`. `unsigned char` are an integer type, and an efficient way to perform this operation is an exclusive or using bitwise operators, which is `^` in C++. So I could simply change my code to `n1 ^ n2`. But I was wondering if there are equivalent or if one of them is more efficient. I wrote the simplest functions for each operation in separate files.

% cat neq.cpp
bool f(unsigned char n1, unsigned char n2)
{
return n1 != n2;
}
% cat xor.cpp
bool f(unsigned char n1, unsigned char n2)
{
return n1 ^ n2;
}
Then, I compiled them to produce assembler code with the `-S` option of GCC
% g++ -S neq.cpp xor.cpp
For each source file .cpp, the corresponding assembler code is store in a .s file. If we compare the two file, they are equivalent, of course apart for the file name of the source code
% diff neq.s xor.s
1c1
< .file "neq.cpp"
---
> .file "xor.cpp"
Using `^` or `!=` are strictly equivalent for `unsigned char`.

My initial goal was to optimize my code but there is nothing that we can do to improve it in C++. To perform optimizations, we will use the compiler. Let us take a look at the assembler code, where the first line is skipped because irrelevant

.text .globl _Z1fhh .type _Z1fhh, @function _Z1fhh: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl %edi, %edx movl %esi, %eax movb %dl, -4(%rbp) movb %al, -8(%rbp) movzbl -4(%rbp), %eax cmpb -8(%rbp), %al setne %al popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size _Z1fhh, .-_Z1fhh .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609" .section .note.GNU-stack,"",@progbits
Let us compile with different sets of flats, -O, or -O1 which is equivalent, will drastically reduce the number of instruction passed to the CPU. Indeed, the block between .cfi_startproc and .cfi_endproc will be reduced to 3 instructions, against 15 without flag.
.text .globl _Z1fhh .type _Z1fhh, @function _Z1fhh: .LFB0: .cfi_startproc cmpb %sil, %dil setne %al ret .cfi_endproc .LFE0: .size _Z1fhh, .-_Z1fhh .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609" .section .note.GNU-stack,"",@progbits
`-O2` can be used, `-O3` and `-Ofast` will not affect the assembler code. In my case, the option `-march` (`-march=bdver2`) also affects the generate assembler code when combine with `-O2`. There is also `-Os` to optimize for size. Because I cannot judge which one is the best based on the assembler code, let us benchmark the operator with a dummy program
% cat main.cpp
int main()
{
for(unsigned char n1 = 0; n1 < 255; ++n1)
{
for(unsigned char n2 = 0; n2 < 255; ++n2)
{
for(unsigned short int i = 0; i < 65535; ++i)
{
n1 ^ n2;
}
}
}
unsigned char n1 = 255;
for(unsigned char n2 = 0; n2 < 255; ++n2)
{
for(unsigned short int i = 0; i < 65535; ++i)
{
n1 ^ n2;
}
}
}
Then, let us write a script to display the computation time in each case
% cat script.sh
#!/bin/bash

printf "g++ main.cpp; time ./a.out"
g++ main.cpp; time ./a.out
printf "\n"
printf "g++ -O1 main.cpp -o o1; time ./o1"
g++ -O1 main.cpp -o o1; time ./o1
printf "\n"
printf "g++ -O2 main.cpp -o o2; time ./o2"
g++ -O2 main.cpp -o o2; time ./o2
printf "\n"
printf "g++ -O2 -march=bdver2 main.cpp -o march; time ./march"
g++ -O2 -march=bdver2 main.cpp -o march; time ./march
printf "\n"
printf "g++ -Os main.cpp -o os; time ./os"
g++ -Os main.cpp -o os; time ./os
We now makethis script executable
% chmod +x script.sh
and we execute it
% ./script.sh g++ main.cpp; time ./a.out real 0m11.138s user 0m11.138s sys 0m0.000s g++ -O1 main.cpp -o o1; time ./o1 real 0m2.649s user 0m2.645s sys 0m0.004s g++ -O2 main.cpp -o o2; time ./o2 real 0m0.002s user 0m0.001s sys 0m0.000s g++ -O2 -march=bdver2 main.cpp -o march; time ./march real 0m0.002s user 0m0.001s sys 0m0.000s g++ -Os main.cpp -o os; time ./os real 0m0.002s user 0m0.001s sys 0m0.000s
With no doubt, the flags optimizes the code. But we don't have enough data to distinguish between `-O2`, `-O2 -march=bdver2` and `-Os`, so let us change the code to deasable some optimizations closer to real life case and perform it only on these three last cases because the first two cases might take a very long time.
% cat main2.cpp
#include <iostream> // std::cout
#include <ostream> // std::endl

bool f(const unsigned char & n1, const unsigned char & n2)
{
return n1 ^ n2;
}

int main()
{
bool * tmp = new bool ;
for(unsigned short int i = 0; i < 256; ++i)
{
tmp[i] = false;
}
for(unsigned char n1 = 0; n1 < 255; ++n1)
{
for(unsigned char n2 = 0; n2 < 255; ++n2)
{
{
for(unsigned short int i = 0; i < 65535; ++i)
{
tmp[n1] += f(n1, n2);
}
}
}
}
unsigned char n1 = 255;
for(unsigned char n2 = 0; n2 < 255; ++n2)
{
{
for(unsigned short int i = 0; i < 65535; ++i)
{
tmp[n1] += f(n1, n2);
}
}
}
std::cout<  return 0;
}
Here the new version of the script to test only the relevent flags
% cat script2.sh
#!/bin/bash

printf "g++ -O2 main2.cpp -o o2; time ./o2"
g++ -O2 main2.cpp -o o2; time ./o2
printf "\n"
printf "g++ -O2 -march=bdver2 main2.cpp -o march; time ./march"
g++ -O2 -march=bdver2 main2.cpp -o march; time ./march
printf "\n"
printf "g++ -Os main2.cpp -o os; time ./os"
g++ -Os main2.cpp -o os; time ./os
Now, let us execute this script
% ./script2.sh g++ -O2 main2.cpp -o o2; time ./o2 0x13cfc20 real 0m3.803s user 0m3.803s sys 0m0.001s g++ -O2 -march=bdver2 main2.cpp -o march; time ./march 0x16d0c20 real 0m3.843s user 0m3.841s sys 0m0.000s g++ -Os main2.cpp -o os; time ./os 0x1cc1c20 real 0m4.962s user 0m4.953s sys 0m0.008s
Running several times gives similar results. `-Os` made a code a little bit slower than the ones produced by `-O2` and `-O2 -march=bdver2`. Notice that forcing inlining of `f` improves the computation time for `-Os` and give similar results than for the other flags.

About 50% of the time`-O2` is faster `-O2 -march=bdver2`, but the relative difference is less than 1%. There is no clear gain from `-march=bdver2` in this case.

.section .text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4,,15 .globl _Z1fhh .type _Z1fhh, @function _Z1fhh: .LFB0: .cfi_startproc cmpb %sil, %dil setne %al ret .cfi_endproc .LFE0: .size _Z1fhh, .-_Z1fhh .section .text.unlikely .LCOLDE0: .text .LHOTE0: .ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609" .section .note.GNU-stack,"",@progbits
Here the difference in the `f` function
% diff o2.s march.s
6c6,7
< .p2align 4,,15
---
> .p2align 4,,10
> .p2align 3
To conclude, we first showed that `^` and `!=` are strictly equivalent for `unsigned char`, then that the maximum of optimization for this operator is achived with `-O2` and finally that `-march=bdver2` does not provide noticeable improvement. In a real life code, other optimizations can make `-Os` performing better than `-O2` and other flags, such as `-O3`, might affect the executable. Bonjour, Je suis Clément Février, docteur en physique théorique de l’université de Grenoble Alpes, ingénieur Recherche et Développement dans le domaine de l’imagerie médicale et de la chirurgie mini-invasive chez Surgivisio et soutien du mouvement La France Insoumise.

On notera les 7 jours entre les premiers symptômes et les résultats.

Dites les velotaffeur, vous avez entendu parler d'un casque qui a un feu avant, un feu arrière et des clignotants sur les cotés / avant arrière ? (Enfin un simple bandeau led de chaque coté se gère hein)
Genre un truc pour que les voitures voient qu'on tourne à droite ou gauche sans qu'on ai besoin de tendre le bras ?

Parce que bon, j'ai testé ce soir, passer sur les marquages au sol + les rails du tramway avec un bras levé sous la pluie, niveau stabilité on repassera ^^'

J'arrive pas à trouver :/ et les solutions existantes de clignotants sont seulement pour l'arrière

Si le coeur vous en dit vous pouvez partager ;)

#velotaff #boostappréciés :)

I also struggles with a global "std::vector< std::any > g", but I end up with run time crash when calling "*dynamic_cast< decltype(g[index]) * >(p)" (probably for good reason? I guess it casts to std::any)

The only thing that really matter is, at the end of the day, to choose the correct method automatically without conditional statement at runtime. I really wonder is a workaround is possible.
I already tried to add "std::tuple< d1*, d2 > b_tuple" to class b, using some forward declaration, but, of course, std::get< i >(b_tuple) cannot compile since it will be known at runtime only.

But I allow complete refactoring code, even the b and d1 & d2. Templates, new classes and sub classes and even are fine (I already put 100-lines define to add iterator to enum and another one to allow enum inheritance, and my coworkers begin to hate me ^^)
For example, method "create" can become a class, or class b can be construct using a macro to make somehow the base class aware of derived ones
BASE(b, ...) std::tuple< __VA_ARGS__ > g_ ## b; class b

How to choose a function at run time using overloaded resolution in ?
I feel it's an old known issue without (obvious) answer, but many things changed since c++98
stackoverflow.com/questions/64

I add the new following constrain to my original question, it can be up to C++14, no more (although if a solution in c++17 or C++20 exist, I'm interested) because of compiler limitation (appears to be based on gcc 5.14 from what I understood).

@kensanata @cadadr @celia it's been 2 years I didn't try. I was trying once a year since they changed the api. I gave up eventually. I'll take a look again.