Slide 1 TITLE
30s
I am Magnus Øverbø
My thesis work has been on the topic of Cryptanalysis of irregularly clocked
LFSR: using ARBP search on FPGA. And my supervisor for the thesis is Professor
Petrovic.
Slide 2 CONTENT
30s
The outline of the presentation is as follows.
First a background on the thesis, followed by a brief explanation of the theory
involved. Then I'll run a demo of the system which was created for the FPGA,
before going through the analysis of gathered data and the the conclusion of
this thesis.
Slide 3 THESIS
0100 - 2min
The problem we were trying to solve in this thesis was to obtain the correlation
metric used by the generalised correlation attack. Where the cipher system in
question employ irregularly clocked LFSR, specifically 0/1 clocking.
We wanted to ascertain whether an FPGA or CPU could perform an approximate
search of a long search word within a sequence twice its length. Which was
needed for this cryptanalysis application.
It was hypothesised that the FPGA would be faster, given a normalised clock
speed, even though past work has shown FPGA to be faster. This is because
previous work mainly focused on just the search algorithm with preloaded data,
whereas we were running a complete system for searching in relation to
cryptanalysis, that must handle logging, and surrounding functionality.
The main outcomes of this thesis was an implementation of unconstrained ARBP
search for CPU and FPGA to be further used in relation to the Generalised
Correlation attack.
An evaluation of the unconstrained ARBP search on FPGA and CPU, in relation to
cryptanalysis.
And it was also found that the GMP MPZ library is not the best option for use in
an ARBP search algorithms.
Slide 4 CRYPTANALYSIS
0300 - 4min
The cipher system on the right shows the cipher system and the irregularly
clocked LFSR which is decimated by the clocking LFSR, producing the keystream X'
that in return is used to produce ciphertext Z.
In order to perform cryptanalysis on the cipher system, and recover the initial
state of the two LFSRs used to produce the keystream, we utilise the generalised
correlation attack.
Which is because the GCA bases its correlation metric on the the Levenshtein
distance between two sequences of unequal length, whereas Siegenthaler utilises
Hamming distance that require sequences of equal length.
As the ciphertext and undecimated bit sequence are of unequal length, we must
find the Levenshtein distance between the sequences for correlation and further
cryptanalysis.
Petrovic proposed in a previous paper, to use an approximate row-wise
bit-parallel search algorithm on FPGA to obtain the LD as bit-parallel
algorithms are well suited for hardware implementation.
Especially given long search words as bit-parallelism allows for handling
logical operations of binary sequences in parallel.
The implemented system in this thesis obtains the LD between the search word (Z)
and the search text (X) for each of the (2^L)-1 possible initial states of the
decimated LFSR.
The LD is obtained through the unconstrained approximate row-wise bit-parallel
search, using the shift-AND algorithm by Wu and Manber. Where the approximation
is given by allowing a set number of errors to occur in the search.
As we see on the right. the shift-AND algorithm is based on an NFA, where a
table of search state values represent current matches with a given number of
errors.
The NFA represent three search state values, with 0, 1 and 2 errors allowed, and
the entire NFA is updated for each new bit being searched in the search text.
The transitions from one state to another can be:
horizontal, direct match
Vertical, insertion of a character into the search word
Diagonal, substitution of character in the search word
Diagonal, epsilon, deletion of a character in the search word
Each active state in the NFA represent a match or partial match.
When the final state (right-most-state) is active, we have a match with a given
number of errors in it.
Slide 5 FPGA DEMO
0700 - 5min
1 Start python script
./serial_logger.py 11 128 64 50
2 Upload FPGA design to board via Vivado
arbp_L011_M0128_K0064.bit
3 Showcase that the board has started with blinking LEDs
The LEDs mark the 16 most significant bits of the current LFSR state being searched
Verification of a deployed system is tricky on FPGA as it is a hardware system with limited output.
My verification methods were the LEDs, and the UART communication.
Everything else runs within the FPGA
4 Back to python script
What we see is all matches transferred from the FPGA over the UART serial connection
For each initial state of the decimated LFSR,
it generates the undecimated sequence
one bit at a time
For each bit it updates all search state values in the NFA
When all are updated it checks for a matches
If a match has occurred it transfers:
- The current initial state
- The current position in search text
- The lowest error level a match occurs at
In this script we plot this and show how the search transgress
5 Stop the script
6 Talk about the output
The output in this demo shows how the search transgress
It tags the status of the search whether a
- new min has occurred
- a rival of the current min has occurred
- or if it is greater than current min value
7 Restart script and let run
8 Hit switch for faster run on FPGA
9 Swap back to show how much farther the search has gotten without output
Slide 6 ANALYSIS
1200 - 3min
In our data set we have two subjects, the two implementations generated for CPU
and FPGA.
The subjects have three separate individual variables which directly affect the
processing time. The LFSRs feedback polynomial degree, which decides the number
of searches which must be performed. M, the search word length, which sets size
of our registers and length of the search text and affects the CPU
implementation quite severely. Lastly, K, the error threshold which depict the
number of errors allowed in the search and depict how many values must be
updated for each position in the search text.
Together they make up the base value Rops and is used to compare the tests and
evaluate the processing time. The behaviour of this base value is an exponential
growth within the polynomial degree if both M and K increase, but it has a
linear growth if K or M is kept constant
As we can see on the right, the processing times are plotted as separate sets
based on the subject and polynomial degree. With the search word length plotted
along the X-axis, with multiple values representing the different K-values
tested.
On the Y axis the total processing time is plotted in seconds.
As we see at a glance, The CPU grows at a much faster rate than the FPGA, which
processes a polynomial of degree 20 faster than CPU with a polynomial of degree
16.
We also see that FPGAs clock rate, when doubled halves the processing time.
Slide 7 FPGA
1500 - 2min
On the right we see the mean processing time for a single Rop for the FPGA,
which is the time required to update a single search state value. And is
calculated as a standard mean value.
The plotted data is with an FPGA clock rate of 50MHz.
As we see it shows an exponential decay, before reaching a consistent mean value
at M>=512b. The initial decay is because of short search words performs fewer
search state updates and surrounding operations causes larger mean time.
But at M>=512b it start to reach a consistent mean time, of about 80E-8
Which at 50MHz means it uses four clock cycles to update a single search state
value. Giving us the search time estimation function of R_ops*4f.
Slide 8 CPU
1700 - 1min
On the right we see the same plot as for FPGA using the same mean time
calculation, however it has a linear growth. Which is directly caused by the CPU
having a fixed maximum register size of usually 64b on a standard CPU
Slide 9 CPU (COMP)
1800 - 1min
However in our case, because it was implemented by utilising the GMP MPZ library
it is fixed to 32b value. So to compensate for this issue where bit-parallel
operations are split into multiple operations, we must compensate the Rops value
by a factor of the number of splits.
This is given by the ceiling of the search word length, divided by the register
size.
After compensation we get the plot on the right, which shows the same decay as
FPGA due to the same issues.
Overall the data set is less consistent than the FPGA which is likely due to the
GMP MPZ causing overhead.
Even so, we see that the mean time is higher than the FPGA, with a mean time of
8E-8, whereas CPU have 0.4E-6
Slide 10 CONCLUSION
1900 - 1min
So to conclude, improvements could be made to both implementations.
FPGA could be parallelised more, better RAM implementation and redesign of data
transfer while the CPU could be implemented using 64b and a more suited library
eg GMP MPN
Even so, the FPGA is shown in this thesis to be between 13 and 173 times faster
than the CPU, without clock normalisation.
The overview of CPU vs FPGA is shown with and without clock normalisation, and
is much faster than the CPU while running at half speed.
The FPGA design implemented is capable of searching an LFSR with polynomial of
degree 32 within 24h, completing a search word of 2k with 1024 as the error
threshold. Given a cluster of 365 FPGAs running at 2.39GHz.
Even so, polynomial degrees greater than 32 becomes increasingly difficult to
search, but achievable. This is because an FPGA design can be realised in
hardware as actual circuitry.
Given enough circuits operating in parallel. even large searches could be
performed.
Slide 11 CITATIONS
0min