Volume 7 - Issue 4
Parallel Improved Schnorr-Euchner Enumeration SE++ on Shared and Distributed Memory Systems, With and Without Extreme Pruning
- Fabio Correia
Darmstadt University of Technology, Darmstadt, Germany
fabio.lei.67@gmail.com
- Artur Mariano
Darmstadt University of Technology, Darmstadt, Germany
artur.mariano@sc.tu-darmstadt.de
- Alberto Proenca
University of Minho, Braga, Portugal
aproenca@di.uminho.pt
- Christian Bischof
Darmstadt University of Technology, Darmstadt, Germany
christian.bischof@tu-darmstadt.de
- Erik Agrell
Chalmers University of Technology, Gothenburg, Sweden
agrell@chalmers.se
Keywords: Enumeration, Parallel, Shared Memory, Distributed Memory, OpenMP, MPI
Abstract
The security of lattice-based cryptography relies on the hardness of problems based on lattices, such
as the Shortest Vector Problem (SVP) and the Closest Vector Problem (CVP). This paper presents
two parallel implementations for the SE++ with and without extreme pruning. The SE++ is an
enumeration-based CVP-solver, which can be easily adapted to solve the SVP. We improved the
SVP version of the SE++ with an optimization that avoids symmetric branches, improving its performance
by a factor of 50%, and applied the extreme pruning technique to this improved version. The
extreme pruning technique is the fastest way to compute the SVP with enumeration known to date.
It solves the SVP for lattices in much higher dimensions in less time than implementations without
extreme pruning. Our parallel implementation of the SE++ with extreme pruning targets distributed
memory multi-core CPU systems, while our SE++ without extreme pruning is designed for shared
memory multi-core CPU systems. These implementations address load balancing problems for optimal
performance, with a master-slave mechanism on the distributed memory implementation, and
specific bounds for task creation on the shared memory implementation. The parallel implementation
for the SE++ without extreme pruning scales linearly for up to 8 threads and almost linearly for 16
threads. In addition, it also achieves super-linear speedups on some instances, as the workload may
be shortened, since some threads may find shorter vectors at earlier points in time, compared to the
sequential implementation. Tests with our Improved SE++ implementation showed that it outperforms
the state of the art implementation by a factor of between 35% and 60%, while maintaining
a scalability similar to the SE++ implementation. Our parallel implementation of the SE++ with
extreme pruning achieves linear speedups for up to 8 (working) processes and speedups of up to 13x
for 16 (working) processes.