JoWUA

Volume 7 - Issue 4

Parallel Improved Schnorr-Euchner Enumeration SE++ on Shared and Distributed Memory Systems, With and Without Extreme Pruning

Fabio Correia Darmstadt University of Technology, Darmstadt, Germany
fabio.lei.67@gmail.com
Artur Mariano Darmstadt University of Technology, Darmstadt, Germany
artur.mariano@sc.tu-darmstadt.de
Alberto Proenca University of Minho, Braga, Portugal
aproenca@di.uminho.pt
Christian Bischof Darmstadt University of Technology, Darmstadt, Germany
christian.bischof@tu-darmstadt.de
Erik Agrell Chalmers University of Technology, Gothenburg, Sweden
agrell@chalmers.se

DOI: 10.22667/JOWUA.2016.12.31.001

Keywords: Enumeration, Parallel, Shared Memory, Distributed Memory, OpenMP, MPI

Abstract

The security of lattice-based cryptography relies on the hardness of problems based on lattices, such as the Shortest Vector Problem (SVP) and the Closest Vector Problem (CVP). This paper presents two parallel implementations for the SE++ with and without extreme pruning. The SE++ is an enumeration-based CVP-solver, which can be easily adapted to solve the SVP. We improved the SVP version of the SE++ with an optimization that avoids symmetric branches, improving its performance by a factor of 50%, and applied the extreme pruning technique to this improved version. The extreme pruning technique is the fastest way to compute the SVP with enumeration known to date. It solves the SVP for lattices in much higher dimensions in less time than implementations without extreme pruning. Our parallel implementation of the SE++ with extreme pruning targets distributed memory multi-core CPU systems, while our SE++ without extreme pruning is designed for shared memory multi-core CPU systems. These implementations address load balancing problems for optimal performance, with a master-slave mechanism on the distributed memory implementation, and specific bounds for task creation on the shared memory implementation. The parallel implementation for the SE++ without extreme pruning scales linearly for up to 8 threads and almost linearly for 16 threads. In addition, it also achieves super-linear speedups on some instances, as the workload may be shortened, since some threads may find shorter vectors at earlier points in time, compared to the sequential implementation. Tests with our Improved SE++ implementation showed that it outperforms the state of the art implementation by a factor of between 35% and 60%, while maintaining a scalability similar to the SE++ implementation. Our parallel implementation of the SE++ with extreme pruning achieves linear speedups for up to 8 (working) processes and speedups of up to 13x for 16 (working) processes.

Date

September 2016

Page Number

1-19