BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:0100
TZOFFSETTO:0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:0200
TZOFFSETTO:0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20190220T095400Z
LOCATION:Sydney Room
DTSTART;TZID=Europe/Stockholm:20180703T113000
DTEND;TZID=Europe/Stockholm:20180703T120000
UID:submissions.pasc-conference.org_PASC18_sess249_pap117@linklings.com
SUMMARY:A Massively Parallel Algorithm for the Approximate Calculation of
Inverse p-th Roots of Large Sparse Matrices
DESCRIPTION:Paper\nComputer Science and Applied Mathematics\n\nA Massively
Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots
of Large Sparse Matrices\n\nLass, Mohr, Wiebeler, Kühne, Plessl\n\nWe pres
ent the *submatrix method*, a highly parallelizable method for the
approximate calculation of inverse *p*-th roots of large sparse sym
metric matrices which are required in different scientific applications. W
e follow the idea of Approximate Computing, allowing imprecision in the fi
nal result in order to be able to utilize the sparsity of the input matrix
and to allow massively parallel execution. For an *n*×*n<
/em> matrix, the proposed algorithm allows to distribute the calculations
over **n* nodes with only little communication overhead. The approxi
mate result matrix exhibits the same sparsity pattern as the input matrix,
allowing for efficient reuse of allocated data structures. We evaluate th
e algorithm with respect to the error that it introduces into calculated r
esults, as well as its performance and scalability. We demonstrate that th
e error is relatively limited for well-conditioned matrices and that resul
ts are still valuable for error-resilient applications like preconditionin
g even for ill-conditioned matrices. We discuss the execution time and sca
ling of the algorithm on a theoretical level and present a distributed imp
lementation of the algorithm using MPI and OpenMP. We demonstrate the scal
ability of this implementation by running it on a high-performance compute
cluster comprised of 1024 CPU cores, showing a speedup of 665× comp
ared to single-threaded execution.

Full paper: https://doi.org/
10.1145/3218176.3218231
END:VEVENT
END:VCALENDAR