# «PPAM 2009 What is in this talk? Some discussion of asymptotic scalability measures for parallel algorithms. A simple algorithmic problem that is hard ...»

Parallel Longest Increasing

Subsequences in Scalable Time and

Memory

Peter Krusche Alexander Tiskin

Department of Computer Science

University of Warwick, Coventry, CV4 7AL, UK

PPAM 2009

What is in this talk?

Some discussion of asymptotic scalability

measures for parallel algorithms.

A simple algorithmic problem that is

hard to parallelize scalably...

... and a scalable algorithm for it.

Outline

Parallel Algorithms

The BSP Model

Asymptotic Scalability Longest Increasing Subsequences The Problem Sequential LIS Permutation String Comparison Parallel LIS computation Previous Work Our Algorithm Summary and Outlook Machine Model Ingredients M A BSP computer with p processors/ cores/threads.

p External and per-processor memory.

t c1 c2 Superstep-style code execution. cp Machine Model Ingredients M A BSP computer with p processors/ cores/threads. m m m m External and per-processor memory.

t c1 c2 Superstep-style code execution. cp Machine Model Ingredients M A BSP computer with p processors/ cores/threads. m m m m External and per-processor memory.

t c1 c2 Superstep-style code execution. cp Algorithm Ingredients: Sequential Algorithms We have a problem of size n. We study... the total work W(n)... the memory requirement M(n)... the input/output size: I(n) We assume that the input and output are stored in the environment (e.g. external memory).

Algorithm Ingredients: Parallel Algorithms Across all supersteps of the algorithm, we look at... the computation time: W (n; p)... the communication cost: H(n; p)... the local memory cost: M(n; p) How to do these costs relate to scalability?

Algorithm Ingredients: Parallel Algorithms Across all supersteps of the algorithm, we look at... the computation time: W (n; p)... the communication cost: H(n; p)... the local memory cost: M(n; p) How to do these costs relate to scalability?

Classical Criterion: Work Optimality An algorithm is work-optimal (w.r.t. a sequential algorithm) if W(n) „ « W (n; p) = O

Example (Matrix Multiplication) Algorithms achieving W (n; p) = O(n3 =p) are work-optimal w.r.t. the sequential O(n3 ) method.

We have absolute work-optimality if ˙(W(n)) is a lower bound on the total work for the given problem, and the given model.

Classical Criterion: Work Optimality

Example (Matrix Multiplication) Algorithms achieving W (n; p) = O(n3 =p) are work-optimal w.r.t. the sequential O(n3 ) method.

We have absolute work-optimality if ˙(W(n)) is a lower bound on the total work for the given problem, and the given model.

Classical Criterion: Work Optimality

Example (Matrix Multiplication) Algorithms achieving W (n; p) = O(n3 =p) are work-optimal w.r.t. the sequential O(n3 ) method.

We have absolute work-optimality if ˙(W(n)) is a lower bound on the total work for the given problem, and the given model.

Scalable Communication and Memory

Scalable memory allows to increase number of virtual threads until subproblems ﬁt into caches.

Scalable communication models algorithmic bus bandwidth sharing.

Algorithms with scalable memory and communication can be simulated eﬃciently on more complex parallel machine models.

Why Scalable Communication and Memory?

Scalable memory allows to increase number of virtual threads until subproblems ﬁt into caches.

Scalable communication models algorithmic bus bandwidth sharing.

Algorithms with scalable memory and communication can be simulated eﬃciently on more complex parallel machine models.

Why Scalable Communication and Memory?

Scalable memory allows to increase number of virtual threads until subproblems ﬁt into caches.

Scalable communication models algorithmic bus bandwidth sharing.

Algorithms with scalable memory and communication can be simulated eﬃciently on more complex parallel machine models.

Outline Parallel Algorithms The BSP Model Asymptotic Scalability Longest Increasing Subsequences The Problem Sequential LIS Permutation String Comparison Parallel LIS computation Previous Work Our Algorithm Summary and Outlook The Problem Given a sequence of n numbers, to ﬁnd the longest subsequence that is increasing.

2, 9, 1, 3, 7, 5, 6, 4, 8 (alternate solution) Sequential Algorithms The LIS can be found by patience sorting.

(see [Knuth:73, Aldous/Diaconis:99, Schensted:61]).

Another approach: LIS via permutation string comparison.

(see [Hunt/Szymanski:77]).

For both algorithms, W(n) = O(n log n) in the comparison-based model.

Permutation String Comparison Deﬁnition (Input data) Let x = x1 x2 : : : xn and y = y1 y2 : : : yn be two permutation strings on an alphabet ˚.

Deﬁnition (Subsequences) A subsequence u of x: u can be obtained by deleting zero or more elements from x.

Deﬁnition (Longest Common Subsequences) An LCS (x, y) is any string which is subsequence of both x and y and has maximum possible length.

Length of these sequences: LLCS (x, y).

LIS via LCS How to compute comparison-based LIS using LCS computation?

1. Copy the sequence and sort it.

2. Compute the LCS of the sequence and its sorted copy.

LCS grid dags and highest-score matrices

Parallel Algorithms The BSP Model Asymptotic Scalability Longest Increasing Subsequences The Problem Sequential LIS Permutation String Comparison Parallel LIS computation Previous Work Our Algorithm Summary and Outlook Parallel LIS Algorithms Garcia,2001 LIS by parallel dynamic programming.

(... which, in fact, can solve a slightly more general problem than LIS) Our Tool: Semi-local Sequence Comparison Deﬁnition (Substrings) A substring of any string x can be obtained by removing zero or more characters from the beginning and/or the end of x.

Deﬁnition (Highest-score matrix) The element A(i; j) of the LCS highest-score matrix of two strings x and y gives the LLCS of yi : : : yj and x.

Deﬁnition (Semi-local LCS) Solutions to the semi-local LCS problem are given by a highest-score matrix A(i; j).

Critical points Deﬁnition (Critical Point) Odd half-integer point (i ` 1 ; j + 1 ) is critical iﬀ.

A(i; j) + 1 = A(i ` 1; j) = A(i; j + 1) = A(i ` 1; j + 1).

Theorem

1. For permutation string inputs of length n, N = 2n such critical points are suﬃcient to implicitly represent the whole matrix [Schmidt:95/Alves+:06/Tiskin:05].

2. There is an algorithm to obtain these points in time O(n1:5 ) [Tiskin:06].

Critical points Deﬁnition (Critical Point) Odd half-integer point (i ` 1 ; j + 1 ) is critical iﬀ.

A(i; j) + 1 = A(i ` 1; j) = A(i; j + 1) = A(i ` 1; j + 1).

Theorem

1. For permutation string inputs of length n, N = 2n such critical points are suﬃcient to implicitly represent the whole matrix [Schmidt:95/Alves+:06/Tiskin:05].

2. There is an algorithm to obtain these points in time O(n1:5 ) [Tiskin:06].

Highest-score matrices Example (Highest-score matrix for x =4312 and y =1234)

Given the implicit highest-score matrices DA and DB for two adjacent grid dag blocks, we compute the distribution matrix dC for the union of these blocks.

For each block in DC, only a subset of nonzeros in DA and DB are relevant.

Divide-and-conquer Multiplication The smaller the blocks, the less nonzeros are relevant!

All data for processing a block of size h can be stored in space O(h).

We can partition and compute the number of nonzeros in a block in time O(h).

) We get a divide-and-conquer algorithm for multiplying highest-score matrices of size N running in O(N 1:5 ).

Parallel Multiplication We ﬁrst partition DC into p blocks and then ﬁnish the computation on these blocks independently [K,Tiskin:06].

To apply this algorithm to the LIS problem, we need to (min; +)-multiply work-optimally!

Work-optimal Multiplication Lemma We can locate a set of k nonzeros in a block of p size h in time O(h k).

Load-balancing Processors locate groups of maximally N=p nonzeros.

We have maximally p complete groups (N nonzeros overall).

We have maximally one incomplete group on each processor.

Work-optimal Multiplication Lemma We can locate a set of k nonzeros in a block of p size h in time O(h k).

Load-balancing Processors locate groups of maximally N=p nonzeros.

We have maximally p complete groups (N nonzeros overall).

We have maximally one incomplete group on each processor.

Work-optimal Multiplication Lemma We can locate all nonzeros using W (N; p) = O(N 1:5 =p).

Matches can be distributed in a scalable fashion using e.g. sorting by regular sampling.

Discussion of Practicality Speedup over the sequential algorithm is possible if input is distributed equally between processors.

This is interesting if we have big records to compare – in the comparison-based model, our sequence elements do not have to be numbers.

Speedup for large sequences is limited due to asymptotically very fast sequential algorithm.

Outline Parallel Algorithms The BSP Model Asymptotic Scalability Longest Increasing Subsequences The Problem Sequential LIS Permutation String Comparison Parallel LIS computation Previous Work Our Algorithm Summary and Outlook Summary and Outlook We have shown a scalable approach to computing longest increasing subsequences.

We are currently working on an improvement

**to the parallel multiplication algorithm:**

) We aim to reduce the work for multiplication to W (n; p) = O((n log n)=p).

This will bring us a step closer to achieving general work-optimality in a scalable fashion.

Thanks for listening!

Questions?

Parallel Algorithms The BSP Model Asymptotic Scalability Longest Increasing Subsequences The Problem Sequential LIS Permutation String Comparison Parallel LIS computation Previous Work Our Algorithm