«Detecting Program Execution Phases using Heuristic Search Omar Benomar, Houari Sahraoui, and Pierre Poulin Dept. I.R.O., Universit´ de Montr´al e e ...»
Detecting Program Execution Phases
using Heuristic Search
Omar Benomar, Houari Sahraoui, and Pierre Poulin
Dept. I.R.O., Universit´ de Montr´al
Abstract. Understanding a program from its execution traces is extremely diﬃcult because a trace consists of thousands to millions of
events, such as method calls, object creation and destruction, etc. Nonetheless, execution traces can provide valuable information, once abstracted
from their low-level events. We propose to identify feature-level phases based on events collected from traces of the program execution. We cast our approach in an optimization problem, searching through the dynamic information provided by the program’s execution traces to form a set of phases that minimizes coupling while maximizing cohesion. We applied and evaluated our search algorithms on diﬀerent execution scenarios of JHotDraw and Pooka.
Keywords: Execution phase, execution trace, dynamic analysis, genetic algorithm 1 Introduction There is a consensus today that program comprehension is a major challenge in software maintenance . However, understanding a program is a considerable resource-consuming activity . To address this issue, a growing programcomprehension research community actively develops techniques and tools to support maintenance. One such family of techniques deals with dynamic analysis, which helps in understanding behavioral aspects of the analyzed program.
Dynamic analysis shows to developers information from a diﬀerent perspective to better grasp how a program executes. This execution comprehension is crucial when analyzing a program because for many problems, it is more precise than static analysis, which relies on the source code. However, this comes at a much higher cost in complexity. Typically, a program execution produces an execution trace that records execution events such as method calls and returns, object creations and destructions, etc. Usually, an execution generates thousands to millions of such events. This is prohibitively too large for a developer to even just look at, in order to gain a better understanding of the program.
Fortunately, not all execution events need to be considered to grasp the dynamic behavior of the program. In fact, developers get a better idea of the execution when they get the “big picture” of the run-time information. For all these reasons, one should focus directly on useful parts of execution traces that relate to 2 Detecting Execution Phases using Heuristic Search system functionality in order to reduce such complexity. Ideally, this abstraction should be computed automatically by identifying in the execution, the phases that correspond to program functionalities.
Some approaches (e.g., ) have addressed the problem of phase identiﬁcation, but they lack scalability by considering executions containing only a few thousands of events. Another issue with existing approaches is the online processing, i.e., the program run-time information is processed on the ﬂy instead of being collected in ﬁles and treated oﬄine. While online processing beneﬁts from lower memory usage (for large execution traces), one can question the pertinence of detecting high-level phases from the developer’s perspective if the developer is performing the execution phases online [13, 16]. An additional limitation is that most high-level phase detection techniques make use of several parameters or thresholds in their algorithms, such as the estimated number of execution phases . The determination of appropriate values for these parameters can be diﬃcult and may vary between programs. Furthermore, the results can be very sensitive to the modiﬁcation of such parameters.
We propose an automatic approach for the detection of high-level execution phases from previously recorded execution traces, based on object lives, and without the speciﬁcation of parameters or thresholds. Our technique is simple and based on the heuristic that, to a certain extent, diﬀerent phases involve diﬀerent objects. We apply a metaheuristic to implement our heuristic. We utilize in particular a genetic algorithm to search for the best decomposition of an execution trace into high-level phases according to the objects triggered during the program execution. To the best of our knowledge, it is novel to use a searchbased approach for high-level phase detection. We used JHotDraw and Pooka as case studies for the evaluation of our approach, and identiﬁed execution phases on seven use-case scenarios of these two open source software systems.
The rest of the paper is organized as follows. Section 2 details our phaseidentiﬁcation technique and explains the underlying heuristic and the search algorithm. The settings and results of the evaluation appear in Section 3, and Section 4 discusses the achieved results and their impact. Section 5 introduces related work and contributions that are comparable to our work. Finally, Section 6 summarizes the content of this paper and exposes limitations of our approach and future research directions.
2 Phase Identiﬁcation
A program execution typically involves various functionalities. Knowing which part of the execution (phase) belongs to which functionality helps maintainers to focus on this part during their comprehension and maintenance tasks. However, there are no explicit links in a program (source code or execution) between functionalities and execution events. The goal of our work is to explore various heuristics that approximate this link. Before introducing these heuristics, we ﬁrst deﬁne the related concepts. Then we present the implementation of heuristics using a genetic algorithm to search for program execution phases.
Detecting Program Execution Phases using Heuristic Search 3
2.1 Basic Deﬁnitions
Event: An event is an action that occurs periodically during the execution, e.g., a method call, a method return, an object creation. It encapsulates information such as the class of the triggered method, the instance of the class, and the method name.
Object: An object is the instance of the class concerned by the execution event.
Objects are uniquely identiﬁed.
Object life: An object begins its life when it is created and the end of its life is approximated by its last appearance (reference) in the execution trace.
Trace: A trace is the sequence of all execution events representing the entire execution scenario.
Phase: A phase is a sequence of consecutive execution events; it is a portion of the trace.
Cut position: A cut position is the location of the ﬁrst event of a phase in the trace.
Phase identiﬁcation solution: A solution is a set of cut positions that decomposes the trace into phases.
2.2 Heuristics Our approach for phase identiﬁcation stands on assumptions concerning the activities of objects during their lifetime. Our rationale is based on the role of
objects during program execution and is outlined as follows:
– Two successive execution phases should not have many active objects in common, otherwise it could suggest that the program is still within the same phase.
– Not all objects active in a phase are representative of this phase. Such objects are more general and are indiﬀerently used during the execution. Other objects characterize the phase as they are only triggered during this particular phase.
– A phase does not begin between two successive method calls, between two successive method returns, or even between a method call and a method return.
A phase switch occurs only when the program’s execution exits a method and enters a method.
2.3 Detection Algorithm
We approach phase identiﬁcation as an optimization problem. We consider phases as subsets of the execution events contained in the execution trace. The phase detection problem then becomes one of determining the best decomposition of the execution trace’s events.
Considering that an execution trace contains n events (possibly in the order of hundreds of thousands), and that a particular execution contains any number l m of phases, the number of possible solutions is Ck, where 0 ≤ l ≤ (n/2) − 2, k = m−1 is the number phase shifts in the trace, and l is the number of positions 4 Detecting Execution Phases using Heuristic Search l in the execution trace where a phase shift can occur. The number Ck depends on the shape of the call tree. The wider the tree is, the larger l will be.
To understand the eﬀect on l, consider two extreme cases. First, a call tree with one branch coming out of the root node, this branch consisting of a series of method calls followed by a series of method returns; then l = 0 because there is no method return followed by a method entry in the entire trace. Second, a call tree with a depth of 1, and each method call is immediately followed by a method return; then l = (n/2) − 2 because there are as many potential phase shifting positions as the number of pairs of method return/call (in this order), minus the root method call (at the beginning of the trace) and the root method return (at the end of the trace).
The number k of phase switches during execution is the number of phases minus 1. For instance, an execution phase containing two phases has one phase switch. The problem of phase detection considers therefore the number of possible combinations of k from l. As mentioned before, the number of events n can be very large and despite the fact that l ≤ (n/2) − 2, l remains also large. This results in an exceedingly large search space to investigate exhaustively. Hence, we rely on a metaheuristic search, in our case a genetic algorithm, to ﬁnd a solution to our phase identiﬁcation problem.
The algorithm starts by creating an initial population of solutions, i.e., diﬀerent trace decompositions. In the ﬁrst generation, these solutions are generated randomly. Then in an iterative process, every iteration produces a new generation of solutions derived from the previous ones. For each iteration, we push the ﬁttest candidates to the next generation (elitism), and then we generate the rest of the solutions composing the next generation by combining/modifying existing solutions using crossover and/or mutation operators. The ﬁtness of a solution is computed using a function implementing the heuristics stated earlier.
The details about the main aspects of our algorithm are presented in the following subsections.
Solution Encoding A solution is a decomposition of the execution trace into diﬀerent chunks representing phases. Our algorithm searches for the best cut positions to decompose the execution trace into phases. These cut positions in the trace are events where a phase shift occurs. Figure 2 (left) schematizes two solutions: A and B. The rectangle represents the entire trace, that is divided in two phases in solution A with one cut position (dashed line), and in four phases in solution B with three cut positions. As all solutions are decompositions of the same execution trace, we simply represent a solution by a vector of integers containing the cut positions. Each cut position consists of a valid potential phase switching event position in the trace, according to heuristics in Section 2.2. This vector representation of our solution maps perfectly with the genomic representation solutions in genetic algorithms, where each vector position corresponds to a chromosome of our solution or phenotype. The vector size indicates the number of execution phases, and therefore, it can have diﬀerent sizes as we do not limit our search to a ﬁxed number of phases.
Detecting Program Execution Phases using Heuristic Search 5 Initial Population At the beginning of the search algorithm, we create an initial population of solutions, i.e., integer vectors containing phase cut positions.
In order to diversify the population’s individuals, we generate N solutions as
follows (in our experiments, N was set to 100):
1. We randomly choose cut positions in the trace, within the number of events in the trace. The positions must be valid phase shifting positions, i.e., a cut position is a method call (start of a phase) AND its preceding event must be a method return (end of a phase).
2. The random cut positions are sorted in ascending order because events are ordered in time, and phases are successive. Two equal cut positions are merged, and only one of them is conserved.
3. In order to vary the number of phases in an initial population of N individuals, we generate N/10 solutions with two phases, N/10 other solutions with three phases, and so on. In total, we produce solutions containing two to eleven phases. Again, our technique is not bound by a ﬁxed number of phases or even a subset of phases’ numbers, and therefore the number of phases can exceed eleven during the search.
Fitness Function The ﬁtness function is the most important element of our approach. It encodes our phase detection heuristics explained in Section 2.2, and thus, guides the search process. There are diﬀerent ways to translate the proposed heuristics into measurable properties that allow evaluating the quality of a phase detection solution. We propose three metrics that are combined to deﬁne ﬁtness functions for our problem.
1) Phase Coupling: Two successive phases are coupled if they share objects. An object is shared by phases if its lifetime covers them. Figure 1 illustrates eight diﬀerent cases that can occur when computing coupling over two phases. These cases diﬀer in the way object lifetimes are distributed over the two successive phases. Some of them are more desirable than others, and therefore, are given larger weights when computing the result. The latter is a linear combination of the number of objects per category. Here are the details and rationale behind our weight aﬀectation starting from the most desirable to the less desirable distribution of object lifetimes. We illustrate the diﬀerent cases using the examples of Figure 1. We refer to consecutive phases as the ﬁrst phase and the second phase (phase i and phase j in Figure 1).
First, objects that are included in one phase have a weight of 6, i.e., they are created and destroyed within each phase. This is the ideal case since each phase would involve a diﬀerent set of objects, e.g., Obj1 and Obj2.