«Abstract Fitting a single generic AAM on an unseen face (that is not in the training set) under any pose and expression is very difﬁcult. The ...»
Pools of AAMs: Towards Automatically Fitting
any Face Image
Julien Peyras Adrien Bartoli Samir Khoualed
LASMEA, Clermont-Ferrand, France
Fitting a single generic AAM on an unseen face (that is not in the training set) under any pose and expression is very difﬁcult. The variability of
the data is so high that the ﬁtting process usually gets stuck into one of the
numerous local minima. We show that a solution to this problem consists
to separate the variability sources. We build a pool of specialized AAMs.
Each AAM is trained over multiple identities, all shown under the same pose and expression. We then retain the AAM that shows the smallest residual error when ﬁtted to the input image. The ﬁtting obtained in this manner is very accurate on unseen faces. The ultimate goal is to automatically train a person-speciﬁc AAM. In addition, the pool of specialized AAMs allows us to recognize the face pose and expression at each frame of the video with good performances. The proposed method has potential applications in Human Computer Interaction and driving surveillance, to name just but a few.
1 Introduction The problem of face analysis in still images and videos has been extensively studied for years. This intense research activity ﬁnds its motivation in the possibility to set up a large range of applications in the medical, psychological and linguistic ﬁelds (cognitive studies, expression transfer on an avatar, etc. ). Face analysis is a difﬁcult topic since face images vary in identity, pose and expression. The sought-after model should be able to automatically and reliably describe previously unseen faces under any pose and expression. We describe the two most promising approaches.
The ﬁrst one is Bartlett et al.’s machine learning based expression analysis solution proposed in . Several classiﬁers are trained for face and eye detection, as well as for the presence and intensity of particular Action Units. These are the elementary deformations occuring on a face, as described by Ekman’s Facial Action Coding System . This method is probably the best performing one in the literature for expression analysis on unseen faces (faces that are not explicitly learnt by the classiﬁers). The method is non model-based. This makes it difﬁcult to retrieve the shape, and so restrain the range of possible applications.
The second established approach is the Active Appearance Model (AAM) proposed by Cootes et al. . An ad hoc face AAM is trained on manually labeled images, so as to learn the shape and appearance bases. An optimization process is used to ﬁt the AAM on an input image: the shape and appearance coefﬁcients of the model are tuned until the model instance matches the input picture. Retrieving the face shape is important for many video post processing systems. Obviously the performance of such systems is directly related to the quality of the face shape description, i.e., the ﬁtting accuracy is crucial.
As Gross et al.  ﬁrst pointed out, it is important to distinguish between two situations, providing two different kinds of achievable ﬁtting accuracy:
• the person-speciﬁc context, where the ﬁtted face has been explicitly learnt by the model. The ﬁtting accuracy is usually very good in this context, and reliable for post processing systems. In , Lucey et al. use person-speciﬁc AAMs to retrieve the face shape and successfully classify facial deformations into Action Units.
• the person-generic context, where the ﬁtted face is not in the training set. As ﬁrst shown by Gross et al. in , the ﬁtting process is much harder than in the personspeciﬁc context. In , Peyras et al. showed with carefully chosen experiments that ﬁtting an unseen face with an AAM is much less accurate than ﬁtting a face that belongs to the set of images used to train the model. They explained the reason for this: in the generic context, the appearance counterpart of the model cannot fully explain the appearance of the face in the input image. As an unfortunate consequence, the minimum error of the cost function corresponds to a biased position of the model. Even when initialised in the best possible position (the ground-truth shape), the AAM drifts away.
The problem of ﬁtting unseen faces is a corner-stone for an extended amount of applications. As of today, no method have proven able to accurately ﬁt previously unseen faces under a wide range of poses and expressions. AAMs appear to provide an interesting basis to face this problem. One could think that adding more training data would increase the ability of the model to generalize to unseen faces. Indeed, this ability increases with the amount of training data. In practice however, the higher complexity of the AAM makes its ﬁtting unreliable because this induces numerous local minima in the cost function. In other words, the model is so ﬂexible that it ‘explains’ spurious non face solutions in the image. As a consequence, the solution for reliable and accurate ﬁtting
must combine these two contradictory conditions:
• the complexity of an AAM must be kept as low as possible so as to preserve a large convergence basin and be able to ﬁnd the global cost minimum,
• the range of face images that the AAM can explain must be large, so that the global cost minimum matches the sought after solution.
The ﬁrst condition is satisﬁed by limiting the size of the training set while the second one requires to expand the training set. To bypass such a contradiction, we propose to separate the sources of variability within the training data. Instead of considering the face as an object that varies in identity, pose and expression, we see it as a collection of objects that vary in identity only: each object has a constant pose and expression. In this view, an AAM must model only one of the three sources of variability: identity, so as to ﬁt a variety of unseen faces under the same pose and expression. We say that such an AAM is specialized to a particular pose and expression pair. To deal with many poses and facial deformations, we train a pool of specialized AAMs.
Contribution. We showed in  that ﬁtting an unseen face with local models increases the generalization ability and the ﬁtting accuracy in comparison to global models covering all facial features. The ﬁtting bias is reduced to a point where the ﬁtting accuracy on unseen faces is equivalent to the accuracy of manual labels. Following this insight, we design two categories of specialized AAMs that locally model the face: the upper AAMs, built to ﬁt the eyes and eyebrows, and the lower AAMs, designed to ﬁt the mouth. This also presents the advantage to model separately the possible combinations of facial deformations. Our strategy consists to run all upper and lower AAMs on one input picture.
For each category we keep the AAM presenting the smallest residual error. This AAM is expected to be the most accurately ﬁtted on the face, and should represent the current pose and expression of this face. Consequently, we expect our method to automatically provide accurate labels on unseen faces under varying expression and pose, and also to correctly classify the pose and expression at any frame of a video. The process is presumably slow and costly. This is often not a limitation: the long off-line training is performed only once, on a video of a person who frequently uses the device at hand. As an example, communication with personal-computers and car driver monitoring systems can be
equipped with this technology. As two important contributions, we show that:
• good ﬁtting accuracy, good robustness to position perturbation and high classiﬁcation rates are obtained,
• the obtained labels can be used to automatically train a person-speciﬁc AAM, which is able to ﬁt the face and classify its expression in real-time.
Organization. Section 2 reviews the literature and introduces the AAMs. Section 3 presents the specialized AAMs and the pose and expression database we have used to perform our experiments. In section 4 we show experimental results on still images in a leave-one-identity-out fashion, and on a video where an unseen person displays a series of poses and expressions. We compare the performance of the specialized AAMs against the classical AAM learning all data. Section 5 gives a conclusion and our perspectives. The good ﬁtting results of the specialized models will allow us to build a person-speciﬁc AAM for real-time tracking and pose and expression classiﬁcation on the just-learnt person.
2.1 Previous Work The concept of ﬁtting several models is not new: Cootes et al. used one model for each face pose in . However, despite the advantages it presents, this solution were not pursued afterward.
The AAM is not the unique face ﬁtting solutions in the literature. We review some others. Cristinacce et al. proposed a competitive template matching solution called Constrained Local Models in , which were further studied by Wang et al. in . This solution exhibits better ﬁtting results than AAMs. Note that these methods can be embedded as the specialized models in our framework. Indeed, pools would increase the discriminability between correct and wrong alignments, which is an important ability when aligning objects with a very high and complex range of variability.
The 3DMM (3D Morphable Model) presented by Vetter et al. in  can recover the 3D structure of a face from a single picture. This model is too heavy to automatically and reliably ﬁt faces under any pose and expression. Here too, the specialization of multiple 3DMMs could be help to improve the results.
2.2 Background on the AAM An AAM combines two linear subspaces, one for the shape and one for the appearance.
They are learnt from a labeled set of training images . A certain percentage of the whole training set shape and appearance variance is kept. As a rule of thumb,  showed that keeping 60% shape variance and 100% appearance variance is ‘optimal’ in the persongeneric context. We therefore keep 60% shape and 95% appearance variance, so as to keep the AAM size reasonable.
Fitting an AAM consists to ﬁnd the shape and appearance instances that make the residual error between the image and the synthesized model as small as possible. We use Baker and Matthews’ optimization framework  with the Simultaneous Inverse Compositional Algorithm.
3 A Pool of Specialized AAMs
3.1 The Concept In , both global and local models are specialized on the frontal pose and neutral expression. Since stufﬁng various poses and expressions into a single AAM spoils its ﬁtting performance, we extend here the concept of specialized AAM. The idea is to build a pool of AAMs, each being specialized on a particular pose and expression pair. The whole pool would then encompass a continuum of poses and expressions.
Each specialized AAM is built over N different identities, giving the AAM a certain ability to explain unseen faces. Unfortunately, none of the publicly available face databases present a large range of facial deformations under several head poses and an homogeneous illumination. For this reason, we had to build our own pose and expression database that we present in the rest of the section.
3.2 The Pose and Expression Database Our current database has 15 identities taken under 3 views (frontal, 10◦ and 20◦ in azimut) displaying 21 facial (upper or lower) deformations. We kept the illumination homogeneous. All pictures (63 per identity) were manually labeled thoroughly to maximize the label accuracy. Taking pictures and labeling them represents about 3 hours of work per identity. The facial deformations we use are showed in ﬁgure 1. Figure 2 shows a sample of people from the database.
It is obvious that more people, more poses and more deformations could be included in the database to ﬁt more unseen people under a less restricted amount of poses and
expressions. However one faces several difﬁculties:
• it is time consuming and tedious to label images with high accuracy, as this present study requires.
Figure 1: Facial deformations represented in the database. The manually placed landmarks represent the vertices used for training or ﬁtting (for testing purposes). The deformation number is indicated on top of each of the thumbnails. Each deformation is meant to represent some Action Units or a particular combination of them .
0◦ 10◦ 20◦ Figure 2: 5 of the 15 identities of the database for all poses and deformation n◦ 5.
• the appearance and deformation of faces are wide-ranging. The set of people forming the database must capture this diversity, in quality and quantity.
• the quality of the deformations is very important to prevent from badly deﬁned deformation classes and their possible overlaps. The selected people composing the database should therefore be actors or possess some particular talents to perform facial deformations on demand.
4 Evaluation and Tests
4.1 Leave-One-Identity-Out Test The test consists to train a pool of specialized AAMs on N identities and to operate the ﬁtting on one of the remaining faces. In this way, the identity we ﬁt is unknown from the AAMs. We perform this leave-one-identity-out test 15 times. N can at most be 14. For each test identity, 63 images (21 expressions under 3 poses) must be ﬁtted with all upper or all lower specialized AAMs. For each image, we run all AAMs and keep as the winner the one that makes the smallest residual error at convergence, after 30 iterations. Our goal
is to assess the two following points:
• the ﬁtting accuracy, i.e., the quality of each label position on the face at convergence: we measure it by comparison with manual labels taken as a reference,
• the basin of convergence, i.e., the ability to cope with perturbed initializations,
• the classiﬁcation rate, i.e., the frequency of correct correspondence between the pose and expression of the winning AAM and the true pose and expression.