Therefore, many organizations,especially social networking companies such as Facebook and Twitter, are focusing on research anddevelopment of a system to identify fake photos to be able
Trang 1Hanoi university of science and technology
Master thesis
Facial image forgery and detection:
benchmark and application
Pham Minh Tamtam.pm202708M@sis.hust.edu.vn
School of Information and Communication Technology
Supervisor: Assoc Prof Huynh Quyet Thang
Supervisor’s signature
Institution: School of Information and Communication Technology
Hanoi, 12/2021
Trang 2• Where I have consulted the published work of others, this is always clearly attributed.
• Where I have quoted from the work of others, the source is always given With the exception
of such quotations, this thesis is entirely my own work
Student
Signature and Name
Trang 3I also like to thank my family, my girl friend and my friends for their wise counsel andsympathetic ear You are always there for me.
Finally, I would like to thank Vingroup Innovation Foundation (VINIF) for the financialsupport
Trang 4In recent years, visual forgery has reached a level of sophistication that humans cannot tify fraud, which poses a significant threat to information security A wide range of maliciousapplications have emerged, such as fake news, defamation or blackmailing of celebrities, imper-sonation of politicians in political warfare, and the spreading of rumours to attract views As aresult, a rich body of visual forensic techniques has been proposed in an attempt to stop thisdangerous trend In this thesis, I introduce two new models with name Efficient-Frequency andWADD to improve the result of fake face images detection problem I also present a benchmarkthat provides in-depth insights into visual forgery and visual forensics, using a comprehensive andempirical approach and propose a novel end-to-end visualforensic framework that can incorporatedifferent modalities toefficiently classify real and forged contents More specifically, we develop anindependent framework that integrates state-of-the-arts counterfeit generators and detectors, andmeasure the performance of these techniques using various criteria We also perform an exhaustiveanalysis of the benchmarking results, to determine the characteristics of the methods that serve as
iden-a compiden-ariden-ative reference in this never-ending widen-ar between meiden-asures iden-and countermeiden-asures
Student
Signature and Name
Trang 51.1 Context 10
1.2 Research Problems 11
1.3 Contributions 12
1.4 Thesis Outline 14
1.5 Selected Publications 14
2 Preliminaries and literature survey 16 2.1 Image classification problem 16
2.2 Visual forgery techniques 21
2.2.1 Graphics-based techniques 21
2.2.2 Feature-based techniques 23
2.3 Visual forensics techniques 26
2.3.1 Computer vision techniques 26
2.3.2 Deep learning techniques 28
3 Proposed facial forgery detection models 31 3.1 Efficient-Frequency model 31
3.2 WADD (Wavelet Attention for deepfake detection) model 34
Trang 64 Proposed dual benchmarking framework for facial forgery and detection
4.1 Framework 39
4.2 Datasets 40
4.2.1 Dual-benchmarking datasets (DBD) 40
4.2.2 External datasets 42
4.3 Measurements 43
4.4 Experimental Procedures 43
4.5 Reproducibility Environment 45
5 Experimental Results and Performance Analysis 46 5.1 Efficiency comparison 46
5.2 End-to-end comparison with existing datasets 47
5.3 Dual-benchmarking comparison 48
5.3.1 Forensic generalisation and forgery feature overlapping 49
5.3.2 Qualitative study of forensic-forgery duel 51
5.3.3 Influence of contrast 52
5.3.4 Effects of brightness 53
5.3.5 Robustness against noise 54
5.3.6 Robustness against image resolution 55
5.3.7 Influence of missing information 56
5.3.8 Adaptivity to image compression 57
5.4 Performance guidelines 57
Trang 7List of Tables
1.1 Comparison Between Existing Benchmarks on Facial Forensics 13
2.1 Taxonomy of visual forgery techniques 21
4.1 Statistic of real datasets 43
5.1 Model size and detection speed 46
5.2 Statistic train, validate, test of existing datasets 47
5.3 Performance (Accuracy|P recision|Recall|F1score) of visual forensic techniques on different datasets 48
5.4 Statistic train, validate, test of synthetic datasets 48
5.5 Performance of visual forensics techniques against visual forgery techniques 49
5.6 Performance guideline for visual forensics 59
Trang 8List of Figures
2.1 A vanilla CNN 17
2.2 Convolution layer 18
2.3 Max and average pooling 18
2.4 Global average pooling 19
2.5 Flatten and fully connected layer 19
2.6 Activation functions 20
3.1 Overview of Efficient-Frequency pipeline 32
3.2 EfficientNet architecture 33
3.3 WADD model 35
3.4 Wavelet Pooling 35
3.5 Attention layer 38
4.1 Dual benchmarking framework 40
4.2 Size of facial forgery datasets 41
4.3 Images from DBD dataset 42
5.1 Generalization ability of forensic techniques 50
5.2 Overlapping feature of forgery techniques 51
5.3 Suspicious region of forged images 52
Trang 95.4 Effects of illumination factors 53
5.5 Robustness against noises 54
5.6 Robustness against image resolution 55
5.7 Influence of missing information 56
5.8 Adaptivity to image compression 57
Trang 10an existing image or video is replaced with someone else’s likeness.
The term "fake images" or "forgery image" have emerged in the recent years, because therehave been a lot of tools helping change the content of images such as Photoshop software Althoughthese tools are so powerful, users must have a huge a mount of knowledge to use them As a results,the number of fake images is low and it takes a lot of resources to make a fake photo But now,rather than images simply being altered by editing software such as Photoshop or videos beingdeceptively edited, there’s a new breed of machine-made fakes – and they could eventually make itimpossible for us to tell fact from fiction With the development of deep learning techniques, thereare a lot of methods which can generate "fake images" quickly, easily and in bulk
"Deep fakes" are the most prominent form of what’s being called “synthetic media”: images,sound and video that appear to have been created through traditional means but that have, in fact,been constructed by complex software Deep fakes have been around for years and, even thoughtheir most common use to date has been transplanting the heads of celebrities onto the bodies ofactors in pornographic videos, they have the potential to create convincing footage of any persondoing anything, anywhere There are many app which use "Deep fake" technique to help peopleeasily make forgery content such as Zao or FaceApp software
Trang 11The development of social networks makes the rapid spread of fake photos and videos becomemore widespread and common These fake images and videos are distributed with the false contentcontained within it causing many economic and social problems Therefore, many organizations,especially social networking companies such as Facebook and Twitter, are focusing on research anddevelopment of a system to identify fake photos to be able to promptly prevent the spread of fakeimages, fake pictures and videos on the internet.
From the above urgent requirements, this thesis investigates methods to detect fake facialimages (forensic methods) and ways to generate this fake facial images (forgery methods) Becausethis thesis just focuses on facial fake images, terms "fake images" or "forgery images" refer toimages with faces changed and the term "forensic" refers to methods used to solve the detectingfake facial images problem The results of this thesis could help social network administratorsand communities to choose appropriate forensic methods to reduce the consequences of fake facialimages
The development of fake visual content, such as fake images and fake videos, has undergone
particular interest, since the face plays an essential role in human communication and representsthe identity of the person Forged facial images and videos have reached such a high level of qual-ity that even people with good vision, under ideal lighting conditions, cannot distinguish between
mali-cious applications, such as counterfeit news generation, click-baits, impersonation and fraudulenttransactions [32,2]
Given the importance of fake content detection, visual forensics has emerged to detect fakeimage in the world especially in social media content Fake facial image detection can be definedas: with an image x, we need developing a detection model D such that :
y = D(x) =
0,if x is real images1,if x is fake facial images
(1.1)
Fake facial image detection can be categorised into two paradigms On the one hand, puter vision approaches rely on handcrafted features to detect anomalous patterns in visual con-
Trang 12com-tent, including frequency-based techniques, visual artefacts, and techniques that examine head
deep neural networks (DNNs) to automatically extract hidden features that go beyond human
However, recent advances in modern artificial intelligence have given rise to a new and evolving class of visual forgery techniques These techniques exploit the power of AI to hide thedigital footprints generated by the forgery process, and can trick even the latest forensics tech-
feature-based The former often mixes and disguises fake artefacts with common ones to produce
on commodity hardware The latter method relies on the power of DNNs to increase the level of
since the generation of fake videos requires more fine-grained and precise features than in image
In this continual war between visual forgery (measures) and visual forensics
using the same benchmarks This is primarily due to the challenges in aligning the different settings.Countermeasures are often proposed for a previous forgery technique, and soon become obsoletefor a new forgery measure Consequently, the interpretation of performance results from forensictechniques is challenging, since baselines tend to change quickly over time Moreover, visual forgeryand visual forensics have not been subject to a fair comparison, as the reported performance isgenerally based on small datasets and a limited range of adverse conditions The recent social andpolitical damage from fake visual content requires a common ground to allow us to understand thetimeline, compare variations, and keep pace with this war
In my thesis, I report the first independent dual benchmarking study to evaluate visual forgeryand forensics methods in a unified framework Using this framework, a comprehensive performancecomparison is conducted on a wide range of state-of-the-art forensic techniques, forgery baselines,
Trang 13and real-world datasets To envision the future of the forgery/forensics war, I also conduct in-depthanalyses of synthetic datasets to extract insights on the performance behaviour of the benchmarkedmethods Based on these results, I propose several guidelines for the selection of appropriate visualforensic techniques for particular application settings In addition, I also propose two new model
to improve performance of the fake image detection issue Researchers and gate-keepers can useour generic framework to reduce the complexity of future benchmarking studies
Table 1.1: Comparison Between Existing Benchmarks on Facial Forensics
Benchmarks #Forensics #Forgery #Adversaries Image Video Code New Datasets
The contributions of our thesis can be summarised as follows:
• New forgery datasets I apply these forgery techniques to generate forged content, ing in a sizeable collection of datasets that is then used to explore the ability of forensicstechniques to handle malicious applications Comparing to existing datasets, our dataset cov-ers larger range of contents and forgery types (8 techniques in 3 types, results in 1,000,000forged images and 21,095 videos), which enables a thorough investigation of the field
result-• Reproducible dual benchmarking I publish the first large-scale reproducible ing framework that can assist in a dual comparison of a wide range of forensic and forgery
framework is designed in a component-based architecture, which allows the direct integration
of new forgery and forensics techniques besides the default ones Our framework also providesthe application layer, which aids the investigation of different imagery factors on the forensics
• Performance guideline I present an exhaustive list of performance results at differentlevels of granularity From this, I extract a comparative reference that can be used to select
an appropriate forensic approach in particular cases of forgery
• Propose new models I propose two new models to improve result on fake image detection.First model (Efficient-Frequency), I combine visual and frequency information to boost ac-curacy Second model (WADD), I use wavelet layer to reduce the number of parameter whilekeep the same result
1 https://github.com/tamlhp/dfd_benchmark
Trang 141.4 Thesis Outline
In the remainder, the thesis is organised as follows:
Inchapter 2We introduce about background knowledge of different visual forgery techniques
sec-tion 2.2, which is divided into two subsections that discuss graphics-based and feature-based
Inchapter 3, We propose two novel models which can improve result of deepfake detection
Inchapter 4We introduces the setup used for our benchmark, including the component-baseddesign, datasets, metrics and evaluation procedures I also introduce a Dual-benchmarking datasets
Inchapter 5We report the experimental results
Inchapter 6We provide a summary of the findings and practical guidelines as well as concludesthe paper
This thesis is based on the following research papers:
• Chau Xuan Truong Du, Huynh Thanh Trung, Pham Minh Tam, Nguyen Quoc Viet Hung,and Jun Jo "Efficient-Frequency: a hybrid visual forensic framework for facial forgery detec-tion." In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pp 707-712.IEEE, 2020
This paper propose a new model with name Efficient-Frequency to improve result of detect
• Minh Tam Pham; Thanh Trung Huynh; Van Vinh Tong; Thanh Thi Nguyen; HongzhiYin; Quoc Viet Hung Nguyen “A dual benchmarking study of facial forgery and facial foren-
12912)
Trang 15In this work, we develop a benchmark that offers a comprehensive empirical study on theperformance comparison of deepfake detection models Specifically, we integrate several state-of-the-art net-work alignment techniques in a comparable manner, and measure distinctcharacteristics of these techniques with various settings We then provide in-depth analysis ofthe benchmark results The in-depths analysis of network alignments in this work is presented
in chapter 2andchapter 4
Trang 16Chapter 2
Preliminaries and literature survey
Image Classification is one of the most fundamental tasks in computer vision This task relate
to a process predicting a specific class or label a given image This task can be divided to two type:
• Single-label classification which predict one label for each image This is the most commonclassification task in supervised Image Classification A single label or annotation is presentfor each image in single-label classification Therefore, the model outputs a single value orprediction for each image that it sees The output from the model is a vector with a lengthequal to the number of classes and value denoting the score that the image belongs to thisclass A Softmax activation function is usually used to make sure the score sums up to oneand the maximum of the scores is taken to form the model’s output Fake face image detection
is a sigle-label classification problem with two class: real and fake
• Multi-label classification which predict two or more labels for each image Multi-labelclassification is a classification task where each image can contain more than one label, andsome images can contain all the labels simultaneously Instead using Softmax function, thistask usually use Sigmoid activation function to predict whether the given image contain thegiven class A threshold is used to decide the labels of the given image
Convolution neural network is the most popular way to solve this problem In following thissection, I will discuss more detail about convolution neural network
Convolution neural network (CNN) is a neural network to solve image related problem such asimage classification, image segmentation, The CNN is built by stacking many convolution layers
Trang 17and combining fully connected layers at the end of model asFigure 2.1 Convolutional networks
were inspired by biological processes in that the connectivity pattern between neurons resembles
only in a restricted region of the visual field known as the receptive field The receptive fields of
different neurons partially overlap such that they cover the entire visual field There are many
Inception [52],
Convolution layers Convolution layers are the major building blocks used in convolution neural
networks A convolution is the simple application of a filter with convolution operation to an input
to calculate a map called a feature map indicating the locations and strength of a detected feature
in an input image The filter is smaller than the input data and the type of multiplication applied
between a filter-sized patch of the input and the filter is a dot product A dot product is the
element-wise multiplication between the filter-sized patch of the input and filter, which is then
summed, always resulting in a single value Using a filter smaller than the input is intentional
as it allows the same filter (set of weights) to be multiplied by the input array multiple times at
different points on the input Specifically, the filter is applied systematically to each overlapping
part or filter-sized patch of the input data, left to right, top to bottom The size of filter usually is
Pooling layers Pooling layers are used to reduce the dimensions of the feature maps Thus,
it reduces the number of parameters to learn and the amount of computation performed in the
network The pooling layer also summarises the features present in a region of the feature map
1 https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
2 https://viniciuscantocosta.medium.com/understanding-the-structure-of-a-cnn-b220148e2ac4
Trang 18Figure 2.2: Convolution layer
generated by a convolution layer So, further operations are performed on summarised features
instead of precisely positioned features generated by the convolution layer This makes the model
more robust to variations in the position of the features in the input image
There are some different types of pooling layer :
• Max pooling selects the maximum element from the region of the feature map covered by
containing the most prominent features of the previous feature map
3 https://www.researchgate.net/figure/Pooling-layer-operation-oproaches-1-Pooling-layers-For-the-function-of-decreasing-the_ fig4_340812216
4 https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/blocks/
global-average-pooling-2d
Trang 19Figure 2.4: Global average pooling
• Average pooling computes the average of the elements present in the region of feature map
in a particular patch of the feature map, average pooling gives the average of features present
in a patch
• Global Pooling reduces each channel in the feature map to a single value For examplewith an input c × h × w, after forward to global pooling, the result has size c × 1 × 1 Max
or average operation can be used to select feature in each channel The operate of global
Fully connected layer After go through convolution layers and pooling layers, feature map is
implies that every neuron in the previous layer is connected to every neuron on the next layer The
5 https://www.researchgate.net/figure/Flattening-step-in-CNN_fig2_343263135
Trang 20output from the convolutional and pooling layers represent high-level features of the input image.The purpose of the Fully Connected layer is to use these features for classifying the input imageinto various classes based on the training dataset.
Activation functions Activation functions are non-linear transformations that use after lution layers or fully connected layers before sending its outputs to the next layers of neurons orfinalizing as output It makes easy for models to generalize or adapt with variety of data and to
Loss functions The loss function is the function that computes the distance between the currentoutput of the algorithm and the expected output.It’s a method to evaluate how your algorithmmodels the data It can be categorized into two groups: one for classification (discrete values,0,1,2 ) and the other for regression (continuous values)
Cross-entropy is common loss function for classification problems with formula:
In a fake face image detection problem, with one image as input, the output of the model is
a float value from [0 − 1] which represents the probability that the input image is fake So binarycross-entropy is used in the fake face image detection problem with formula as below
6 https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092
Trang 212.2 Visual forgery techniques
Visual forgery techniques aim to create a false image/video by injecting incorrect information(e.g a false identity) into an original image/video We can classify these into two categories: (i)
characteristics summary are used in our benchmark
Table 2.1: Taxonomy of visual forgery techniques
Name Forgery type Video specific
Id Swap Att Swap Att mani Deepfake [ 23 ] Ë
3DMM [ 57 ] Ë FaceSwap-2D [ 14 ] Ë
FaceSwap-3D [ 30 ] Ë MonkeyNet [ 49 ] Ë Ë ReenactGAN [ 64 ] Ë Ë
StarGAN [ 7 ] Ë Ë X2Face [ 61 ] Ë Ë
These techniques are often used to replace the face of a source person (A) with the face of atarget person (B) using handcrafted features (e.g the landmark points of a human face) to forgethe image We describe several typical graphics-based techniques below
points on a 2D scale is then used to fit the face of the target person B onto the source image ofperson A
Colour adjustment: In this routine, the aim is to transfer a histogram of the image of person
B in order to match it with the histogram for the image of person A This process involves the
Trang 22Faceswap-3D This graphics-based technique goes beyond the Faceswap-2D method by
in this approach is the 3D setting of facial landmarks, which makes the generated image harder todetect:
X Y Z
= R
U V W
are the camera coordinates, R is a 3 × 3 rotation matrix, and t is a 3 × 1 translation vector After3D modelling, the projection into 2D is carried out as follows:
x y 1
is the optical centre
After 3D modelling, the projection of the 3D landmarks onto 2D equivalents, as shown in
Equation 2.5should match the 2D landmarks in the image This can be formulated as an sation problem:
2
(2.6)
This optimisation is often referred as a P3P problem, and a strategy for solving this can be found
the estimated head pose (sA, RA, tA) of xA
3D-Morphable face model (3DMM) This is a graphics-based technique that is also used
mapping from 3D to 2D, this model uses a nonlinear mapping that is learned by an encoder-decoderdeep neural network
Trang 23Formally, given a set of 2D face images Ii i=1, 3DMM constructs three deep neural networks: (i)
i=1 R(EM(Ii), DS(ES(Ii)), DT(ET(Ii))) −
the quality of the training process After the training process, the fake image can be generated by
S}
Recent fake image generators have leveraged advanced neural network architectures such as
forged images of superior quality, without the need for feature engineering or expert knowledge
We describe some representative feature-based techniques below
other faces The typical architecture of this model is composed of one encoder En and two
rec= EnxX[||xX−DeX(En(xX))||1]
features
The loss functions used to train the ED and D networks are:
Trang 24where λadv is a balancing hyper-parameter between originality (reconstruction loss) and realisticrendering (adversarial loss) After the training process, the fake image can be obtained by applying
facial attributes (e.g hair color, skin, gender, facial expression) To achieve this, StarGAN firstgroups the training images that share a particular combination of attributes as a domain It thenuses a generator G to learn a mapping between multiple domains: G(x, c) → y, where x and yare the input and output images, respectively, and c is a target domain which is randomised inthe training process to enable a flexible transition The model also employs a discriminator D to
• Adversarial loss: This loss function aims to ensure that the generated image is
capture well the domain information of the images more effectively
• Reconstruction loss: This guarantees that G translates only the domain information from the
applied to guarantee that the generator can reconstruct the original image using the domaininformation
expressions to generate a fake image Instead of using a pixel-wise transformation, the model mapsthe target image onto a latent space that closely captures the facial contours (i.e boundaries)
The architecture of ReenactGAN consists of three DNNs: (i) an encoder (En), which embeds
Trang 25the target image into a latent boundary space; (ii) a target-specific decoder (De), which converts
fits the boundaries of the target face to those of the source image The encoder (En) and decoder
boundaries to be similar to those of the source
a set of motion-specific keypoints in an unsupervised manner, which allow it to describe relativemovements between pixels Then, only the relevant motion-specific patterns of the source image
The Monkey-Net framework contains three components The first is the keypoint petector,
output of the module is fed to the second component, the dense motion predictor, which translatesthe sparse key points into a motion heat map The third module is called the motion transfer
the fake image To train the model, a generator network G is trained together with the keypoint
D is responsible for distinguishing the real image from the fake one, as follows:
Trang 26pose and expression of a given face image X2Face takes two inputs: a source frame and a driving
embedding network, which learns a bilinear sampler to construct the mapping from the source frame
to an embedded face The driving frame is put through an encoder-decoder architecture nameddriving network, which learns a bilinear sampler to transform the embedded face to the generatedframe
The network is trained in two stages The first training stage is fully self-supervised, whichuses the images sampled from the same video To this end, the generated frame and the drivingframe have the same identity, which guarantee latent embedding learnt from driving network mustencode variation factors (e.g pose, expression, zoom) by a pixelwise L1 loss between the generatedand the driving frames In the second training stage, additional identity loss functions are applied
to enforce that the identity of the generated and the source frames are the same To this end, thetrained network is able to inject into a given source frame variation factors from a driving frame
Following the rapid development of forgery techniques as well as the emerging threat of forgedartefacts, many studies of visual forensics methods have been carried out We can divide thesemethods into two categories: (i) computer vision techniques, which rely on handcrafted features todetect the anomalous patterns (e.g frequency, head pose); and (ii) deep learning techniques, whichleverage the advances in deep learning to automatically learn hidden features that are non-trivialfor humans
discover anomalous content A frequency domain analysis is used to exploit the repetitive nature
More precisely, FDBD adopts a discrete Fourier transform (DFT) to decompose the inputimage into sinusoidal components of various frequencies This spectral decomposition of the inputimage (which is treated as an M × N signal) reveals the distribution of signal energy over different
Trang 27frequency ranges: Xk,l = n=0 m=0xn,m.e−N kn.e−M lm where Xk,l is the frequency-domainrepresentation, in which each frequency is associated with a signal amplitude and a phase.
manipulation techniques (e.g DeepFake and FaceSwap)
Global Consistency: Fake image generators, and especially feature-based techniques, oftensmooths a given face by interpolating the latent space of network features with supporting datapoints However, these data points are not necessarily meaningful when new faces are generated,resulting in a mixture of different facial characteristics (e.g differences in colour between the leftand right eyes), which is referred to as global consistency
Illumination estimation: An original image may contain incident illumination, and this poses
a challenge when rendering a fake image with similar illumination conditions Visual forgery niques often leave traces of illumination-related artefacts: for example, a typical artefact of theDeepFake algorithm is a shading effect around the nose, in which one side is too dark
tech-Geometry estimation: Facial geometry is often taken into account in graphics-based models(e.g 3D-Morphable) or feature-based generators (e.g geometry estimators) to make the counterfeitimage more realistic However, this is often approximate, and leads to inaccurate details (artefacts).These artefacts typically appear along the boundary of the face mask (e.g the nose, eyebrows andteeth) in the form of blending spots (strong edges or high contrast) or holes (missing detail)
forgery technique is used to inject the face of the target person into the source image, the faciallandmarks may be mismatched These errors in landmark locations can be discovered using a 2Dhead pose estimation between the real and fake regions of the input image To achieve this, HPBDcompares head poses across all facial landmarks and uses the central region to look for anomaliesand discrepancies
More precisely, the model utilises the 3D configuration of the facial landmarks, as described
HPBD splits a system of 68-landmark points into two parts, representing the central and border
Trang 28The vectors ⃗va and ⃗vc representing the orientations of the head are then calculated by ⃗va = Raw⃗and ⃗vc = RT
is small for real images, and significantly larger for synthesised images This feature is therefore arobust indicator for use in separating fake images from real ones
at a mesoscopic level of analysis Two variants of it have been proposed based on the mesoscopicproperties of the image, namely Meso-4 and MesoInception-4
Meso-4: This variant is designed with four layers, which alternate between convolution and
instead of the first two convolutional layers The idea behind the inception operation is to enrichthe function space of the model by applying different kernel shapes to multiple convolutional layerssimultaneously The other vanilla convolution layers in Meso-4 are replaced by dilated convolutions[48] to avoid overfitting
detection, especially against highly realistic photos/videos
The model first locates the face in the image and rescales it to a size of 128 × 128 This is
network that contains: (i) three primary capsules, each of which integrate statistical pooling toenhance forgery detection; and (ii) two output capsules, which are dynamically routed from the
Trang 29XceptionNet This approach [8] adopts the Inception architecture [53,51] to extract the lying features of input images to distinguish between fake and real images The original Inceptionarchitecture maps the input data from the original space to multiple smaller spaces separately,and the cross-channel correlations between smaller spaces are then put together via convolutionallayers.
under-XceptionNet goes beyond existing Inception architectures by entirely decoupling the tions across space and channels It has 36 convolutional layers, which act as the feature extractionmodule of the whole network This module in turn consists of three parts, each of which is con-structed from a linear stack of depth-wise separable convolution layers with residual connections.This linear stacking increases the flexibility of development in terms of implementation and modi-fication for high-level libraries such as Keras The first part, referred to as the entry flow, processesthe data once, while the second, called the middle flow, processes the data eight times The finalpart, called the exit flow, then processes the data once Finally, a logistic regression layer is appliedfor binary classification (real/fake)
are investigated: model fingerprints and image fingerprints
• Model fingerprint: This approach is based on the observation that even if two well-trainedGAN models vary in terms of the hyper-parameter configurations, the non-convexity of theloss functions and the adversarial equilibrium between the generator and discriminator, theirhigh-equality generation is equivalent This uniqueness can be exploited to trace GAN-basedmodifications
• Image fingerprint: If fake images are generated by the same GAN instance, they often havestable, common patterns, and vice-versa This uniqueness hints that the encoding of an imagefingerprint is possible
Using these two observations, GAN-fingerprint learns the model fingerprint for each source,and then uses it to map an input image to its fingerprint Formally, given an image-model pair(I, y) where I is the input image and y ∈ Y is a GAN instance, the model learns a reconstruction
Trang 30function R : I → R(I) using the following pixel-wise reconstruction losses:
y mod ,FimI ) P
ˆ y∈Y cor(Fmodˆ ,F I
im ) The losses Ladv, Lpixand Lclsthen are puttogether with a weighted-sum combination to train the model
Trang 31Chapter 3
Proposed facial forgery detection models
To predict a fake face image, I use a pipeline to extract face from image and detect this face
is fake or real The pipeline contain two main parts: face detection part and classification modelpart First, the face is detected from the input images or video frames using face detection module.The extracted face image is forward into classification model to detect whether this image is fake
or real
face from given images or video frames The framework pipeline contains three main steps Inthe first step, the given image is rescaled to a range of different sizes (a.k.a image pyramid),then a shallow fully convolutional network (so called P-Net) is employed to produce the candidatewindows In the second step, a more complex CNN model namely R-Net is adopted to refine thewindow candidates and keep only the high potential ones In the last step, a powerful CNN modelnamely O-Net further refines the candidates and locates the facial landmarks positions Betweenthe steps, non-maximum suppression (NMS) is used to filter the candidate bounding boxes The
Figure 3.1depicts the overview of Efficient-Frequency pipeline As can be seen from the image,the extracted face image then is analyzed in the frequency domain using Fourier Transform Theoriginal image and its frequency-domain representation then are forwarded into two separated
Trang 32Figure 3.1: Overview of Efficient-Frequency pipeline
EfficientNet models to learn the underlying features The learnt features are combined by a fusion mechanism which considers the importance of the information, then forwarded to a fullyconnected layer and the common cross entropy loss for binary classification
late-Frequency Analysis We utilize the Discrete Fourier Transform (DFT) to obtain the domain presentation of the input image The presentation can be considered as a spectral decom-position of the image that indicates the distribution of its energy over a range of frequencies giventhe spatial resolution For 2-dimensional image data of size M × N , the Fourier Transform can becomputed as:
original image It is worth noting that the obtained representation inherits the same dimensionality
applied to flatten the representation into 1-dimensional form The transformation can be considered
as a compression where similar frequency components are gathered and averaged into a featurevector The compression helps to significantly reduce the amount of features with the minimizedloss of information, results in a more robust representation of the input image
performs better pattern extraction while guarantees the efficiency of the model The concept ofthe model is designed using a multi-objective neural architecture search that optimizes the twomentioned criteria, accuracy and efficiency Our model leverages the original EfficientNet variant,namely EfficientNet-B0, as this variant can capture better facial detail information comparing toother scaled-up variants such as EfficientNet-B1, where the details can be washed out due to over
Trang 33Figure 3.2: EfficientNet architecture
The network starts with a convolutional layer of the size 3 × 3, which performs lightweightfiltering Then, the network continues with multiple stacked mobile inverted bottleneck (MBConv)
point-wise (1x1) convolution, a depth-wise convolution and a spatially-filtered feature map Instead
of standard convolution, MBConv uses a depth-wise separable layer to reduce computational costwhile guarantees the quality of pattern extraction The last layer of a MBConv block is a spatially-filtered feature map which projects information from previous layer back to a low-dimensionalsubspace using another point-wise convolution The low-dimensional subspace helps to preservethe essential information while reduces the complexity of the model For each MBConv block,
a residual connection is added to aid gradient flow during backpropagation After the stackedMBConv layers, the output feature is flatten into 1-dimensional vector using a fully connectedlayer at the end of the network
Late-fusion mechanism As discussed, we use two separated EfficientNet to extract the patternsfrom original image and its frequency-domain representation, results in two 1-dimensional feature
... estimation between the real and fake regions of the input image To achieve this, HPBDcompares head poses across all facial landmarks and uses the central region to look for anomaliesand discrepanciesMore... data-page="31">
Chapter 3
Proposed facial forgery detection models
To predict a fake face image, I use a pipeline to extract face from image and detect this face
is fake or... refine thewindow candidates and keep only the high potential ones In the last step, a powerful CNN modelnamely O-Net further refines the candidates and locates the facial landmarks positions