Facial image forgery and detection benchmark and application

Therefore, many organizations,especially social networking companies such as Facebook and Twitter, are focusing on research anddevelopment of a system to identify fake photos to be able

Trang 1

Hanoi university of science and technology

Master thesis

Facial image forgery and detection:

benchmark and application

Pham Minh Tamtam.pm202708M@sis.hust.edu.vn

School of Information and Communication Technology

Supervisor: Assoc Prof Huynh Quyet Thang

Supervisor’s signature

Institution: School of Information and Communication Technology

Hanoi, 12/2021

Trang 2

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given With the exception

of such quotations, this thesis is entirely my own work

Student

Signature and Name

Trang 3

I also like to thank my family, my girl friend and my friends for their wise counsel andsympathetic ear You are always there for me.

Finally, I would like to thank Vingroup Innovation Foundation (VINIF) for the financialsupport

Trang 4

In recent years, visual forgery has reached a level of sophistication that humans cannot tify fraud, which poses a significant threat to information security A wide range of maliciousapplications have emerged, such as fake news, defamation or blackmailing of celebrities, imper-sonation of politicians in political warfare, and the spreading of rumours to attract views As aresult, a rich body of visual forensic techniques has been proposed in an attempt to stop thisdangerous trend In this thesis, I introduce two new models with name Efficient-Frequency andWADD to improve the result of fake face images detection problem I also present a benchmarkthat provides in-depth insights into visual forgery and visual forensics, using a comprehensive andempirical approach and propose a novel end-to-end visualforensic framework that can incorporatedifferent modalities toefficiently classify real and forged contents More specifically, we develop anindependent framework that integrates state-of-the-arts counterfeit generators and detectors, andmeasure the performance of these techniques using various criteria We also perform an exhaustiveanalysis of the benchmarking results, to determine the characteristics of the methods that serve as

iden-a compiden-ariden-ative reference in this never-ending widen-ar between meiden-asures iden-and countermeiden-asures

Student

Signature and Name

Trang 5

1.1 Context 10

1.2 Research Problems 11

1.3 Contributions 12

1.4 Thesis Outline 14

1.5 Selected Publications 14

2 Preliminaries and literature survey 16 2.1 Image classification problem 16

2.2 Visual forgery techniques 21

2.2.1 Graphics-based techniques 21

2.2.2 Feature-based techniques 23

2.3 Visual forensics techniques 26

2.3.1 Computer vision techniques 26

2.3.2 Deep learning techniques 28

3 Proposed facial forgery detection models 31 3.1 Efficient-Frequency model 31

3.2 WADD (Wavelet Attention for deepfake detection) model 34

Trang 6

4 Proposed dual benchmarking framework for facial forgery and detection

4.1 Framework 39

4.2 Datasets 40

4.2.1 Dual-benchmarking datasets (DBD) 40

4.2.2 External datasets 42

4.3 Measurements 43

4.4 Experimental Procedures 43

4.5 Reproducibility Environment 45

5 Experimental Results and Performance Analysis 46 5.1 Efficiency comparison 46

5.2 End-to-end comparison with existing datasets 47

5.3 Dual-benchmarking comparison 48

5.3.1 Forensic generalisation and forgery feature overlapping 49

5.3.2 Qualitative study of forensic-forgery duel 51

5.3.3 Influence of contrast 52

5.3.4 Effects of brightness 53

5.3.5 Robustness against noise 54

5.3.6 Robustness against image resolution 55

5.3.7 Influence of missing information 56

5.3.8 Adaptivity to image compression 57

5.4 Performance guidelines 57

Trang 7

List of Tables

1.1 Comparison Between Existing Benchmarks on Facial Forensics 13

2.1 Taxonomy of visual forgery techniques 21

4.1 Statistic of real datasets 43

5.1 Model size and detection speed 46

5.2 Statistic train, validate, test of existing datasets 47

5.3 Performance (Accuracy|P recision|Recall|F1score) of visual forensic techniques on different datasets 48

5.4 Statistic train, validate, test of synthetic datasets 48

5.5 Performance of visual forensics techniques against visual forgery techniques 49

5.6 Performance guideline for visual forensics 59

Trang 8

List of Figures

2.1 A vanilla CNN 17

2.2 Convolution layer 18

2.3 Max and average pooling 18

2.4 Global average pooling 19

2.5 Flatten and fully connected layer 19

2.6 Activation functions 20

3.1 Overview of Efficient-Frequency pipeline 32

3.2 EfficientNet architecture 33

3.3 WADD model 35

3.4 Wavelet Pooling 35

3.5 Attention layer 38

4.1 Dual benchmarking framework 40

4.2 Size of facial forgery datasets 41

4.3 Images from DBD dataset 42

5.1 Generalization ability of forensic techniques 50

5.2 Overlapping feature of forgery techniques 51

5.3 Suspicious region of forged images 52

Trang 9

5.4 Effects of illumination factors 53

5.5 Robustness against noises 54

5.6 Robustness against image resolution 55

5.7 Influence of missing information 56

5.8 Adaptivity to image compression 57

Trang 10

an existing image or video is replaced with someone else’s likeness.

The term "fake images" or "forgery image" have emerged in the recent years, because therehave been a lot of tools helping change the content of images such as Photoshop software Althoughthese tools are so powerful, users must have a huge a mount of knowledge to use them As a results,the number of fake images is low and it takes a lot of resources to make a fake photo But now,rather than images simply being altered by editing software such as Photoshop or videos beingdeceptively edited, there’s a new breed of machine-made fakes – and they could eventually make itimpossible for us to tell fact from fiction With the development of deep learning techniques, thereare a lot of methods which can generate "fake images" quickly, easily and in bulk

"Deep fakes" are the most prominent form of what’s being called “synthetic media”: images,sound and video that appear to have been created through traditional means but that have, in fact,been constructed by complex software Deep fakes have been around for years and, even thoughtheir most common use to date has been transplanting the heads of celebrities onto the bodies ofactors in pornographic videos, they have the potential to create convincing footage of any persondoing anything, anywhere There are many app which use "Deep fake" technique to help peopleeasily make forgery content such as Zao or FaceApp software

Trang 11

The development of social networks makes the rapid spread of fake photos and videos becomemore widespread and common These fake images and videos are distributed with the false contentcontained within it causing many economic and social problems Therefore, many organizations,especially social networking companies such as Facebook and Twitter, are focusing on research anddevelopment of a system to identify fake photos to be able to promptly prevent the spread of fakeimages, fake pictures and videos on the internet.

From the above urgent requirements, this thesis investigates methods to detect fake facialimages (forensic methods) and ways to generate this fake facial images (forgery methods) Becausethis thesis just focuses on facial fake images, terms "fake images" or "forgery images" refer toimages with faces changed and the term "forensic" refers to methods used to solve the detectingfake facial images problem The results of this thesis could help social network administratorsand communities to choose appropriate forensic methods to reduce the consequences of fake facialimages

The development of fake visual content, such as fake images and fake videos, has undergone

particular interest, since the face plays an essential role in human communication and representsthe identity of the person Forged facial images and videos have reached such a high level of qual-ity that even people with good vision, under ideal lighting conditions, cannot distinguish between

mali-cious applications, such as counterfeit news generation, click-baits, impersonation and fraudulenttransactions [32,2]

Given the importance of fake content detection, visual forensics has emerged to detect fakeimage in the world especially in social media content Fake facial image detection can be definedas: with an image x, we need developing a detection model D such that :

y = D(x) =





0,if x is real images1,if x is fake facial images

(1.1)

Fake facial image detection can be categorised into two paradigms On the one hand, puter vision approaches rely on handcrafted features to detect anomalous patterns in visual con-

Trang 12

com-tent, including frequency-based techniques, visual artefacts, and techniques that examine head

deep neural networks (DNNs) to automatically extract hidden features that go beyond human

However, recent advances in modern artificial intelligence have given rise to a new and evolving class of visual forgery techniques These techniques exploit the power of AI to hide thedigital footprints generated by the forgery process, and can trick even the latest forensics tech-

feature-based The former often mixes and disguises fake artefacts with common ones to produce

on commodity hardware The latter method relies on the power of DNNs to increase the level of

since the generation of fake videos requires more fine-grained and precise features than in image

In this continual war between visual forgery (measures) and visual forensics

using the same benchmarks This is primarily due to the challenges in aligning the different settings.Countermeasures are often proposed for a previous forgery technique, and soon become obsoletefor a new forgery measure Consequently, the interpretation of performance results from forensictechniques is challenging, since baselines tend to change quickly over time Moreover, visual forgeryand visual forensics have not been subject to a fair comparison, as the reported performance isgenerally based on small datasets and a limited range of adverse conditions The recent social andpolitical damage from fake visual content requires a common ground to allow us to understand thetimeline, compare variations, and keep pace with this war

In my thesis, I report the first independent dual benchmarking study to evaluate visual forgeryand forensics methods in a unified framework Using this framework, a comprehensive performancecomparison is conducted on a wide range of state-of-the-art forensic techniques, forgery baselines,

Trang 13

and real-world datasets To envision the future of the forgery/forensics war, I also conduct in-depthanalyses of synthetic datasets to extract insights on the performance behaviour of the benchmarkedmethods Based on these results, I propose several guidelines for the selection of appropriate visualforensic techniques for particular application settings In addition, I also propose two new model

to improve performance of the fake image detection issue Researchers and gate-keepers can useour generic framework to reduce the complexity of future benchmarking studies

Table 1.1: Comparison Between Existing Benchmarks on Facial Forensics

Benchmarks #Forensics #Forgery #Adversaries Image Video Code New Datasets

The contributions of our thesis can be summarised as follows:

• New forgery datasets I apply these forgery techniques to generate forged content, ing in a sizeable collection of datasets that is then used to explore the ability of forensicstechniques to handle malicious applications Comparing to existing datasets, our dataset cov-ers larger range of contents and forgery types (8 techniques in 3 types, results in 1,000,000forged images and 21,095 videos), which enables a thorough investigation of the field

result-• Reproducible dual benchmarking I publish the first large-scale reproducible ing framework that can assist in a dual comparison of a wide range of forensic and forgery

framework is designed in a component-based architecture, which allows the direct integration

of new forgery and forensics techniques besides the default ones Our framework also providesthe application layer, which aids the investigation of different imagery factors on the forensics

• Performance guideline I present an exhaustive list of performance results at differentlevels of granularity From this, I extract a comparative reference that can be used to select

an appropriate forensic approach in particular cases of forgery

• Propose new models I propose two new models to improve result on fake image detection.First model (Efficient-Frequency), I combine visual and frequency information to boost ac-curacy Second model (WADD), I use wavelet layer to reduce the number of parameter whilekeep the same result

1 https://github.com/tamlhp/dfd_benchmark

Trang 14

1.4 Thesis Outline

In the remainder, the thesis is organised as follows:

Inchapter 2We introduce about background knowledge of different visual forgery techniques

sec-tion 2.2, which is divided into two subsections that discuss graphics-based and feature-based

Inchapter 3, We propose two novel models which can improve result of deepfake detection

Inchapter 4We introduces the setup used for our benchmark, including the component-baseddesign, datasets, metrics and evaluation procedures I also introduce a Dual-benchmarking datasets

Inchapter 5We report the experimental results

Inchapter 6We provide a summary of the findings and practical guidelines as well as concludesthe paper

This thesis is based on the following research papers:

• Chau Xuan Truong Du, Huynh Thanh Trung, Pham Minh Tam, Nguyen Quoc Viet Hung,and Jun Jo "Efficient-Frequency: a hybrid visual forensic framework for facial forgery detec-tion." In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pp 707-712.IEEE, 2020

This paper propose a new model with name Efficient-Frequency to improve result of detect

• Minh Tam Pham; Thanh Trung Huynh; Van Vinh Tong; Thanh Thi Nguyen; HongzhiYin; Quoc Viet Hung Nguyen “A dual benchmarking study of facial forgery and facial foren-

12912)

Trang 15

In this work, we develop a benchmark that offers a comprehensive empirical study on theperformance comparison of deepfake detection models Specifically, we integrate several state-of-the-art net-work alignment techniques in a comparable manner, and measure distinctcharacteristics of these techniques with various settings We then provide in-depth analysis ofthe benchmark results The in-depths analysis of network alignments in this work is presented

in chapter 2andchapter 4

Trang 16

Chapter 2

Preliminaries and literature survey

Image Classification is one of the most fundamental tasks in computer vision This task relate

to a process predicting a specific class or label a given image This task can be divided to two type:

• Single-label classification which predict one label for each image This is the most commonclassification task in supervised Image Classification A single label or annotation is presentfor each image in single-label classification Therefore, the model outputs a single value orprediction for each image that it sees The output from the model is a vector with a lengthequal to the number of classes and value denoting the score that the image belongs to thisclass A Softmax activation function is usually used to make sure the score sums up to oneand the maximum of the scores is taken to form the model’s output Fake face image detection

is a sigle-label classification problem with two class: real and fake

• Multi-label classification which predict two or more labels for each image Multi-labelclassification is a classification task where each image can contain more than one label, andsome images can contain all the labels simultaneously Instead using Softmax function, thistask usually use Sigmoid activation function to predict whether the given image contain thegiven class A threshold is used to decide the labels of the given image

Convolution neural network is the most popular way to solve this problem In following thissection, I will discuss more detail about convolution neural network

Convolution neural network (CNN) is a neural network to solve image related problem such asimage classification, image segmentation, The CNN is built by stacking many convolution layers

Trang 17

and combining fully connected layers at the end of model asFigure 2.1 Convolutional networks

were inspired by biological processes in that the connectivity pattern between neurons resembles

only in a restricted region of the visual field known as the receptive field The receptive fields of

different neurons partially overlap such that they cover the entire visual field There are many

Inception [52],

Convolution layers Convolution layers are the major building blocks used in convolution neural

networks A convolution is the simple application of a filter with convolution operation to an input

to calculate a map called a feature map indicating the locations and strength of a detected feature

in an input image The filter is smaller than the input data and the type of multiplication applied

between a filter-sized patch of the input and the filter is a dot product A dot product is the

element-wise multiplication between the filter-sized patch of the input and filter, which is then

summed, always resulting in a single value Using a filter smaller than the input is intentional

as it allows the same filter (set of weights) to be multiplied by the input array multiple times at

different points on the input Specifically, the filter is applied systematically to each overlapping

part or filter-sized patch of the input data, left to right, top to bottom The size of filter usually is

Pooling layers Pooling layers are used to reduce the dimensions of the feature maps Thus,

it reduces the number of parameters to learn and the amount of computation performed in the

network The pooling layer also summarises the features present in a region of the feature map

1 https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

2 https://viniciuscantocosta.medium.com/understanding-the-structure-of-a-cnn-b220148e2ac4

Trang 18

Figure 2.2: Convolution layer

generated by a convolution layer So, further operations are performed on summarised features

instead of precisely positioned features generated by the convolution layer This makes the model

more robust to variations in the position of the features in the input image

There are some different types of pooling layer :

• Max pooling selects the maximum element from the region of the feature map covered by

containing the most prominent features of the previous feature map

3 https://www.researchgate.net/figure/Pooling-layer-operation-oproaches-1-Pooling-layers-For-the-function-of-decreasing-the_ fig4_340812216

4 https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/blocks/

global-average-pooling-2d

Trang 19

Figure 2.4: Global average pooling

• Average pooling computes the average of the elements present in the region of feature map

in a particular patch of the feature map, average pooling gives the average of features present

in a patch

• Global Pooling reduces each channel in the feature map to a single value For examplewith an input c × h × w, after forward to global pooling, the result has size c × 1 × 1 Max

or average operation can be used to select feature in each channel The operate of global

Fully connected layer After go through convolution layers and pooling layers, feature map is

implies that every neuron in the previous layer is connected to every neuron on the next layer The

5 https://www.researchgate.net/figure/Flattening-step-in-CNN_fig2_343263135

Trang 20

output from the convolutional and pooling layers represent high-level features of the input image.The purpose of the Fully Connected layer is to use these features for classifying the input imageinto various classes based on the training dataset.

Activation functions Activation functions are non-linear transformations that use after lution layers or fully connected layers before sending its outputs to the next layers of neurons orfinalizing as output It makes easy for models to generalize or adapt with variety of data and to

Loss functions The loss function is the function that computes the distance between the currentoutput of the algorithm and the expected output.It’s a method to evaluate how your algorithmmodels the data It can be categorized into two groups: one for classification (discrete values,0,1,2 ) and the other for regression (continuous values)

Cross-entropy is common loss function for classification problems with formula:

In a fake face image detection problem, with one image as input, the output of the model is

a float value from [0 − 1] which represents the probability that the input image is fake So binarycross-entropy is used in the fake face image detection problem with formula as below

6 https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092

Trang 21

2.2 Visual forgery techniques

Visual forgery techniques aim to create a false image/video by injecting incorrect information(e.g a false identity) into an original image/video We can classify these into two categories: (i)

characteristics summary are used in our benchmark

Table 2.1: Taxonomy of visual forgery techniques

Name Forgery type Video specific

Id Swap Att Swap Att mani Deepfake [ 23 ] Ë

3DMM [ 57 ] Ë FaceSwap-2D [ 14 ] Ë

FaceSwap-3D [ 30 ] Ë MonkeyNet [ 49 ] Ë Ë ReenactGAN [ 64 ] Ë Ë

StarGAN [ 7 ] Ë Ë X2Face [ 61 ] Ë Ë

These techniques are often used to replace the face of a source person (A) with the face of atarget person (B) using handcrafted features (e.g the landmark points of a human face) to forgethe image We describe several typical graphics-based techniques below

points on a 2D scale is then used to fit the face of the target person B onto the source image ofperson A

Colour adjustment: In this routine, the aim is to transfer a histogram of the image of person

B in order to match it with the histogram for the image of person A This process involves the

Trang 22

Faceswap-3D This graphics-based technique goes beyond the Faceswap-2D method by

in this approach is the 3D setting of facial landmarks, which makes the generated image harder todetect:



 X Y Z



 = R



 U V W



are the camera coordinates, R is a 3 × 3 rotation matrix, and t is a 3 × 1 translation vector After3D modelling, the projection into 2D is carried out as follows:



 x y 1



is the optical centre

After 3D modelling, the projection of the 3D landmarks onto 2D equivalents, as shown in

Equation 2.5should match the 2D landmarks in the image This can be formulated as an sation problem:

2

(2.6)

This optimisation is often referred as a P3P problem, and a strategy for solving this can be found

the estimated head pose (sA, RA, tA) of xA

3D-Morphable face model (3DMM) This is a graphics-based technique that is also used

mapping from 3D to 2D, this model uses a nonlinear mapping that is learned by an encoder-decoderdeep neural network

Trang 23

Formally, given a set of 2D face images Ii i=1, 3DMM constructs three deep neural networks: (i)

i=1 R(EM(Ii), DS(ES(Ii)), DT(ET(Ii))) −

the quality of the training process After the training process, the fake image can be generated by

S}

Recent fake image generators have leveraged advanced neural network architectures such as

forged images of superior quality, without the need for feature engineering or expert knowledge

We describe some representative feature-based techniques below

other faces The typical architecture of this model is composed of one encoder En and two

rec= EnxX[||xX−DeX(En(xX))||1]

features

The loss functions used to train the ED and D networks are:

Trang 24

where λadv is a balancing hyper-parameter between originality (reconstruction loss) and realisticrendering (adversarial loss) After the training process, the fake image can be obtained by applying

facial attributes (e.g hair color, skin, gender, facial expression) To achieve this, StarGAN firstgroups the training images that share a particular combination of attributes as a domain It thenuses a generator G to learn a mapping between multiple domains: G(x, c) → y, where x and yare the input and output images, respectively, and c is a target domain which is randomised inthe training process to enable a flexible transition The model also employs a discriminator D to

• Adversarial loss: This loss function aims to ensure that the generated image is

capture well the domain information of the images more effectively

• Reconstruction loss: This guarantees that G translates only the domain information from the

applied to guarantee that the generator can reconstruct the original image using the domaininformation

expressions to generate a fake image Instead of using a pixel-wise transformation, the model mapsthe target image onto a latent space that closely captures the facial contours (i.e boundaries)

The architecture of ReenactGAN consists of three DNNs: (i) an encoder (En), which embeds

Trang 25

the target image into a latent boundary space; (ii) a target-specific decoder (De), which converts

fits the boundaries of the target face to those of the source image The encoder (En) and decoder

boundaries to be similar to those of the source

a set of motion-specific keypoints in an unsupervised manner, which allow it to describe relativemovements between pixels Then, only the relevant motion-specific patterns of the source image

The Monkey-Net framework contains three components The first is the keypoint petector,

output of the module is fed to the second component, the dense motion predictor, which translatesthe sparse key points into a motion heat map The third module is called the motion transfer

the fake image To train the model, a generator network G is trained together with the keypoint

D is responsible for distinguishing the real image from the fake one, as follows:

Trang 26

pose and expression of a given face image X2Face takes two inputs: a source frame and a driving

embedding network, which learns a bilinear sampler to construct the mapping from the source frame

to an embedded face The driving frame is put through an encoder-decoder architecture nameddriving network, which learns a bilinear sampler to transform the embedded face to the generatedframe

The network is trained in two stages The first training stage is fully self-supervised, whichuses the images sampled from the same video To this end, the generated frame and the drivingframe have the same identity, which guarantee latent embedding learnt from driving network mustencode variation factors (e.g pose, expression, zoom) by a pixelwise L1 loss between the generatedand the driving frames In the second training stage, additional identity loss functions are applied

to enforce that the identity of the generated and the source frames are the same To this end, thetrained network is able to inject into a given source frame variation factors from a driving frame

Following the rapid development of forgery techniques as well as the emerging threat of forgedartefacts, many studies of visual forensics methods have been carried out We can divide thesemethods into two categories: (i) computer vision techniques, which rely on handcrafted features todetect the anomalous patterns (e.g frequency, head pose); and (ii) deep learning techniques, whichleverage the advances in deep learning to automatically learn hidden features that are non-trivialfor humans

discover anomalous content A frequency domain analysis is used to exploit the repetitive nature

More precisely, FDBD adopts a discrete Fourier transform (DFT) to decompose the inputimage into sinusoidal components of various frequencies This spectral decomposition of the inputimage (which is treated as an M × N signal) reveals the distribution of signal energy over different

Trang 27

frequency ranges: Xk,l = n=0 m=0xn,m.e−N kn.e−M lm where Xk,l is the frequency-domainrepresentation, in which each frequency is associated with a signal amplitude and a phase.

manipulation techniques (e.g DeepFake and FaceSwap)

Global Consistency: Fake image generators, and especially feature-based techniques, oftensmooths a given face by interpolating the latent space of network features with supporting datapoints However, these data points are not necessarily meaningful when new faces are generated,resulting in a mixture of different facial characteristics (e.g differences in colour between the leftand right eyes), which is referred to as global consistency

Illumination estimation: An original image may contain incident illumination, and this poses

a challenge when rendering a fake image with similar illumination conditions Visual forgery niques often leave traces of illumination-related artefacts: for example, a typical artefact of theDeepFake algorithm is a shading effect around the nose, in which one side is too dark

tech-Geometry estimation: Facial geometry is often taken into account in graphics-based models(e.g 3D-Morphable) or feature-based generators (e.g geometry estimators) to make the counterfeitimage more realistic However, this is often approximate, and leads to inaccurate details (artefacts).These artefacts typically appear along the boundary of the face mask (e.g the nose, eyebrows andteeth) in the form of blending spots (strong edges or high contrast) or holes (missing detail)

forgery technique is used to inject the face of the target person into the source image, the faciallandmarks may be mismatched These errors in landmark locations can be discovered using a 2Dhead pose estimation between the real and fake regions of the input image To achieve this, HPBDcompares head poses across all facial landmarks and uses the central region to look for anomaliesand discrepancies

More precisely, the model utilises the 3D configuration of the facial landmarks, as described

HPBD splits a system of 68-landmark points into two parts, representing the central and border

Trang 28

The vectors ⃗va and ⃗vc representing the orientations of the head are then calculated by ⃗va = Raw⃗and ⃗vc = RT

is small for real images, and significantly larger for synthesised images This feature is therefore arobust indicator for use in separating fake images from real ones

at a mesoscopic level of analysis Two variants of it have been proposed based on the mesoscopicproperties of the image, namely Meso-4 and MesoInception-4

Meso-4: This variant is designed with four layers, which alternate between convolution and

instead of the first two convolutional layers The idea behind the inception operation is to enrichthe function space of the model by applying different kernel shapes to multiple convolutional layerssimultaneously The other vanilla convolution layers in Meso-4 are replaced by dilated convolutions[48] to avoid overfitting

detection, especially against highly realistic photos/videos

The model first locates the face in the image and rescales it to a size of 128 × 128 This is

network that contains: (i) three primary capsules, each of which integrate statistical pooling toenhance forgery detection; and (ii) two output capsules, which are dynamically routed from the

Trang 29

XceptionNet This approach [8] adopts the Inception architecture [53,51] to extract the lying features of input images to distinguish between fake and real images The original Inceptionarchitecture maps the input data from the original space to multiple smaller spaces separately,and the cross-channel correlations between smaller spaces are then put together via convolutionallayers.

under-XceptionNet goes beyond existing Inception architectures by entirely decoupling the tions across space and channels It has 36 convolutional layers, which act as the feature extractionmodule of the whole network This module in turn consists of three parts, each of which is con-structed from a linear stack of depth-wise separable convolution layers with residual connections.This linear stacking increases the flexibility of development in terms of implementation and modi-fication for high-level libraries such as Keras The first part, referred to as the entry flow, processesthe data once, while the second, called the middle flow, processes the data eight times The finalpart, called the exit flow, then processes the data once Finally, a logistic regression layer is appliedfor binary classification (real/fake)

are investigated: model fingerprints and image fingerprints

• Model fingerprint: This approach is based on the observation that even if two well-trainedGAN models vary in terms of the hyper-parameter configurations, the non-convexity of theloss functions and the adversarial equilibrium between the generator and discriminator, theirhigh-equality generation is equivalent This uniqueness can be exploited to trace GAN-basedmodifications

• Image fingerprint: If fake images are generated by the same GAN instance, they often havestable, common patterns, and vice-versa This uniqueness hints that the encoding of an imagefingerprint is possible

Using these two observations, GAN-fingerprint learns the model fingerprint for each source,and then uses it to map an input image to its fingerprint Formally, given an image-model pair(I, y) where I is the input image and y ∈ Y is a GAN instance, the model learns a reconstruction

Trang 30

function R : I → R(I) using the following pixel-wise reconstruction losses:

y mod ,FimI ) P

ˆ y∈Y cor(Fmodˆ ,F I

im ) The losses Ladv, Lpixand Lclsthen are puttogether with a weighted-sum combination to train the model

Trang 31

Chapter 3

Proposed facial forgery detection models

To predict a fake face image, I use a pipeline to extract face from image and detect this face

is fake or real The pipeline contain two main parts: face detection part and classification modelpart First, the face is detected from the input images or video frames using face detection module.The extracted face image is forward into classification model to detect whether this image is fake

or real

face from given images or video frames The framework pipeline contains three main steps Inthe first step, the given image is rescaled to a range of different sizes (a.k.a image pyramid),then a shallow fully convolutional network (so called P-Net) is employed to produce the candidatewindows In the second step, a more complex CNN model namely R-Net is adopted to refine thewindow candidates and keep only the high potential ones In the last step, a powerful CNN modelnamely O-Net further refines the candidates and locates the facial landmarks positions Betweenthe steps, non-maximum suppression (NMS) is used to filter the candidate bounding boxes The

Figure 3.1depicts the overview of Efficient-Frequency pipeline As can be seen from the image,the extracted face image then is analyzed in the frequency domain using Fourier Transform Theoriginal image and its frequency-domain representation then are forwarded into two separated

Trang 32

Figure 3.1: Overview of Efficient-Frequency pipeline

EfficientNet models to learn the underlying features The learnt features are combined by a fusion mechanism which considers the importance of the information, then forwarded to a fullyconnected layer and the common cross entropy loss for binary classification

late-Frequency Analysis We utilize the Discrete Fourier Transform (DFT) to obtain the domain presentation of the input image The presentation can be considered as a spectral decom-position of the image that indicates the distribution of its energy over a range of frequencies giventhe spatial resolution For 2-dimensional image data of size M × N , the Fourier Transform can becomputed as:

original image It is worth noting that the obtained representation inherits the same dimensionality

applied to flatten the representation into 1-dimensional form The transformation can be considered

as a compression where similar frequency components are gathered and averaged into a featurevector The compression helps to significantly reduce the amount of features with the minimizedloss of information, results in a more robust representation of the input image

performs better pattern extraction while guarantees the efficiency of the model The concept ofthe model is designed using a multi-objective neural architecture search that optimizes the twomentioned criteria, accuracy and efficiency Our model leverages the original EfficientNet variant,namely EfficientNet-B0, as this variant can capture better facial detail information comparing toother scaled-up variants such as EfficientNet-B1, where the details can be washed out due to over

Trang 33

Figure 3.2: EfficientNet architecture

The network starts with a convolutional layer of the size 3 × 3, which performs lightweightfiltering Then, the network continues with multiple stacked mobile inverted bottleneck (MBConv)

point-wise (1x1) convolution, a depth-wise convolution and a spatially-filtered feature map Instead

of standard convolution, MBConv uses a depth-wise separable layer to reduce computational costwhile guarantees the quality of pattern extraction The last layer of a MBConv block is a spatially-filtered feature map which projects information from previous layer back to a low-dimensionalsubspace using another point-wise convolution The low-dimensional subspace helps to preservethe essential information while reduces the complexity of the model For each MBConv block,

a residual connection is added to aid gradient flow during backpropagation After the stackedMBConv layers, the output feature is flatten into 1-dimensional vector using a fully connectedlayer at the end of the network

Late-fusion mechanism As discussed, we use two separated EfficientNet to extract the patternsfrom original image and its frequency-domain representation, results in two 1-dimensional feature

More... data-page="31">

Chapter 3

Proposed facial forgery detection models

To predict a fake face image, I use a pipeline to extract face from image and detect this face

is fake or... refine thewindow candidates and keep only the high potential ones In the last step, a powerful CNN modelnamely O-Net further refines the candidates and locates the facial landmarks positions

Định dạng
Số trang	66
Dung lượng	1,49 MB