building attendance system prototype based on face recognition using machine learning

THESIS NAME: BUILDING ATTENDANCE SYSTEM PROTOTYPE BASED ON FACE RECOGNITION USING MACHINE LEARNING II.. CONTENT OF THE THESIS: - Designing and constructing a real-time attendance sys

Trang 1

HCMC UNIVERSITY OF

TECHNOLOGY AND EDUCATION

Faculty of Electrical and Electronics Engineering

Student name: Vo Anh Quoc Student ID: 15151254

Student name: Tran Van Son Student ID: 15151211

Major: Automation and Control Engneering Technology

Program: Full-time program

Cohort: 2015 – 2019 Class: 151511C

I THESIS NAME:

BUILDING ATTENDANCE SYSTEM PROTOTYPE

BASED ON FACE RECOGNITION USING MACHINE LEARNING

II TASKS:

1 INITIAL FIGURES AND DOCUMENTS:

2 CONTENT OF THE THESIS:

- Designing and constructing a real-time attendance system prototype in class based on face recognition using Machine Learning involved on NVIDIA Jetson Nano Developer Kit and C270 Logitech Webcam

- Understanding and applying algorithms for Face Detection and Face Verification stage

III RECEIVED DATE:

IV THESIS COMPLETED DATE:

V ADVISOR: My-Ha Le, PhD

Ho Chi Minh City, 4th July 2019 Ho Chi Minh City, 4th July 2019 Advisor Head of department

Trang 2

Cohort: 2015 – 2019 Class: 151511C

THESIS NAME:

BUILDING ATTENDANCE SYSTEM PROTOTYPE

BASED ON FACE RECOGNITION USING MACHINE LEARNING

Trang 3

Week 9 – 10

(14/5 – 27/5)

Vo Anh Quoc - Getting samples of people’s face

- Training and test the result

- Adjust and complete the software

Week 11 – 12

(28/5 – 10/6)

Tran Van Son - Setup hardware (NVIDIA Jetson

Nano), install the libraries and environment to build code

- Testing and complete on NVIDIA Jetson Nano

Week 13 – 14

(11/6 – 24/6)

Tran Van Son

Vo Anh Quoc

- Writing final report

- Preparing for presentation meeting

Ho Chi Minh City, 4th July 2019

ADVISOR

Trang 4

Implementer

Tran Van Son Vo Anh Quoc

Trang 5

ACKNOWLEDGEMENT

First and foremost, we would like to express our sincere thanks to our thesis advisor, My-Ha Le, PhD Without his assistance and dedicated involvement in every step throughout the process, this thesis cannot be accomplished He is one of significant lecturers at Ho Chi Minh city University of Technology and Education, who have studies and papers in the field related to Artificial Intelligent in general and Image Processing in particular Therefore, we would like to say that we are so lucky to work with him He guided us what we needed to do at the very beginning

of research step, from reading and searching remarkable papers to choosing the solution to follow Also, he showed us core values that we need to focus on doing our graduation thesis and our final year project Moreover, he taught us how to present and debate when standing in front of the graduation thesis council

Second, we would like to thank all lecturers in Faculty of Electrical and Electronics Engineering This thesis is the combination of knowledge that we learned over the past 4 years We know that every lecture is useful for my career path after graduation

Third, we would like to thank Intelligent System Laboratory (ISLab), Faculty

of Electrical and Electronics Engineering for the support of facilities and enabling

us to carry out the thesis

Last but not least, with special mention to our parents and members of ISLab who are our kind friends Even though we are not the same topic, they really enthusiastically supported us when we got in stuck

Tran Van Son Vo Anh Quoc

Trang 6

SOCIALIST REPUBLIC OF

VIETNAM

Independence – Freedom – Happiness

******

ADVISOR’S COMMENT SHEET

Cohort: 2015 – 2019 Class: 151511C

1 About the thesis contents:

- Students fulfill the requirements of graduation thesis

2 Advantage:

- The system works stably

- The system identifies object with high accuracy

6 Mark: 9.8 (In writing: Nine point eight)

Advisor

Trang 7

SOCIALIST REPUBLIC OF

VIETNAM

Independence – Freedom – Happiness

******

REVIEWER’S COMMENT SHEET

Cohort: 2015 – 2019 Class: 151511C

1 About the thesis contents:

2 Advantage:

3 Disadvantage:

4 Propose defending thesis?

5 Rating:

6 Mark: ……… (In writing )

Reviewer

Trang 8

TABLE OF CONTENTS

THESIS TASKS i

SCHEDULE ii

ASSURANCE STATEMENT iv

ACKNOWLEDGEMENT v

ADVISOR’S COMMENT SHEET vi

REVIEWER’S COMMENT SHEET vii

TABLE OF CONTENTS viii

ABREVIATIONS AND ACRONYMS xi

LIST OF FIGURES xii

LIST OF TABLES xv

CHAPTER 1: OVERVIEW 1

1.1 INTRODUCTION 1

1.1.1 Introduction to Face Recognition Problem 1

1.1.2 Face Recognition Application 4

1.2 THESIS OBJECTIVE 6

1.3 RESEARCH SCOPE OF THE THESIS 6

1.4 RESEARCH METHOD 6

1.5 MAIN CONTENT 7

CHAPTER 2: ARTIFICIAL INTELLIGENT 8

2.1 OVERVIEW 8

2.1.1 Artificial Intelligent 8

2.1.2 Machine Learning 10

2.1.3 Deep Learning 14

2.2 CONVOLUTIONAL NEURAL NETWORK 18

2.2.1 Introduction 18

2.2.2 Convolutional Neural Network Architectures 19

2.2.2.1 Convolutional Layer 20

2.2.2.2 Non-linearity 22

2.2.2.3 Stride and Padding 23

2.2.2.4 Pooling layer 25

2.2.2.5 Flattening Layer 26

2.2.2.6 Fully-Connected Layer 27

2.3 MTCNN (Multi-task Cascaded Convolutional Neural Networks) 27

Trang 9

2.3.1 Multi-Task 27

2.3.2 CNN Architectures 29

2.4 ONE SHOT LEARNING AND TRIPLET LOSS 32

2.4.1 One Shot Learning 32

2.4.2 Triplet Loss 34

2.5 SVM CLASSIFIER 36

2.5.1 Image classification 36

2.5.1.1 Definition of SVM 38

2.5.1.2 Hyperplanes and Support Vectors 39

2.5.1.3 Maximal-Margin Classifier 40

2.5.1.4 Soft Margin Classifier 41

2.5.1.5 Support Vector Machines (Kernels) 42

CHAPTER 3: HARDWARE AND RELATED WORKS 44

3.1 HARDWARE COMPONENTS 44

3.1.1 NVIDIA Jetson Nano Developer Kit 44

3.1.2 Logitech Webcam C270 48

3.1.3 Arduino Uno R3 50

3.2 HARDWARE BLOCK DIAGRAM AND WIRING DIAGRAM 55

3.3 THE CONSTRUCTION OF HARDWARE PLATFORM 56

CHAPTER 4: SOFTWARE AND ALGORITHM 57

4.1 SOFTWARE AND INTEGRATED DEVELOPMENT ENVIRONMENT 57

4.1.1 Facenet Introduction 57

4.1.2 Python 58

4.1.3 Visual Studio Code (VSC) 59

4.2 NETWORK ARCHITECTURE 60

4.2.1 Inception v1 60

4.2.1.1 The Premise: 61

4.2.1.2 The Solution: 61

4.2.2 Inception-ResNet v1 and v2 63

4.2.2.1 The Premise 64

4.2.2.2 The Solution 64

4.3 ALGORITHM AND FLOWCHART 66

4.3.1 Flowchart of collecting training data 66

4.3.2 Flowchart of face identification using trained data 68

CHAPTER 5: EXPERIMENTAL RESULT 70

5.1 DATA COLLECTING 70

5.2 DATA DESCRIPTION 70

5.3 FACE RECOGNITION PROCESS 75

Trang 10

CHAPTER 6: CONLUSION AND FUTURE WORKS 82

6.1 CONCLUSION 82

6.1.1 Advantages 82

6.1.2 Disadvantages 83

6.2 FUTURE WORKS 83

REFERNECES 84

Trang 11

ABREVIATIONS AND ACRONYMS AI: Artificial Intelligence

CCTV: Closed-Circuit Television

CNN: Convolutional Neural Network

DL: Deep Learning

ML: Machine Learning

PIN: Personal Identification Number

MTCNN: Multi-task Cascaded Convolutional Neural Networks

SVM: Support Vector Machine

NMS: Non-maximum Suppression

Trang 12

LIST OF FIGURES

Figure 1.1 Face recognition process 2

Figure 1.2 The general scheme of face recognition on an input image 3

Figure 1.3 Selfie pay app of MasterCard 4

Figure 1.4 FaceTech is applied on Authentication of Iphone X 5

Figure 1.5 OptimEyes screens at 450 petrol stations in the UK 5

Figure 2.1 Relation Between AI, Machine Learning, and Deep Learning 10

Figure 2.2 Structure of a biological neuron 11

Figure 2.3 Structure of artificial neurons 12

Figure 2.4 Compare performance between DL with order learning algorithrm 15

Figure 2.5 The organization of the visual cortex system 16

Figure 2.6 A recurrent neural network and the unfolding into a full network 17

Figure 2.7 Autoencoder architecture 17

Figure 2.8 Computer see an array of pixel the value of an image 19

Figure 2.9 CNN architecture 20

Figure 2.10 The input and filter of a convolutional layer 20

Figure 2.11 The convolutional operation of a CNN 21

Figure 2.12 The result of a convolution operation 21

Figure 2.13 Perform multiple convolutions on an input 22

Figure 2.14 The convolution operation for each filter 22

Figure 2.15 The feature map after applying the rectifier function 23

Figure 2.16 Resulting feature map 24

Figure 2.17 Apply zero padding for input image 24

Figure 2.18 The max pooling operation 25

Figure 2.19 The max pooling operation for multi feature map 26

Figure 2.20 Flattening operation 26

Figure 2.21 Fully-Connected layer 27

Figure 2.22 Overall process of MTCNN[9] 29

Figure 2.23 P-Net architecture[9] 30

Figure 2.24 R-Net architecture[9] 31

Figure 2.25 O-Net architecture[9] 32

Trang 13

Figure 2.26 One shot learning with a limited data 33

Figure 2.27 Triplet loss operation [10] 34

Figure 2.28 Triplet loss on two positive faces (Obama) and one negative face (Macron)[10] 35

Figure 2.29 An image classification model is represented on an image 36

Figure 2.30 An example training set for four visual categories 38

Figure 2.31 Possible hyperplanes 38

Figure 2.32 Hyperplanes in 2D and 3D feature space 39

Figure 3.1 NVIDIA Jetson Nano Developer Kit 44

Figure 3.2 Jetson Nano block function 44

Figure 3.3 Jetson Nano block diagram 45

Figure 3.4 Jetson Nano compute module with 260-pin edge connector 46

Figure 3.5 NVIDIA Jetson Nano Dev Kit pinout 47

Figure 3.6 Performance of various deep learning inference networks with Jetson Nano and TensorRT, using FP16 precision and batch size 1 48

Figure 3.7 Logitech Webcam C270 49

Figure 3.8 Arduino Uno R3, top view 51

Figure 3.9 Arduino Uno R3 pinout 54

Figure 3.10 Hardware block diagram of the system 55

Figure 3.11 Hardware wiring diagram of the system 55

Figure 3.12 The construction of hardware platform 56

Figure 4.1 Trained via Triplet Loss 57

Figure 4.2 FaceNet’s algorithm operation 58

Figure 4.3 Visual Studio Code interface 59

Figure 4.4 From left a dog occupying most of the image, a dog occupying a part of it, and a dog occupying very little space.[7] 61

Figure 4.5 The naive inception module [7] 62

Figure 4.6 Inception module with dimension reduction [7] 62

Figure 4.7 GoogLeNet The orange box is the stem, which has some preliminary convolutions The purple boxes are auxiliary classifiers The wide parts are the inception modules [7] 63

Trang 14

Figure 4.8 Inception modules A, B, C in an Inception ResNet Note how the pooling layer was replaced by the residual connection, and also the additional 1x1

convolution before addition [8] 64

Figure 4.9 Reduction Block A (35x35 to 17x17 size reduction) and Reduction Block B (17x17 to 8x8 size reduction) Refer to the paper for the exact hyper-parameter setting [8] 64

Figure 4.10 Activations are scaled by a constant to prevent the network from dying [8] 65

Figure 4.11 The top image is the layout of Inception v4 The bottom image is the layout of Inception-ResNet [8] 66

Figure 4.12 Detected face by using MTCNN 67

Figure 4.13 Flowchart of taking dataset 67

Figure 4.14 Flowchart of training dataset 67

Figure 4.15 Face is recognized after detecting 68

Figure 4.16 Flowchart of recognition process 69

Figure 5.1 Collecting the faces 70

Figure 5.2 Facial images are collected under different condition of emotion 71

Figure 5.3 Dataset in Condition 1 72

Figure 5.9 Testing result in classroom with fluorescent light 80

Figure 5.10 Testing result in classroom without fluorescent light 80

Figure 5.11 Saved timesheet 81

Trang 15

LIST OF TABLES

Table 3-1 Specifics of Jetson Nano 45

Table 3-1 Requirements while using camera 49

Table 3-2 Specifics of Camera 49

Table 3-4 Specifics of Arduino Uno R3 52

Table 5-1 Example of confusion matrix 76

Table 5-2 Testing result in Condition 1 76

Trang 16

CHAPTER 1: OVERVIEW 1.1 INTRODUCTION

The 21st century is the explosion and development of technology in which a lot of progress has been achieved as to expedite humans for accomplishing their tasks Particularly, the use of computer technology plays a vital role in many fields

of human life Recently, there is a vast range of amazing artificial-intelligent-based application For instance, self-driving cars, which are cars driven without a driver

by AI technologies automatic learning (AL) Another application is chatbots, in the customer service problems, which helps us to quickly and efficiently respond to the customer’s main worries, questions, or problems The most advanced chatbots are already able to answer open questions by using the NLP and AL to respond in a human-like way In the authentication problem [1], based on biological characteristics and traits of an individual person and AI, biometric recognition systems are built which are more reliable than previous traditional methods such as smart cards, wallets, keys, tokens using PINs and passwords This system can be built through various techniques The most commonly used which are fingerprints and iris methods, which obligatorily require individual’s participation or involvement to access the system However, lately, the latest systems allow participants to access the authentication system without its intervention Among such methods face recognition seems to be the most viable technique so far in which every individual can easily be captured and monitored through the system

1.1.1 Introduction to Face Recognition Problem

Face recognition has been one of the most interesting and important research fields in the past two decades The reasons come from the need of automatic recognitions and surveillance systems, the interest in human visual system on face recognition, and the design of human-computer interface, etc These researches involve knowledge and researchers from disciplines such as neuroscience, psychology, computer vision, pattern recognition, image processing, and machine learning, etc A bunch of papers have been published to overcome difference factors

Trang 17

(such as illumination, expression, scale, pose, ……) and achieve better recognition rate, while there is still no robust technique against uncontrolled practical cases which may involve kinds of factors simultaneously

Face recognition is a visual pattern recognition problem, where the face, represented as a three-dimensional object that is subject to varying illumination, pose, expression, and other factors, needs to be identified based on acquired images Given a picture taken from a digital camera, would like to know if there is any person inside, where his/her face locates at, and who he/she is Towards this

goal, generally separate the face recognition procedure into three steps [2]: Face

Detection, Features Extraction and Face Recognition (shown at Figure 1.1)

Face Detection: The main function of this step is to determine whether

human faces appear in a given image, and where these faces are located at The expected outputs of this step are patches containing each face in the input image

This step also has a participantion of Face Alignment which are performed to

justify the scales and orientations of these patches in order to make further face recognition system more robust and easier to design Besides playing a role as the pre-processing for face recognition, face detection could be used for region-of interest detectiom, video and image classigfication, etc

Features Extraction: After detecting all human faces in a given image,

human-face patches are extracted from images Feature extractions are performed to

do information packing, dimension reduction, salience extraction, and noise cleaning After this step, a face patch is usually transformed into a vector with fixed dimension or a set of fiducial points and their corresponding locations

Face Recognition: After formalizing the representation of each face, the last

step is to recognize identities of these faces In order to achieve automatic

Detection

Features Extraction

Face

Figure 1.1 Face recognition process

Trang 18

recognition, a face database is required to build For each person, several images are taken and their features are extracted and stored in the database Then when an input face image comes in, we perform face detection and feature extraction, and compare its feature to each face class stored in the database There have been many researches and algorithms proposed to deal with this classification problem There are two general applications of face recognition, one is called identification and another one is called verification Face identification means given a face image, we want the system to tell who he / she is or the most probable identification; while in face verification, given a face image and a guess of the identification, we want the

system to tell true or false about the guess In Figure 1.2, we show an example of

how these three steps work [2] on an input image

(a) The input image and the result of face detection (the red rectangle) (b) The extracted face patch (c) The feature vector after feature extraction (d) Comparing the input vector with the stored vectors in the database by classification techniques and determine the most probable class (the red rectangle) Here we express each face patch as a d-dimensional vector, the vector x , as the, vector n and as the

Ne4- number of faces stored in the

Figure 1.2 The general scheme of face recognition on an input image

Trang 19

1.1.2 Face Recognition Application

Face recognition is applied to countless applications such as surveillance camera systems, criminal identification, access and security and so on Here are some applications which are applied to human life:

Access and security

Facial biometrics can be integrated with physical devices and objects Instead

of using passcodes, mobile phones and other consumer electronics will be accessed via owners’ facial features Apple, Samsung and Xiaomi Corp have all installed

FaceTech in their phones

Payments

In 2016, MasterCard launched a new selfie pay app called MasterCard Identity

Check Customers open the app to confirm a payment using their camera, and that’s that Facial recognition is already used in store and at ATMs, but the next step is to

do the same for online payments Chinese ecommerce firm Alibaba and affiliate payment software Alipay are planning to apply the software to purchases made over

the Internet

Advertising

The ability to collect and collate masses of personal data has given marketers and advertisers the chance to get closer than ever to their target markets FaceTech

could do much the same, by allowing companies to recognize certain demographics

Figure 1.3 Selfie pay app of MasterCard

Trang 20

– for instance, if the customer is a male between the ages of 12 and 21, the screen might show an ad for the latest FIFA game

Grocery giant Tesco plans to install OptimEyes screens at 450 petrol stations

in the UK to deliver targeted ads to customers

At the present, face recognition problem is facing challenges which significantly impact on recognition abilities For example, commonly, images are obtained by Closed-Circuit Television (CCTV) cameras at public places, which usually are not frontal face images and poor quality There are many of methods to solve this problem Among such methods Machine Learning (ML) is the one giving

Figure 1.4 FaceTech is applied on Authentication of Iphone X

Figure 1.5 OptimEyes screens at 450 petrol stations in the UK

Trang 21

the top performance so far Therefore, this thesis is written about how to apply that

method for Building attendance system prototype based on face recognition using

Machine Learning

1.2 THESIS OBJECTIVE

This thesis mainly focuses on these following key goals below:

 Research and understandInception-Resnet V1 architectures

 Research and understand MTCNN algorithm and how it works for face detection by facial feature extraction

 Research and understand CNN and how it works for facial feature extraction

 Research and understand Triplet-loss which is a Machine Learning algorithm and how it works for classification and verification

 Research and use suitable and compatible components for building attendance system prototype Particularly, this thesis will mainly focus on using NVIDIA Jetson Nano Developer Kit and Logitech Camera C270

 The target is Building attendance system prototype based on face recognition

using Machine Learning

1.3 RESEARCH SCOPE OF THE THESIS

The scope of this thesis is to build a real-time face recognition system based

on related algorithms, NVIDIA Jetson Nano Developer Kit and Logitech Camera C270 The input face images will include frontal face

1.4 RESEARCH METHOD

At the beginning, we chose the project name, then we quickly generally researched on that key word After that we did a presentation with my advisor who

is My-Ha Le, PhD He accepted which ideas we presented and guided us how to do

First, we have to do researching papers and surveys which related to our chosen project name

Secondly, after reading those papers and surveys, we have to choose the method that we want to apply for the project

Trang 22

Third, we have to find and read materials which are relevant to our chosen project Particularly, we have to understand theories of CNN, Machine Learning, Triplet-loss algorithm and so on

Fourth, we have to build the hardware system and combine them together Fifth, we have to apply those theories in the third part to the hardware system

by coding

Sixth, we have to check and measure the accuracy of the system

We alsoneed to report to our advisor every two weeks

1.5 MAIN CONTENT

The thesis, namely “Building attendance system prototype based on face

recognition using Machine Learning”, includes these following chapters: Chapter 1 Overview: This chapter provides the overview of a whole thesis

and the requirements which includes Introduction, Objective, Scope, Research Method and Main Content of the thesis

Chapter 2 Artificial Intelligent: This chapter provides knowledge which are

relevant to build a face recognition system such as Convolution Neural Network, MTCNN, Triplet-loss algorithm and SVM classifier

Chapter 3 Hardware and related works: This chapter provides what

hardware components information that we chose to build the system and how to combine it together

Chapter 4 Software and Algorithm: This chapter provides algorithms and

diagrams

Chapter 5 Experimental results: This chapter shows the experiment result

after building the system

Chapter 6 Conclusion and Development Orientation: This chapter

provides the conclusion in term of advantages and limitation of this thesis This also concludes the contribution and proposing ideas, development orientation for future works

Trang 23

CHAPTER 2: ARTIFICIAL INTELLIGENT

2.1 OVERVIEW

2.1.1 Artificial Intelligent

Artificial intelligence (AI) has the power to disrupt many industries, and it is

no surprise that it is among the most buzzworthy topics in business today However,

AI is also very misunderstood: In addition to being touted as the next big thing for cutting-edge companies, AI is also supposedly going to replace human workers and dismantle industry as we know it [3]

“Artificial Intelligence” describes machines that use decision-making or computing processes that mimic human cognition In practice, this requires that a machine understands information about its environment, whether that’s a physical space, the status of a game, or relevant information from a database An artificially intelligent system can then use this data to optimize actions that help achieve a specific goal, like winning a game of chess

Artificial Intelligence is a wide field encompassing several sub-fields, techniques, and algorithms The field of artificial intelligence is based on the goal of making a machine as smart as a human That is literally the initial overarching goal Back in 1956, researchers came together at Dartmouth with the explicit goal of programming computers to behave like humans This was the modern birth of Artificial Intelligence as we know it today

To further explain the goals of Artificial Intelligence, researchers extended their primary goal to these six main goals :

- Logical Reasoning: Enable computers to do the types of sophisticated

mental tasks that humans are capable of doing Examples of solving these Logical Reasoning problems include playing chess and solving algebra word problems

- Knowledge representation: Enable computers to describe objects, people,

and languages Examples of this include Object Oriented Programming Languages,

Trang 24

such as Smalltalk

- Planning and navigation: Enable a computer to get from point A to point

B For example, the first self-driving robot was built in the early 1960’s

- Natural Language Processing: Enable computers to understand and

process language One of the first projects related to this, was attempting to translate English to Russian and vice versa

- Perception: Enable computers to interact with the world through sight,

hearing, touch, and smell

- Emergent Intelligence: That is, Intelligence that is not explicitly

programmed, but emerges from the rest of the explicit AI features The vision for this goal was to have machines exhibit emotional intelligence, moral reasoning, and more

Even with these main goals, this does not categorize the specific Artificial Intelligence algorithms and techniques These are just six of the major algorithms and techniques within Artificial Intelligence:

- Machine Learning is the field of artificial intelligence that gives computers

the ability to learn without being explicitly programmed

- Search and Optimization: Algorithms such as Gradient Descent to

iteratively search for local maximums or minimums

- Constraint Satisfaction is the process of finding a solution to a set of

constraints that impose conditions that the variables must satisfy

- Logical Reasoning: An example of logical reasoning in artificial

intelligence is an expert computer system that emulates the decision-making ability

of a human expert

- Probabilistic Reasoning is to combine the capacity of probability theory to

handle uncertainty with the capacity of deductive logic to exploit structure of formal

Trang 25

argument The result is a richer and more expressive formalism with a broad range

of possible application areas

Control Theory is a formal approach to find controllers that have provable

properties This usually involves a system of differential equations that usually describe a physical system like a robot or an aircraft

Artificial Intelligence, Machine Learning, and Deep Learning are each a

subset of the previous field as shown in Figure 2.1 Artificial Intelligence is the

overarching category for Machine Learning In addition, Machine Learning is the overarching category for Deep Learning

2.1.2 Machine Learning

Machine Learning is a subset of Artificial Intelligence Artificial Intelligence aims to make computers smart, Machine Learning takes the stance that we should give data to the computer, and let the computer learn on its own The idea that computers might be able to learn for themselves was divined by Arthur Samuel in

1959

One major breakthrough led to the emergence of Machine Learning as the driving force behind Artificial Intelligence - the invention of the internet The internet came with a huge amount of digital information is being generated, stored, and made available for analysis This is when you start hearing about Big Data

Figure 2.1 Relation Between AI, Machine Learning, and Deep Learning

Trang 26

Moreover, Machine Learning algorithms have been the most effective at leveraging all of this Big Data

Neural Networks are a key piece of some of the most successful machine learning algorithms The development of neural networks has been key to teaching computers to think and understand the world in the way that humans do Essentially,

a neural network emulates the human brain Brains cells, or neurons, are connected via synapses This is abstracted as a graph of nodes (neurons) connected by weighted edges (synapses) The structure of a biological neuron is illustrated in

Figure 2.2

Our brains use extremely large interconnected networks of neurons to process information and model the world we live in Electrical inputs are passed through this network of neurons which result in an output is being produced In the case of a biological brain, this could result in contracting a muscle or signaling your sweat glands to produce sweat A neuron collects inputs using a structure called dendrites, the neuron effectively sums all of these inputs from the dendrites and if the resulting value is greater than its firing threshold, the neuron fires When the neuron fires it sends an electrical impulse through the neuron's axon to its boutons These boutons can then be networked to thousands of other neurons via connections called synapses

Figure 2.2 Structure of a biological neuron

Trang 27

There are about one hundred billion (100,000,000,000) neurons inside the human brain each with about one thousand synaptic connections It is effectively the way in which these synapses are wired that give our brains the ability to process information the way they do

Neuron models are at their core-simplified models based on biological neurons This allows them to capture the essence of how a biological neuron functions We usually refer to these artificial neurons as “perceptrons”

As shown in the Figure 2.3 a typical perceptron will have many inputs and

these inputs are all individually weighted The perceptron weights can either amplify or de-amplify the original input signal For example, if the input is 1 and the input's weight is 0.2 the input will be decreased to 0.2 These weighted signals are then added together and passed into the activation function The activation function is used to convert the input into a more useful output There are many different types of activation function but one of the simplest would be step function

A step function will typically output a 1 if the input is higher than a certain threshold, otherwise its output will be 0

The detail about some of these components:

- Neurons: A neural network is a graph of neurons Similarly, a neural

network has inputs and outputs The inputs and outputs of a neural network are represented by input neurons and output neurons Input neurons have no

Figure 2.3 Structure of artificial neurons

Trang 28

predecessor neurons but do have an output Similarly, an output neuron has no successor neuron but does have inputs

- Connections and Weights: A neural network consists of connections, each

connection transferring the output of a neuron to the input of another neuron Each connection is assigned a weight

- Propagation Function: The propagation function computes the input of a

neuron from the outputs of predecessor neurons The propagation function is leveraged during the forward propagation stage of training

- Learning Rule: The learning rule is a function that modifies the weights of

the connections This serves to produce a favored output for a given input for the neural network The learning rule is leveraged during the backward propagation stage of training

- Learning Types: There are many different algorithms that can be used

when training artificial neural networks such as Supervised Learning, Unsupervised Learning, Reinforcement Learning, each with their own separate advantages and disadvantages The learning process within artificial neural networks is a result of altering the network's weights, with some kind of learning algorithm The objective

is to find a set of weight matrices which when applied to the network should - hopefully - map any input to a correct output

if the desired output for the network is also provided with the input while training the network By providing the neural network with both an input and output pair it

is possible to calculate an error based on its target output and actual output It can then use that error to make corrections to the network by updating its weights

set of inputs and it's the neural network's responsibility to find some kind of pattern within the inputs provided without any external aid This type of learning paradigm

is often used in data mining and is also used by many recommendation algorithms due to their ability to predict a user's preferences based on the preferences of other

Trang 29

similar users it has grouped together

learning in that some feedback is given, however instead of providing a target output a reward is given based on how well the system performed The aim of reinforcement learning is to maximize the reward the system receives through trial-and-error This paradigm relates strongly to how learning works in nature, for example, an animal might remember the actions it's previously taken which helped

it to find food (the reward)

2.1.3 Deep Learning

Deep Learning is at the cutting edge of what machines can do, and developers and business leaders absolutely need to understand what it is and how it works This unique type of algorithm has far surpassed any previous benchmarks for classification of images, text, and voice

It also powers some of the most interesting applications in the world, like autonomous vehicles and real-time translation There was certainly a bunch of excitement around Google’s Deep Learning based AlphaGo beating the best Go player in the world, but the business applications for this technology are more immediate and potentially more impactful

Deep learning is a specific subset of Machine Learning, which is a specific subset of Artificial Intelligence For individual definitions:

 Artificial Intelligence is the broad mandate of creating machines that can think intelligently

 Machine Learning is one way of doing that, by using algorithms to glean

insights from data

 Deep Learning is one way of doing that, using a specific algorithm called

a Neural Network

Trang 30

Deep Learning is just a type of algorithm that seems to work really well for predicting things Deep Learning and Neural Nets, for most purposes, are effectively synonymous If people try to confuse you and argue about technical definitions, don’t worry about it: like Neural Nets, labels can have many layers of meaning

Neural networks are inspired by the structure of the cerebral cortex At the basic level is the perceptron, the mathematical representation of a biological neuron Like in the cerebral cortex, there can be several layers of interconnected perceptrons

Input values, or in other words our underlying data, get passed through this

“network” of hidden layers until they eventually converge to the output layer The output layer is our prediction: It might be one node if the model just outputs a number or a few nodes if it’s a multiclass classification problem

The hidden layers of a Neural Network perform modifications on the data to eventually feel out what its relationship with the target variable is Each node has a weight, and it multiplies its input value by that weight Do that over a few different layers, and the Neural Network is able to essentially manipulate the data into something meaningful

Deep Learning is important for one reason only: We have been able to achieve meaningful, useful accuracy on tasks that matter Machine Learning has

Figure 2.4 Compare performance between DL with order learning algorithrm

Trang 31

been used for classification on images and text for decades but it struggled to cross the threshold – there is a baseline accuracy that algorithms need to have to work Deep Learning is finally enabling us to cross that line in places we were not able to before

Computer vision is a great example of a task that Deep Learning has

transformed into something realistic for business applications Figure 2.4 shows

that using Deep Learning to classify and label images is not only better than any other traditional algorithms: It is starting to be better than actual humans

 Deep Learning Models:

Convolutional Neural Network:

Convolutional neural networks, short for “CNN”, is a type of feed-forward artificial neural networks, in which the connectivity pattern between its neurons is

inspired by the organization of the visual cortex system as shown in Figure 2.5

Recurrent Neural Network:

Figure 2.6 shows a sequence model is usually designed to transform an input

sequence into an output sequence that lives in a different domain Recurrent neural network, short for “RNN”, is suitable for this purpose and has shown tremendous improvement in problems like handwriting recognition, speech recognition, and machine translation

Figure 2.5 The organization of the visual cortex system

Trang 32

A recurrent neural network model is born with the capability to process long sequential data and to tackle tasks with context spreading over time The model processes one element in the sequence at one-time step After computation, the newly updated unit state is passed down to the next time step to facilitate the computation of the next element Imagine the case when an RNN model reads all the Wikipedia articles, character by character, and then it can predict the following words given the context

Autoencoders:

Different from the previous models, autoencoders, which is shown in Figure

2.7, are for unsupervised learning It is designed to learn a

low-dimensional representation of a high-low-dimensional data set, similar to what Principal Components Analysis (PCA) does The autoencoder model tries to learn an approximation function f(x) ≈ xf(x) ≈ x to reproduce the input data However, it is restricted by a bottleneck layer in the middle with a very small number of nodes

Figure 2.7 Autoencoder architecture Figure 2.6 A recurrent neural network and the unfolding into a full network

Trang 33

With limited capacity, the model is forced to form a very efficient encoding of the

data that is essentially the low-dimensional code we learned

2.2 CONVOLUTIONAL NEURAL NETWORK

2.2.1 Introduction

Convolutional Neural Networks (CNNs) are the premier deep learning model for computer vision Computer vision has become so good that it currently beats humans at certain tasks, and CNNs play a major part in this success story

CNNs are used to evaluate inputs through convolutions The input is convolved with a filter This convolution leads the network to detect edges and lower level features in earlier layers and more complex features in deeper layers in the network CNNs are used in combination with pooling layers and they often have fully connected layers at the end, as you can see in the picture below Run forward propagation as you would in a vanilla neural network and minimize the loss function through backpropagation to train the CNN [4][5]

Image classification is the task of taking an input image and outputting a class

(a cat, dog, etc.) or a probability of classes that best describes the image Figure 2.8

shows that for humans, this task of recognition is one of the first skills we learn from the moment we are born and is one that comes naturally and effortlessly as adults Without even thinking twice, we are able to quickly and seamlessly identify the environment we are in as well as the objects that surround us When we see an image or just when we look at the world around us, most of the time we are able to immediately characterize the scene and give each object a label, all without even consciously noticing These skills of being able to quickly recognize patterns, generalize from prior knowledge, and adapt to different image environments are ones that we do not share with our fellow machines

When a computer sees an image (takes an image as input), it will see an array

of pixel values Depending on the resolution and size of the image, it will see a 32 x

32 x 3 array of numbers (The 3 refers to RGB values)

Trang 34

(a) is what people see, (b) is what computer see

Each of these numbers is given a value from 0 to 255, which describes the pixel intensity at that point These numbers, while meaningless to us when we perform image classification, are the only inputs available to the computer The idea is that you give the computer this array of numbers and it will output numbers that describe the probability of the image being a certain class

For example: 0.8 for cat, 0.15 for dog, 0.05 for bird, etc

What we want the computer to do is to be able to differentiate between all the images it is given and figure out the unique features that make a dog or that make a cat This is the process that goes on in our minds subconsciously as well When we look at a picture of a dog, we can classify it as such if the picture has identifiable features such as paws or 4 legs In a similar way, the computer is able perform image classification by looking for low level features such as edges and curves, and then building up to more abstract concepts through a series of convolutional layers This is a general overview of what a CNN does

2.2.2 Convolutional Neural Network Architectures

All CNN models follow a similar architecture In order to achieve the functionality we talked about, Convolutional Neural Network processes image

through several layers as shown in Figure 2.9

- Convolutional Layer – Used to extract features

Figure 2.8 Computer see an array of pixel the value of an image

Trang 35

- Non-Linearity Layer – Introducing non-linearity to the system

- Pooling (Down sampling) Layer – Reduces the number of weights and

controls overfitting

- Flattening Layer – Prepares data for Classical Neural Network

- Fully-Connected Layer – Standard Neural Network used for classification

2.2.2.1 Convolutional Layer

In Figure 2.10, on the left side is the input to the convolution layer, for

example the input image On the right is the convolution filter, also called the kernel, we will use these terms interchangeably This is called a 3x3 convolution due to the shape of the filter

(a) is an input image, (b) is a filter/kernel

The main building block of CNN is the convolutional layer Convolution is a mathematical operation to merge two sets of information In our case, the convolution is applied on the input data using a convolution filter to produce

Figure 2.10 The input and filter of a convolutional layer

Figure 2.9 CNN architecture

Trang 36

a feature map as shown in Figure 2.11 and Figure 2.12

(a) is an input image, (b) is result after using convolutional layer

The mathematical formula of convolution operation is illustrated in Equation 2.1

Where I is an input image, K is the filter size of h x w

We perform the convolution operation by sliding this filter over the input At every location, we do element-wise matrix multiplication and sum the result This sum goes into the feature map The red area where the convolution operation takes place is called the receptive field Due to the size of the filter the receptive field is also 3x3

The output shape of the convolutional layer is determined by the shape of the input and the shape of the convolution kernel window In general, assuming the

Figure 2.11 The convolutional operation of a CNN

Figure 2.12 The result of a convolution operation

Trang 37

input shape is × and the convolution kernel window shape is × , then the output shape will be:

( − + 1)× ( − + 1) (2.2)

For example, in Figure 2.11, input image had a height ( ) and width ( )

of 7 and a convolution kernel with a height ( ) and width ( ) of 3, yielding an output with a height and a width of (7-3+1) × (7-3+1) or 5×5

One more important point before we visualize the actual convolution operation We perform multiple convolutions on an input, each using a different filter and resulting in a distinct feature map We then stack all these feature maps together and that becomes the final output of the convolution layer

In Figure 2.13, if we used 10 different filters, we would have 10 feature

maps of size 32x32x1 and stacking them along the depth dimension would give us the final output of the convolution layer: a volume of size 32x32x10, shown as the large blue box on the right Note that the height and width of the feature map are unchanged and still 32, it is due to padding and we will elaborate on that shortly

In Figure 2.14, we can see how two feature maps are stacked along the depth

dimension The convolution operation for each filter is performed independently and the resulting feature maps are disjoint

2.2.2.2 Non-linearity

After every convolutional layer, we usually have a non-linearity layer The problem is that our Neural Network would behave just like a single perception, because the sum of all the layers would still be a linear function, meaning the output

Figure 2.14 The convolution operation for each filter Figure 2.13 Perform multiple

convolutions on an input

Trang 38

could be calculated as the linear combination of the outputs This layer is also called the activation layer because we use one of the activation functions In the

past, nonlinear functions like sigmoid and tanh were used, but it turned out that the

function that gives the best results when it comes to the speed of training of the Neural Network is Rectifier function So, this layer is often ReLU Layer, which removes linearity by setting values that are below 0 to 0 since Rectifier function is

described with the Equation 2.3:

( ) = max (0, ) (2.3) Figure 2.15 shows how it looks once applied to one of the feature maps On

the second image, the feature map, black values are negative ones, and after we apply the rectifier function, black ones are removed from the image

(a) is original image, (b) is the feature map, (c) is the output after non-linear

2.2.2.3 Stride and Padding

Stride specifies how much we move the convolution filter at each step By default, the value is 1 We can have bigger strides if we want less overlap between the receptive fields This also makes the resulting feature map smaller since we are skipping over potential locations

The following figure demonstrates a stride of 2 Note that the feature map got smaller In general, when the stride for the height is and the stride for the width is , the output shape is:

Trang 39

(a) is a 5 x 5 image, (b) is 3x3 feature map

The Figure 2.16 shows the result when implement 3×3 convolution chose a

stride of 2 We see that the size of the feature map 3×3 is smaller than the input 7×7, because the convolution filter needs to be contained in the input

Figure 2.17 shows that if we want to maintain the same dimensionality, we can use padding to surround the input with zeros The gray area around the input is

the padding We either pad with zeros or the values on the edge Now the dimensionality of the feature map matches the input Padding is commonly used in CNN to preserve the size of the feature maps, otherwise they would shrink at each layer, which is not desirable

If we add a total of rows of padding (roughly half on top and half on bottom) and a total of columns of padding (roughly half on the left and half on the right), the output shape will be:

( − + + 1)×( − + + 1) (2.5)

(a) is an input array with padding, (b) is feature map Figure 2.17 Apply zero padding for input image Figure 2.16 Resulting feature map

Trang 40

In general, when we use the stride for the height is , the stride for the width

is and add rows of padding, columns of padding, the output shape is:

After a convolution operation, we usually perform pooling to reduce the

dimensionality This enables us to reduce the number of parameters, which both shortens the training time and combats overfitting Pooling layers down sample each feature map independently, reducing the height and width, keeping the depth intact

The most common type of pooling is max pooling which just takes the max value in the pooling window Contrary to the convolution operation, pooling has no parameters It slides a window over its input, and simply takes the max value in the window Similar to a convolution, we specify the window size and stride

In Figure 2.18, we have a sort of a filter once again We used the max

pooling with 2×2 window on the 4×4 image As you already guessed, filter picks

the largest number of the part of the image it covers This way we end up with smaller representations that contain enough information for our Neural Network to make correct decisions If the input to the pooling layer has the dimensionality 32x32x10, using the same pooling parameters described above, the result will be a

16x16x10 feature map, shown in Figure 2.19

Figure 2.18 The max pooling operation

Định dạng
Số trang	100
Dung lượng	3,25 MB
File đính kèm	file dinh kem.rar (18 MB)