1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn thạc sĩ Kỹ thuật điều khiển và tự động hóa: Face recognition performance comparison between K-nearest neighbors algorithm and self-organized map

157 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Face Recognition Performance Comparison between K-Nearest Neighbors Algorithm and Self-Organized Map
Tác giả Nguyễn Đức Minh
Người hướng dẫn GS.TS. Hồ Phạm Huy Ánh
Trường học Ho Chi Minh University of Technology
Chuyên ngành Control Engineering & Automation
Thể loại Master Thesis
Năm xuất bản 2020
Thành phố Ho Chi Minh City
Định dạng
Số trang 157
Dung lượng 2,86 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • 1.1 Pattern Recognition (22)
  • 1.2 Face Recognition (22)
  • 2.1 Project Overview (24)
  • 2.2 Problem Statement (24)
  • 2.3 Project Objective (25)
  • 2.4 Project Methodology (25)
    • 2.4.1 Study and Research (25)
    • 2.4.2 Design and Implementation (26)
    • 2.4.3 Performance Comparison (26)
  • 3.1 Introduction (28)
  • 3.2 Historical Background (29)
    • 3.2.1 Origins of Machine Learning (29)
    • 3.2.2 Origins of Neural Networks (31)
  • 3.3 Machine Learning Algorithms (33)
    • 3.3.1 An overview (33)
    • 3.3.2 Machine Learning Models (34)
      • 3.3.2.1 Artificial Neural Networks (35)
      • 3.3.2.2 Decision Trees (35)
      • 3.3.2.3 Linear Regression (37)
      • 3.3.2.4 Support Vector Machine (38)
      • 3.3.2.5 k-Nearest Neighbors (40)
      • 3.3.2.6 Bayesian Networks (40)
  • 3.4 k-Nearest Neighbors (41)
    • 3.4.1 KNN Algorithms (41)
      • 3.4.1.1 Determine value of K (42)
      • 3.4.1.2 Distance calculation (42)
      • 3.4.1.3 Output class measurement (42)
    • 3.4.2 Application of KNN (43)
  • 3.5 Neural Network Algorithms (43)
    • 3.5.1 Biological and Artificial Neurons (44)
      • 3.5.1.1 Biological Neurons (44)
      • 3.5.1.2 Artificial Neurons (45)
    • 3.5.2 Architecture of Neural Networks (47)
      • 3.5.2.1 Feed-forward Networks (48)
      • 3.5.2.2 Recurrent Networks (56)
      • 3.5.2.3 Stochastic Neural Networks (58)
      • 3.5.2.4 Botlzmann Machine (58)
      • 3.5.2.5 Modular Neural Networks (58)
    • 3.5.3 Neural Network Training (58)
      • 3.5.3.1 Definition of Training (58)
      • 3.5.3.2 Selection of Cost Function (59)
      • 3.5.3.3 Memorization of Inputs (60)
      • 3.5.3.4 Determination of Weight (61)
    • 3.5.4 Learning Paradigms (61)
      • 3.5.4.1 Supervised Learning (62)
      • 3.5.4.2 Unsupervised Learning (62)
      • 3.5.4.3 Reinforcement Learning (62)
      • 3.5.4.4 Training Function (63)
    • 3.5.5 Learning Algorithm (63)
    • 3.5.6 Employing Artificial Neural Networks (64)
      • 3.5.6.1 Selection of Model (64)
      • 3.5.6.2 Selection of Learning Algorithm (64)
      • 3.5.6.3 Robustness (64)
    • 3.5.7 Applications of Artificial Neural Networks (64)
  • 3.6 Kohonen Self-Organizing Map (65)
    • 3.6.1 SOM Network Architecture (65)
    • 3.6.2 Training Process of SOM (66)
      • 3.6.2.1 The Competitive Process (66)
      • 3.6.2.2 The Cooperative Process (67)
      • 3.6.2.3 The Adaptive Process (67)
      • 3.6.2.4 Ordering and Convergence (68)
    • 3.6.3 SOM Applications (68)
  • 3.7 Conclusion (69)
    • 3.7.1 Different between ML and ANN (69)
    • 3.7.2 Applying ML and ANN in this project (70)
  • 4.1 Discrete Cosine Transform (71)
    • 4.1.1 Introduction (71)
    • 4.1.2 Properties of DCT (72)
    • 4.1.3 Definition of DCT (73)
      • 4.1.3.1 Overview (73)
      • 4.1.3.2 One-dimensional type-2 DCT (73)
      • 4.1.3.3 Two-Dimensional Type-2 DCT (75)
    • 4.1.4 DCT Image Compression in JPEG format (79)
      • 4.1.4.1 Overview (79)
      • 4.1.4.2 Example with detailed process (80)
    • 4.1.5 Other Applications of DCT (82)
    • 4.1.6 Conclusion (82)
  • 4.2 Illumination Normalization (82)
    • 4.2.1 An Engineer Approach (82)
    • 4.2.2 IN Techniques (85)
      • 4.2.2.1 Introduction (85)
      • 4.2.2.2 Finding least error IN technique (86)
      • 4.2.2.3 Applying and testing AS with DCT in image (87)
    • 4.2.3 Conclusion (92)
      • 4.2.3.1 Research Analysis (92)
      • 4.2.3.2 Disadvantages of the Proposed Method (93)
  • 4.3 Residual Network for Image Data Encoding (93)
    • 4.3.1 Motivation (93)
    • 4.3.2 Deviation with other Neural Networks (94)
      • 4.3.2.1 Identity block (95)
      • 4.3.2.2 Convolutional block (96)
    • 4.3.3 Applying ResNet in image encoding (97)
    • 4.3.4 Conclusion (98)
  • 5.1 Data Gathering (102)
    • 5.1.1 High Resolution Images (102)
    • 5.1.2 Online Database of Images (103)
    • 5.1.3 Low Resolution Images (104)
  • 5.2 Image Pre-processing (104)
    • 5.2.1 High Resolution Images (104)
    • 5.2.2 Online Database Images (105)
    • 5.2.3 Low Resolution Images (105)
  • 5.3 Image Data Compression (106)
    • 5.3.1 Input design for SOM system (106)
    • 5.3.2 Input design for KNN system (107)
  • 5.4 Reshape and Save Image Data (107)
  • 6.1 Network Architecture (110)
  • 6.2 Network Size (111)
  • 6.3 Training Time (113)
  • 6.4 Other Parameters (114)
  • 7.1 System Architecture (115)
  • 7.2 Determine value of k (116)
  • 7.3 Other Parameters (117)
  • 8.1 Trained Data (120)
  • 8.2 Untrained Data (121)
  • 8.3 SOM Recognition Program (121)
  • 8.4 KNN Recognition Program (122)
  • 9.1 Calibrate SOM System (125)
    • 9.1.1 Optimal Number of Neurons (125)
    • 9.1.2 Optimal Number of Epochs (126)
  • 9.2 Calibrate KNN System (127)
    • 9.2.1 Optimal Number of K (127)
    • 9.2.2 Optimal Number of Nearest Distance (129)
  • 10.1 GUI Design for SOM System (131)
    • 10.1.1 Main Window (131)
    • 10.1.2 Database Camera Window (133)
    • 10.1.3 Database Modifying Window (134)
    • 10.1.4 Input Modifying Window (134)
    • 10.1.5 Input Camera Window (135)
  • 10.2 GUI Design for KNN System (135)
    • 10.2.1 Main Window (136)
    • 10.2.2 Database Acquisition Windows (137)
    • 10.2.3 Training Data Encoding Wizard (139)
    • 10.2.4 Test Face Recognition Wizard (140)
  • 12.1 Significance of the Project (146)
  • 12.2 Remaining Limitations (147)

Nội dung

Từ đó tiến hành so sánh về mặt lý thuyết và ứng dụng, độ chính xác trong việc nhận diện khuôn mặt, ưu và khuyết điểm của hai phương pháp.. As a further step frommy university thesis [1],

Pattern Recognition

Pattern recognition is a branch of artificial intelligence associated with the classification or description of observations - defining points in an appropriate multidimensional space. Pattern recognition aims to categorize data (patterns) based on either a priory knowledge or on statistical information extracted from the patterns The patterns to be classified are usually groups of measurements or observations A complete pattern recognition system consists of a sensor that gathers the observations; a feature extraction mechanism that computes numeric or symbolic information from those observations; and a classification or description scheme that classifies or describes observations, based on the extracted features [10] Pattern recognition has extensive application in astronomy, medicine, robotics, and remote sensing by satellites.

Although some of the barriers that hindered such automated pattern recognition systems have been removed due to advances in computer hardware in recent years, providing ma- chines capable of faster and more sophisticated computation, the field of pattern recognition is still pretty much in its early stages [11].

In summary, pattern recognition can be classified as the categorization of input data into recognizable classes via the extraction of significant features or attributes of the data from a background of irrelevant details The functional block diagram of an adaptive pattern recognition system is shown in Figure1.1.1.

Figure 1.1.1: Block diagram of a pattern recognition system

Face Recognition

Face recognition, although considered a casual task for the human brain, has proved to be extremely difficult to simulate artificially, because although similarities exist between faces, they can vary considerably in terms of age, skin color, angle, facial expressions or

1.2 F A C E R E C O G N I T I O N 4 facial details such as glasses or beards The problem is further complicated by varying lighting conditions, image qualities and geometries, as well as the possibility of partial occlusion and disguise [9]

For basic face recognition systems, some of these effects can be avoided by assuming and ensuring a uniform background and lighting condition This assumption is acceptable for some applications such as automated separation of goods on a production line, where lighting condition can be controlled and the image background is uniform For many applications however, this is not suitable, and systems must be designed to accurately classify images subjected to a variety of unpredictable conditions Figure1.2.1outlines the block diagram of a typical face recognition system.

Figure 1.2.1: Block diagram of a face recognition system

Project Overview

Face recognition focuses on still images, which can be generally separated into image- based and feature-based approaches Face recognition is commonly used in many appli- cations such as human-machine interfaces, automatic access control systems, etc which involve comparing an image with a database of stored faces in order to identify the subject in the image.

This project includes the research, design, development and performance comparison of two efficient facial recognition systems Each systems have completely different algorithm, but both are aim for the same goal of recognizing facial identity and are based on general architecture of facial recognition systems The first system is follows image- based approach and programmed in MATLAB The other system is applying the remaining approach: feature based, with the programming language is Python.

Problem Statement

Face recognition is a popularly deployed technology in our daily life for many applica- tions such as surveillance purpose, intrusion detection system, personal identification in a database, etc Until now, there are plenty implemented methods that can perform good result in one of those field The most powerful method for this application as of 2020 is the advanced Convolutional Neural Network, with some of its branch could perform the accuracy up to 99.63% on the Labeled Face in the Wild dataset [12].

While this technology received a lot of attention from the foreign countries, especially developed countries where research in computer vision has becoming an important field due to the increasing in its potential in commercial and law enforcement, it’s not surprise to see face recognition appears in domestic applications Yet we rarely found any article mentions about performance comparison between some popular methods, as public usages are only focus on applying to a specific task This phenomenon is blocking a real usage overview of whom interested in this field but have not experienced in all aspects of it.

Project Objective

Therefore, our project focuses on developing two different techniques, both can provide a solution for an efficient high-speed face recognition system in surveillance applications, then perform speed and accuracy comparison between them in different conditions From it, we can find out which method is good for which situation as well as their advantages.

The goal of this project is to study and design two efficient high-speed face recognition systems, one in MATLAB and the other is in Python, with different popular algorithms, then perform a real application performance analysis on the same hardware The specific objectives are listed below.

• To study and understand simple pattern recognition using face images.

• To design two different models for face recognition system.

• To enhance the models for a high-speed system to serve surveillance purposes.

• To develop a program in MATLAB based on the first designed model.

• To develop a program in Python based on the second designed model.

• To create a database set of face images, use them to validate both systems.

• To perform tests for program optimization for both systems.

• To perform performance analysis for each systems.

• To demonstrate the effectiveness for each systems.

Project Methodology

Study and Research

In this phase, we focus on the following missions:

• Research different approaches on face recognition.

• Study about neural networks, machine learning and image compression methods.

• Focus our study on the selected solutions.

• Propose a general design to visualize our research.

Design and Implementation

In this phase, we apply what we’ve studied to:

• Design the system working principle.

• Execute the design by developing two different softwares by MATLAB and Python.

• Test the theory with our calculation and adjust the program parameters.

• Optimize the performance and methods of computation.

• Create a simple Graphical User Interface (GUI) of the program.

Performance Comparison

In this phase, we apply the implementation to real life usage:

• Perform accuracy check for both implemented systems.

• Perform speed check for both implemented systems.

• Ensure test environments are balanced for each systems.

• Specify most suitable application, advantages and disadvantages of both systems.

In this chapter, we will bring an introduction to Machine Learning (ML) and its’ subset: Artificial Neural Networks (ANNs) A historical background is presented, various types of learning method and neural networks will be showed and explained as well as their applications More importantly, the sub-type neural network-based used for the first face recognition system and another machine learning approach used for the second system in our project will be described in details so as to convey its concept to the reader.

Introduction

ML is a field of Artificial Intelligence (AI) concerned with training algorithms to interpret large amounts of data with as little human guidance as possible The algorithms process data, then use what they “learned” from those calculations to adjust themselves in order to make better decisions on the next batch of data.

ANNs, or neural networks in general, are electronic models based on the neural structure of the brain It comes without saying that some problems which are beyond the capability of computers are actually solvable by small efforts of an animal mind, since the brain learns from experience, a feature that casual machines do not possess That aspect also represents ANN as a branch of ML field, since they are having the same goal: make programs proceed tasks continuously without human interaction or as less as possible.

Biologically inspired methods of computing are thought to be the next major leap in the computing industry Computers do rote tasks very well, like keeping orders or performing complex mathematical calculations However, they have trouble recognizing even simple patterns much less than generalizing those patterns of the past into actions of the future [3].

Nowadays, a new field of programming that mentions words which are more "human" than traditional computation, likebehave,react,self-organize,learn,generalize, andforget, etc is developing at a significant rate From the troubled early years of development to

Historical Background

Origins of Machine Learning

During the early state of AI development, in 1955 researchers defined the definition of

AI of "making intelligent machines that have the ability to achieve goals like humans do".

To achieve that target, many fundamental methods had been proposed Machine Learning (ML) is one of those approach, which started the huge changes in the future we are living in.

Defined by Arthur Samuel in 1959, ML had been defined as a large field of AI which making computers the ability to learn without being explicitly programmed This means a single program, once created, will be able to learn how to do some intelligent activities outside the notion of programming He created a game of checkers to visualize his concept, later known as the first computer learning program The program "improved" skills as more time being played, studying which moves can lead to winning strategies and parsing those move into its program automatically.

Figure 3.2.1: Model comparison between Normal programmed code and Machine Learning pro- gram

Samuel also designed a number of mechanisms allowing his program to become better.

In what Samuel called rote learning, his program recorded/remembered all positions it had already seen and combined this with the values of the reward function.

In 1957, Frank Rosenblatt combined Donald Hebb’s model of brain cell interaction with Arthur Samuel’s Machine Learning efforts and created the perceptron program This software was installed in a machine called Mark 1 perceptron and has been created for image recognition purpose This made the software and the algorithms transferable and available for other machines.

Described as the first successful neuro-computer, the Mark I perceptron developed some problems with broken expectations Although the perceptron seemed promising, it could not recognize many kinds of visual patterns (such as faces), causing frustration and stalling neural network research It would be several years before the frustrations of investors and

3.2 H I S T O R I C A L B A C K G R O U N D 11 funding agencies faded Neural network and Machine Learning research struggled until a resurgence during the 1990s.

In 1967, the nearest neighbor algorithm was conceived, which was the beginning of basic pattern recognition This algorithm was used for mapping routes and was one of the earliest algorithms used in finding a solution to the traveling salesperson’s problem of finding the most efficient route Using it, a salesperson enters a selected city and repeatedly has the program visit the nearest cities until all have been visited Marcello Pelillo has been given credit for inventing the “nearest neighbor rule.”

In the 1960s, the discovery and use of multilayers opened a new path in machine learning and started the neural network research It was discovered that providing and using two or more layers in the perceptron offered significantly more processing power than a perceptron using one layer Other versions of neural networks were created after the perceptron opened the door to “layers” in networks, and the variety of neural networks continues to expand The use of multiple layers led to feed forward neural networks and back propagation.

Recently, ML has become popular and used widespread in many fields, responsible for some of the most significant technology advancements such as new industry of self driving vehicles or automated drones ML built a standard concepts, brings AI research and application flourish since 1990s to this day, including supervised and unsupervised learning, robotics algorithm, Internet of Things (IoT) and many more Technically, ML designed to be adaptive by time, continuously learning which makes them increasingly give accurate decisions the longer they operate However, in real life situations, some tasks requires absolute accuracy and need detailed explanation such as medical diagnosis, disease analysis and pharmaceutical development This affects directly to the problem of ML: being considered as a black-box algorithm after significant training time These make machine decisions in transparent and non-understandable, even to the eyes of experts, which reduces trust in ML specifically and AI generally.

Back to the above concept, researcher defines an Artificial Neural Network (ANN) which has hidden layers used to respond to more complicated tasks than the earlier perceptrons could We can see that ANNs are primary tool used for Machine Learning Neural networks use input and output layers and include a hidden layer designed to transform input into data that can be used the by output layer The hidden layers are excellent for finding patterns too complex for a human programmer to detect, meaning a human could not find the pattern and then teach the device to recognize it.

We will discuss the detailed historical background about neural networks in next section.

Origins of Neural Networks

In 1943, neuro physiologist Warren McCulloch and mathematician Walter Pitts wrote a paper on how neurons might work In order to describe how neurons in the brain might work, they modeled a simple neural network using electrical circuits.

In 1949, Donald Hebb wrote The Organization of Behavior, a work which pointed out the fact that neural pathways are strengthened each time they are used, a concept fundamentally essential to the ways in which humans learn If two nerves fire at the same time, he argued, the connection between them is enhanced.

As computers became more advanced in the 1950’s, it was finally possible to simulate a hypothetical neural network The first step towards this was made by Nathanial Rochester from the IBM research laboratories Unfortunately for him, the first attempt to do so failed.

In 1959, Bernard Widrow and Marcian Hoff of Stanford developed models called

"ADALINE" and "MADALINE." In a typical display of Stanford’s love for acronyms, the names come from their use of Multiple ADAptive LINear Elements ADALINE was developed to recognize binary patterns so that if it was reading streaming bits from a phone line, it could predict the next bit MADALINE was the first neural network applied to a real world problem, using an adaptive filter that eliminates echoes on phone lines While the system is as ancient as air traffic control systems, like air traffic control systems, it is still in commercial use.

In 1962, Widrow and Hoff developed a learning procedure that examines the value before the weight adjusts it (i.e 0 or 1) according to the rule:

Weight Change=Pre-weight line value x Error

It is based on the idea that while one active perceptron may have a big error, one can adjust the weight values to distribute it across the network, or at least to adjacent perceptrons. Applying this rule still results in an error if the line before the weight is 0, although this will eventually correct itself If the error is conserved so that all of it is distributed to all of the weights than the error is eliminated.

Despite the later success of the neural network, traditional Von Neumann architecture took over the computing scene, and neural research was left behind Ironically, John von Neumann himself suggested the imitation of neural functions by using telegraph relays or vacuum tubes.

In the same time period, a paper was written that suggested there could not be an extension from the single layered neural network to a multiple layered neural network In addition, many people in the field were using a learning function that was fundamentally flawed because it was not differentiated across the entire line As a result, research and funding went drastically down.

This was coupled with the fact that the early successes of some neural networks led to an exaggeration of the potential of neural networks, especially considering the practical technology at the time Promises went unfulfilled, and at times greater philosophical questions led to fear Writers pondered the effect that the so-called "thinking machines" would have on humans, ideas which are still around today.

The idea of a computer which programs itself is very appealing If Microsoft’s Windows

2000 could reprogram itself, it might be able to repair the thousands of bugs that the programming staff made Such ideas were appealing but very difficult to implement In addition, von Neumann architecture was gaining in popularity There were a few advances in the field, but for the most part research was few and far between.

In 1972, Kohonen and Anderson developed a similar network independently of one another, which we will discuss more about later They both used matrix mathematics to describe their ideas but did not realize that what they were doing was creating an array of analog ADALINE circuits The neurons are supposed to activate a set of outputs instead of just one.

The first multilayered network was developed in 1975, an unsupervised network.

In 1982, interest in the field was renewed John Hopfield of Caltech presented a paper to the National Academy of Sciences His approach was to create more useful machines by using bidirectional lines Previously, the connections between neurons was only one way. That same year, Reilly and Cooper used a "Hybrid network" with multiple layers, each layer using a different problem-solving strategy.

Also in 1982, there was a joint US-Japan conference on Cooperative/Competitive Neural Networks Japan announced a new Fifth Generation effort on neural networks, and US papers generated worry that the US could be left behind in the field (Fifth generation computing involves artificial intelligence (AI) First generation used switches and wires, second generation used the transistor, third state used solid-state technology like integrated circuits and higher level programming languages, and the fourth generation is code generators.) As a result, there was more funding and thus more research in the field.

In 1986, with multiple layered neural networks in the news, the problem was how to extend the Widrow-Hoff rule to multiple layers Three independent groups of researchers, one of which included David Rumelhart, a former member of Stanford’s psychology department, came up with similar ideas which are now called back propagation networks because it distributes pattern recognition errors throughout the network Hybrid networks used just two layers, these back-propagation networks use many The result is that back- propagation networks are "slow learners," needing possibly thousands of iterations to learn [13].

Today, neural networks discussions are occurring everywhere Yet, its future lies in hardware development, the most important key to the whole technology Currently, most

Machine Learning Algorithms

An overview

As this is not the main topic that we are focusing on, the following section will only convey the most general knowledge we need to get.

"A core objective of a learner is to generalize from its experience" [14] We also learn things from mistake and so is machine learning original idea Actions and mistakes here can be expressed as experience and the term "generalize" represents the understanding of what was the problem that caused the mistake and how should things behave correctly, then build the ability to perform accurately on the same field of tasks later on This field can be called as a learning data set, comes from generally unknown probability distribution. The learner then build the general model about this field that can be used to produce predictions in new cases Optimally, this can lead to tasks’ result improvement, as showed in Figure3.3.1for general cases applied in human individual.

Practically, because training sets are finite and the future is uncertain, learning theory usually does not yield guarantees of the performance of algorithms Instead, probabilistic bounds on the performance are quite common The bias–variance decomposition is one way to quantify generalization error [15] Details about properties of machine learning will be discussed in the next part.

Aside from software developments, compatible hardware that can provide sufficient performance for applying in real application is also an important topic Since 2010s, advances in both machine learning algorithms and computer hardware have led to more efficiency.

Although machine learning has been transformative in some fields, machine-learning programs often fail to deliver expected results [16] There are many reasons for this drawback: lack of (suitable) data, lack of access to the data, data bias, privacy problems,badly chosen tasks and algorithms, wrong tools and people, lack of resources, and eval- uation problems For most applications, the problem about data bias is the main cause,especially when there are large set of training data A machine learning system trained

Figure 3.3.1: Illustration on learning from experience [2] on current objects only, may not be able to predict the necessary steps when challenged by a new object that are not represented in the training data When trained on man-made data, machine learning is likely to pick up the same constitutional and unconscious biases already present in society In facial recognition applications, we also suffer the same effect:

ML programs usually trying to predict a new face that is not in the trained set as a closest face that have some minor similarities.

Because of such challenges, the effective use of machine learning may take longer to be adopted in other domains Concern for fairness in machine learning is reducing bias in machine learning and propelling its use for human good, which has been expressed by artificial intelligence scientists.

Machine Learning Models

After several years of research, currently ML algorithms can be categorized into learning types, each learning types can apply to some of the models An implemented model will then become feasible software and integrated to a hardware, which is now called a complete system The types of machine learning algorithms differ in their approach, the type of data they input and output as well as the type of task or problem that they are intended to solve.

Broadly, there are 3 types of Machine Learning Algorithms: Supervised Learning, Unsupervised Learning and Reinforcement Learning These are the most common and defined basis in this field As Neural Networks are also following the same trait as a sub-type of Machine Learning, we can discuss about learning paradigms in section3.5.4 on page42for more detailed examples.

In this section we will focus on machine learning models, which are the skeletons of the system Each models can use different approaches as long as they can perform training and process trained data with additional parameters to make predictions Various types of models have been used and researched for machine learning systems.

Because ANN is the kind of model we selected to implement for our project, it will be discussed later in a separate section3.5.1on page25.

Decision Trees (DTs) are a non-parametric supervised learning method used for clas- sification and regression The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features [17].

Specifically, a decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility Represented in figure3.3.2, a decision tree will have following contents: each internal node represents a “test” on an attribute, each branch represents the outcome of the test and each leaf node represents a class label as well as decision taken after computing all attributes The paths from root to leaf represent classification rules It is one way to display an algorithm that only contains conditional control statements.

Figure 3.3.2: Basic contents of a decision tree

Decision trees have a natural “if then else” construction that makes it fit easily into a programmatic structure They also are well suited to categorization problems where attributes or features are systematically checked to determine a final category However, in practice we might face one problem: how do we know when to split the branch As we can see, the decision of making strategic splits heavily affects a tree’s accuracy Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes The

3.3 M A C H I N E L E A R N I N G A L G O R I T H M S 17 creation of sub-nodes increases the homogeneity of resultant sub-nodes In other words, we can say that purity of the node increases with respect to the target variable Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes This approach will then lead to over fitting problem.

To reduce the possibility of getting over fitting issue, researchers provided some decision tree algorithm to follows Algorithms for constructing decision trees usually work top- down, by choosing a variable at each step that best splits the set of items [18] Below is the list of the most common algorithms of DT:

• ID3 (Iterative Dichotomiser 3) and C4.5 (successor of ID3), both used Information Gain to represents entropy of information contents.

• CART (Classification And Regression Tree), which uses Gini impurity to measure how often a randomly chosen element from the set would be incorrectly labeled.

In regression, CART used Variance reduction, requires metric discretionary before being applied The variance reduction of a nodeN is defined as the total reduction of the variance of the target variablexdue to the split at this node.

• Chi-square automatic interaction detection (CHAID) performs multi-level splits when computing classification trees.

• Multivariate adaptive regression spline (MARS) extends decision trees to handle numerical data better.

From above properties of DT, we can observe some advantage about this method and how it is not applicable for complicated programs such as facial recognition.

First impression on DTs must be they are very easy to understand, even for people from non-analytically background It does not require any statistical knowledge to read and interpret them Its graphical representation is very intuitive and users can easily relate their hypothesis Secondly, decision trees require relatively little effort from users for data preparation since the decision choice are mainly based on yes no options or simplified parameters Furthermore, non-parametric method helps decision trees reduce assumptions about the space distribution and the classifier structure.

But there are many drawbacks, where the most common problem about DT is the high possibility of approaching over fitting Decision-tree learners can create over-complex trees that do not generalize the data well This problem gets solved by setting constraints on model parameters and pruning, using the above algorithms Aside from over fitting,DTs are also create biased trees if some classes dominate It is therefore recommended to balance the data set prior to fitting with the decision tree Generally, it gives low prediction accuracy for a data set as compared to other machine learning algorithms [17].

Linear regression (LR) is another supervised-learning method It attempts to model the relationship between two variables by fitting a linear equation to observed data One variable is considered to be an explanatory variable and the other is considered to be a dependent variable In practice, Linear regression is used to estimate real values such as cost of houses, number of calls or total sales, as long as the data are based on continuous variable Here, we establish relationship between independent and dependent variables by fitting a best line This best fit line is known as regression line and represented by a linear equation below The coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line. y=ax+b (3.1)

• yis a vector of observed values, often denoted as initial stateY n thus represents as

• xis a n-dimensional column-vectors, often denoted as regression matrixX mn thus represents asX 

• ais a vector of regression coefficients with structure same asy, but with an additional dimension applies for first column ofxthus represents asa

• bis a vector of error or noise with structure same asy, affects to other factors and directly manipulate observed valuey.

Above are the basic concepts of LR, which may cause limitations as weak exogenous,fixed error variance, errors of the response variables are uncorrelated with each other.Fortunately, there are many types of extensions for linear regression developed to overcome these drawbacks Below are some of them:

• Simple and multiple linear regression, where simple ones apply for single scalar x and ygiven from equation3.1and multiple version used when matrix or multi dimensional parameters given.

• Generalized linear models which broadly used for multiple situations include model- ing positive quantities under large scale or modeling categorical data.

• Hierarchical linear models organizes the data into a hierarchy of regressions, for example where A is regressed on B and B is regressed on C.

Aside from extensions, researchers also specify many estimation method which increase the effectiveness of parametersa,bandx[19] Namely:

• Bayesian linear regression, adapted from a probabilistic graphical model that rep- resents a set of random variables and their conditional independence discussed in section3.3.2.6.

Support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two or multiple group problems It is well known as a fast and dependable classification method that performs very well with a limited amount of data. After giving an SVM model sets of labeled training data for each category, the model can be used to solve specific categorizing tasks.

Developed at ATT Bell Laboratories by Vapnik with colleagues (Boser et al., 1992, Guyon et al., 1993, Vapnik et al., 1997), it presents one of the most robust prediction methods, based on the statistical learning framework or VC theory proposed by Vapnik and Chervonekis (1974) and Vapnik (1982, 1995).

k-Nearest Neighbors

KNN Algorithms

The algorithm of KNN can be separated to 4 simplified steps as follows:

1) Data preparation: Load the training as well as test data Determine total number of features pand amount of input samplesn.

2) Determine value of k: Select number of nearest neighbors, known as k For any given problem, a small value of kwill lead to a large variance in predictions. Alternatively, settingkto a large value may lead to a large model bias This value can be determined by using cross-validation method for a more efficient result In practice,kis usually chosen to be odd, so as to avoid ties.

3) Distance calculation: Calculate the distance between the query instancex i (i1, 2,ã ã ã,n) and all the training samples x l wherel =1, 2,ã ã ã,n There are many distance functions but Euclidean is the most commonly used measure It is mainly used when data is continuous, along with Manhattan distance algorithm Apply this formula for each points of the test data.

4) Output class measurement: Sort the calculated distance and determine nearest neighbors based on the k−th minimum distance (ascending order), which means k−th most similar attributes Afterward, gather the categoryY of the nearest neigh- bors and remove all neighbors that have larger than the k−th minimum distance. Finally, use simple majority of the categorized neighbors as the prediction value of the query instance This is KNN output.

As stated in step 2, the value of kwill directly affect to final result Thus selectingk value should be taken in mind There are many researches about this field, in which they used many techniques and compare experiment result to give their conclusion Usually, thekparameter in the KNN classifier is chosen empirically Depending on each problem, different numbers of nearest neighbors are tried and the parameter with the best perfor- mance and accuracy is chosen to define the classifier A detailed consideration for face recognition application will be discussed in next chapter.

In step 3 above, the k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples Letx i be an input sample vector with total number of dimensional pas(x i 1 ,x i 2 ,ã ã ã,x i p )andnbe the total number of training sample(i=1, 2,ã ã ã,n) The Euclidean distance between two samplesx i andx l where(l=1, 2,ã ã ã,n)is defined as: d(x i ,x l ) q (x i 1 −x l 1 ) 2 + (x i 2 −x l 2 ) 2 +ã ã ã+ (x i p −x l p ) 2 s p

Continue from distance calculation step, now we have a list of distances between test sample and all training samples The class label assigned to a test example is now

3.5 N E U R A L N E T W O R K A L G O R I T H M S 24 determined by the majority vote of its k nearest neighbors Define y(d i) as output of whether the inputx i belongs to which classc k that having most members in theknearest neighbors. y(d i ) =arg max k ∑ x j ∈k y(x j ,c k ) (3.3)

Above is the general steps of KNN Depends on specific applications, we can use different types of KNN which basically derives from one of all of the above procedure. Nonetheless, no matter which type of KNN used, their output is for classification problems or regression.

Application of KNN

The traditional KNN is well-known and widely used for its simplicity and its easy implementation KNN expects the class conditional probabilities to be locally constant and suffers from bias in high dimensions KNN is an extremely flexible classification scheme, and does not involve any preprocessing of the training data This can offer both space and speed advantages in very large problems However, in many practical applications, it fails to get good results on most of data sets because of the unevenly distribution of the examples among classes [21].

Below is some of the notable applications of KNN:

• Data mining in medicine, finance, agriculture and more

• In regression learning, KNN can be used to predict dependent variables

Neural Network Algorithms

Biological and Artificial Neurons

As this is not the main topic that we are focusing on, the following section will only convey the most general knowledge we need to get.

Basically, a typical neuron consists of a cell body (soma), dendrites, and an axon as showed in Figure3.5.1.

The term neurite is used to describe either a dendrite or an axon, particularly in its undifferentiated stage Dendrites are thin structures that arise from the cell body, often extending for hundreds of micrometers and branching multiple times, giving rise to a complex "dendritic tree" An axon (also called a nerve fiber when myelinated) is a special cellular extension (process) that arises from the cell body at a site called the axon hillock and travels for a distance, as far as 1 meter in humans or even more in other species Nerve fibers are often bundled into fascicles, and in the peripheral nervous system, bundles of fascicles make up nerves (like strands of wire make up cables) The cell body of a neuron frequently gives rise to multiple dendrites, but never to more than one axon, although the axon may branch hundreds of times before it terminates At the majority of synapses, signals are sent from the axon of one neuron to a dendrite of another There are, however, many exceptions to these rules: neurons that lack dendrites, neurons that have no axon, synapses that connect an axon to another axon or a dendrite to another dendrite, etc [23].

All neurons are electrically excitable, maintaining voltage gradients across their mem- branes by means of metabolically driven ion pumps, which combine with ion channels embedded in the membrane to generate intracellular-versus-extracellular concentration differences of ions such as sodium, potassium, chloride, and calcium Changes in the cross-membrane voltage can alter the function of voltage-dependent ion channels If the voltage changes by a large enough amount, an all-or-none electrochemical pulse called

3.5 N E U R A L N E T W O R K A L G O R I T H M S 26 an action potential is generated, which travels rapidly along the cell’s axon, and activates synaptic connections with other cells when it arrives.

In most cases, neurons are generated by special types of stem cells It is generally believed that neurons do not undergo cell division but recent research in dogs shows that in some instances in the retina they do [24] Astrocytes are star-shaped glacial cells that have also been observed to turn into neurons by virtue of the stem cell characteristic pluripotency In humans, neurogenesis largely ceases during adulthood; but in two brain areas, the hippocampus and olfactory bulb, there is strong evidence for generation of substantial numbers of new neurons [25].

Artificial neurons are created based on the structure of of human neurons and their interconnections However, because our knowledge of neurons is incomplete and our computing power is limited, these models are just gross idealizations of real networks of neurons Figure3.5.2illustrates a typical artificial neuron with respect to a biological one shown before in Figure3.5.1on page25.

Figure 3.5.2: Artificial neuron with similar structure compare to a human neuron

In most cases, artificial neuron follow a working principle namedfiring rule.

3.5.1.2.1 F I R I N G R U L E S A single neuron fires or doesn’t fire when it reads a particular input pattern This so-called "fire" action (equivalent to the emit of an action potential in biological neuron) does actually obey some regulation called thefiring rule. This is an important concept in neural networks and accounts for their high flexibility.

A firing rule determines how one calculates whether a neuron should fire for any input pattern It relates to all the input patterns, not only the ones on which the node was trained.

A simple firing rule can be implemented by using Hamming distance technique The rule goes as follows:

Take a collection of training patterns for a node, some of which cause it to fire (the1-taught set of patterns) and others which prevent it from doing so (the 0-taught set) Then

3.5 N E U R A L N E T W O R K A L G O R I T H M S 27 the patterns not in the trained collection cause the node to fire if, on comparison, they have more input elements in common with the ’nearest’ pattern in the 1-taught set than with the ’nearest’ pattern in the 0-taught set If there is a tie, then the pattern remains in the undefined state.

For example, a 3-input neuron is taught to output 1 when the input (X 1 ,X 2 andX 3 ) is

111 or 101 and to output 0 when the input is 000 or 001 Then, before applying the firing rule, the truth table is:

Table 1: Truth table of the above example

We can see that currently, there are 4 patterns whose outputs are undefined: 010, 011,

100 and 110 As an example of the way the firing rule is applied, let’s first consider the pattern 010 Compare to the other defined patterns, it differs from 000 in 1 element, from

001 in 2 elements, from 101 in 3 elements and from 111 in 2 elements Therefore, the

’nearest’ pattern is 000 which belongs in the 0-taught set Thus the firing rule requires that the neuron should not fire when the input is 010 On the other hand, 011 is equally distant from two taught patterns (001 and 111) that have different outputs and thus its output stays undefined (0/1).

By applying the firing rule in every column, the following truth table is obtained.

Table 2: Truth table of the above example with firing rule applied

The difference between the two truth tables is called thegeneralization of the neuron. Therefore, the firing rule gives the neuron a sense of similarity and enables it to respond

’sensibly’ to patterns not seen during training [26].

"1" or "0", which we called binary, answers These applications include the recognition of text, the identification of speech, the image deciphering of scenes, etc which are required to turn real-world inputs into discrete values limited to some known set, like the 0-9 digits for example Because of this limitation of output options, these applications usually involve networks composed of neurons that just simply sum up smooth inputs.

Figure 3.5.3shows a fundamental representation of a simple artificial neuron stated above, called anode, with respect to an engineering approach.

Figure 3.5.3: Structure of a node with its inputs (xi), weights (wi), activation functionσ(z)and output functionν

These networks may utilize the binary properties of OR or AND operator of inputs, follow by the firing rule These functions, and many others, can be built into the summation and transfer functions of a network [27].

3.5.1.2.3 C O M P L I C AT E D A R T I F I C I A L N E U R O N The previous neuron doesn’t do anything that conventional computers don’t do already Figure3.5.4shows a more sophisticated neuron , the McCulloch and Pitts model (MCP).

Figure 3.5.4: Structure of a node with its inputs (xi), weights (wi), weighted inputs (wixi), activation functionσ(z)and output functionν

The difference between this and the very first model is that the inputs are ’weighted’, the effect that each input has at decision making is dependent on the weight of the particular input The weight of an input is a number which when multiplied with the input gives out the weighted input These weighted inputs are then added together and if they exceed a pre-set threshold value, the neuron fires In other cases, the neuron does not fire.

Architecture of Neural Networks

Different types of ANN have different structures, served for various purposes and sophisticated tasks However, there are four common types that basically are the most general among all kinds of neural networks:

Figure3.5.5shows a general feed-forward neural network architecture with intercon- nected group of nodes.

Figure 3.5.5: General structure of feed-forward neural network

This type of neural network allows signals to travel one way only: from input to output. There is no feedback (loops), means that the output of any layer does not affect that same layer Feed-forward ANNs tend to be straight forward networks that associate inputs with outputs They are extensively used in pattern recognition Please note that the number of hidden layers may vary, but input and output layers must exist.

Feed-forward networks are the first and simplest type of artificial neural networks devised They can be divided into 5 main sub-types:

The earliest kind of neural network is a Single-Layer Perceptron (SLP) network, which consists of a single layer of output nodes, means that inputs are fed directly to the outputs via a series of weights In this way it can be considered the simplest kind of feed-forward network that works like a large single neuron Sum of the products of weights and inputs is calculated in each node, and if this value is above some threshold (typically 0) the neuron fires and takes the activated value (typically 1); otherwise it takes the deactivated value (typically -1).

Figure 3.5.6: A single-layer perceptron network capable of calculating XOR

Perceptrons can be trained by a simple learning algorithm that is usually called the delta rule It calculates the errors between calculated output and sample output data, and uses this to create an adjustment to the weights, thus implementing a form of gradient descent [28].

Single-unit perceptrons are only capable of learning linearly separable patterns: in 1969, a famous monograph entitled Perceptrons written by Marvin Minsky and Seymour Papert showed that it was impossible for a single-layer perceptron network to learn an XOR function They conjectured (incorrectly) that a similar result would hold for a multi-layer perceptron network Although a single threshold unit is quite limited in its computational power, it has been shown that networks of parallel threshold units can approximate any continuous function from a compact interval of real numbers into the interval [-1,1].

A single-layer neural network can compute a continuous output instead of a step function.

A common choice is the so-called logistic function: y= 1

With this choice, the single-layer network is identical to the logistic regression model, widely used in statistical modeling The logistic function is also known as thesigmoid function It has a continuous derivative, which allows it to be used in back-propagation. This function is also preferred because its derivative is easily calculated: y 0 =y(1−y) (3.5)

A Multi-Layer Perceptron (MLP) network consists of multiple layers of computational units, usually interconnected in a feed-forward way Each neuron in one layer has directed connections to the neurons of the subsequent layer In many applications, the units of these networks apply the sigmoid function as an activation function.

The universal approximation theorem for neural network states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This result holds only for restricted classes of activation functions, e.g for the sigmoid functions [29].

Multi-layer networks use a variety of learning techniques, the most popular being back- propagation Here, output values are compared with the correct answer to compute the value of some predefined error-function By various techniques, the error is then fed back through the network Using this information, the algorithm adjusts the weights of each connection in order to reduce the value of the error function by some small amount. After repeating this process for a sufficiently large number of training cycles, the network will usually converge to some state where the error of the calculations is small In this case one says that the network has learned a certain target function To adjust weights properly one applies a general method for non-linear optimization that is called gradient descent For this, the derivative of the error function with respect to the network weights is calculated and the weights are then changed such that the error decreases For this reason back-propagation can only be applied on networks with differentiated activation functions. Figure3.5.5shown on page29 is an example of a multi-layer feed-forward perceptron neural network.

In general, the problem of teaching a network to perform well, even on samples that were not used as training samples, is a quite subtle issue that requires additional techniques.

This is especially important for cases where only very limited numbers of training samples are available.The danger is that the network may over fits the training data and fails to capture the true statistical process generating the data Computational learning theory is concerned with training classifiers on a limited amount of data In the context of neural networks, a simple heuristic, called early stopping, often ensures that the network will generalize well to examples not in the training set.

Other typical problems of the back-propagation algorithm are the speed of convergence and the possibility of ending up in a local minimum of the error function Today, there are practical solutions that make back-propagation in multi-layer perceptrons the solution of choice for many machine learning tasks.

Adaptive Linear Neuron or later called Adaptive Linear Element It was developed by Professor Bernard Widrow and his graduate student Ted Hoff at Stanford University in

1960 [30] It consists of a weight, a bias and a summation function The main operation of this network is: y i =wx i +b (3.6)

Its adaptation is defined through a cost function (error metric) of the residual e d i −(b+wx i ), whered i is the desired output with the Mean Square Error (MSE) metric,

While the ADALINE is through this capable of simple linear regression, it has limited practical use There is an extension of the ADALINE, called the Multiple ADALINE(MADALINE) that consists of two or more ADALINEs connected with each other in series.

Radial Basis Functions (RBFs) are powerful techniques for interpolation in multidimen- sional space A RBF is a function which has built into a distance criterion with respect to a center Radial basis functions have been applied in the area of neural networks where they may be used as a replacement for the sigmoidal hidden layer transfer characteristic in multi-layer perceptron [31].

Radial basis function network shares the similar structure to that of feed-forward network. Figure3.5.7shows the architecture of a RBF network.

Figure 3.5.7: Architecture of RBF network RBF networks can be trained by:

• Deciding on how many hidden units there should be

• Deciding on their centers and the sharpness (standard deviation) of their Gaussians

• Training up the output layer

Generally, the centers and the standard deviations are decided on first by examining the vectors in the training data The output layer weights are then trained using the delta rule [32] Back-propagation is the most widely applied neural network technique RBF networks can be trained on:

• Classification data (each output represents one class), and then they can be used directly as classifiers of new data

• (x,f(x))points of an unknown function f, and then can be used for interpolation

Advantages of RBF networks include finding the input to output map using local approximators Usually, the supervised segment is simply a linear combination of the approximators Since linear combiners have few weights, these networks trained extremely fast and require fewer training samples RBFs also have the advantage in which the one can add extra units with centers near parts of the input which are difficult to classify.

Neural Network Training

Once a network has been categorized and structured for a particular application, that network is ready to be trained for it To start this process, the initial weights are chosen randomly Then, the training, or learning, begins.

One of the key elements of a neural network is its ability to learn A neural network is not just a complex system, but a complex adaptive system, meaning it can change its internal structure based on the information flowing through it Typically, this is achieved through the adjusting of weights, executed by the training process In Figure3.5.5on page

29, each line represents a connection between two neurons and indicates the pathway for the flow of information Each connection has a weight, a number that controls the signal between the two neurons If the network generates a "good" output (which we’ll define later), there is no need to adjust the weights However, if the network generates a "poor" output, an error or so to speak, then the system adapts, altering the weights in order to improve subsequent results [36].

That’s the big picture of training, now we’ll cover a more theoretically accurate definition about this process.

Given a specific task to solve, and a class of functionsF, learning means using a set of observations in order to find: f ∗ ∈F, which solves the task in an optimal sense.

This entails defining a cost functionC:F →R, such that, for the optimal solution f ∗ ,

C(f ∗ )≤C(f)∀f ∈F (no solution has a cost less than the cost of the optimal solution).

The cost functionCis an important concept in learning, as it is a mean to measure how far away we are from an optimal solution to the problem that we want to solve Learning algorithms will then search through the solution space in order to find a function that has the smallest possible cost [37].

For applications where the solution is dependent on some data, the cost must necessarily be a function of the observations; otherwise we would not be modeling anything related to the data It is frequently defined as a statistic to which only approximations can be made.

As a simple example, consider the problem of finding the model f which minimizes [26]:

For data pairs(x,y)drawn from some distributionD In practical situations we would only haveNsamples fromDand thus, for the above example, we would only minimize:

Thus, the cost is minimized over a sample of the data rather than the true data distribution. WhenN →∞, some form of online learning must be used, where the cost is partially minimized as each new example is seen While online learning is often used whenDis fixed, it is most useful in the case where the distribution changes slowly over time In neural network methods, some form of online learning is frequently also used for finite data sets.

While it is possible to arbitrarily define somead hoccost function, frequently a particular cost will be used either because it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem (i.e., in a probabilistic for- mulation, the posterior probability of the model can be used as an inverse cost) Generally,the cost function will depend on the task we wish to perform The three main categories of learning tasks are:

• Supervised Learning- Essentially, a strategy that involves a teacher that is smarter than the network itself Let’s take the facial recognition as an example The teacher shows the network a bunch of faces, and the teacher already knows the name associated with each face The network makes its guesses, then the teacher provides the network with the answers The network can then compare its answers to the known "correct" ones and make adjustments according to its errors.

• Unsupervised Learning - Required when there isn’t an example data set with known answers Imagine searching for a hidden pattern in a data set, or finding a matched face image inside a group of different faces An application of this is clustering, i.e dividing a set of elements into groups according to some unknown pattern.

• Reinforcement Learning- A strategy built on observation Think of a little mouse running through a maze If it turns left, it gets a piece of cheese; if it turns right, it receives a little shock Presumably, the mouse will learn over time to turn left Its neural network makes a decision with an outcome (turn left or right) and observes its environment (rewarded or punished) If the observation is negative, the network can adjust its weights in order to make a different decision the next time [36].

Each of the above learning type will be presented in a more mathematical way later.

The memorization of patterns and the subsequent response of the network can be categorized into two general paradigms:

Associative mapping is one in which the network learns to produce a particular pattern on the set of input units whenever another particular pattern is applied on the set of input units Associative mapping can generally be divided into two mechanisms:

In auto-association, an input pattern is associated with itself and the states of input and output units coincide This is used to provide pattern completion, i.e to produce a pattern whenever a portion of it or a distorted pattern is presented In the second case, the network actually stores pairs of patterns building an association between two sets of patterns.

Hetero-association is further related to two recall mechanisms:

• Nearest-neighbor recall- Is where the output pattern produced corresponds to the input pattern stored, which is closest to the pattern presented.

• Interpolative recall - Is where the output pattern is a similarity dependent inter- polation of the patterns stored corresponding to the pattern presented Yet another paradigm, which is a variant associative mapping, is classification, i.e when there is a fixed set of categories into which the input patterns are to be classified.

In regularity detection, units learn to respond to particular properties of the input patterns. Whereas in associative mapping the network stores the relationships among patterns, in regularity detection the response of each unit has a particular ’meaning’ This type of learning mechanism is essential for feature discovery and knowledge representation.

Information is stored in the weight matrix W of a neural network Following the way learning is performed, we can distinguish two major categories of neural networks with respect to the way they interact with weight matrix:

• Fixed networks- Networks in which the weights cannot be changed, i.e dW dt =0.

In such networks, the weights are fixed according to the problem that needed to solve.

Learning Paradigms

As mentioned before in Section3.3.2on page15and3.5.3.2at40, there are 3 major learning paradigms, each corresponding to a particular abstract learning task They are:supervised learning,unsupervised learning andreinforcement learning Normally, any given type of network architecture can be employed in any of those tasks.

Supervised learning constitutes of a given a set of example pairs(x,y),x∈X,y∈Y and the aim is to find a function f in the allowed class of functions that matches the examples.

In other words, we wish to infer the mapping implied by the data: the cost function is related to the mismatch between our mapping and the data and it implicitly contains a prior knowledge about the problem domain.

A commonly used cost is the MSE which tries to minimize the average error between the network’s output, f(x), and the target valueyover all the example pairs When one tries to minimize this cost using gradient descent for the class of neural networks called MLP (Multi-Layer Perceptrons as stated before), one obtains the well-known back-propagation algorithm for training neural networks.

Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification) and regression (also known as function approximation) The supervised learning paradigm is also applicable to sequential data (e.g., for speech and gesture recognition) This can be thought of as learning with a "teacher," in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.

In unsupervised learning, we are given some datax, and the cost function to be minimized can be any function of the dataxand the network’s output, f.

The cost function is dependent on the task (what we are trying to model) and our a prior assumptions (the implicit properties of our model, its parameters and the observed variables).

As a trivial example, consider the model f(x) =a, whereais a constant and the cost,

C= (E[x]−f(x)) 2 Minimizing this cost will give us a value of a, that is equal to the mean of the data The cost function can be much more complicated Its form depends on the application: for example, in compression, it could be related to the mutual information betweenxandy; in statistical modeling, it could be related to the posterior probability of the model given the data.

Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; these applications include clustering, the estimation of statistical distributions, compression and filtering.

In reinforcement learning, data x is usually not given, but generated by an agent’s interactions with the environment At each point in time t, the agent performs an actiony t and the environment generates an observationx t and an instantaneous costc t , according to some (usually unknown) dynamics The aim is to discover a policy for selecting actions

3.5 N E U R A L N E T W O R K A L G O R I T H M S 44 that minimizes some measure of a long-term cost, i.e the expected cumulative cost The environment’s dynamics and the long-term cost for each policy are usually unknown, but can be estimated.

More formally, the environment is modeled as a Markov Decision Process (MDP) with statess 1 , ,s n and actionsa 1 , ,a m ∈Awith the following probability distributions: the instantaneous cost distributionP(c t |s t ), the observation distributionP(x t |s t ), and the transitionP(s t +1|s t ,a t ), while a policy is defined as conditional distribution over actions given the observations Taken together, the two define a Markov Chain (MC) The aim is to discover the policy that minimizes the cost, i.e the MC for which the cost is minimal.

Artificial neural networks are frequently used in reinforcement learning as part of the overall algorithm Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks.

The behavior of an artificial neural network depends on both the weights and the input- output function,transfer functionthat is specified for the units This function typically falls into one of three categories:

Forlinear units, the output activity is proportional to the total weighted output.

Forthreshold units, the output is a set at one of two levels, depending on whether the total input is greater than or less than some threshold value [35].

Forsigmoid units, the output varies continuously but not linearly as the input changes. Sigmoid units bear a greater resemblance to real neurons than do linear or threshold units, but all three must be considered rough approximations.

To make a neural network that performs some specific task, the method of connecting units to one another should be chosen, and the weights on the connections must be set properly The connections determine whether it is possible for one unit to influence another.The weights specify the strength of the influence.

Learning Algorithm

Training a neural network model essentially means selecting one model from the set of allowed models (or, in a Bayesian framework, determining a distribution over the set of allowed models) that minimizes the cost criterion There are numerous algorithms available

3.5 N E U R A L N E T W O R K A L G O R I T H M S 45 for training neural network models; most of them can be viewed as a straightforward application of optimization theory and statistical estimation.

Most of the algorithms used in training artificial neural networks are employing some form of gradient descent This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in a gradient- related direction.

Evolutionary methods, simulated annealing, expectation-maximization and non para- metric methods are among other commonly used methods for training neural networks.

Employing Artificial Neural Networks

The greatest advantage of ANNs is their ability to be used as an arbitrary function approximation mechanism which learns from observed data.

This will depend on the data representation and the application Overly complex models tend to lead to problems with learning.

Numerous trade offs exist between learning algorithms Almost any algorithm will work well with the correct hyper-parameters for training on a particular fixed data set However selecting and tuning an algorithm for training on unseen data requires a significant amount of experimentation.

If the model, cost function and learning algorithm are selected appropriately, the resulting artificial neural network can be extremely robust With the correct implementation artificial neural networks can be used naturally in learning and large data set applications Their simple implementation and the existence of mostly local dependencies exhibited in the structure allows for fast, parallel implementations in hardware.

Applications of Artificial Neural Networks

ANN models can be used to infer a function from observations This is particularly useful in applications where the complexity of the data or task makes the design of such a function by hand impractical [38].

For real life applications, the tasks to which artificial neural networks are applied tend to fall within the following broad categories:

Kohonen Self-Organizing Map

SOM Network Architecture

We’ll concentrate on the particular type of SOM known as a Kohonen Network, the one that we use for our thesis This SOM has a feed-forward structure with a single SOM layer (sometimes got refer as computational layer) arranged in rows and columns Each neuron is fully connected to all the source nodes in the input layer as presented in Figure3.6.1 Attached to every neuron, there is a weight vector with the same dimensional as the

3.6 K O H O N E N S E L F-O R G A N I Z I N G M A P 47 input vectors The number of input dimensions is usually a lot higher than the output grid dimension.

Clearly, for a one dimensional map, SOM layer will just have a single row (or a single column).

Training Process of SOM

The training, or self-organization process involves four major components:

1 Initialization- All the connection weights are initialized with small random values.

2 Competition- For each input pattern, the neurons compute their respective values of adiscriminant functionwhich provides the basis for competition The particular neuron with the smallest value of the discriminant function is declared the winner.

3 Cooperation- The winning neuron determines the spatial location of a topological neighborhood of excited neurons, thereby providing the basis for cooperation among neighboring neurons.

4 Adaptation- The excited neurons decrease their individual values of the discrim- inant function in relation to the input pattern through suitable adjustment of the associated connection weights, such that the response of the winning neuron to the subsequent application of a similar input pattern is enhanced.

If the input space isDdimensional (i.e there areDinput units) we can write the input patterns asxxx={x i :i=1, ,D}and the connection weights between the input unitsiand the neurons jin the computation layer can be writtenwww jjj = w ji : j=1, ,N;i=1, ,D whereN is the total number of neurons.

We can then define our discriminant function to be the squared Euclidean distance between the input vectorxxxand the weight vectorwww jjj for each neuron j d j (xxx) D

In other words, the neuron whose weight vector comes closest to the input vector (i.e. is most similar to it) is declared the winner In this way the continuous input space can be mapped to the discrete output space of neurons by a simple process of competition between the neurons.

In neurobiological studies we find that there islateral interactionwithin a set of excited neurons When one neuron fires, its closest neighbors tend to get excited more than those further away There is atopological neighborhoodthat decays with distance.

We want to define a similar topological neighborhood for the neurons in our SOM IfS i j is the lateral distance between neuronsiand jon the grid of neurons, we take

(3.13) as our topological neighborhood, whereI(xxx)is the index of the winning neuron This has several important properties: it is maximal at the winning neuron, it is symmetrical about that neuron, it decreases monotonically to zero as the distance goes to infinity, and it is translation invariant (i.e independent of the location of the winning neuron).

A special feature of the SOM is that the sizeσ of the neighborhood needs to decrease with time A popular time dependence is an exponential decay: σ(t) =σ 0 exp

Clearly our SOM must involve some kind of adaptive, or learning, process by which the outputs become self-organized and thefeature mapbetween inputs and outputs is formed.

The point of the topographic neighborhood is that not only the winning neuron gets its weights updated, but its neighbors will have their weights updated as well, although by not as much as the winner itself In practice, the appropriate weight update equation is

3.6 K O H O N E N S E L F-O R G A N I Z I N G M A P 49 in which we have a time (epoch)t dependent learning rateη(t)calculated by η(t) =η0exp

(3.16) and the updates are applied for all the training patternsxxxover many epochs.

The effect of each learning weight update is to move the weight vectors www iii of the winning neuron and its neighbors towards the input vectorxxx Repeated presentations of the training data thus leads to topological ordering.

Provided the parameters(σ0,τ σ ,η0,τ η )are selected properly, we can start from an initial state of complete disorder, and the SOM algorithm will gradually lead to an organized representation of activation patterns drawn from the input space (however, it is possible to end up in ametastable statein which the feature map has a topological defect).

There are two identifiable phases of this adaptive process:

1 Ordering or self-organizing phase- During which the topological ordering of the weight vectors takes place Typically this will take as many as 1000 iterations of the SOM algorithm, and careful consideration needs to be given to the choice of neighborhood and learning rate parameters.

2 Convergence phase - During which the feature map is fine tuned and comes to provide an accurate statistical quantification of the input space Typically the number of iterations in this phase will be at least 500 times the number of neurons in the network, and again the parameters must be chosen carefully.

SOM Applications

Applications of SOM are commonly used as visualization aids for classification purposes. They make it easier to see relationships between vast amounts of high dimensional data. Typical applications of SOMs include visualization of process states or financial results by representing the central dependencies within the data on the map SOMs have been used in many practical applications such as [40]:

• State diagrams for processes and ma- chines

Conclusion

Different between ML and ANN

As mentioned earlier, the difference between machine learning and neural networks is one of application and scale Machine learning is a process while a neural network is a construct Simply put, neural networks are used to do machine learning, but since there are other methods of machine learning the terms cant be used interchangeably.

Because neural networks are used in deep learning, they can be considered an evolution of machine learning It is something of a simplistic view, but neural networks operate on a larger and more complex scale than basic machine learning Thousands of algorithms can be layered together into a single network for more complex calculations.

Figure 3.7.1: Relation between AI, ML and NN

Neural networks do have advantages over other machine learning methods They’re well-suited for generalization and finding unseen patterns There are no preconceptions about the input and how it should be arranged, which is a problem with supervised learning.Also, neural networks excel at learning non-linear rules and defining complex relationships between data.

Applying ML and ANN in this project

The computing world has gained a lot from both MLs and ANNs Their ability to learn by examples makes them very flexible and powerful They are also very well suited for real-time systems because of their fast response and computational times due to their parallel architecture.

While Machine learning-powered enterprise analytics are becoming a core facet of business intelligence and recently established its value to the globe, neural networks are still being developed as well as contribute to areas of research such as neurology and psychology They are regularly used to model parts of living organisms and to investigate the internal mechanisms of the brain.

Even though neural networks and machine learning have a huge potential, our current focus in this project is to integrate neural networks with ML, AI, fuzzy logic and related subjects For ANN, especially SOM, has been chosen in this project is because it is well suited for clustering problems in developing AI systems such as face recognition and identification systems, since neural network training data corresponds to noisy, complex sensor data, such as inputs from cameras In this project the input data and training database are preprocessed digital images.

Before being processed by machine learning or neural network, the input image will have to be normalized and preprocessed in either image-based or feature-based extrac- tion method In this chapter, we will bring an introduction to the image compression and reconstruction using Discrete Cosine Transform (DCT) combined with Anisotropic diffusion (AS) based normalization technique A comparison to fortify the choice of DCT and AS were also performed Definition and more detailed applications of DCT will also be mentioned in this chapter Another approach to encode image data to produce a low dimension feature representation of faces will also be mentioned here This method based on a modified version of Google’s FaceNet project [41], which includes Residual NeuralNetwork to replace the original Convolutional Neural Network.

Discrete Cosine Transform

Introduction

The rapid growth of digital imaging applications has increased the need for effective and standardized image compression techniques From that, a basic technique known as the Discrete Cosine Transform (DCT) were developed by Ahmed, Natarajan, and Rao in

1974 [42] The DCT is a close relative of the Discrete Fourier Transform (DFT).

The discovery of the DCT is an important achievement for the research community working on image compression since it is an efficient technique for image coding, and has been successfully used for face recognition in many researches [43] Later, it has become very popular and several versions have been proposed [44].

In real life application, compression standards like JPEG, MP3, MPEG, and the H.263 are respectively the compression of still images, audio streams, motion video and video telephony Therefore they are all employ the basic technique of the DCT [45].

Properties of DCT

About the property of DCT, it possess some fine properties such as de-correlation, energy compaction, separability, symmetry and orthogonality.

In practice, the DCT is an algorithm that is widely used for data compression since it converts data (pixels, waveforms, etc.) into sets of frequencies The first frequencies in the set are the most meaningful; the latter, the least Because most of the signal information tends to be concentrated in a few low frequency components of the DCT and therefore, high-frequency components can be eliminated, since the most high frequency coefficients of DCT are almost zero Thus, such coefficients can be ignored without degrading the data quality.

Figure4.1.1illustrates the DCT transforms an image from the spatial domain to the elementary frequency domain Lower frequencies are more obvious in an image than higher frequencies When an image is transferred into its frequency components and higher frequency coefficients are discarded, the amount of data needed to describe the image without sacrificing too much image quality will reduce Hence, it can be concluded that the DCT de-correlates image data and after each transform, the coefficient is encoded independently without losing compression efficiency.

Figure 4.1.1: DCT transforms an image from the spatial domain to the elementary frequency domain

Now assume that we are going to apply DCT to compress pixels Each pixel will become one DCT coefficient, therefore we would obtain a 8×8 DCT matrix for each 8×8 pixels. When reconstructing a compressed image, few numbers of DCT coefficients were used to reduce redundancy and recover the original image from the selected coefficients The coefficients with largest magnitude (i.e low frequency) are mainly located in the upper left corner of the DCT matrix This section is related to illumination variation and smooth regions such as forehead and cheeks of the face image On the contrary, the coefficients with lowest magnitude (i.e high frequency) are situated in the bottom right corner of the DCT matrix And the coefficients with mid magnitude (i.e medium frequency) are found in the middle region This represents the general structure of the face in the Figure4.1.2[6].

Figure 4.1.2: Three frequency regions of a DCT matrix and its histogram, applied in a sample image block that containing 8×8 pixels

Many researchers have tried to truncate the number of DCT coefficients Some re- searchers suggested saving the first five values from every DCT block to restore the image without significant errors [46] Besides, in an attempt to increase the performance of a face system with variation in facial geometry and illumination [47], 2D-DCT method was proposed for feature extraction Recently, an advanced DCT through histograms was developed, which improves the compression ability of the DCT [48].

Definition of DCT

Based on the properties listed above, we can say that DCT is regarded as a discrete-time version of the Fourier-cosine series Hence, it is considered as a Fourier-related transform similar to the DFT, using only real numbers Since DCT is real-valued, in applications it provides a better approximation of a signal with fewer coefficients produced (see Figure 4.1.3).

First thing we have to note is that the DCT is a linear invertible function f :R N →R N (whereRdenotes the set of real numbers), or equivalently anN ×N square matrix There are many types of DCTs vary from type I to VIII The DCT-II (type-2 DCT) is probably the most commonly used form, and is often simply referred to as "the DCT" [49].

Mathematically, the One-Dimensional DCT (1D-DCT)X[k]of a sequencex[n]of length

Figure 4.1.3: Amplitude spectra of the gray-scaled image above, under the DFT and the DCT

From that, we obtain the inverse 1D-DCT as:

In equation4.1and4.2above,α[k]is: α[k] 

Discrete-time sinusoids are defined when the basis sequences of the 1D-DCT are real, by: c N [n,k] =cos

Each element of the transformed listX[k]in equation4.1is the inner dot product of the input listx[n]and a basis vector Constant factors are chosen so that the basis vectors are orthogonal and normalized From this point, the DCT can be written as the product of a vector (the input list) and theN ×Northogonal matrix’s rows are the basis vectors.

In MATLAB, for normalization, this function will have theX 0 term multiplied by √ 1

2 and multiply the resulting matrix by an overall scale factor of q2

N This makes the DCT-II matrix orthogonal, but breaks the direct correspondence with a real-even DFT of half-shifted input [49].

The 1D-DCT above is useful in processing one-dimensional signals such as speech waveforms For analysis of two-dimensional (2D) signals such as images, we need a 2D version of the DCT.

For anN ×N square matrix, the 2D-DCT is computed in a simple way: The 1D-DCT is applied to each row of the matrix and then to each column of the result [45] In details, we can see the Figures4.1.4and4.1.5below:

Figure 4.1.4: The first step to compute a 2D-DCT from aN 1 ×N 2 matrix is apply 1D-DCT for elements in eachN 2 rows

Figure 4.1.5: Next, we compute 1D-DCT of each columns in the matrix obtained from above step

Now we have the 2D-DCT matrixC x [k 1 ,k 2 ]converted from the originalx[n 1 ,n 2 ]matrix.Since the 2D-DCT can be computed by applying 1D transforms separately to the rows and columns, hence the 2D-DCT is separable in the two dimensions.

Figure 4.1.6: A more generic 2D-DCT architecture for a 8×8 matrix

About 2D-DCT’s property, it is similar to a Fourier transform but uses purely real math.

It has purely real transform domain coefficients and strictly incorporates with positive frequencies The 2D-DCT is equivalent to a DFT of roughly twice the length, operating on real data with even symmetry, where in some variants the input and output data are shifted by half a sample.

As the 2D-DCT is simpler to evaluate than the Fourier transform, it has become the basic algorithm in image compression standards such as JPEG In image processing, the 2D-DCT represents an image as a sum of sinusoids of varying magnitudes and frequencies. And it has the special property that, for a typical image, most of the visually significant information about the image is concentrated in just a few coefficients of the DCT.

The series form of the 2D-DCT is defined as:

The equation above is also called the forward transform of 2D-DCT Mathematically, the DCT is reversible and there is no loss of image definition until coefficients are quantized. Therefore, we will have the inverse transform 2D-IDCT as follow: x[n 1 ,n 2 ] N 1 −1 n ∑ = 0

For further application in image processing, we can use the 2D-DCT to create set of basis functions with given a known input array size that can be pre-computed and stored The 2D basis matrix (8×8 DCT coefficients) are the outer product of the 2D basis vectors (64×1 integers array) Each basis matrix can be thought of as an image and is characterized by a horizontal and vertical spatial frequency This involves computing values for a convolution mask (8×8 window) [50] that will be presented in the later parts The 8×8 DCT basis functions are shown in Figure4.1.7below:

Figure 4.1.7: The DCT transforms an 8×8 block of input values to a linear combination of these

64 patterns The patterns are referred to as the 2D-DCT basis functions

The pixels in the DCT image in Figure 4.1.1 describe the proportion of each two- dimensional basis function present in the image Each basis functions grouped into matrices is characterized by a horizontal and vertical spatial frequency The matrices shown in Figure4.1.7are arranged left to right and top to bottom in order of decreasing frequencies The top-left function (brightest pixel) is the basis function of the "DC" coefficient, which represents zero spatial frequency and defines the basic hue for the entire block (see Figure4.1.8) It is the average of the pixels in the input, and is typically the largest coefficient in the DCT of image Along the top row the basis functions have increasing horizontal spatial frequency content Down the left column the functions have increasing vertical spatial frequency content.

Figure4.1.9below illustrates the comparison between a normal picture matrix and DCT coefficients matrix.

For most images, much of the signal energy lies at low frequencies; these appear in the upper left corner of the DCT matrix as shown in Figure4.1.9 Compression is achieved since the lower right values represent higher frequencies, and are often very small enough to be neglected with little visible distortion.

Figure 4.1.8: A more generic view about the usage of 2D-DCT basis functions DC coefficient is the mean of each corresponding block AC coefficients are simply the rest coefficients in the block

Figure 4.1.9: Comparison between pixel data matrix and DCT coefficients

The 2D-DCT is partly restricted to compute to a limited size when it is used for image compression Therefore, in practice, rather than taking the whole transform of the image, the 2D-DCT is applied separately to 8× 8 blocks of the image To compute the 2D blocked DCT, since the 2D-DCT is separable, a partition of each row into lists of length 8 is performed, and the 2D-DCT is applied to them and the resulting lists are rejoined and then the whole image is transposed by repeating the process.

For each block of input image, we have 8×8 pixels and the single DCT input is an 64 ×1 array of integers This array contains each pixel’s gray-scale level, where the pixel have levels from 0 to 255 When the two-dimensional blocked DCT is computed, the output DCT coefficients (range from−1024 to 1023) are quantized, the entropy is coded and transmitted Generally, the DCT formula is applied to each row and column of the block (refer to Figure4.1.6).

In a more overall view, we can see Figure 4.1.10which illustrates three 2D blocked DCT matrices by dividing the face image intoN 1 ×N 2 blocks, each block consists of 8 ×8 pixels, and each pixel contains one DCT coefficient data Therefore, in practice, we prefer to use square dimension pictures with number of pixels are divisible by 8 as input to makeN 1 =N 2 (512×512 pixels, for example).

Figure 4.1.10: Illustration of 2D blocked DCT of face image with 8×8 blocks These blocks were randomly picked from the image [3]

DCT Image Compression in JPEG format

In 1992 the Joint Photographic Experts Group (JPEG) established the first international standard for still image compression which the encoders and decoders were Discrete Cosine Transform (DCT) based.

In the DCT algorithm used for JPEG image compression, the input image is firstly divided into blocks and the 2D-DCT for each block is computed The resultant DCT coefficients for each block are quantized separately by discarding redundant information and high-coefficients are neglected After that, to reconstruct the image, we have to decodes the quantized DCT coefficients of each block separately and computes the 2D-IDCT of each block to reconstruct the compressed image block Finally, it puts the reconstructed blocks back together to form an image (refer to Figure4.1.11.

The image file size is also reduced, because DCT coefficients were divided by a quan- tization matrix in order to get long runs of zeros Quantized coefficients are encoded separately using a lossless method (Huffman, for example) Although there is some loss

Figure 4.1.11: Illustration of JPEG image compression processes using DCT and IDCT [4] of quality in the reconstructed image, it is still recognizable as an approximation of the original image [51].

For example, we use an 8-bit gray-scale face image of a subject in JPEG format is chosen to be compressed (the image dimension is resized to 64×64 pixels, in this case). The image is divided into blocks and each block consists of 8×8 pixels From here, in each pixel we acquired a single input as an integer This array contains each pixel’s gray scale level, where the 8 bit pixels have levels from 0 to 255 When the two-dimensional blocked DCT of 8×8 blocks is computed, the output DCT coefficients should be range from−1024 to 1023 and are quantized after that to further reduce the data size This example were illustrated in the Figure4.1.12.

Until now we have already obtained the blocked 2D-DCT matrices with the quantized coefficients But we still haven’t take full advantage of the DCT properties yet We noted above that the DCT output contains most of the input image data in its upper left corner of the matrix and the lower right values can be excluded without affect too much to the reconstructed image quality Hence, we can use an 8×8 masked matrix with the first coefficients are valued “1” and all the remaining are zeroes (refer to Figure4.1.13) Apply this mask for all 2D-DCT matrices so that their first DCT coefficients are selected out of the 64 in each block.

In this example, only 8 DCT coefficients are selected out of the 64 for masking in each block Fewer number of DCT coefficients selected for masking will leads to fewer

Figure 4.1.12: Illustration of blocked 2D-DCT of a face image with 8×8 blocks.

Figure 4.1.13: An example of the 8×8 matrix used as the mask, with the first 10 coefficients are selected. compression and quality deterioration of the compressed image From this point, image data can be storage or transmission as a stream bits by using zigzag ordering of DCT coefficients (Refer to Figure4.1.14).

If we want to reconstruct the image to see the result, we can use the 2D-DCT matrices after applying the mask above The masked image is generated which is concentrated on few low-frequency components as high frequency components are discarded during quan- tization In particular, the masked image is reconstructed using the 2D-IDCT transform.The resultant image is a compressed image, which is blurred due to the loss of quality,evidently showing the block structure [52] Figure4.1.15below illustrates this so-calledJPEG compression example A more particular example can be found at figures4.1.16and4.1.17on page65.

Figure 4.1.14: Zigzag ordering of DCT coefficients converting a 2D matrix into a 1D array of integers, which used in JPEG image compression The frequency (horizontal and vertical) increases in this order, and the coefficient variance (average of magnitude square) decreases in this order.

Other Applications of DCT

The DCT has a number of applications in information theory DCT is widely used inJPEG image compression, MJPEG, MPEG, and DV video compression DCTs are widely employed in solving partial differential equations by spectral methods, where the different variants of the DCT correspond to slightly different even/odd boundary conditions at the two ends of the array.

Conclusion

Since high information redundancy and correlation in face images results in inefficiencies when such images are used face recognition, DCTs can be used to reduce image information redundancy because only a subset of the transform coefficients are necessary to preserve the most important facial features such as hair outline, eyes and mouth.

It will be demonstrated experimentally further that when DCT coefficients are fed into a neural network, an efficient high-speed face recognition rate can be achieved by using a very small proportion of transform coefficients This makes DCT-based face recognition much faster than other approaches.

Illumination Normalization

An Engineer Approach

The feature extraction under uncontrolled illumination conditions is a major concern and a great challenge for recognition in real world applications [53] Illumination Normal- ization (IN) is a preprocessing technique that compensates for different lighting conditions.Before carrying out the feature extraction by different methods, a variety of IN techniques

Figure 4.1.15: 2D-DCT and IDCT applied on a sample face image in JPEG compression example. were used to normalize the illuminations under such constraints This is to restore a face image back to its normal lighting condition [54].

In practice, we often normalize the illumination from images before extracting their features using DCT in order to compensates the coefficient variances and sets the blocks to more equalized intensity It is very important to categorize preprocessors according to their suitability on a particular algorithm For this reason, many studies was conducted to choose the best IN technique specifically for DCT performance improvement Re- searchers had attempted to improve face recognition under uncontrolled lighting conditions through the choice of various techniques for illumination compensation and preprocessing enhancement.

Despite the number of effective IN techniques is very little in the literature [55], the choice of a technique that improves a particular algorithm is still very necessary The remaining problem is that until now, none IN techniques has previously focused on face recognition using DCT by finding the most suitable preprocessor So our main motivation behind emphasizing on the DCT is because: most practical transform coding systems are based on DCT, since it provides a good compromise between information packing ability and computational complexity [56] Therefore, the most well-known 21 IN techniques were established and tested, and finding the one that given the combinative compression with DCT and be reconstructed with least error is the goal of this evaluation.

Figure 4.1.16: An example of the above process from an image block’s pixel data to quantized DCT coefficients.

Figure 4.1.17: An example of the above process from quantized DCT coefficients to quantized image block, using IDCT By using quantized DCT to store data, this saves a lot of bits, but we no longer have an exact replica of original image block.

IN Techniques

The most well-known 21 IN techniques proposed below are available in a MATLAB toolbox called INFace (Illumination Normalization techniques for robust Face recognition).

It is a free−to−share collection of MATLAB functions and scripts intended to help researchers working in the field of face recognition [57].

In details, these 21 IN techniques are [58] :

• Single Scale Retinex (SSR) algorithm

• Multi Scale Retinex (MSR) algorithm

• Adaptive Single scale Retinex (ASR) algorithm

• HOMOmorphic filtering (HOMO) based normalization technique

• Single Scale self-Quotient image (SSQ)

• Multi Scale self-Quotient image (MSQ)

• Discrete Cosine Transform (DCT) based normalization technique

• WAvelet (WA) based normalization technique

• Wavelet Denoising (WD) based normalization technique

• ISotropic diffusion (IS) based normalization technique

• AniSotropic diffusion (AS) based normalization technique

• Steerable Filter (SF) based normalization technique

• Non-Local Means (NLM) based normalization technique

• Adaptive Non-Local means (ANL) based normalization technique

• Modified AniSotropic diffusion (MAS) normalization technique

• Single scale WEBerfaces (WEB) normalization technique

• MultiScale Weberfaces (MSW) normalization technique

• Large and Small Scale Features (LSSF) normalization technique

• Tan and Triggs (TT) normalization technique

• Difference Of Gaussian filtering (DOG) based normalization technique

4.2.2.2 Finding least error IN technique

All the calculations and tests described below in this section are presented in the document Enhanced Face Recognition Using Discrete Cosine Transform, which was financially supported by Kano State Scholarship Board, batch 502, under the Kano State Government Nigeria, and academically supported by Universiti Sultan Zainal Abidin (UniSZA), Kuala Terengganu, Terengganu, Malaysia [6].

To conduct the test of 21 IN techniques in order to find the one that given the combi- native compression with DCT and be reconstructed with least error, we need to do some simulations A sample image from our database were normalized by implementing each of the IN technique source codes provided in the toolbox After that, the image were compressed using principle of JPEG compression using 2D−DCT The compression was carried out to test which illumination technique would compensate lighting effect without significantly degrading the quality of the images Accordingly, the following steps were executed:

2 Apply the IN technique and set compression quality.

4 Plot the original image and the compressed images, save them to files (refer to Figure 4.2.1).

In step 2, after performed the IN technique for the image and applied 2D−DCT compression, only 60 percent of the DCT coefficients were extracted for encoding, using the mask that mentioned in the DCT section The compressed features were obtained by run−length encoding, and then reconstructed using IDCT The reconstructed features were retrieved to determine the quality variation between original image and its reconstructed image.

After that, in step 5, error measurements were computed using these metrics: Peak Signal to Noise Ratio (PSNR), Mean Square Error (MSE), Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) Higher PSNR and lower other errors means better result.

In addition, average results of the images were also calculated The results were stated inTable3.

Technique PSNR (dB) MSE RMSE MAE

Table 3: Comparison of 21 proposed techniques’ error [6]

The results reported that AS technique is the highest PSNR and the lowest MAE as indicated in Table3 For this reason, the AS is proven as the best and most suitable IN technique for recognition using DCT Hence, the filtered images of AS demonstrated high similarity with their original images despite lossy compression.

Actually, the AS technique working principle is estimating the luminance function using anisotropic smoothing Therefore, it’s obvious that the design of DCT with AS technique will compensate lighting effects, noise, and preserve object’s edge effectively. The comparison result table is just another evidence of this prediction.

4.2.2.3 Applying and testing AS with DCT in image compression

After comparing to find the least error IN technique, DCT was again applied on the image that preprocessed by the chosen AS technique for further simulation.

Before applying AS and DCT to calculate the compression performance of this method, a necessary preprocessing to make the images suitable for efficient feature extraction (resize,gray-scaled), AS was applied to remove the illumination variations The performance of the proposed DCT and AS techniques were evaluated on many different sample image sets.

Figure 4.2.1: Results showing effect of 21 IN techniques applied on a sample image

All the images were gray-scale, taken against a simple homogeneous background and the subjects are in up-right, frontal position with tolerance for some side movement.

Supposeiis the number of training images, andkis the total number of images belonging to a particular subjecthin each database The training process begins by computing a mean image from theitraining images Therefore, the mean imageA i (x,y)can be described as:

WhereI g h (x,y)represents an imagegof subjecthfrom the number of subjects consid- ered.

Then the DCT of the mean imageA i (x,y)is computed The definition of the DCT for an input image f(x,y)and output imageF(u,v)of sizeM×Ncan be written in the function below.

Algorithm implementation procedures of the proposed method AS and DCT using JPEG coding of an image can be illustrated step by step as follow:

1 Load the images of same size from a database (In our test, the images’ resolution is

2 Perform a blocked DCT for a given image.

3 Implement a 2-D DCT for each 8×8 blocks.

4 Apply JPEG default normalization matrix (Q) to normalize the DCT transformed coefficients (refer to Figure4.2.2).

5 Scale the DCT transformed image by maximum coefficient value in the transformed image (i.e a scale factor).

6 Perform an integer conversion of the DCT transformed image.

7 Select the number of DCT coefficients 1 through 64 for the image to be compressed (using mask).

8 Perform ordering of the transformed coefficients by zigzag scan using MATLAB matrix indexing power.

9 Save the obtained bit streams as feature vectors during enrollment, then compare the saved features with new features during testing.

10 Measure and evaluate the performance (verification and identification rates) of each algorithm.

Figure 4.2.2: The JPEG default normalization matrix (Q) The DCT coefficients matrix will be rounded-divide with each corresponding element of Q to get the normalized DCT coefficients.

The proposed algorithm above is demonstrated in Figure4.2.3[6]:

Figure 4.2.3: Algorithm implementation procedures of the proposed DCT using JPEG coding of an image (called ASDCT, means AS and DCT)

In the last step of the above algorithm, the results should have to be evaluated graphically with some numerical outcomes In order to observe the behavior of recognition process, different algorithms were tested on some selected databases Extensive experiments have to be conducted and evaluated by varying False Acceptance Rate (FAR) This represents the percentage of invalid inputs which are incorrectly accepted The verification results should be observed at 1%, 10% and 100% of the FAR.

Obviously, the results are expected to vary across the databases Receiver Operating Characteristic (ROC) and Cumulative Match Characteristic (CMC) curves are generated to describe the results of each method While the ROC curve illustrates the rate of correct and false detection, CMC curve can show the probability of detection for numerous rank, or in other words, for the entire database [59].

As a result, the verification rates were presented at different FAR in the ROC curves(refer to Figure4.2.5); while CMC curves depicted the rank recognition rates (as you can see in Figure4.2.4) The performance is demonstrated based on aggregate statistics and

Figure 4.2.4: Receiver operating characteristic (ROC) curve of ASDCT with Mahalanobis Cosine distance (MAHCOS [5]) on a sample image set [6].

Figure 4.2.5: Cumulative match characteristic (CMC) curve of ASDCT with Mahalanobis Cosine distance (MAHCOS [5]) on a sample image set [6].

4.2 I L L U M I N AT I O N N O R M A L I Z AT I O N 73 relative ordering of matching rates As expected, the proposed method ASDCT achieved high recognition accuracy.

Conclusion

Out of the 21 IN techniques, AS is proven as the most suitable preprocessor for face recognition using DCT Hence, an enhanced DCT technique is developed for an efficient face recognition This method improved the performance of face recognition system under any illumination conditions.

The justifiable reason of carrying out the study is to confirm that some feature extraction algorithms are more sensitive to illumination than others, and no algorithm is insensitive to illumination variations In other words, each of the IN techniques can work better with some feature extractors than the others Therefore, recognition using AS and few selected coefficients of the DCT has supported additional performance over the traditional face recognition techniques In particular, the AS technique estimates luminance function using anisotropic smoothing and supports decorrelation of data in this experiment, preserves object boundary effectively without degrading the quality of images.

The discovery of the most suitable illumination compensator above was carried out through JPEG standard method, which is one of the typical image compression techniques that use the DCT Therefore we can say that, to determine a preprocessor that works best with any feature extractor, the property of the extractor needs to reflect the testing criterion, which we can see clearly in the above cases.

The major contributions of the proposed algorithm ASDCT are:

1 It promotes decorrelation of DCT sufficiently.

2 It compensates the effects of illumination and creates almost unvarying input images.

However, the limitation of this study is we can only assure the use of AS for efficient recognition using DCT algorithm, yet haven’t further proved using other algorithm other than DCT.

Furthermore, we can see that the normalizing effect of the AS technique can only be observed within certain limit Normally, with poor illuminations, it is difficult for all algorithms and classifiers to recognize the faces, since illuminations appeared in images as noise Nevertheless, recognition with DCT still can be greatly improved by applying the

AS for compensating illumination as much as possible in uncontrolled environments.

Residual Network for Image Data Encoding

Motivation

Image compression has been an significant task in the field of signal processing for many decades to achieve efficient transmission and storage Classical image compression standards, such as 2D-DCT based JPEG usually rely on original image quality standards quantized into block diagram forms Along with the fast development of new image formats and high resolution camera devices, existing image compression standards are not expected to be optimal and general compression solutions.

One way to solve the above limitation is using deep learning based image compression methods, which has been proved efficiency in achieve better subjective quality at extremely low bit rate Based on this, we propose to utilize deep residual learning to maintain the same receptive field with fewer parameters This strategy not only reduces the model size, but also improves the performance greatly.

In previous section, we already have a know-how about how DCT compressed image data from spacial form to truncated frequency region, which is in the more physical way. For the case of FaceNet, we have another approach for this task: represent image as a vector of coefficients, encoded in a way that each image can be separated by comparing such variables At a result, they “directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity” [41].

Figure4.3.1shows the system structure of FaceNet We can have our choice on which

"deep architecture" can be used, as well as Euclidean distance optimization method. Originally, FaceNet used a typical Convolutional Neural Network (CNN) to train and transform image data.

Figure 4.3.1: General structure of FaceNet and its learning objective

We suggests to replace CNN with its modification: ResNet, to reduce the so called vanishing and exploding gradients happens with deep neural networks as mentioned in section 3.5.2.1.6 Presented by Microsoft in 2015, this architecture with which they have 3.57 percent top-5 error on ImageNet dataset and won the 1st place in the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) classification competition [63].

In this section, we will try to use ResNet to compress and encode the image into a field that can be applied in further face recognition system Other applications for the proposed method will also be shown.

Deviation with other Neural Networks

About the properties of ResNet, we can refer to CNN since it is basically an upgrade of CNN with added residual connection between layers We already know the main

4.3 R E S I D U A L N E T W O R K F O R I M A G E D ATA E N C O D I N G 76 application of CNN is feature extraction in images The same trait applies for ResNet as well As stated in section3.5.2.1.6, we only need to focus on main different with CNN: the identity block and convolutional block.

The identity block is the standard block used in ResNet and corresponds to the case where the input activation has the same dimension as the output activation Its’ structure is very similar with typical CNN convolutional operation: a few convolutional layers, with selected filters and some support data reduction functions But the different is one shortcut path has been included to add input and output together before performs final activation function.

Assume we are analyzing a ResNet neuron with a matrix inside regards as pixel block data, which is an output of local receptive field, here we will pass it through 3 convolutional operations with filter size 1×1 and 3×3 Zero padding has been applied to fit such filters with the analyzing block Batch normalization is also can be used to normalizes the output of a previous activation layer by subtracting the batch’s mean and dividing by the batch standard deviation The activation function is the commonly used Rectified Linear Unit (ReLU) to eliminate negative values from the output.

Figure 4.3.2: Identity block taken from figure3.5.11

Here are the step by step of how ResNet operate inside each identity blocks, from left to right of the figure4.3.2:

1 Perform convolutional operation with1×1filter and stride of 1 In a computer vision application, each value in the matrix on the left corresponds to a single pixel value, and we convoluted a 1×1 filter with the image by multiplying its values element-wise with the original matrix, then summing them up and adding a bias. Finally, we apply batch normalization and softmax activation function Output of this operation should be another matrix with the same size, which the value represents the feature deviation from input data with the applied filter.

2 Perform convolutional operation with 3×3 filter and stride of 1 with zero padding added in the output matrix of step 1, such that the output matrix size of this step is the same as input Calculation is the same with previous step, including batch normalization and ReLU activation function.

3 Perform convolutional operation with 1×1filter and stride of 1on the matrix output from step 2 No activation function applied in this step.

4 The input of step 1 and the output of step 3 are added togetheras they have the same matrix size After that, perform activation function ReLU to get the output of an identity block.

As you can see in the figure3.5.11in section3.5.2.1.6, we won’t keep data dimension unchanged through all the network There will be some cases when we need to reduce or increase the data dimension by changing filter size or stride level, at a result their input and output’s dimension don’t match up in the shortcut path At such situations the Convolutional block can be applied It has one difference with the Identity block: there is a convolutional layer in the shortcut path.

Figure 4.3.3: Convolutional block taken from figure3.5.11

The convolutional layer in the shortcut path is used to resize the input to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path In figure4.3.3we didn’t visualize the stride level Therefore let’s perform another step by step walk through for convolutional block.

1 Perform convolutional operation with1×1filter and stride of 2if we expect to reduce the activation dimension’s height and width, which is in this case, by a factor of 2 Different value of stride or filter can be used depends of desired output matrix size.

2 Perform convolutional operation with 3×3 filter and stride of 1 with zero padding added, completely same as Identity block’s 2nd step.

3 Perform convolutional operation with1×1filter and stride of 1with projection shortcut on the matrix output from step 2, such that the output becomes similar with step 1’s input in term of dimension size No activation function applied.

4 Perform convolutional operation with1×1filter and stride of 2to the input of step 1, as a shortcut path The result should have same dimension with output of step 3 No activation function applied.

5 The output of step 3 and step 4 are added togetheras they have the same matrix size After that, perform activation function ReLU to get the output of a convolutional block.

Applying ResNet in image encoding

Upon release of ResNet in 2015, the authors introduced three basic types of ResNet based on their network depth, including ResNet-18, ResNet-34 and ResNet-50 All of them share the same properties, arranging layers of different identity and convolutional blocks Further researches have already performed multiple experiments with different levels of ResNet, including ImageNet [7] and we can inherit their results to select the most suitable ones for our project.

Figure 4.3.4: Proposed ResNet architectures from section [7] and their performance

The more depth level of ResNet, the more computation required For an effective face recognition system we need a balanced performance between execution time and accuracy to ensure real application possibility, as some theoretical models require expensive CPUs. Thus we can use ResNet-29, a modified ResNet-34 with reduced some layers for faster calculation but not differ much in failure rate Figure and table below illustrates overall architecture of the proposed ResNet-29.

As shown in the proposed model and table4, input for this system is a 220×220×3 data array, represents width and height of picture’s pixel Each pixel contains RGB values range from 0 to 255 and thus arranged as 3 arrays To get this input, the image must have been optimized beforehand About identity block and convolutional block in this model, we only use two convolutional operations for each, which is different and more simple compare with the original ResNet-50 model shown in figure3.5.11,4.3.2(remove step 3) and4.3.3(remove step 4) Furthermore, in Fully Connected layer, we extended the average

4.3 R E S I D U A L N E T W O R K F O R I M A G E D ATA E N C O D I N G 79 pooling method and only select 128 best result neurons The output of this method is a

128×1×1 array contains encoded features of the input image, each value should be a small number to ensure data storage efficiency.

Conclusion

Image compression can have many approaches and each of them have their own strong points The simplicity yet still effective DCT based is one example A more modern and advanced deep residual neural network approach like ResNet is more suitable for specific application which is facial feature extraction in our case Due to its complexity, we proposed a smaller version of ResNet-34 and named it ResNet-29 with fewer layers, to pair with DCT based approach in term of resource consumption.

In the next section, we will implement two face recognition systems One of them is supported by ResNet-29 (feature-based) and the other is powered by DCT with AS(image-based) We shall check their performance in many aspects and see if the more advanced ResNet method will have any significant improvement in practical use cases compare with the mainstream DCT based system or not.

Layer Input size Output size Kernel Stride conv1 220×220×3 110×110×64 7×7×3 2 max-pool1 110×110×64 55×55×64 3×3×64 2 iden- block1

Two 1 and One 2 iden- block4

Two 1 and One 2 iden- block7

Two 1 and One 2 iden- block10

14×14×512 14×14×512 Two 3×3×512 Two 1 avg-pool 14×14×512 1×1×512 14×14×512 14 fc1 1×1×512 1×1×128 Maxout pooling size p=4

Table 4: Structure of ResNet-29 model used for image data

This is the design and implementation part of our project After the wall of text in theory part, now is the place where there are more figures than words Those materials in the previous chapters are necessary to understand our method and parameter selections. However, since they are a bit "general", we’ll repeat them again at some important points in this part.

In this project, we designed two face recognition systems One is DCT-SOM based and the other is ResNet-KNN Each will have slightly different preprocess method but still maintain basic similarities to be comparable.

For DCT-SOM based, the programming language used to design and implement our face recognition system is MATLAB due to its Neural Network and Image Processing Toolbox that helped to obtain an efficient code (for both compiling and debugging process). MATLAB also has a huge online community that provides everything we need to know when there are problems happen.

On the other hand, we choose Python programming language for ResNet-KNN system for many reasons Firstly, Python is recently become more popular as a very strong and general-purposed language compare to MATLAB or C++ Thus it has many useful libraries to be applied for our project Secondly, Python can be easily implemented to multiple operating systems including Windows, Linux and able to be embedded to Raspberry Pi as well as other development kits Our last reason is related to this project target: To varies the implementation activities and compare the actual performance, if the Python can be better than MATLAB in term of applications.

Generally, our progress consists of two major steps:

We also did spend time to create a GUI that will serve the demonstration Instruction on how to use it will also be mentioned later in the next part.

To sum up, this process will have the following four steps:

3 Compress images’ data by DCT in MATLAB and by ResNet-29 in Python

Data Gathering

High Resolution Images

Face images of different persons were taken under uniform light conditions and plain backgrounds with a 12 megapixels digital camera for the training database Each face image size was roughly equal to 3920×2204 pixels Figure5.1.1on the next page shows a sample face image taken.

Another work around is to use a digital camera and record a short clip featuring some facial expressions and some small angling of the faces from a couple of directions After

Figure 5.1.1: A high resolution face image (3920×2204 pixels) that, the images will be extracted from the clip This method has proven to be more efficient and time-saving than the first one.

Online Database of Images

Here we use the images from ORL (Olivetti Research Laboratory) database, provided by AT&T Laboratories Cambridge It contains a set of face images taken between April 1992 and April 1994 at the lab In total, there are 400 images of 40 distinct subjects For some subjects, the images were taken at different times, varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses) All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement) [64].

Figure 5.1.2presents a sample set contains 30 images of three subjects from ORL database.

Figure 5.1.2: Sample images from ORL database

Low Resolution Images

These images are directly taken from an USB webcam, which supports a maximum resolution up to 640×480 pixels We choose this type this webcam to test the actual usage, as VGA webcams are the most popular camera input for PC and embedded systems.

Figure5.1.3shows the interface in which the webcam screen is placed with its real-time video display.

Figure 5.1.3: Images taken from built-in webcam

Image Pre-processing

High Resolution Images

Since the images were taken by a portable device, we have to pre-process them manually using Photoshop, a raster graphics editor developed and published by Adobe Systems for Windows and OS X.

It’s pretty simple to put on the modifications, we follow the below process to complete this task.

1 Crop and resize the image from 3920×2204 pixels to 512×512 pixels

2 Auto adjusting image levels, contrast, brightness and colors

3 For SOM system, we need to convert image from RGB color to 8-bit grayscale. KNN system however, does not need this step thanks to ResNet can handle RGB data effectively.

Figure 5.2.1: (a) Original image (b) Preprocessed image

Online Database Images

All of the images are already formatted to be in 8-bit grayscale, we only need to crop and resize again to achieve square images of 92×92 pixels to import to SOM system.Our experiments will also test if the KNN system can perform well in grayscale original pictures or not It is reasonable since these kind of format is usually require less storage than RGB.

Low Resolution Images

The image will be automatically pre-processing once it’s taken from the webcam thanks to the code we created in MATLAB and Python GUI The process is similar to that applied for high resolution images.

Image Data Compression

Input design for SOM system

The purpose of compressing images is to decrease the noise effects and generalize its features If the images are directly saved, noisy features will affect the results To compress the image into a desired format that can be input to our SOM neural network for recognition process, we will implement the following steps by execute our MATLAB code contained in the filepredct.m.

1 Import the image into MATLAB usingimreadcommand.

2 Convert it to double withim2double.

3 Compress images by applying 8×8 2D blocked DCT, the format of this command can be found onwww.mathworks.com.

4 Choose an appropriate mask for the DCT (this feature works like a filter, it dis- cards unimportant components to remove redundant noise) Figure5.3.1shows the modified image after this step.

Figure 5.3.1: (a) Original image (b) DCT compressed image

5 Resize our image to 8×8 pixels, save it to a specific location by using imwrite command Figure5.3.2shows the final result image.

Figure 5.3.2: Zoomed result image after resize

Input design for KNN system

Similar to SOM system, we need to analyze input image for KNN system Using the proposed ResNet-29 model for image encoding as discussed in previous section4.3, we summarized it to a step-by-step implementation using Python code contained in the file api.py.

1 Import the image into Python usingPIL.Image.opencommand.

2 Convert it to numpy array withnp.array.

3 Encode images by applying ResNet-29 model This model has been implemented, pre-trained and published onhttps://github.com/davisking/dlib-models.

4 Store our output to 128×1 numpy array, save it to a specific location by usingpickle command Figure5.3.3shows the encoding process output.

Figure 5.3.3: (a) Original image (b) Encoded array

Reshape and Save Image Data

5.3.2 Input design for KNN system

Similar to SOM system, we need to analyze input image for KNN system Using the proposed ResNet-29 model for image encoding as discussed in previous section4.3, we summarized it to a step-by-step implementation using Python code contained in the file api.py.

1 Import the image into Python usingPIL.Image.opencommand.

2 Convert it to numpy array withnp.array.

3 Encode images by applying ResNet-29 model This model has been implemented, pre-trained and published onhttps://github.com/davisking/dlib-models.

4 Store our output to 128×1 numpy array, save it to a specific location by usingpickle command Figure5.3.3shows the encoding process output.

Figure 5.3.3: (a) Original image (b) Encoded array

For the image data to be input into the neural network, it should follow the form of only one column or row, no matter what the original dimensional is Currently, all our DCT and resized compressed face images are in the form of 8×8 pixels as shown in Figure5.3.2.

The reason of this is because MATLAB’s SOM neural network toolbox only allow inputs to be rows or columns of matrix, as we can see in Figure5.4.1.

Figure 5.4.1: SOM toolbox data selection screen

A 8×8 pixels grayscale image when imported into MATLAB workspace will be in the form of an 8×8 matrix with each component represents the intensity of the corresponding pixel Hence, we need to transform this 8×8matrixinto neither a 64×1 or 1×64 (a 64 component array) so that it can be used as a properly formatted input for training the neural network.

To do this, we will use a simple command calledreshape, its format can be found in Figure5.4.2below There is also a sample picture of the reshaped image, which has been resized for an easier view.

On the contrary, in the case of ResNet we already mapped the image encoding into a row with 128 parameters, thus no need to be reshaped.

Finally, we’ll apply the same method to all other images and save the data into an accessible file It is adata.matfile for SOM system and a Python encryptedmodel.clf file for KNN system This result in the fact that a 64 or 128×N matrix is the database which will be input into the neural network as a set of N image data combined together, not the actual images.

Figure 5.4.2: (a) MATLAB’sreshapecommand (b) Reshaped image

After researching and studying, SOM were found to be efficient for image data clustering and proved to be an accurate closest matching technique of untrained input images with trained database of images.

Normally, if we create an SOM neural network in MATLAB with the help from its Neural Network Toolbox, all of the parameters will be set to default values These values can be found onwww.mathworks.com.

However, default values are just there for consultation To fit with our design, some of them should be configured, and to find the optimal result, we need to perform several tests and adjustments The way to execute and setting of those tests will be discussed later in another part, right now we will simply point out a general procedure we applied to find out what option we can adjust so as to shape the neural network into our best usage.

Network Architecture

We already know that the principal goal of SOM is to transform an incoming signal pattern of arbitrary dimension into a lower dimensional discrete map, and to perform this transformation adaptively in a topologically ordered formation [39] Therefore, the original purpose of SOM is not for solving pattern recognition problems, but for rearranging data in a way that is fundamentally topological in characteristic Knowing this, we deployed the following system, a SOM network that classifies our images, which had been converted to DCT-based vectors, into distinguish groups so as to identify if a certain input face image is

"recognizable" or "unrecognizable" compare to the database [65] Figure6.1.1shows the structure of our SOM network, where theSOM Layeris actually the grid-like layer that we saw in Figure3.6.1before on page47.

The input vectorpis the column of pixels of our input compressed image, which has been converted into a 64×1 DCT-based vector The||ndist||box receives the input vector pand the input weight matrixIW 1,1 that produces a vector having S 1 elements These

Network Size

Figure 6.1.1: General architecture of our SOM elements are the negative of the distances between the input vector and vectors i IW 1,1 formed from the rows of the input weight matrix The||ndist||box computes the net input nn n 1 of a competitive layer by finding the Euclidean distance between input vectorpand the weight vectors The competitive transfer function C receives a net input vector for a layer and returns neuron output of 1 for the "winner", the neuron associated with the most positive element of net inputnnn 1 The other neurons output 0 The neuron whose weight vector is closest to the input vector has the least negative net input and wins the competition to output a 1 Hence, the competitive transfer function C produces a 1 for output elementa 1 i corresponding toi ∗ , the "winner" All other output elements ina 1 are

Thus, when a vector pis presented, the weights of the winning neuron and its close neighbors move towardp Consequently, after many presentations, neighboring neurons learn vectors similar to each other Hence, the SOM network learns to classify the input vectors it sees The SOM network used in our project contains N nodes ordered in a one-dimensional lattice structure.

This is the number of neurons inside the SOM layer In normal condition, the more neurons inside the network, the better the recognition rate (the number of recognized images over the total number of trials) would be However, we have to consider the fact that, more neurons also leads to a more complex architecture, which can reduce the efficiency when learning with an unchanging database.

So to begin with, we have to decide the size of image database After some study and configuration, we decide to make it a 30 image database, with 10 images for each subject.

Firstly, we load the data into MATLAB as a 64 × 30 matrix named P, like what is presented in Figure6.2.1 As mentioned before, this is exactly the extracted intensity of pixels from the data images.

Figure 6.2.1: Database of images is loaded into MATLAB workspace

Now we typennstart into theCommand Window of MATLAB to call for the Neural Network toolbox Since our network is SOM which is used for clustering application, we’ll press on theClustering appbutton to activate the main window of SOM generating process, shown in Figure6.2.2.

Figure 6.2.2: The starting screen of SOM generating tool

PressNext, we’ll move to the Select Data screen, which has already been mentioned in section5.4on page88as well as been shown in Figure5.4.1 Since we’ve just loaded our data into MATLAB workspace, we can easily find and select it in the Inputs pop-up menu,remember to tick on the Matrix columns option in sample type selection PressNextto proceed.

Training Time

On the next screen we select the size of neural network to be 11×11, means that there are 121 neurons in total.

After that, on the following screen, we can start training our newly created SOM neural network to see if its performance adapted to what we require Note that training multiple times will generate different results due to different initial conditions and sampling, so we need to run it a few times to have a good decision.

Using this kind of testing, we can finally decide a structure of SOM network, for example, with 128 neurons for our project by applying MATLAB commandnewsom A sample structure of this configuration is shown in Figure6.2.3.

Figure 6.2.3: An SOM topology map with 128 neurons (8×16)

Before looking into how we set the training time, firstly we’ll have to get familiar with a new definition: epoch Quoting from a freely-accessible version of the MATLAB ANN toolbox glossary, we have: epoch is the presentation of the set of training (input and/or target) vectors to a network and the calculation of new weights and biases Note that training vectors can be presented one at a time or all together in a batch [67].

In MATLAB, an epoch can be thought of as a completed iteration of the training procedure of the artificial neural network That is, once all the vectors in our training set have been used by the training algorithm, one epoch has passed Thus, the "real-time duration", or training time in other words, is dependent on the training method used and the number of maximum epochs allowed.

Setting a maximum number of epochs will prevent the network from falling into infinite loops of training if the result is not convergent enough In our case, setting this parameter will help the network achieve the optimal result within a required elapse time.

Other Parameters

That optimal result we mentioned is how "good" the images of one subject grouped together in the SOM topology map Figure6.3.1shows the comparison between a "good" and "bad" result.

Figure 6.3.1: (a) Good sample hit (b) Bad sample hit

The "good" out come shown above was achieved with the maximum number of epochs chosen to be 13 On the contrary, training the network for 1000 epochs led to the second result, where the images were scattered around and their clusters almost overlapped with each other.

To setup the maximum number of epochs, simply use the commandnet.trainParam.epochs.

Besides from network size and maximum number of epochs, the rest of our SOM network were left in their default setup These values will be stated in the next part.

As discussed in previous chapter 3.4, KNN can be selected to compare with SOM system design due to some similarities on working method.

In Python, we can make use of Scikit-learn open library to implement KNN algorithm.

As we already did in SOM section, KNN architecture selections and detailed implementa- tion works will be discussed here.

System Architecture

Not categorized as the same field with CNN and ResNet which is strong for pattern recognition, KNN is most useful when using as data clustering and classification Therefore KNN and SOM share the same basic properties in this field However, KNN and SOM themselves has a very different algorithms It is obvious since SOM is a neural network based architecture with unsupervised learning principle, whereas KNN is a machine learning method We designed a KNN system based on the same idea applied for SOM: classify vectors converted from image data, group them in specific groups to be able to determine if a new input is belong to which group.

In term of mathematical algorithm, KNN is much simpler compare to SOM because weight implementation is not needed Furthermore, KNN does not need many iterations or epochs, which is only used in neural network training.

To be homologous with SOM, we design the general KNN architecture based on figure 6.1.1 The redesigned version is shown in figure7.1.1.

The input vectorpis the column of encoded input image, which is a 128×1 vector in numpy array state The||ndist||box receives the input vectorpand the training data

TD that contains S 1 elements with given classes of each elements The ||ndist|| box computes net input nnn 1 for query distance list, here denoted asC since it has the same purpose as Competitive block of SOM Thisnnn 1 value is the Euclidean distance between each parameters of input vectorpand all training samples insideTD Return value ofC

Determine value of k

Figure 7.1.1: KNN working principle block is a list of thekmost nearest elements of training dataTDcompare to input vector p, in the other words, has the lowestnnn 1 values and have the most elements of the same class Since the "class" is literally integer number, KNN output can be regards as 1×1.

Training time for KNN is not selectable as for the case of SOM, since the data is being classified after each input data from either training sets or test image In the other words, we have to run KNN algorithm forS 1 times for training process.

As shown in previous figure, we have to choose the optimalkfor our face recognition problem, as the performance of a KNN classifier varies from significantly when k is changed as well as the change of distance metric used In practice, odd numbers are selected for thekparameter for three reasons: Firstly, to increase the speed of the algorithm by avoiding the even classifiers Secondly, to avoid the chance of two different classes having the same number of votes Finally, the pilot experiments having the evenkshow no significant change of the results In addition to the choosing rule, maximum value ofknormally will not greater than the square root of the training data set size ndue to computation time efficiency.

Table below represents the test result when multiple selection ofkapplied on the same data inputs Experienced in 2014 by a group of researchers [68], using many different data sets to represent real life classification problems, taken from the UCI Machine Learning Repository [69] The accuracy of each classifier on each normalized data set is the average of 10 runs.

Other Parameters

Data set Number of examplesn

Table 5: Description of UCI data sets used

Table 6: Result of k-NN with differentkvalues

According to the experiments, the using ofk=√ncould provide better results compared to other methods Sometimes higher number ofkand different methods proposed by the report’s authors could provide good result, but also require more resources Therefore usingk=√nis generally a feasible choice for the KNN classifier.

To implement the selectedkvalue, we can parse this value to then neighborsvalue of KNN library’s functionsklearn.neighbors.KNeighborsClassifier Since the value ofnis depends on number of element from training sets, lets assume we use the same database as SOM: 30 images with 10 images for each subject A rounded result ofkvalue in this case is 5.

Based on designed architecture, we also choosedistanceandbruteoptions for function used in classification Aside from these,sklearn.neighborslibrary provides some other parameters that we keep their default setup.

After finishing the first design and establishment of our system, in this part, we’ll show a detailed explanation and description of the SOM neural network and KNN algorithm testing for a variety of trained and untrained inputs to confirm its validity The experimental work carried out for optimal efficiency and to attain a high-speed design will also be presented. The testing and experimental works include:

A variety of tests were carried out with both trained and untrained input face images of different subjects with various facial expressions, leading to the validation of our SOM neural network design and KNN algorithm system.

Trained Data

This is an initial test, the first one to check for the validity of our systems on face recognition task.

To do this, we created a database consists of 5 different subjects, shown in Figure8.1.1 (a) Then, we input 5 images, 3 of which were taken from the trained database, the other

2 were not, into the network, simulated them and recorded the accuracy of our system. These input images are shown in Figure8.1.1(b).

(b) Figure 8.1.1: (a) Trained database (b) Input subjects for the initial test

To ensure the result, we ran the test 10 times on each input image This will show us a general view on the overall recognition ability of our systems Same process were applied for both systems.

Untrained Data

The result turned out to be very positive For each subject within the database, 10 out of

10 times, both systems identified them and pointed out the matched image For subjects that were not learned in the database, SOM system categorized them asunknownwith a success rate of 95%, while KNN performs at around 90%.

Hence, we can conclude that our system is able to identify exact trained data and distinguish abnormal inputs with an acceptable accuracy.

This time our test was slightly different compare to the first one We introduced our system with a more difficult task: recognize an untrained image that is taken from the same subject inside the trained database.

To implement this experiment, we input a data set with 5 different facial expressions taken from a single person, shown in Figure8.2.1(a).

(b) Figure 8.2.1: (a) Trained database (b) Input subject for the second test

Next, we input an image taken from the same person that hasn’t been seen before by the neural network, shown in Figure8.2.1(b).

In this case, the result shown an accuracy of 90% for a 10 time SOM simulation run and 100% for KNN system Note that this result is based on 10 times run check only so 9 successful runs on 10 times can be regards as good enough In conclusion, both systems we designed can reliably recognize an input that have some similarities to the trained data.

SOM Recognition Program

After confirming the validity of SOM in categorizing and thus recognizing different face images, we built up a prototype system of our recognition program with a small scale of data size.

KNN Recognition Program

The program will import a data set of 30 images divided into 3 persons, means that each person can have upto 10 different images to be their face database.

The overall structure of our SOM program is presented in the Figure8.3.1below.

Figure 8.3.1: General process of the face recognition system using SOM

We simulated the network with the size of 120 neurons and the maximum number of epochs was 20 Our inputs were 5 different images, 3 of which were inside the database and 2 were not; our database was a set of 30 images of 3 persons as shown in Figure8.3.2.

Running the test for 50 times, we achieved a success rate of 90% From the result of this test, our system is proved to be efficient in face recognition task.

The validation result in the previous sections show the KNN capability on categorize processed image data and generate good result in face recognition In practice, KNN performance is mainly depends on how good of the input data was being analyzed, thus we created a prototype system to provide such needs.

(b) Figure 8.3.2: (a) Trained database (b) Input subjects for the program

The data scale should be the same with SOM program We import a data set of 30 images divided into 3 persons, where 10 different images were given to be their face database.

The overall structure of our KNN program is presented in the Figure8.4.1below.

Data input for KNN were analyzed by 29 convolution layers of ResNet model, where it will check if input image is in grayscale or RGB form in prior to perform specified pre-processing tasks For KNN training set, we selectkequals to√ nwhere in this case n0, thusk=5 for data categorize.

Tested with the same test set from SOM system, we re-run the KNN system for 50 times and get the average result of 96% At a result, we have proved that both SOM and KNN system are efficient in face recognition task.

Figure 8.4.1: General process of the face recognition system using KNN

The requirement of an efficient face recognition lies in two main characteristics: the least execution time, while maintaining the highest possible accuracy These features can be improved by tuning system’s parameters, including both the image pre-processing method and the neural network efficiency However, if we count all of these parameters into our computation, there would be an overload of possible options.

Calibrate SOM System

Optimal Number of Neurons

As mentioned before, the number of neurons inside an SOM network affects the ef- fectiveness of that network when clustering the database Most cases require the number of neurons should be enough for drawing a recognizable topology map of the input data, while not exceeding too much that may lead to complication in solving the problem.

In our case, we have 3 different persons, therefore, our SOM network should have a minimum number of neuron to be 4 (1 spare for the case if our input face is a stranger). However, there are always errors occur, plus that both the faces used for database and input are different, more or less, so we have to count the worst case, means that every image will be mapped onto a particular neuron inside the computational layer Finally, we concluded that the minimum number of neurons should be 30, give it some space between clusters for better recognition, we double the value to make it become 60.

After taking this number as the minimum point, we started to carry out several tests. These tests were performed by varying the number of neurons on each trial The database used for each trial consist of 30 images from 3 different persons Totally, we tested it on all face images in ORL and our own data sets With the variety in lighting condition, face

9.1 C A L I B R AT E S O M S Y S T E M 107 angle, facial details and expressions, our tests guaranteed the result to be accurate and universal.

Table7shows the tabulated results of our test For this test, we keep 61 DCT coefficients for image masking and trained the network for 13 epochs.

Table 7: Comparison of different number of neurons vs simulation time and accuracy

Figure9.1.1illustrates the results from Table7on a graph.

Figure 9.1.1: Effect of number of neurons vs accuracyFrom the statistic, 80 is the optimal number of neurons.

Optimal Number of Epochs

Remind from a previous section, we know that setting a maximum number of epochs will prevent the network from falling into infinite loops of training if the result is not

Calibrate KNN System

Optimal Number of K

We already stated about how important of k value might affect to KNN effectiveness.

During implementation section, we described about the reason fork=√nand KNN system validated successful with that generalized value Actually, we only need to considerkvalue when classifying a new data input Our arguments were taken from general researches and we are still concern if this value is really the best fit for facial recognition specific purpose.

Thus we conducted some more tests to find out the optimal number of k in our system.

Figure 9.1.2: Effect of number of epochs vs accuracy

Let’s analyze our specifications We knows that KNN is basically a lazy learning method that process training during each data input, which is a new picture to be categorized in our case As a form of supervised learning, we might not able to use the same mind set while using KNN as used with SOM, because SOM is an unsupervised learning based Our input is an array represents one person or one class Our output in this case is an integer number represents class name Logically, we can try small number ofk in this case, evenk=1 seems reasonable Reducingkvalue might make boundaries between classes less distinct, but in our case it might not be the case Therefore, we performed tests of severalkvalues to justify the optimal number ofk, in order to balance process speed and accuracy.

Table9shows the result of our test The test sets is similar with SOM Thekvalues has been chosen as odd number to avoid tied votes and Euclidean distance thresold is 0.6 As nin this data set is relatively small(30), small increase inkleads to significant changes in result.

(%) 96.0 96.9 97.4 97.4 97.2 95.5 92.3Table 9: Comparison of different number of k vs simulation time and accuracy

Figure 9.2.1: Effect of number of K vs accuracy

From the result, we can see thatk=3 is actually a better choice compares tok=√n.

Keep in mind that this number is mostly depends on how largenis, and how is the input data being arranged Specifically for ResNet-29 encoding, we expect this data is already being well-localized for each face’s features Therefore a small number ofkin this case is good enough and balance between execution time and accuracy.

Optimal Number of Nearest Distance

Aside from kvalue, we also need to specify the maximum Euclidean distance to be accepted tok list Consider the case when an absolutely different person added to the system Obviously the calculated distance for that input in this case is very large for all training points However if this distance is not bounded or too large, there could be higher chance that the our system will regard the strange person as one of the person in the training set, which is incorrect On the contrary, if this distance is too small, our system will become too sensitive to input changes Therefore, we need to specify the optimal distance to be processed Any distance exceed this number should be discarded fromklist.

We use the same training data set as previous withk=3, but now added an anonymous person as input, to check the effectiveness of the selected distance Table10shows the result of our test.

(%) 88.6 91.1 93.0 97.1 97.4 97.2 96.5 Table 10: Comparison of different distance thresold vs accuracy

Figure 9.2.2: Effect of number of D vs accuracy

As from the result, we will adapt the distance thresold for KNN system as 0.6 to balance between recognition correct person as well as ability to detect unknown faces.

GUI implementation of the face recognition system is a part of our project objective An user friendly GUI program is considered a good demonstration of the project work Thus it is necessary to create two GUI for our two systems The general process of designing the GUI and instruction on how to use it will be briefly discussed in this section.

GUI Design for SOM System

Main Window

Refer to Figure10.1.1, here is a list of buttons and pop-up menus function.

1 WEBCAM: Open the Database Camera Window, use webcam and capture your own image for the database

2 MODIFY: Open a Database Modifying Window to manage the database with stored images from our project and ORL database

3 VIEW: Preview the database, this button can use to view a single set or all of the sets base on your choice in the pop-up menu (4)

Figure 10.1.1: SOM GUI main window

4 Set Selection: Select the set you want to view or load to the program

5 Load: Press this button after selecting a certain data set in order to load it to the recognition program

6 MODIFY: Open Input Modifying Window to change your input image slots

7 Input Selection: Select an input to load it to the program

8 Preview: View your input image

9 START: Press this button to run the recognition program

10 Result: Print the recognition output

You can only load the input image after loading your data set START button only enabled after you load all required components into the program (database and input images).

Database Camera Window

Figure 10.1.2: SOM GUI database camera window Refer to Figure10.1.2, here is a list of each section’s function.

1 Data Management: Select the data set as well as the ID inside it that you want to modify (add new data or erase current data)

2 Take Snapshot: Press Start button to take your images input to the selected ID slot; choose Auto Order so it will automatically increase the image slot from 1 to 10 while you shot.

3 Preview: Works similarly to section (2), however it’s for previewing the images, which are shown on the Preview screen.

Normally, you will start the process by selecting data set and ID slot inside that data set Next, take snapshots of your own face to write them to the selected ID slot Finally, preview your images to see if you need to retake another shot.

Database Modifying Window

Figure 10.1.3: SOM GUI database modifying window Refer to Figure10.1.3, here is a list of each panel’s function.

1 Select Database Slot: Select the data slot you wish to replace or remove, VIEW and DELETE button will change their label and will work according to the change in your selection.

2 Select New Data: Select the new data you want to add or replace with the selected database slot in section panel (1).

To sum up, you will start the process by selecting database slot to be replaced or removed and preview them if you like Panel (2) then will allows you to choose a new data to add into the empty slot or replace an existing slot.

Input Modifying Window

Refer to Figure10.1.4, here is a list of each panel’s function.

1 Select Input Slot: Select the input slot you wish to replace or remove, VIEW button will change its label and will work according to the change in your selection.

2 Select New Data: Select the new data you want to add or replace with the selected input slot in section panel (1).

Figure 10.1.4: SOM GUI input modifying window

Regularly, you will start the process by selecting input slot to be replaced or removed and preview them if you like Panel (2) then will allows you to choose a new data to add into the empty slot or replace an existing slot.

Input Camera Window

This window will only show up if you chooseNew Inputin theInput Selectionpop-up menu of Main Window.

Refer to Figure10.1.5, here is a list of each button’s function.

1 Star: Press this button to begin Its label will change toTake Snapshot, press again so that the program will take your photo and find it in the database loaded before in

Main Window The result will be shown in Result panel.

2 Retry: Only enable after you take your image, press this button if you wish to do it again.

GUI Design for KNN System

Main Window

We tried to create similar work flow in KNN system as performed with SOM Therefore main window is mandatory.

Refer to figure10.2.1and10.2.2, here is a list of main menu’s contents.

2 Real Time Log: Displays general information during program execution.

Figure 10.2.1: KNN GUI Main window

3 File - New Project: Begin data acquisition procedure or recognition preparation steps It will be introduced in next steps.

4 Options - Change Camera ID: Allows the user to change camera input, based on their ID.

5 Auto Save Function: The program will automatically storage all user data changes in Windows user default folder in drive C, such as camera ID or latest work space location We can erase this file to reset all data to default values.

Database Acquisition Windows

Upon selection of File - New Project, a new window will appears Refer to figure

10.2.3, here is a list of this window contents.

Figure 10.2.2: (a) "File" drop down list (b) "Options" drop down list

(c) "Options - Change Camera ID" window (d) Auto save location

For "Data Input" tab, we concentrate on collecting data for training set There are 2 buttons in this tab:

1 Add New Face Identity: Start the wizard to add new groups or new faces to the existing group.

2 Re-Train Added Data: Start the wizard to specify added groups of face to begin

For "Recognition Input" tab, we specify the trained data and provide test data as well as output handling There are 2 buttons in this tab:

1 Static Picture Recognition: Given a trained data, then capture a facial picture from camera The program will perform face recognition in that static picture taken.

2 Re-Train Added Data: Given a trained data, then perform a live video feed from camera The program will perform face recognition in that live video feed with stable frames per second.

From here we separate into two different implement paths:

• Training data acquisition and encoding function.

• Test data analyzing which output facial prediction function.

Training Data Encoding Wizard

Upon selection of File - New Project - Add New Face Identity, figure10.2.4 will appear This window requires user to enter their information in order to begin user facial data capture for training purpose.

1 User Group: Master folder for a group of facial pictures For example: "ORL

Database", "HCMUT Khoa Dien Tu" and so on User Group will be used as work space, thus it is a mandatory field.

2 User Group Confirmation Button: After a manual input of User Group, press

Apply button to access next steps.

3 User Name: Sub folder for a person’s facial pictures For example: "Minh", "Thay

Anh" and so on User Name will be used as KNN input features for categorize and display in recognition result, thus it is a mandatory field.

4 User Name Confirmation Button: After a manual input of User Name, press Apply button to access next steps.

5 Personal Path: Display output path of the upcoming picture taken, based on given

Group and User Name Path properties: Browsed PersonalPath/UserGroup/UserName, automatically create new folder if such path have not existed.

6 Browse for Personal Path: Open Windows browse for folder function.

7 OK to continue next step: This button will confirm data input and continue next step Only accessible if any mandatory fields are all not empty.

In case user wants to add or modify their saved picture images, they have to input correctly their previous entries including User Group, User Name and Personal Path The program will not restrict data input limitation as implemented in SOM system for more flexible handling, it will mostly depends on the users’ demand.

After information input section has been passed, the program will open two windows: a camera feed and a photo supporter with auto snapshot function.

Here the program will require user put their face in front of the camera and either press the "Take Picture" button to manually record picture or make use of the auto snapshot function Below is the description of this window:

1 User Group: Master folder for a group of facial pictures For example: "ORL

Database", "HCMUT Khoa Dien Tu" and so on User Group will be used as work space, thus it is a mandatory field.

2 Auto take picture when a face is detected: A master check box to disable or enable the auto snapshot function.

3 Sensitivity Level: Require the program to detect if a face is present After n sensitivity level times continuously frames with a face detected, will perform storage then-th frame Otherwise, reset the counter and loop again.

4 Desired number of picture: During both manual and automation image recording, the program will continuously count the current number of image files in the Personal

Path If the number of image files in the drive reached this thresold, both manual and automation function will not record new image.

5 Current number of picture: Display the current number of image files in the

Next, upon selection of File - New Project - Re-Train Added Data, a new window will appear This window requires user to browse their desired User Group that contains related User Names as an input for supervised training.

After browsing User Group path, the Start Training button will be accessible The training process should take several minutes, depends on hardware configuration and number of images inside User Group Output of the training will be stored in a file, which will be inputted in next steps.

Test Face Recognition Wizard

Upon selection of File - New Project - Static Picture Recognition or Real-time

Recognition, figure 10.2.6 will appear This window requires user to browse for the trained data The program will perform facial recognition to check if that user is belong to the trained User Group or not Test user image can be taken by either static picture or real-time video feed.

(b) Figure 10.2.3: (a) "Data Input" tab contents (b) "Recognition Input" tab contents

Figure 10.2.4: User Data Input Wizard

Figure 10.2.5: (a) Camera feed window with manual snapshot button at the bottom (b)

In this chapter, we will finalize the two completed systems and bring it to the competition about facial recognition performance, as well as each system’s advantages Motivation and objective of them are the same, but the design of each system are completely different. Despite they are based on the same metric to perform learning: Euclidean distance, but they might excel in different situations Therefore, we created a table11to conclude their general properties and from it we can have an overview about the two systems Training time and accuracy were based on our training sets from previous section, which has 30 training data featuring 3 persons and 1 positive test data.

From the table, we can see that SOM system perform training much faster than KNN system The reason behind it is because of pre-processing algorithm The image-based process requires impressively less resources and computation time, compare to the feature- based approach deep neural network ResNet-29 While 2D-DCT and Anisotropic diffusion only applies a few mathematical algorithms to each pixel’s value for one picture, ResNet perform convolution computation for many layers with 1×1 and 3×3 kernels, also just for one picture As a result, despite KNN algorithm is much simpler than SOM in term of classification, the main problem of the whole KNN system is it requires more bruce-force calculations for preparation It makes the total training time for KNN is almost unreasonable for real applications Fortunately, the designed KNN system boasts significant higher accuracy in final prediction Therefore both systems are worth consideration while searching for an efficient high speed face recognition system.

Item SOM-based System KNN-based System

Feature-based Feed-Forward Neural Network Classification

2D-DCT and Anisotropic diffusion Pre-trained ResNet-29 Pre-Processing

Output Type 64×1, integer 0 to 255 128×1, float−1 to 1

Self-Organized Map K-Nearest Neighbors

Table 11: General system deviation between the designed SOM and KNN

Significance of the Project

After performing the simulations in MATLAB and Python, we can see that our project could describe two successful designs of efficient high-speed face recognition systems, especially in surveillance application with large amount of individuals’ data Based on the results we got, both systems met all main objectives that it can recognize correctly nearly 95% of the persons inside the database.

For the first system, thanks to using 2D-DCT and SOM Neural Network techniques, our face database can be saved after discard redundant data to make it become very light (about 200KB for a database with 30 images of 3 persons, from an originally more than 10MB data) and easier for storage as well as transfer with generally leads to fast training time Furthermore, its excellent working speed is consider to be another noticeable plus: in particular, it can gives you the result within less than 1 second In addition, we also provide the additional image preprocess method ASDCT to help removing unwanted illumination features from the input images so that it can further improves the correct recognition rate in some cases.

For the second system, powered by a more advanced feature extraction techniques named ResNet, combined with a fast and simple data classifier K-Nearest Neighbors The processed image data can also be much more compact, not only comparable to 2D-DCT, but also make the surveillance application become feasible Despite the encoding method requires more resources as well as working speed is a bit slower than the first system, the average accuracy is significantly higher and it can compensate to the disadvantages.

By the time we finished our project objective, we have been familiarizing ourselves with MATLAB’s and Python’s commands and supporting tools like ANN Toolbox, Image Processing, Image Acquisition Toolbox, Open library Scikit-Learn, Face Recognition Toolbox, MATLAB and Python GUI We also acquired a lot of knowledge about the Machine Learning, ANN and its potential capability in modern industry development as well as high level technology applications.

Remaining Limitations

Hence, in brief, the 2D-DCT, ResNet-29, SOM neural network and K-Nearest Neighbors are the core algorithms for the design and implementation of our efficient high-speed face recognition systems Simulated using MATLAB and Python, it has proven to be highly accurate for recognizing a variety of face images with different facial expressions on plain backgrounds under slightly changes in lighting conditions, as well as open the door to other platform implementation, thanks to the OS-friendly language Python.

Despite appearing to be very efficient in face recognition, both systems still remains some weaknesses Nevertheless, most of these problems are not from the system itself but caused by the image quality for input and database (e.g too complex background, different light exposure in image using for database, blurred image, inhomogeneous face position and more).

The most serious issue about our systems is: both of them cannot precisely point out an individual that is not inside the database The system sometimes tends to result in someone inside the database instead of confirms that it can’t recognize the input image and return

"unknown" value However, consider this system is used for surveillance purpose, where the authority possesses all of their citizens’ personal identity information included their face image, the weakness mentioned above is acceptable.

Last but not least, is the limitation in our resources and project scale Currently, our tests are only focus on a limited interval when it comes to the parameter optimization and data scale, because the operating system we’re using is just a personal computer The processor of our test hardware is not specialized for this particular task Therefore, it may affect the performance During our test, we recorded some anomaly in the statistical test If we can possess a more efficient system, we’ll have more chance to apply testing with larger scale and could improve the peak performance for our programs.

About SOM system, upon extensive study and research, recommendations for improve- ment and enhancement of the face recognition system program are concluded as follows:

• Applying different image processing methods besides DCT, such method that is superior and has improved algorithm for image compression which requires less processing time than DCT but still maintain good compression capability.

• Further SOM neural network efficiency testing based on the following factors:

– Selecting the optimal number of DCT coefficients to use for face image com- pression, which will lead to less DCT processing time and increase the program execution time In this project, currently we are using 61 DCT coefficients for image compression.

– Selecting other optimal parameters for our neural network rather than using the default one Although most of the default parameters are recommended by universal experiments from MATLAB main website and its community, it seems that for a particular task, these parameters should be tuned so as to find out the most suitable values.

– Extending our range of testing to find better solutions This can be done by expanding the system scale, database scale and parameters scale.

• Complete a fully developed GUI program with the ability to be converted into a stand-alone application (.exe) so that it can run freely on any machine without having MATLAB installed as a pre-requisite.

About KNN system, expected to be a more advanced method compare to SOM system and escaped from an OS-limited MATLAB language, still have many open points for improvement which are listed below:

• Optimize parameters in deep neural network for feature extraction step in order to perform faster and also maintain the final result performance.

• Further KNN efficiency testing based on the following factors:

– Can be applied as a regression learning method, to optimize and adapt its parameters (number of k, bounded distance, etc ) based on different input conditions, rather than using a default value.

– Use a more advanced nearest neighbors searching method to boost the output speed Some of the method can be mentioned are KDTree or BallTree.

– Extending our range of testing to find better solutions This can be done by expanding the system scale, database scale and parameters scale.

• The Python based GUI were designed from pure code by Tkinter library, which is not user-friendly, especially for inexperienced developers We also tried to generate execution file (.exe) from python file (.py), but the result is still not work Therefore the ability to run the application freely on any machine without having Python installed is still an open point for improvement.

All the above mentioned factors will result to achieve the shortest amount of time for program execution while maintaining the maximum possible accuracy to produce an efficient high-speed face recognition system, such system that can be implemented for a practical task in real life application.

[1] N D Minh and T L M Tam, “Face recognition system using discrete cosine transform and self-organized map,” 2016.

[2] Smart Parenting for Smart Kids Jossey-Bass, 2011.

[3] J Nagi, “Design of development an efficient high-speed face recognition system in a matlab/simulink environment,” 2007.

[4] R Wang, “Transform coding and jpeg image compression,” 15 November 2007, ac- cessed time June 2020 http://fourier.eng.hmc.edu/e101/lectures/Image_ Processing/node14.html.

[5] R C Narayanan Ramanathan and A K R Chowdhury, “Facial similarity across age, disguise, illumination and pose,” 2016.

[6] Z Sufyanu, I Member, F S Mohamad, A A Yusuf, and M B Mamat, “Enhanced face recognition using discrete cosine transform,”Engineering Letters, 2016.

[7] X He, Kaiming; Zhang, “Deep residual learning for image recognition,”Computer Vision and Pattern Recognition (cs.CV), 2015.

[8] F Azirar, Facial Expression Recognition PhD thesis, School of Electronics and Physical Sciences, Departmentof Electronic Engineering, University of Surrey, 2004.

[9] P Brimblecombe,Face Detection Using Neural Networks PhD thesis, School of Electronics and Physical Sciences, Departmentof Electronic Engineering, University of Surrey, 2005.

[10] Data-driven Technology for Engineering System Health Management: Design Ap- proach, Feature Construction, Fault Diagnosis, Prognostics, Fusion and Decisions. Springer, 2016.

[11] Y Chen, S Jiang, and A Abraham,Face Recognition using Discrete Cosine Trans- form and Hybrid Flexible Neural Tree PhD thesis, School of Information Science and Engineering, Jinan University, 2005.

[12] V M P Jun-Cheng Chen, “Unconstrained face verification using deep cnn features,” 2019.

[13] C Clabaugh, D Myszewski, and J Pang, “Neural networks,” 2000, accessed time May 2020.https://cs.stanford.edu/people/eroberts/courses/soco/ projects/neural-networks/index.html.

[14] A Mohri, Mehryar; Talwalkar, “Foundations of machine learning,” USA, Mas- sachusetts: MIT Press, 2012.

[15] Wikipedia, “Machine learning,” retrieved June 2020, accessed time June 2020.https:

//en.wikipedia.org/wiki/Machine_learning.

[16] T H N Kumar, “Why machine learning models of- ten fail to learn,” retrieved April 2017, accessed time May

2020 https://web.archive.org/web/20170320225010/ https://www.bloomberg.com/news/articles/2016-11-10/ why-machine-learning-models-often-fail-to-learn-quicktake-q-a.

[17] Wikipedia, “Decision tree learning,” retrieved June 2020, accessed time June 2020. https://en.wikipedia.org/wiki/Decision_tree_learning.

[18] O Rokach, L.; Maimon, “Top-down induction of decision trees classifiers-a survey,”

IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and

[19] Wikipedia, “Linear regression,” retrieved June 2020, accessed time June 2020.https:

//en.wikipedia.org/wiki/Linear_regression.

[20] L E Peterson, “K-nearest neighbor,” 2009, accessed time June 2020. http://www.scholarpedia.org/article/K-nearest_neighbor#:~:text=

In%20an%20unpublished%20US%20Air,Fix%20%26%20Hodges%2C%201951).

[21] S Sun, “An adaptive k-nearest neighbor algorithm,” 2010 Seventh International

Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2010), 2010.

[22] A.D.Dongare, R.R.Kharde, and A D.Kachare, “Introduction to artificial neural network,”International Journal of Engineering and Innovative Technology (IJEIT),

[23] Wikipedia, “Neuron,” retrieved October 2015, accessed time June 2020 https:

//en.wikipedia.org/wiki/Neuron.

[24] L Downs, E Santana, and A Berta-Antalics, “Before retinal cells die, they regenerate, penn vet blindness study finds,” 2016.

[25] Wade and Nicholas, “Brain may grow new cells daily,”The New York Times, 1999.

[26] C Stergiou and D Siganos, “Artificial neural network,” Retrieved 2015.

[27] Face Recognition Based on Kernel Radial Basis Function Networks Computer

Science Department, Beijing Institute of Technology, Beijing, China, 2004.

[28] Wikipedia, “Feedforward neural network,” retrieved October 2015, accessed time

May 2020 https://en.wikipedia.org/wiki/Feedforward_neural_network.

[29] Multilayer Feedforward Networks are Universal Approximators Pergamon Press plc,

[30] B Widrow, “An adaptive adaline neuron using chemical "memistors",” tech rep.,

[31] Impact of Climate Change on Natural Resource Management Springer Science and

[32] P L Smith, “An introduction to neural networks,” 1996, retrieved March 2016, ac- cessed time June 2020.http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides. html.

[33] Software Reliability Assessment with OR Applications Springer Science and Business

[34] Quantum Particle Swarm Optimization for Elman Recurrent Network Universiti

[35] Wikipedia, “Boltzmann machine,” retrieved March 2016, accessed time June 2020. https://en.wikipedia.org/wiki/Boltzmann_machine.

[36] The Nature of Code: Simulating Natural Systems with Processing Free Software

[37] M Niknafs, “Neural network optimization,” retrieved March 2016, accessed time May

2020 http://courses.mai.liu.se/FU/MAI0083/Report_Mina_Nikanfs.pdf.

[38] Wikipedia, “Artificial neural network,” retrieved March 2016, accessed time May

2020 https://en.wikipedia.org/wiki/Artificial_neural_network.

[39] Y Fan, “Face recognition system based on self-organizing maps and multilayer perception networks,” 2002.

[40] Wikipedia, “Self-organizing map,” retrieved March 2016, accessed time May 2020. https://en.wikipedia.org/wiki/Self-organizing_map.

[41] F Schroff and D K from Google Inc., “Facenet: A unified embedding for face recognition and clustering,”Computer Vision Foundation, 2016.

[42] N Ahmed, T Natarajan, and K Rao, “Discrete cosine transform,”IEEE Trans on

[43] S Vikas and J P Gupta, “Collusion attack resistant watermarking scheme for colored images using dct,” 2007.

[44] K Rao and P Yip, “Discrete cosine transform, algorithms, advantages, applications,”

[45] D Marshall, “The discrete transform (dct),” 10 April 2010, accessed time June 2020. https://www.cs.cf.ac.uk/Dave/Multimedia/node231.html.

[46] B K Ahmed, “Dct image compression by run-length and shift coding techniques,”

[47] New Technique for DCT-PCA Based Face Recognition 2011.

[48] Z Sufyanu, F S Mohamad, and A A Yusuf, “A new discrete cosine transform on face recognition through histograms for an optimized compression,” 2015.

[49] Wikipedia, “Discrete cosine transform,” 18 November 2005, accessed time May 2020. https://en.wikipedia.org/wiki/Discrete_cosine_transform.

[50] “The 2d discrete cosine transform and image compression,” 17 February 2007, accessed time May 2020 http://dmr.ath.cx/gfx/dct/.

[51] J Nagi, “Design of an efficient high-speed face recognition system,” 2006.

[52] B A Wandell, “Foundations of vision, chapter 8: Multiresolution image representa- tions,” 2016.

[53] Z Sufyanu, F S Mohamad, and A S Ben-Musa, “Choice of illumination normal- ization algorithm for preprocessing efficiency of discrete cosine transform,”Interna- tional Journal of Applied Engineering, vol 10, no 3 pp 6341-6351, 2015.

[54] M Lee and C H Park, “An efficient image normalization method for face recognition under varying illuminations,”The 1st ACM international conference on multimedia information retrieval, pp 128-133, 2008.

[55] V Struc and N Pavesic, “Gabor-based kernel partial-least-squares discrimination features for face recognition,” 2009.

[56] S Du and R K Ward, “Adaptive region-based image enhancement method for robust face recognition under variable illumination conditions,”IEEE transactions on circuits and systems for video technology, vol 20, no 9, pp 1165-1175, 2010.

[57] V Struc, “The inface toolbox v2.0: The matlab toolbox for illumination invariant face recognition,” 2011.

[58] V Struc and N Pavesic, “Photometric normalization techniques for illumination invariance,” 2011.

[59] N Science and T C (NSTC), “Biometrics testing and statistics,” 7 August

2006, accessed time July 2020 http://www.biometrics.gov/documents/ biotestingandstats.pdf.

[60] C Chi-Tsong, “Digital signal processing: spectral computation and filter design,”

[61] P.-C Su, H.-J M Wang, and C.-C J Kuo, “Digital watermarking on ebcot com- pressed images,”SPIE’s International Symposium on Optical Science, Engineering, and Instrumentation, International Society for Optics and Photonics, 1999.

[62] J K Anil, R P W Duin, and J Mao, “Statistical pattern recognition: A review,”

Pattern Analysis and Machine Intelligence, pp 4-37, IEEE Transactions, 2000.

[63] B Er, “Microsoft presents : Deep residual networks,” retrieved Au- gust 2016, accessed time June 2020 https://medium.com/@bakiiii/ microsoft-presents-deep-residual-networks-d0ebd3fe5887.

[64] “The database of faces,” retrieved March 2016, accessed time July 2020 http:

//www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

[65] E Aybar,TopolojikKenar slecleri PhD thesis, Anadolu ĩniversitesi, 2003.

[66] A Abdallah, M A El-Nasr, and A L Abbott, “A new face detection technique using 2d-dct and self organizing feature map,”International Journal of Computer,

Electrical, Automation, Control and Information Engineering, 2007.

[67] “Neural network toolbox glossary,” retrieved March 2016, accessed time June 2020. http://www.mathworks.com/help/nnet/gs/_bsr3kgi-1.html#bsr3kgi-33.

[68] A B Hassanat, “Solving the problem of the k parameter in the knn classifier using an ensemble learning approach,” International Journal of Computer Science and

Information Security, Vol 12, No 8, August 2014.

[69] K Bache and M Lichman, “Uci machine learning repository,” 2013, accessed time

2020 http://archive.ics.uci.edu/ml.

Student name : NGUYEN DUC MINH

Date of birth : 01/11/1994 Origin : Ho Chi Minh City Address : 173, Ngo Tat To Street, Ward 22, Binh Thanh district, Ho Chi Minh City

Department : Control Engineering and Automation

Faculty : Electrical and Electronics Engineering

Department : Control Engineering and Automation

Faculty : Electrical and Electronics Engineering

HELLA Vietnam Co Ltd (9/2016 – current)

Position : Advanced SW Test Engineer

Ngày đăng: 03/08/2024, 13:55

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w