Nghiên cứu fpga cho nhận dạng từ khóa sử dụng kỹ thuật học sâu

To improve the efficiency of KWS, efficient neural network architectures, modelcompression, or model quantization methods were proposed.. The main contribution in this work includes: 1 p

Trang 1

TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI

Trang 2

CỘNG HOÀ XÃ HỘI CHỦ NGHĨA VIỆT NAM

Độc lập - Tự do - Hạnh phúc

————————

BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ

Họ và tên tác giả luận văn: Dương Văn Hải

Đề tài luận văn: Nghiên cứu FPGA cho nhận dạng từ khóa sử dụng kỹ thuật học sâu

Chuyên ngành: Kỹ thuật điều khiển và Tự động hóa

Mã số học viên: CBC19009

Tác giả, người hướng dẫn khoa học và hội đồng chấm luận văn xác nhận tác giả đãsửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày 22 tháng 4 năm 2021 vớicác nội dung sau:

• Bổ sung tiêu đề của các thuật toán, thêm phần danh mục các thuật toán

• Chỉnh sửa tiêu đề hình 4.1, thêm hình 5.2

• Chỉnh sửa công thức 5.1, bổ sung công thức 5.2 và 5.3

• Chỉnh sửa nội dung phần triển khai FPGA của khối GRU

• Di chuyển các phần lập trình DCN, GRU ở chương 5 xuống phần Phụ lục

Ngày 17 tháng 05 năm 2021

Giáo viên hướng dẫn Tác giả luận văn

Chủ tịch hội đồng

Trang 3

ĐỀ TÀI LUẬN VĂN

Nghiên cứu FPGA cho nhận dạng từ khóa sử dụng kỹ thuật học sâu

Giáo viên hướng dẫn

Ký và ghi rõ họ tên

Trang 4

Lời cảm ơn

Luận văn này sẽ không thể hoàn thành nếu không có sự cố vấn chuyên môn và hỗ trợtinh thần mà tôi nhận được từ thầy hướng dẫn Nguyễn Quốc Cường Mặc dù có nhiềusinh viên dưới sự hướng dẫn của mình, thầy vẫn dành một phần lớn thời gian quý báucủa mình để giúp cải thiện việc nghiên cứu

Luận văn là sự đúc kết của vô số thời gian với những người bạn ở phòng thí nghiệmC1-311, những người luôn động viên tôi kiên trì trên con đường nghiên cứu đầy chông gaihiện tại và tương lai

Cảm ơn gia đình và bạn gái của tôi vì tất cả tình yêu và sự ủng hộ

Đặc biệt cảm ơn Thành, người đã đồng hành cùng tôi trong suốt thời gian tại Đại họcBách Khoa Hà Nội

Tóm tắt nội dung luận văn

Trong những năm gần đây, với sự phát triển của nhận dạng tiếng nói, phát hiện từkhóa đã trở thành một cách phổ biến để bắt đầu tương tác bằng giọng nói (ví dụ: “OKGoogle”, “Alexa” hoặc “Hey Siri”) Có một vài phương pháp để giải quyết vấn đề này Ví

dụ phương pháp nhận dạng cho một từ khoá cố định, nhiều mẫu của một từ khoá cụ thểđược thu thập sau đó một mạng nơ-ron được huấn luyện để phân loại từ khoá hay khôngphải từ khoá Bên cạnh đó, phương pháp “truy vấn theo ví dụ” lấy một số ví dụ về các từkhóa làm mẫu và so sánh phân đoạn âm thanh với các mẫu để đưa ra quyết định pháthiện có phải từ khoá hay không Trong luận văn này, phương pháp áp dụng cho một từkhóa cố định sẽ được sử dụng, hơn nữa, một mô hình mạng nơ-ron học sâu mới được đềxuất với độ chính xác cao và kích thước của mô hình mạng nơ-ron học sâu nhỏ

Ngoài ra, một mô hình học sâu mới luôn tốn nhiều năng lượng và tài nguyên Để nângcao hiệu quả của mô hình mạng nơ-ron, các kiến trúc mạng nơ-ron hiệu quả, các phươngpháp nén mô hình hoặc lượng tử hóa mô hình đã được đề xuất Mặc dù vậy, các nền tảngphần cứng, chẳng hạn như GPU, có thể không đủ linh hoạt để hỗ trợ tất cả các loại tối

ưu hóa Từ đó, cần phải có sự đồng thiết kế phần cứng và phần mềm để tăng tốc tínhtoán ở mức tối đa, tính linh hoạt của nền tảng phần cứng trở thành một đặc điểm chính.FPGA do đó trở thành một lựa chọn thích hợp, mang lại hiệu quả năng lượng tốt và tínhlinh hoạt để cấu hình phần cứng Vì vậy, để cải thiện hiệu năng cũng như tốc độ của môhình nhận dạng đề xuất, đề tài đưa ra một kiến trúc thiết kế mạng nơ-ron trên nền tảngFPGA với độ chính xác không đổi

Học viên

Ký và ghi rõ họ tên

Trang 5

The thesis would not have been completed without the academic advisement and tional support that I received from my advisor Professor Nguyen Quoc Cuong Thoughhaving many students under his supervision, he still spends a huge chunk of his precioustime helping be better-off doing research He also creates a wonderful environment forstudents to compete and improve their research prospects I would like to express mygratitude to Professor Cuong and wish his success in research and leadership as well ashappiness in life

emo-The thesis is a summary of countless hours with people at C1 311 lab, who alwaysencourage me to be persistent on present and future research pathway full of struggles.Thanks a special friend who supported me at the worst time when I feel alienated Thanksother juniors who I spent time working in-depth and enjoyed coffee breaks together.Thank you to my family and my girlfriend for all of their love and support

Special thanks to Thanh for accompanying me throughout the whole time here atHUST

Hanoi, April 15, 2021Duong Van Hai

Trang 6

In recent years, with the development of voice assistants, keyword spotting has become

a common way to begin an interaction by the voice interface (e.g “OK Google”, “Alexa”

or “Hey Siri”) Despite all advances in understanding spoken commands, it is difficult formachines to decide whether an utterance is actually intended as a command or if it isjust conversational speech Various approaches were developed to solve this problem, and

an approach that retains these advantages is to use a pre-defined word or short phrase towake up the machine and signal that the following speech will be a command, which so-called Keyword Spotting (KWS) There are several existing methods to tackle the KWSproblem For example, numerous variations of a specific keyword utterance are collectedand then training neural networks for classification of the keyword/non-keyword whichhave been promising methods in the field Besides, Query-by-Example (QbE) methodsusually take several examples of the keywords as templates and compare the test audiosegment against the templates to make detection decisions In this thesis, the methodapplying for a single keyword will be focused on Moreover, a new model architecture forKWS was proposed achieving high accuracy and small footprint

Furthermore, the process of developing a new deep learning model is always intensive To improve the efficiency of KWS, efficient neural network architectures, modelcompression, or model quantization methods were proposed Despite the efficient algo-rithms and methods, the hardware platforms, such as GPUs, might not be flexible enough

energy-to support all sorts of optimizations

Hence, a hardware-software co-design is required to accelerate the computation at theedge, and the flexibility of hardware platforms becomes a key feature Field ProgrammableGate Array (FPGA) thus become a good candidate, providing good energy efficiency andflexibility to configure the hardware Thus, the proposed system was built on FPGA toimprove the speed of the model without sacrificing the accuracy

The main contribution in this work includes: 1) propose a new model architecture forKWS system and 2) derive a FPGA implementation of the proposed method The thesis

is organized as follows:

• Chapter 1 introduces the overview of the KWS problem and its challenges

• Chapter 2 provides a brief review of some essential concepts of KWS system andFPGA design

• Chapter 3 presents most recently methods relating to KWS problem

• Chapter 4 proposes the new network architecture for KWS and represents imental results

exper-• Chapter 5derives the FPGA implementation and provides an analysis of the posed approach’s performance

Trang 7

DCN Deformable Convolutional Network

GRU Gated Recurrent Unit

CNN Convolutional Neural Networks

RNN Recurrent Neural Networks

KWS Keyword Spotting

CRNN Convolutional Recurrent Neural Networks

QbE Query-by-Example

LSTM Long Short-Term Memory

FPGA Field Programmable Gate Array

DNN Deep Neural Network

QoR Quality of Results

IoT Internet of Things

LVCSR Large-vocabulary continuous speech recognition

HMM Hidden Markov Model

DTW Dynamic Time Warping

MFCC Mel Frequency Cepstral Coefficients

RTF Real Time Factor

HLS High-Level Synthesis

Trang 8

List of Tables

4.1 Dataset statistics 30

4.2 Performances of proposed model and CRNN-A model: Number of param-eters and FRR (%) on clean and noisy conditions (SNR=5dB) at 0.5 FA per hour 32

4.3 Performances of proposed model and WaveNet model: Number of param-eters and FRR (%) on clean and noisy conditions (SNR=5dB) at 0.5 FA per hour 33

5.1 Detail performance of the Design 1 (100 MHz) 41

5.4 Processing time for 200 frames 40-dimensional MFCC 42

Trang 9

List of Figures

2.1 KWS approaches 13

2.2 FPGA HSL design flow 17

3.1 Framework of Deep KWS system [23] 18

3.2 Attention-based end-to-end model for KWS [20] 20

3.3 WaveNet-based KWS architecture [29] 22

4.1 End-to-end proposed KWS system 25

4.2 The deformable convolutional network 26

4.3 Illustration for bilinear interpolation 27

4.4 DET curve for different parameters of the proposed model 31

4.5 The concentration of learned offset values of the DCN layer 32

4.6 DET curve for the proposed architecture (blue) compared to the CRNN-A (dashed orange) baselines in clean (a) and noisy (b) conditions 33

5.1 Block diagram of the proposed KWS system 35

5.2 Time consuming (%) on CPU and GPU of each components in proposed system 36

5.3 Vivado HLS report of a naive implementation 37

5.4 Operation of the line buffer and window for 2D filtering 37

5.5 Vivado HLS report of a line buffer and window implementation 38

5.6 FPGA design performance compares with CPU and GPU 42

5.7 A simple block design of system on Vivado 44

Trang 10

List of Algorithms

4.1 Deformable Convolution at channel c, timestep t, frequency f 284.2 Summary the operations of the proposed model for each input of T frames 295.1 Pseudocode of a convolution layer 36

Trang 11

5.1 C++/High-Level Synthesis (HLS) implemention of KWS system 39

5.2 Application to communicate with IP Core 42

A.1 C++/HLS implemention of Convolution using Line Buffer 51

A.2 C++/HLS implemention of Bilinear Interpolation 52

B.1 Design 1: C++/HLS implemention of Gated Recurrent Unit (GRU) 54

B.2 Design 2: Implementation of GRU using loop tiling and loop unrolling 55

B.3 Design 3: Implementation of GRU with separated operations 56

Trang 12

2.1 KWS Approaches 13

2.2 KWS Evaluation 14

2.2.1 Real time factor 14

2.2.2 False alarm rate 15

2.2.3 False alarm per hour per keyword 15

2.2.4 False rejection rate 15

2.3 FPGA Implementation Overview 15

2.3.1 Programming Tools 15

2.3.2 Design Flow 17

2.4 Summary 17

3 Related Works 18 3.1 Deep KWS 18

3.2 Attention-based KWS 20

3.3 WaveNet-based KWS 21

3.4 Summary 23

4 Proposed Sytem 24 4.1 End-to-end Small-footprint KWS 24

4.1.1 System Description 24

4.1.2 Deformable Convolutional Network 25

4.1.3 Gated Recurrent Unit 27

4.1.4 Attention Mechanism 28

4.2 Experimental Result 29

4.2.1 Experiments 29

4.2.2 Proposed Model: DCN-based Attention Model 31

4.2.3 Attention-based Baseline Model (CRNN-A) 31

4.2.4 WaveNet-based Baseline Model (WaveNet) 33

4.3 Summary 34

Trang 13

5 FPGA Design 35

5.1 FPGA Implementation 35

5.1.1 Deformable Convolution Network Implementation 35

5.1.2 Gated Recurrent Unit Implementation 38

5.1.3 System Integration 39

5.2 Implementation Result 40

5.2.1 FPGA Platform 40

5.2.2 FPGA Performance 40

5.3 Summary 43

Conclusion 45 Publication 46 A FPGA Implementation of DCN 51 A.1 Line Buffer 51

A.2 Bilinear Interpolation 52

B FPGA Implementation of GRU 54 B.1 Approach 1 54

B.2 Approach 2 55

Trang 14

Chapter 1

Introduction

Hands-free operation of machines based on speech recognition allows for fast, natural,and convenient interaction Despite all advances in understanding spoken commands, it isdifficult for machines to decide whether an utterance is actually intended as a command

or if it is just conversational speech One solution is to have the user push a button beforeuttering a command However, this greatly reduces the aforementioned flexibility andconvenience of hands-free systems An approach that retains these advantages is to use

a pre-defined word or short phrase to wake up the machine and signal that the followingspeech will be a command The problem of recognizing this so-called Keyword Spotting(KWS) is a sub-field of automatic speech recognition

In recent years, with the development of voice assistants, keyword spotting has become

a common way to begin an interaction by the voice interface (e.g “OK Google”, “Alexa”

or “Hey Siri”) They collect numerous variations of a specific keyword utterance andtrain neural networks which have been promising methods in the field In the last fewyears, deep learning-based approaches have been proven to be an efficient solution due

to their small footprint with superior performance As shown first in [1], a Deep KWSmodel, with all fully-connected layers laid the foundation for applying deep learning forKWS In this approach, a simple multi-layers feed-forward neural network was used topredict the frame-level posteriors of sub-keywords The keyword is defined as detected ifthe confidence score of these posteriors exceeds a threshold However, the fully connectedarchitecture is simple and cannot capture the temporal correlation in the speech features.Besides, the densely connected network also consumes a huge memory footprint Later,more advanced neural network models have been proposed to improve KWS system,

by utilizing Convolutional Neural Networks (CNN) layers for exploiting spatially localcorrelations, and Recurrent Neural Networks (RNN) layers for processing sequence data ofany length By combining their advantages, a more recent model Convolutional RecurrentNeural Networks (CRNN) yielded an acceptable latency and high accuracy, yet requiredsignificantly fewer parameters than Deep KWS

On the other hand, there have been Query-by-Example (QbE) approaches that detectquery keywords of any kind QbE methods usually take several examples of the keywords

as templates and compare the test audio segment against the templates to make tion decisions In [2], example keywords are decoded with an Large-vocabulary continuousspeech recognition (LVCSR) system to get their lattice representation as templates Be-sides, in [3] a Long Short-Term Memory (LSTM) is used as a feature extractor, for eachaudio segment, a fixed-length representation of the audio is created by taking the activa-tions from the last hidden layer of the LSTM and stacking them over a fixed number offrames This approach embeds audio segments of different lengths into a fixed-dimensional

Trang 15

detec-space, therefore vector distance can be used for similarity measurement.

In this thesis, I focus on analyzing end-to-end KWS system using Deformable lutional Network (DCN) with attention mechanism for a small memory-footprint model,applied for a single keyword The DCN was first introduced in [4] Regular convolutionoperates on a pre-defined rectangular grid from an input feature or a set of input featuremaps, based on the defined filter size This grid can be the size of 3 × 3 and 5 × 5, etc.However, useful information that we want to extract and classify can be deformed oroccluded within the input In DCN, the grid is deformable in the sense that each gridpoint is moved by a learnable offset And the convolution operates on these moved gridpoints, which thereby is called “deformable convolution” The DCN therefore is able tofocus on regions with more information In other words, DCN has a tendency to ignorethe irrelevant frequency in input features or silent segments in the input audio

Convo-Recently, DCN is mainly used in image recognition tasks, such as object detection.DCN improves the accuracy of DeepLab, Faster R-CNN, R-FCN, FPN, etc By usingDCN + FPN + Aligned Xception, MSRA won the 2nd Runner Up in COCO DetectionChallenge and 3rd Runner Up in Segmentation Challenge

Furthermore, the process of developing a new deep learning model is always intensive A recent study conducted by Strubell et al [5] reveals that the estimated carbonemission from training a transformer model [6] with the neural architecture search usingGPU can be five times as much as the carbon emission of a car in its whole lifetime Eventhough the amount of CO2 produced from deep learning is still negligible compared withthe total carbon emissions from the whole planet, low-power solutions are still in urgentneed in order to slow down the increasing energy consumption in deep learning

energy-To improve the efficiency of Deep Neural Network (DNN) computation, efficient ral network architectures such as SqueezeNet [7] and ShuffleNet [8] were proposed Thesedesigns could reduce the amounts of parameters and operations without significantlydegrading the model performance Meanwhile, model compression and quantization areintroduced to effectively reduce the model size and improve efficiency Model compres-sion methods [9] aim to prune the redundant structures in the neural networks that have

neu-no or minimal impact on the model performance to avoid wastes of memory and puting power On the other hand, model quantization [10] could reduce the number ofbits used in parameters and results of the DNN so that the computation can becomemore efficient DNN optimized by therefore mentioned techniques can achieve compara-ble performance with significantly smaller memory usage and better computing efficiency.Modified optimization methods including stochastic weight averaging [11], low-precisionstochastic variance-reduced gradient [12], etc were re-cently proposed to train DNN withlow-precision floating numbers

com-Despite the efficient algorithms and methods of model compression and quantization,hardware platforms, such as GPUs, might not be flexible enough to support all sorts ofoptimizations For example, most commercial GPUs support only full-precision floating-point operations Using fixed-point or even mixed-precision numbers for calculation isnot fully supported in most GPUs Meanwhile, random network compression might notbenefit the computation efficiency of GPUs since the computation is highly parallelized,and the overhead of memory alignment is higher than the gains obtained from modelcompression Hence, a hardware-software co-design is required to accelerate the DNNcomputation at the edge, and the flexibility of hardware platforms becomes a key feature.Field Programmable Gate Array (FPGA) thus become a good candidate, providing goodenergy efficiency and flexibility to configure the hardware

Facing the opportunities and challenges, in this work, a small-footprint highly accurate

Trang 16

KWS system is proposed Also, the system was built on FPGA to improve the speed ofthe model without sacrificing the accuracy.

The thesis is organized as follows Chapter 2provides a brief review of some essentialconcepts of KWS system and FPGA design Most recently methods relating to KWSproblem will be presented in Chapter 3 Next, Chapter 4 shows the proposed networkarchitecture for KWS, represents experimental results, and provides an analysis of theproposed approach’s performance Finally, Chapter 5 derives the FPGA implementation

of the proposed model

Trang 17

a text search algorithm determines the keywords that occurred in the utterance [13].Therefore, the first approach is called, “LVCSR-based KWS” In the second approach,KWS is considered as a function or a classifier without passing from the speech recognitionstep It is so-called “Direct KWS”.

LVCSR-based KWS consists of two phases First, a large vocabulary speech recognizerconverts large audio archives to phoneme or word lattices Then, in the second phase, thelattice-based search looks for the set of target keywords The first phase of LVCSR-basedKWS is offline while the second one is online This approach has three major drawbacks.Firstly, a large amount of labeled data is required to train LVCSR-based KWS Secondly,the computational cost implied by large vocabulary decoding is high The third drawback

Figure 2.1: KWS approaches

Trang 18

is decreasing in its performance for Out Of Vocabulary (OOV) words.

In the second approach (Direct KWS), KWS is completely independent from thespeech recognition task The Direct KWS can be considered as a “Template MatchingKWS” or a “End-to-end KWS” In the first group, the most intuitive way to search overspoken utterances is to directly look for those parts of the utterances that sound likethe target keywords One or more spoken utterances of the keywords are used as tem-plates which are then compared with the input speech to make decisions about keywordoccurrences The most widely used approach for template matching is Dynamic TimeWarping (DTW) [14] Although KWS based on DTW has been widely used, it has sev-eral weaknesses Firstly, it has very high complexity which is not acceptable in real-timeapplications Secondly, getting a fixed-length representation of the time-varying inputspeech is a difficult task QbE KWS is the improved version of DTW-based KWS thatrepresents each word as a vector and computes only the similarities between two singlevectors Thus, it is much more efficient than the conventional DTW [15] Additionally,the QbE retrieval performance is higher than DTW [15] QbE KWS has been presented

in the literature from two different points of view In the first one, the methods are based

on template matching of decoded query and spoken utterances It means that an matic speech recognizer decodes the input query to its phonetic transcription Then, thedecoded query has been text-based searched in the decoded spoken utterances [16] Thus,

auto-in this case, the QbE KWS is dependent of speech recognition and does not belong tothe “Direct KWS” group In the second view, methods try to match the query extractedfeatures with spoken utterances features They usually borrow the idea from DTW andoutperform the first group of QbE KWS [17]

In the second group, end-to-end keyword spotter is a function with two input andtwo output arguments [18] The input arguments are input speech utterances and the set

of target keywords The output arguments consist of the confidence measure of keywordoccurrence in the input speech and the keyword position If the calculated confidencemeasure is greater than a predefined threshold, the keyword spotter confirms the occur-rence of the target keyword in the corresponding position of the input speech utterance.End-to-end means a simple model that directly detects the keyword occurrences It doesnot usually need any complex search algorithm Moreover, no alignment for the train set

is required End to end KWS is usually based on a RNN and exploits the ist temporal classification (CTC) loss function for training the RNN The main recentlyend-to-end KWS approaches have been presented in [19, 20, 21]

There are several metrics for evaluating the KWS system In this section, the main metricsfor evaluating the KWS systems have been discussed

A common metric for measuring the speed of KWS systems is Real Time Factor (RTF) If

it takes time P to process an input of duration I, RTF is computed using the calculation:

RTF = P

Trang 19

2.2.2 False alarm rate

False alarm rate in the field of KWS refers to detect a non-keyword as a keyword Itcalculates the ratio of the number of false detection of keyword occurrences to the wholenumber of non-keyword occurrences The following equation shows this calculation:

FAR = Total False Acceptance

Total Non-keywords (2.2)

This term refers to the ratio of the false alarm rate to the whole number of keywordsmultiply by the duration of the whole test set in an hour It calculates as follows:

FA/Kw/hour = FAR

Total Keywords × Testset duration (hour) (2.3)

False rejection rate in the field of KWS refers to the detection of a keyword as a keyword It calculates the ratio of the number of false rejection of keywords to the wholenumber of keyword occurrences The following equation shows this calculation:

non-FRR = Total False Rejection

a C test bench to simulate the C function prior to synthesis and to verify the RTL outputusing C/RTL Cosimulation

The tool also adds some constraints to the exported IP block, like the clock period,clock uncertainty, and FPGA target The clock uncertainty defaults to 12.5% of the clockperiod, but there are options to change it Finally, the tool allows for directives to beadded to direct the synthesis process to implement a specific behavior or optimization.Directives are optional and do not change the behavior of the C code in the simulations,only in the synthesized IP block

When synthesizing an HLS code tool exports a synthesis report showing the mance metrics of the generated design The report’s information on performance metricsare presented below:

Trang 20

perfor-• Area: Amount of hardware resources required to implement the design based on theresources available in the target FPGA The types of resources are Look-Up Tables(LUT), Flip Flops (FF), Block RAMs (BRAMs), and DSP48s.

• Latency: Number of clock cycles required for the function to compute all outputvalues

• Iteration Interval (II): Number of clock cycles before the function can acceptnew input data

• Loop Initiation Interval: Number of clock cycles before the next iteration of theloop starts to process data

• Loop Latency: Number of cycles to execute all iterations of the loop

• Loop Iteration Latency: Number of clock cycles it takes to complete one iteration

of the loop

Vivado HLS also provides (optional) directives that can be used to optimize the design:reduce latency, improve throughput performance, and reduce area and device resourceutilization of the resulting RTL code These pragmas can be added directly to th1e sourcecode for the kernel

• Pipeline: The PIPELINE pragma reduces the initiation interval for a function orloop by allowing the concurrent execution of operations A pipelined function orloop can process new inputs every N clock cycles, where N is the initiation interval(II) of the loop or function

• Array Partition: Partitions an array into smaller arrays or individual elements.This partitioning:

– Results in RTL with multiple small memories or multiple registers instead ofone large memory

– Effectively increases the amount of reading and write ports for the storage.– Potentially improves the throughput of the design Requires more memoryinstances or registers

• Unroll: Unroll loops to create multiple independent operations rather than a singlecollection of operations The UNROLL pragma transforms loops by creating mul-tiples copies of the loop body in the RTL design, which allows some or all loopiterations to occur in parallel

• Stream: By default, array variables are implemented as RAM If the data stored inthe array is consumed or produced in a sequential manner, a more efficient commu-nication mechanism is to use streaming data as specified by the STREAM pragma,where FIFOs are used instead of RAMs

• Array Map: Combines multiple smaller arrays into a single large array to helpreduce block RAM resources

• Resource: Specify that a specific library resource (core) is used to implement avariable (array, arithmetic operation, or function argument) in the RTL If theRESOURCE pragma is not specified, Vivado HLS determines the resource to use

Trang 21

Figure 2.2: FPGA HSL design flow.

• Dataflow: The DATAFLOW pragma enables task-level pipelining, allowing tions and loops to overlap in their operation, increasing the concurrency of the RTLimplementation, and increasing the overall throughput of the design

In this section, the preparation and setting up of the proposed models on the FPGA aredescribed Additionally, the design flow is presented and summarized in Figure 2.2.First of all, the proposed model was trained using Pytorch After the training process

is done, the set of trained weights of the neural networks were saved to text files All ofthe parameters are able to store directly on BRAM when porting the model to FPGA ifthe model size is small Otherwise, DRAM is used for storing the weights of the model.Each module in the neural networks was reimplemented on C++ instead of Python inorder to adapt to HLS

To verify the implementation, random input was generated for the neural networks.The input was then fed into both implementations (Python + FPGA) The output ofFPGA was compared with the output of the correct Python implementation If the error

is less than 10−5, the FPGA implementation is considered correct The modules are sembled together to form a complete HLS system It was then synthesized and exportedinto a single IP Core Finally, a simple application on Vivado SDK is programmed totransfer data and verify the whole system

This chapter introduces the basic and essential concepts of KWS including the overallmethods of KWS problem and some metrics to evaluate the performance of KWS system.The FPGA programming is also presented It contains the FPGA programming tools andthe design flow applying for developing neural network modules In the next chapter, themost recent methods relating to KWS problem will be shown and analyzed

Trang 22

Chapter 3

Related Works

The aim of KWS is to detect specific keywords/key-phrases in an audio stream, andthere are several conventional KWS systems For example, DTW based KWS searches formatches between a sliding window of the speech signal and the template of the keyword-based on the DTW algorithm [22] LVCSR-based KWS first utilizes a LVCSR system totranscribe the input audio stream into text or rich lattices, then employs efficient searchingalgorithms to determine the position of the keyword in the text or lattices Another widelyused KWS approach is the Keyword/Filler Hidden Markov Model (HMM)-based KWS;for each predefined keyword, this approach trains an HMM on a training dataset, and afiller HMM is trained for all non-keywords speech, noise, and silence Both LVCSR basedKWS and Keyword-Filler HMM-based KWS utilize Viterbi search for decoding

Nowadays, many applications on mobile devices require real-time KWS Such KWSsystems should have a small memory footprint, low latency, and low computational costwithout suffering loss of accuracy However, the conventional KWS approaches do notmeet these requirements, because a great deal of memory and computation are neededfor Viterbi search Therefore, this chapter tackles in detail several approaches for small-footprint keyword spotting, such as DNN, WaveNet, Attention-based KWS

In 2014, a DNN-based small-footprint KWS system called Deep KWS was proposed byChen et al to solve the problem of real-time KWS on mobile devices [23] This system iscomposed of three modules as can be seen in Figure 3.1, which are the feature extractionmodule, the deep neural network module, and the posterior handling module

Figure 3.1: Framework of Deep KWS system [23]

Trang 23

Feature Extraction The feature extraction module produces 40-dimensional acousticfeature vectors from an audio stream using log-filterbank energies In order to pro-vide sufficient context information for the current frame, 30 past frames and 10future frames are stacked with the current frame to form a larger feature vector.

Classifier The deep neural network is used to estimate the posterior probabilities of theentire keywords (keywords could be items such as “okay”, “google”, and so on) giventhe stacked input feature vectors The deep neural network model is a standard feed-forward fully connected neural network with k hidden layers and n hidden nodesper layer, each computing a non-linear function of the weighted sum of the output

of the previous layer The last layer has a softmax which outputs an estimate of theposterior of each output label

Post-processing Once the posterior probabilities are calculated for each frame of thefeature vector sequence, the confidence scores of the predefined keyword/key-phrasecan be computed based on the posterior probabilities derived from the neural net-works As the original posterior probabilities generated by neural networks are nor-mally noisy, before the confidence scores for the keyword/keyphrase are computed,the posteriors are also smoothed by calculating the mean of posteriors using thesame formula as:

of a phoneme sequence, if the phonemes of a given keyword/keyphrase appear in asliding window of the frame sequence with high probabilities, then this keyword/key-phrase also appears in the sliding window with high probability Finally, the posteriorhandling module calculates the confidence scores based on the smoothed posteriors.Then the confidence of the keyword/keyphrase at the j-th time step is calculatedusing this formula:

conf idence = n−1

vuut

of the maximum posteriors of all keywords to be detected in the past 100 frames

However, the Deep KWS system also suffers some disadvantages Firstly, a large amount

of training data is needed to train the DNN, according to the paper [23], more than twothousand training examples for each keyword are used to train the DNN However for key-words which rarely appear in the natural language, it is difficult to collect enough trainingdata Secondly, Deep KWS system uses the DNN to estimate the posterior probabilitiesfrom a sequence of feature vectors, but DNN cannot model the temporal dimension of theaudio data Thirdly, the dense layer also does not have any parameter sharing mechanism,for example, shares a set of parameter across different timestep Therefore, it requires sig-nificantly large number of parameters

Trang 24

Figure 3.2: Attention-based end-to-end model for KWS [20].

is composed of two modules, which are the feature extraction module, the deep neuralnetwork module, but without the posterior handling module

Feature Extraction The positive training sample has a frame length of T = 1.9 secondswhich ensures the entire wake-up word is included Accordingly, in the attentionmodels, the input window has set to 189 frames to cover the length of the wake-upword Each audio frame was computed based on a 40-channel Mel-filterbank with25ms windowing and 10ms frame shift Then the filterbank feature was converted

to per-channel energy normalized (PCEN) [28] Mel-spectrograms

Classifier The overall architecture can be seen in Figure 3.2 The end-to-end systemconsists of an encoder and an attention mechanism The encoder transforms theinput features into a high level representation using simple RNN Then the attentionmechanism weights the encoder features and generates a fixed-length vector Twotypes of attention were investigated, average attention and soft attention:

• Average attention: The Attention model does not have trainable parameters

Trang 25

and the αt is set as the average of T :

αt= 1

• Soft attention: This attention method is borrowed from speaker tion [abc], the model learns the shared-parameter non-linear attention weights.First, model learns a scalar score et:

verifica-et= vT tanh(W ht+ b) (3.4)Then the normalized weight αt are computed using these scalar scores:

αt= exp(et)

PT j=1exp(ej) (3.5)Finally, by linear transformation and softmax function, the vector becomes a scoreused for keyword detection To improve end-to-end approach, further encoder archi-tectures were explored, including LSTM, Gated Recurrent Unit (GRU) and CRNN

Post-processing Unlike some other approaches, the end-to-end system outputs a fidence score directly without post-processing The attention-based system is trig-gered directly when the p(y = 1) exceeds a preset threshold

con-Experiments on real-world wake-up data show that, with similar size of parameters,the attention models outperform Deep KWS by a large margin With encoder architecture,GRU is preferred over LSTM and the best performance is achieved by CRNN

One of the problems of this model is the PCEN features, because it increases thenumber of operations and the number of parameters It also adopts the conventionalCNN as the first layer, which captures the redundant informations, and cannot capturethe most important region of input features

In recent years, Wavenet model has been widely used for speech/audio synthesis It doesnot contain any recurrent connection All operations are fully parallelizable, thereby beingone of the fastest sequence modelling models In [29], authors presented a end-to-endstateless temporal modeling which can take advantage of a large context while limitingcomputation and avoiding saturation issues An architecture was explored based on astack of dilated convolution layers, effectively operating on a broader scale than withstandard convolutions while limiting model size Further improvement solution was touse with gated activations and residual skip-connections, inspired by the WaveNet stylearchitecture explored previously for text-to-speech applications [30] and voice activitydetection [31], but never applied to KWS at the time the article was published Moreover,ResNets differ from WaveNet models in that they do not leverage skip-connections andgating, and apply convolution kernels in the frequency domain, drastically increasingthe computational cost In addition, the long-term dependency the model can capture

is exploited by implementing a custom “end-of-keyword” target labeling, increasing theaccuracy of the model

Feature Extraction The acoustic features are 20-dimensional log-Mel filterbank gies (LFBEs), extracted from the input audio every 10ms over a window of 25ms

Trang 26

ener-Figure 3.3: WaveNet-based KWS architecture [29].

Classifier WaveNet was initially proposed in [30], as a generative model for speech thesis and other audio generation tasks It consists in stacked causal convolutionlayers wrapped in a residual block with gated activation units as depicted in Figure3.3

syn-• Dilated convolution: Standard convolutional networks cannot capture longtemporal patterns with reasonably small models due to the increase in compu-tational cost yielded by larger receptive fields Dilated convolutions skip someinput values so that the convolution kernel is applied over a larger area than itsown The network therefore operates on a larger scale, without the downside ofincreasing the number of parameters The receptive field r of a network made

of stacked convolutions indeed reads:

• Gated activations and residual connections: Gated activations units

-a combin-ation of t-anh -and sigmoid -activ-ations controlling the prop-ag-ation

of information to the next layer - prove to efficiently model audio signals.Residual learning strategies such as skip connections are also introduced tospeed up convergence and address the issue of vanishing gradients posed bythe training of models of higher depth Each layer yields two outputs: one isdirectly fed to the next layer as usual, but the second one skips it All skip-connections outputs are then summed into the final output of the network

A large temporal dependency, can therefore be achieved by stacking multipledilated convolution layers

Post-processing Like Deep KWS approaches, the end-to-end system computes the smoothedposteriors by averaging the output of a sliding context window containing wsmooth

Trang 27

frames However, the models do not require any postprocessing step besides ing, as opposed to others multi-class models Indeed, the system triggers when thesmoothed keyword posterior exceeds a pre-defined threshold.

smooth-The original motivation of WaveNet is designing a generative model to generate audio.Therefore, there is no evidence suggesting WaveNet would be appropriate to apply inclassification problems WaveNet model consists of the gated functions, which increasesthe computational complexity remarkablely Moreover, it contains many block and residualconnections, which has more parameters and need much more data to train the model.Applying the Wavenet model is a new direction in KWS, but there is still insufficientempirical experiment to improve the efficiency

This chapter provides an overview of some research in small-footprint KWS system, cluding DNN-based, Attention-based, and WaveNet-based KWS It also gives some prosand cons of the mentioned models The next chapter will represent the proposed systemwhich uses DCN-based model for KWS Finally, the experimental results and the analysisare provided on the proposed approach’s performance

Trang 28

of RNN were evaluated, which are LSTM [32] and GRU [33] Like LSTM unit, GRU hasgating units that control the flow of information inside the unit but without having aseparate memory cell However, GRU is preferred over LSTM as better performance with

a lower complexity [24] Thus, GRU is also adopted as a building block of the proposedmodel

Secondly, in recent years, the attention architecture is widely used in speech nition [34, 35, 36], speaker verification [37] and has achieved many successes Therefore,the use of the attention mechanism in KWS has great potential to improve the quality

recog-of the system In [20], the authors presented an attention-based model that exploited thecanonical CRNN architecture followed by an attention mechanism, significantly outper-formed previous approaches by a large margin Based on the idea of [20], an architecture

is developed combining the CRNN network as encoder part with the attention part stead of using regular CNN in CRNN, a DCN, a special type of CNN, is proposed to use.DCN operates on fixed size convolution filters like regular CNN but with various shapes

In-of grid, therefore the learnable filters are able to concentrate in the region with muchmore information

Figure 4.1 shows the overall the architecture with corresponding configurations Theinput of the proposed model is T -by-F matrix, where T is the length of a feature sequence,

F is the dimensionality of a feature The acoustic features are MFCC, extracted from rawtime-domain signals The subsequent part includes one layer of DCN following by a multi-layers unidirectional GRU and process the entire frame Outputs of the GRU layers are fed

to the attention layer to generate a more compact feature representation Lastly, a lineartransformation and softmax decoding are applied to obtain a corresponding predictionscore

The end-to-end model directly outputs the confidence score without any post-processingmethods, no searching algorithms were involved, no alignments needed beforehand to trainthe model The system is triggered when the output exceeds a pre-defined threshold

Trang 29

Figure 4.1: End-to-end proposed KWS system.

The information in the audio feature is highly deformed and only a few regions haveuseful information The regular convolutional operates on a pre-defined rectangular gridfrom an input feature or a set of input feature maps, based on the defined filter size Thisgrid can be the size of 3 × 3 and 5 × 5, etc However, useful information that we want toextract and classify can be deformed or occluded within the input So these filters cannotappropriately cover the feature structure with various shapes, which limits its ability

to learn more information in the features This thesis proposes the use of deformableconvolution instead of conventional convolution to solve the problem The DCN was firstintroduced in [4] DCN learns to transform the rectangular grid into a deformed one,which makes the model can concentrate on regions with more useful information Theneural network, therefore, is able to focus on regions with more information

Deformable convolution architecture can be depicted in Figure4.2, it consists of threesteps:

Step 1: Generate offsets: The offset indicates the pixel coordinates to use whenperforming deformable convolutional Each offset value is a pair (of f sett, of f setf), where

of f sett and of f setf are the coordinates of pixels along the time and frequency axesrespectively

Offsets are learnable variables which are generated by a conventional convolution(ConvOffset) operating on input features:

Trang 30

Figure 4.2: The deformable convolutional network.

filter sizes (LT, LF ) and strides (ST, SF ) Offsets O ∈ RC×T0×F 0

where T0 =T −LTST + 1and F0 = F −LFSF + 1 O[c, n, k] is the offset of deformable convolution at channel c oftime-frequency index (n, k) With each kernel window of size (LT, LF ) when performingconvolution, LT × LF offset values need to be generated To ensure that the offset val-ues are generated enough for the entire input features, therefore, the number of filters ofConvOffset is set to LT × LF × 2

Step 2: Bilinear interpolation: The bilinear interpolation is generally a weightedsum function of pixel values around the offset coordinate Figure 4.3 shows the functionwhen operating on an input pixel of the working grid For the original pixel x1, the newvalue x01 is calculated using four pixels x5, x6, x8, x9 with pair values (∆x, ∆y)

x01 = x5× (1 − ∆x) × (1 − ∆y)+ x6× ∆x × (1 − ∆y)+ x8× (1 − ∆x) × ∆y+ x9× ∆x × ∆y

(4.2)

In this model, the bilinear interpolation function uses four pixels surrounding the offsetwith a value of p00 = X(xL, yL), p01 = X(xH, yL), p10 = X(xL, yH), p11 = X(xH, yH),where these coordinates xL = bof f setfc, xH = bof f setfc + 1, yL = bof f settc and

Tiêu đề	Nghiên cứu FPGA cho nhận dạng từ khóa sử dụng kỹ thuật học sâu
Tác giả	Dương Văn Hải
Người hướng dẫn	Nguyễn Quốc Cường
Trường học	Đại học Bách Khoa Hà Nội
Chuyên ngành	Điều khiển – Tự động hóa
Thể loại	luận văn thạc sĩ
Năm xuất bản	2021
Thành phố	Hà Nội

Định dạng
Số trang	60
Dung lượng	2,34 MB