Building a smart facial recognition, sound analysis, and security alerts system using raspberry pi

Building a smart facial recognition, sound analysis, and security alerts system using raspberry pi Building a smart facial recognition, sound analysis, and security alerts system using raspberry pi

INTRODUCTION

Background

Technological advancements have revolutionized the integration of the physical, digital, and organic realms, leading to innovative tools that connect tangible and virtual realities Key components of Industry 4.0 encompass the Internet of Things (IoT), smart cities, artificial intelligence, autonomous vehicles, robotics, 3D printing, advanced materials, nanotechnology, and significant breakthroughs in biological sensing.

The Internet of Things (IoT) is fundamentally about "Connecting Everything to the Internet," enabling a network of interconnected devices that communicate through a shared protocol, usually via the internet This system allows users to remotely control any network-enabled device from virtually anywhere, making device management incredibly straightforward—simply connect the device to the internet.

Traditional wireless remote control systems often suffer from limited range, but the advent of the internet has transformed this landscape, enabling innovations in automatic control With the rising demand for seamless information exchange and the proliferation of internet-connected devices, the internet has emerged as the most efficient medium for transmitting control signals This approach not only saves time and enhances safety for household electrical devices but also reduces costs and secures both networks and assets, providing significant benefits for individuals and businesses.

Reason for choosing the topic

My project aims to address a practical challenge by creating a dependable face recognition, audio analysis, and smart security alert system for real-world applications The focus of this applied research is to develop effective solutions to immediate issues rather than engaging in theoretical studies.

The project integrates advanced technologies such as facial recognition, audio analysis, and intelligent security alert systems, showcasing their collaborative potential in real-world scenarios This initiative provides valuable insights into the practical applications of IoT and AI.

1.2 Real-World Applications in Security:

As the demand for smart home and office security systems rises, this project focuses on essential needs like unauthorized access identification, suspicious activity detection, and real-time alert provision, showcasing how affordable embedded systems can significantly improve safety.

1.3 Cost-Effective and Scalable Solution:

Build an affordable security system using the Raspberry Pi 4 Model B, Raspberry Pi Camera, and USB microphone, demonstrating budget-friendly options for small businesses and individual users This scalable solution enhances accessibility for various users seeking effective security measures.

The project provides a comprehensive learning opportunity in various domains:

Embedded Systems: Programming and interfacing with hardware Raspberry Pi

Machine Learning: Implementing facial recognition using libraries like

Audio Processing: Analyzing sound patterns using tools like and Sounddevice

Software Development: Building a robust application integrating all components

- Real-time Facial Recognition: Identify authorized individuals and flag unknown or suspicious persons

- Sound Analysis: Detect unusual audio patterns, such as breaking glass, screams, or other warning signals

- Automated Alerts: Send timely notifications to users or authorities via email, or app alerts

- User Interface: Enable intuitive control and monitoring through telegram application

The project focuses on creating a comprehensive system that integrates facial and voice recognition technologies, along with real-time notifications through email and Telegram, to improve automation and security It encompasses various features, including the collection and training of facial data, real-time individual identification, and the processing of voice commands for automated tasks.

Recent advancements in artificial intelligence (AI) and the Internet of Things (IoT) have significantly enhanced smart security systems This article examines the current technologies, systems, and research contributions that have shaped the evolution of these innovative security solutions.

Facial recognition technology plays a crucial role in contemporary security, encompassing applications from biometric authentication to sophisticated surveillance systems The development of tools like OpenCV, TensorFlow, and PyTorch has facilitated the implementation of accurate face detection and recognition solutions Studies highlight the effectiveness of convolutional neural networks (CNNs) in delivering precise facial identification Nonetheless, challenges such as occlusions, inconsistent lighting, and the need for real-time performance remain significant hurdles in the field.

Audio analysis is becoming an essential component of security systems, significantly improving their capability to identify and react to urgent situations Techniques such as spectrogram analysis and advanced deep learning models, including recurrent neural networks (RNNs) and convolutional neural networks (CNNs), have proven effective in detecting unusual sounds.

The integration of IoT has transformed security systems through enhanced connectivity and automation, allowing devices such as smart cameras, sensors, and microphones to work together for a holistic security perspective Additionally, leveraging cloud and edge computing improves scalability, data processing efficiency, and real-time responsiveness, significantly boosting system performance.

The project is built on a solid theoretical foundation, utilizing established libraries like cv2, face_recognition, and speech_recognition These libraries are grounded in well-researched algorithms, including HOG (Histogram of Oriented Gradients) for facial feature extraction and advanced machine learning models for effective speech processing.

Challenges in Smart Security Systems

Despite advancements, smart security systems face hurdles such as false positives in recognition and detection, high computational demands, and privacy concerns

Addressing these issues requires innovative approaches, including algorithm optimization, secure data handling, and user-focused design principles [3]

The project integrates established academic research in facial and voice recognition, as well as notification systems It utilizes proven techniques like Haar cascades for face detection, the face_recognition library for encoding and comparing faces, and Speech Recognition for processing voice inputs These technologies are extensively covered in academic and technical literature, ensuring a robust foundation for the project.

About OpenCV

OpenCV, or Open Source Computer Vision Library, is a robust open-source library tailored for real-time image and video processing Initially created by Intel in 2000 and subsequently backed by Willow Garage and Itseez, it has become one of the most popular libraries for image analysis, computer vision, and artificial intelligence applications.

OpenCV provides over 2,500 optimized algorithms for a wide range of tasks, including:

- Image processing (filtering, thresholding, transformations)

- Object detection (Haar cascades, YOLO, SSD)

- Feature extraction (SIFT, SURF, ORB)

- Video analysis (optical flow, motion tracking)

- Machine learning (classification, clustering, regression)

OpenCV supports multiple programming languages, including Python, C++, Java, and MATLAB Python bindings make it particularly popular among researchers and developers [4]

OpenCV runs on various platforms, including Windows, Linux, macOS, Android, and iOS It also supports hardware acceleration using GPUs with CUDA and OpenCL [4]

OpenCV effectively integrates with popular deep learning frameworks such as TensorFlow, PyTorch, and Caffe Its DNN module enables the deployment of pre-trained neural networks, making it ideal for applications like face recognition and object classification.

OpenCV's pre-trained Haar Cascade and DNN modules are widely used for face detection and recognition

OpenCV is used in AR applications to overlay virtual objects on real-world images by detecting and tracking key points [4]

The library plays a critical role in lane detection, object detection, and traffic sign recognition [4]

OpenCV is used for analyzing medical images like X-rays and MRIs [4]

About Histogram of Oriented Gradients (HOG)

The Histogram of Oriented Gradients (HOG) is a prominent feature descriptor utilized in computer vision and image processing, primarily for object detection, including faces and pedestrians Renowned for its effectiveness, HOG is extensively employed in feature extraction tasks, particularly those focused on human detection.

First, the image is converted to grayscale (if it is in color)

Gradients are computed in the horizontal (x) and vertical (y) directions using methods such as the Sobel operator, which identifies variations in pixel intensity across both axes.

The gradient at a point is computed as: where III is the intensity of the pixel [5]

The gradient magnitude (G) and orientation (θ) are calculated at each pixel:

This gives information about how much change in intensity occurs at each point in the image and the direction of that change [5]

The image is divided into small cells (e.g., 8x8 pixels), and for each cell, the gradient orientations are grouped into a histogram The bins in the histogram correspond to different gradient directions [5]

Gradient magnitudes are collected into specific bins according to their orientation For instance, a gradient direction of 45° will contribute its magnitude to the appropriate bin, typically categorized for angles ranging from 0 to 180° or 0 to 360°.

After calculating the histograms for individual cells, the histograms are grouped into larger blocks (e.g., 2x2 cells) This block-based approach helps improve the robustness of the features against lighting changes [5]

Normalization is applied to the block histograms to reduce the effect of illumination changes and improve the invariance to local contrast

The final HOG feature vector is obtained by concatenating the normalized histograms from all blocks across the image This feature vector represents the object’s shape and texture

Training: HOG features can be used with a classifier (commonly a Support Vector Machine or SVM) to train a model for detecting specific objects, like faces or pedestrians [5]

The classifier is trained on a set of positive samples and negative samples by using the HOG features extracted from these images

Detection: After training, the classifier can be used to detect faces (or other objects) in unseen images by sliding a detection window over the image and extracting the HOG

8 features from each window The classifier then decides whether the window contains the object (e.g., a face) based on the HOG feature vector [5]

HOG (Histogram of Oriented Gradients) demonstrates resilience to lighting changes by utilizing gradient information instead of raw pixel values, which reduces its sensitivity to fluctuations in lighting conditions.

Simple and Efficient: HOG is computationally lightweight and easy to implement It has proven to be highly effective in many object detection tasks, particularly in detecting pedestrians and faces [5]

HOG (Histogram of Oriented Gradients) is highly effective in representing the shapes and edge structures of objects, making it particularly valuable for face detection, where shape is a crucial feature.

HOG can face challenges in accurately detecting objects when there are changes in orientation, such as with a rotated or tilted face These variations in pose can significantly impact the effectiveness of feature extraction, leading to decreased accuracy in object recognition.

Doesn't Handle Occlusions Well: If part of an object is obscured—like a face covered by glasses or a hand—HOG might miss important details, which can reduce detection accuracy[5]

HOG-based methods necessitate extensive and well-curated datasets for effective training, especially when detecting complex objects or operating in varied environments Additionally, careful tuning is essential to optimize their performance.

About Deep Learning and dlib's ResNet

Deep Learning, a subset of Machine Learning, utilizes artificial neural networks, specifically deep neural networks, to tackle complex challenges like image recognition, speech processing, and natural language understanding By employing multi-layered architectures, it effectively extracts hierarchical features from data, enhancing its capability for pattern recognition.

Key characteristics of Deep Learning include:

Neural Networks: Deep learning models are based on artificial neural networks, which consist of multiple layers of interconnected nodes

Feature Extraction: Unlike traditional machine learning, where feature engineering is crucial, deep learning automatically extracts relevant features

Large-scale Data Training: Deep learning models require vast amounts of labeled data and computational power to achieve high accuracy

Backpropagation and Optimization: These models use backpropagation to update weights and optimize learning using algorithms like stochastic gradient descent (SGD) or Adam [6]

2 dlib's ResNet for Face Recognition: dlib is a popular machine-learning library that provides advanced computer vision tools, including a deep learning-based face recognition system built on ResNet (Residual Network)

ResNet, a deep convolutional neural network architecture created by Microsoft, addresses the vanishing gradient problem commonly found in deep networks By incorporating skip (residual) connections, ResNet enables information to bypass certain layers, thereby enhancing gradient flow and ensuring greater training stability.

2.1 dlib's Implementation of ResNet: dlib's face recognition model is based on a modified ResNet-34, trained on a large dataset of facial images The model maps facial features into 128-dimensional embeddings using a deep CNN, which enables efficient face matching and verification

Feature Extraction: The ResNet model extracts deep facial features from input images

Face Encoding: It generates a 128-dimensional embedding that uniquely represents a face [6]

Face Matching: Faces are compared using Euclidean distance between their embeddings—smaller distances indicate higher similarity

Pre-trained Model: The dlib ResNet model is pre-trained and optimized for real-time face recognition applications

Face Verification & Identification (e.g., security systems, biometric authentication)

Facial Attribute Analysis (e.g., age and emotion recognition)

Surveillance & Monitoring (e.g., automated face tracking in videos) AI-powered Smart Devices (e.g., smart home security)

DESIGN AND SYSTEM DEPLOYMENT

System design

The facial recognition and audio analytics security system enhances real-time surveillance by integrating facial recognition technology, audio detection capabilities, and intelligent security alerts to ensure effective security monitoring and timely alerts.

The facial recognition and audio analytics security system offers real-time monitoring and alerting by integrating essential components such as facial recognition technology, audio detection, and intelligent security alerts.

The system consists of three main modules:

These modules interact with each other through a central processor (Raspberry Pi 4), which coordinates the data flow, processes the data, and sends alerts when necessary:

Below is a high-level block diagram of the system:

Objective: This module is responsible for identifying and recognizing faces in real time

Camera (Raspi Camera Rev 1.3): Record video input from the monitored area

Face Detection: Uses libraries such as OpenCV or face_recognition to detect faces in video frames [7]

Face Recognition API: After detecting a face, the system uses the face_recognition library to extract facial features and compare them to a database of known faces [7]

Face Matching: If the detected face matches an entry in the database, the system confirms the identity and takes appropriate action [7]

Face Database: A local database that stores images of authorized individuals This allows the system to compare real-time captured videos with stored faces for identification [7]

The captured image undergoes a face detection algorithm, followed by analysis through face recognition algorithms When a match is found with a known identity, the system activates an action, such as sending a notification or alert.

Objective: This module captures and processes audio to detect specific events (e.g., loud noises)

Microphone Array (ReSpeaker): A high-quality microphone array is used to capture audio from the environment

Audio Classification: Audio data is analyzed to identify specific audio patterns (e.g., loud noises, alarms, or human voices) [8]

Audio Analysis: Using signal processing and machine learning techniques, the system analyzes audio patterns to find events that may indicate a security breach (glass breaking, door slamming, etc.) [8]

The microphone array captures audio, which is then processed and analyzed If an event (e.g., loud noise) is detected, the system triggers an alert

Goal: The alert system generates notifications or takes actions based on the results from the Facial Recognition and Audio Detection modules

Generate alerts: Based on the results of the recognition and detection processes, the system generates alerts, which may include notifying the user [9]

Notification: Alerts may be sent to the user via email or via a Telegram application The system may also display a live video feed or provide a summary of the event [9]

3.4 Data Flow and Process Flow

- The camera records real-time video

- The face detection algorithm processes the video frames and recognizes the faces

- The face_recognition library compares the detected faces to a stored database of authorized individuals

- If a match is found, the system identifies the person and takes appropriate action (e.g., allowing entry or sending a notification) [7]

- The microphone array continuously collects audio data from the environment

- The audio analysis system processes the collected audio for predefined patterns or unusual sounds

- If a significant audio event is detected (e.g., glass breaking or loud noise), the system analyzes the context and triggers an alert [8]

- When a face is successfully recognized or an alarm sound is detected, the system triggers an alert

- Alerts are sent to users via mobile or via Email

Authentication: Users must authenticate before accessing the system interface to ensure that only authorized persons can interact with the system

Access control: Only authorized faces and audio are processed, and the system ignores unknown or unclassified events [9].

Hardware implementation

The Smart Security System's hardware setup was meticulously crafted for reliability, scalability, and cost-effectiveness, effectively supporting essential computational tasks Key components were selected to fulfill specific roles within the system, ensuring optimal performance and functionality.

Raspberry Pi 4 Model B (Primary Edge Device):

- Quad-core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz

- Dual-band Wi-Fi and Gigabit Ethernet

- CPU (Cortex-A72): Handles the computational load of facial recognition, audio processing, and alert generation in real-time

- RAM: Supports AI models and multi-threaded operations for video and audio processing

- USB Ports: Connects peripherals like cameras, microphones, and external storage

- GPIO Pins: Can be used to interface with external sensors (e.g., motion sensors) for extended functionality

- Wi-Fi Module: Enables communication with the cloud server and mobile applications

- Processes real-time video and audio streams

- Executes facial recognition and sound classification algorithms

- Communicates with the cloud server for data storage and alerts

- Cost-effective and compact for deployment in smart homes

- Supports Python-based AI libraries like TensorFlow Lite and OpenCV

- Low cost and energy-efficient

- Fully compatible with Python-based libraries like OpenCV, TensorFlow Lite, and PyTorch

- Compact design makes it suitable for discreet deployment

Connector converts from micro USB type B to USB type C

This is where the camera can be plugged into the Raspberry pi

This is where the Microphone can be plugged into the Raspberry pi

Figure 2.6 Raspi Ethernet and USB

The most important part of the Raspberry pi This is where the operating system can be read from the Micro SD card

Raspberry pi Camera Rev 1.3 (Video Input):

- Captures video streams for facial recognition and activity monitoring

- High-definition video ensures accurate recognition

- Plug-and-play compatibility with Raspberry Pi

- Cheap but High-quality video ensures accurate recognition and detection

- Simple interface ensures compatibility with Raspberry Pi

ReSpeaker USB Microphone Array (Audio Input):

- 4-microphone array for 360° voice pickup

- Built-in noise reduction and echo cancellation

- Captures environmental audio for sound analysis

- Enhanced sound clarity in noisy environments

Local storage for video and audio data logs

- Provides consistent 5V/3A power to the Raspberry Pi

Software development

Insert microSD card on to computer via a card reader

Select CHOOSE OS → Raspberry Pi OS (Legacy, 64-bit) Lite

CHOOSE STORAGE → Select microSD card

Setting -> Enable SSH ( for distant control ) -> Write OS to memory card

Once complete, the microSD card will be ready to use

2 Installing the OS on Raspberry Pi:

Install the microSD card in the Raspberry Pi:

- Insert the card into the microSD slot on the Raspberry Pi

- Connect the power supply The Raspberry Pi will boot automatically

- Connect to Wi-Fi or Ethernet

Search for the IP of Raspberry pi:

- Open https://192.168.1.1 and login into the router interface

- Find the IP of Raspi

Connect to the IP of Raspberrry pi:

Log in with the saved information:

3 Update the operating system and install necessary libraries:

Open terminal: sudo apt update sudo apt upgrade -y

To get started with your project, ensure you install all essential libraries by running the following commands: `pip install opencv-python`, `pip install opencv-contrib-python`, `pip install face_recognition`, `pip install dlib`, `pip install numpy`, `pip install pillow`, `pip install SpeechRecognition`, `pip install pyaudio`, `pip install gTTS`, `pip install python-telegram-bot`, and `pip install requests`.

Purpose: OpenCV (Open Source Computer Vision Library) is a highly popular library used for computer vision tasks

It is essential for image processing, including:

Face detection: The library uses various algorithms (such as Haar Cascades) to detect faces in images or video streams

Image manipulation: It allows for basic image transformations, such as resizing, cropping, and changing color spaces

Camera access: OpenCV enables the use of webcams or other video capture devices, which is necessary for real-time face recognition applications

Read and display video frames from the camera

Convert images from color (BGR) to grayscale or RGB

Draw bounding boxes around detected faces

Save images of detected faces

Purpose: The face_recognition library is a high-level library built on top of dlib that simplifies face recognition tasks It uses machine learning models to:

Detect faces: Find the locations of faces in images or video

Encode faces: Convert faces into a vector representation (encoding) that can be compared with other face encodings for recognition

Compare faces: Match a detected face with a known database of encodings and identify the person face_recognition is used to:

Detect and encode faces from images captured by the camera

Compare these face encodings with pre-stored encodings to recognize people Save the face encodings to be used for later identification

Purpose: The os library provides a way to interact with the operating system, specifically for file and directory operations It allows:

Create directories: Ensures that folders are created if they do not already exist

Iterate through files and directories: Helps in traversing through the dataset folder to read images of different individuals

Manage file paths: Provides a cross-platform way to work with file paths os is used to:

Check and create directories for storing datasets (images of different people) Iterate over the dataset folder to access images for training the model

Purpose: The pickle module is used for serializing (saving) and deserializing (loading)

Python objects to and from files It is commonly used for storing trained models, configurations, and any other Python objects pickle is used to:

Save the face encodings and labels (person names) into a file

(trained_model.pkl) after training the model

Load the saved face encodings and labels during the recognition phase for identifying faces in real-time

The Telegram library facilitates interaction with the Telegram Bot API, enabling programs to send and receive messages, manage bot commands, and engage with users on the Telegram platform.

Purpose: Used defaultdict to manage email flags

NumPy is an essential library for numerical computing in Python, offering robust support for large, multi-dimensional arrays and matrices It includes a variety of mathematical functions to manipulate these arrays, making it particularly useful for tasks such as calculating the mean of face encodings.

Purpose: smtplib is a built-in Python module for sending emails using the Simple Mail

Transfer Protocol (SMTP) Used to send email notifications when a face is recognized

Purpose: Making HTTP requests Used in to send POST requests to the Telegram API for sending messages

Purpose: speech_recognition is a library that provides easy-to-use interfaces for speech recognition, allows program to convert speech into text Used to detect voice commands and process them

The threading module in Python's standard library enables the concurrent execution of multiple threads, allowing tasks such as speech recognition and face recognition to run in parallel without interrupting the main program.

Utilizing pre-built libraries such as cv2, face_recognition, and speech_recognition, along with APIs like Telegram and Gmail, enables the project to achieve its objectives effectively while optimizing time and resource constraints, making it an ideal choice for my system-development initiative.

To ensures that the device maintains the same IP address every time it connects to the network sudo nano /etc/dhcpcd.conf

Building sofware

1 Building a program for collecting data from authorized members: cv2: OpenCV library, used for image processing, face detection, and camera operations os: Used to interact with the operating system, especially for file management (like creating directories)

The `create_directory` function is designed to create a dedicated folder within the dataset for storing images of a specific individual, such as Duc It utilizes `os.makedirs(path)` to ensure that the directory is created only if it does not already exist.

This code asks for the person's name and stores it in person_name It then calls create_directory to create a folder inside dataset/ where the images will be saved

Initializes the camera using OpenCV The argument 0 refers to the default camera

Haar Cascade: This classifier is used for face detection It is a pre-trained model in

OpenCV utilizes pattern recognition to detect faces in images The method camera.read() captures a frame from the video feed, exiting the loop if the reading fails To enhance face detection, cv2.cvtColor converts the captured frame from BGR to grayscale, as detection is more effective on grayscale images The detectMultiScale function identifies faces in the grayscale image, with parameters 1.1 and 4 regulating the scale and the minimum number of neighbors for a valid detection Finally, cv2.rectangle draws a rectangle around the detected face in the live video feed.

31 cv2.imwrite: Saves the detected face (cropped from the image) to the dataset folder

The images are saved with the format {person_name}_{count}.jpg

The loop will stop once 100 images have been captured for the person

Releases the camera and closes all OpenCV windows once the program is finished Result:

Figure 2.24 Folders of collected image

2 Building a Training program for the Face Recognition Model: face_recognition: Library for face detection and recognition cv2: For image processing os: For managing directories pickle: For saving and loading Python objects (used here to store the model) dataset_path: The path where the images are stored (in the dataset folder) encodings: List to store the face encodings labels: List to store the labels (person names)

The `detect_and_crop_face` function processes an image by converting it to grayscale and utilizing Haar Cascade to detect faces It subsequently crops the identified faces and returns them The `cv2.CascadeClassifier` is employed to load the pre-trained Haar Cascade classifier essential for effective face detection.

Loops through each subdirectory in dataset/, where each subdirectory is named after a person

Loops through the files in each person's folder and processes only image files (JPEG, PNG, JPG)

Detects and crops faces from each image using the detect_and_crop_face function

Converts the cropped face to RGB and uses face_recognition.face_encodings to generate a unique encoding for the face

The encoding is then added to the encodings list, and the person’s name is added to the labels list

After processing all images, the face encodings and labels are saved into a trained_model.pkl file using pickle

3 Building a program for Face Recognition System: cv2: OpenCV library for image and video processing, used for capturing frames from the camera, displaying images, and drawing rectangles around faces face_recognition: A library for face recognition tasks It is used here to extract face locations and encodings from images pickle: Used to load the pre-trained face recognition model (which contains face encodings and labels) from a file defaultdict, deque: These are from the collections module defaultdict is used to handle missing keys automatically, and deque is a list-like container optimized for fast appends and pops mean: A function from the numpy library to compute the average encoding for each person

The model is loaded from a file (trained_model.pkl) that contains face encodings and their corresponding labels (names)

The model was presumably created earlier by training the system on labeled images

The system calculates an average encoding for each individual based on all their images in the training dataset

For each label (person), it computes the mean of all their encodings to represent them as a single average encoding

The camera is set up to record real-time video using the default camera index, which is 0 It utilizes distance_buffers to maintain the distances between the current face encoding and the average encodings of individuals, with a maximum length that restricts storage to the last 10 values per person Additionally, frame_counters monitor the number of consecutive frames where a person is unrecognized or labeled as "Stranger."

This loop continuously captures frames from the camera until the user presses 'q' to exit

If the frame can't be read from the camera, the loop will stop

The camera captures frames in BGR color format, which is utilized by OpenCV However, for face_recognition to function properly, the image must be converted to RGB format The face_recognition.face_locations function is then used to detect the locations of faces within the image.

38 face_recognition.face_encodings: Extracts the encoding for each detected face

For each detected face, it compares the face encoding with the average encodings of known individuals

The face distance is computed, and the face with the minimum distance is considered the best match

If the best match's distance is below a threshold (0.6), the face is recognized as a known individual Otherwise, it is considered "unknown" ("Không rõ")

If a recognized face's distance has been calculated for the last 10 frames, the system checks the average distance

If the average distance is below 0.4 (indicating a high match), the face is highlighted with a green rectangle, and the name is displayed

The counter for that individual is reset

If the system has not recognized a face in 10 consecutive frames, it labels the person as a "Stranger."

The face is highlighted with a red rectangle, and the label "Stranger" is displayed

The frame with the recognized faces and labels is displayed in a window titled "Face Recognition."

The loop continues until the user presses the 'q' key to quit

Releases the camera and closes any OpenCV windows after exiting the loop

Face Recognition: The system recognizes known faces from a pre-trained model and matches them with the average encoding for each person

Distance Buffering: To improve accuracy, the system uses a buffer of the last 10 face distances and validates recognition by averaging these values

Stranger Detection: If a face is not recognized for 10 consecutive frames, the system labels it as "Stranger."

Real-Time Display: The system continuously captures video from the camera and displays the frames with rectangles and labels for recognized faces

4 Building a program for speech recognition and email sending: speech_recognition: Used to recognize speech from the microphone smtplib: Used to send emails through the Gmail SMTP server email.mime.text and email.mime.multipart: Used to create and format the email body and attachments

The send_email function allows users to send an email by specifying a subject and body, along with the sender's and recipient's email addresses To use this function, an app password generated for the sender's Gmail account is necessary, particularly for accounts with 2-step verification enabled.

MIMEText and MIMEMultipart: These classes are used to create the structure of the email, including the body and subject

The function connects to Gmail's SMTP server (smtp.gmail.com) to authenticate the sender's credentials and send the email, handling any errors that may occur during the process Additionally, the recognize_and_check_keyword function monitors speech to detect a specific keyword The speech recognizer is initialized with `sr.Recognizer()`, and the microphone is activated to capture audio input.

42 recognizer.adjust_for_ambient_noise(source): Adjusts the recognizer's sensitivity to ambient noise audio = recognizer.listen(source): Captures the audio from the microphone

Speech Recognition: The audio is processed and converted into text using Google's

Keyword Check: It checks if the keyword "hello” is present in the recognized text

If the keyword is found, it triggers the send_email function to send an email with the detected keyword and recognized speech

If the recognition fails, an error message is printed

This while loop continuously calls the recognize_and_check_keyword function, allowing the program to always listen for the keyword and send an email whenever it's detected

5 Building a program for User interface ( Telegram bot ):

Figure 2.32 System model for Telegram bot telegram and telegram.ext are used to interact with the Telegram Bot API

• Update represents an incoming update (message, command, etc.) from the user

• CallbackContext provides the context for the handler function, such as the bot and other relevant data

• Application is used to initialize and run the bot

• CommandHandler is used to handle specific commands sent by users (like /start, /collect, etc.)

The /start command is triggered when the user starts a conversation with the bot or sends /start

The start function sends a welcome message to the user, providing a brief explanation of the available commands:

• /collect for collecting face data

• /train for training the model

• /voice for listening to the user and send email

• The /train command starts the training process for the face recognition model

• It first sends a message to inform the user that the model is being trained

• The bot then calls the train_model() function (defined earlier) to process the collected face data and train the face recognition model

• Once the training is complete, the bot sends a confirmation message that the model has been successfully trained

The /recognize command starts the face recognition process

• The chat_id and bot_token are extracted from the update and context These are used to send messages and handle further interactions with the user

• The bot sends a message to inform the user that face recognition is starting

• The function handle_face_recognition(chat_id, bot_token) is called to begin recognizing faces

• async def voice(update: Update, context: CallbackContext):: This defines the voice command that can be triggered by a user in the chat

The function recognize_and_check_keyword(update, context) is designed to process voice input by listening for specific keywords in spoken messages, such as "gửi email," which activates the email sending feature.

The “Smart” of the system

The Face Recognition and Sound Analysis Security System is designed to enhance security and improve efficiency through automation Its intelligent features include advanced facial recognition technology and sound analysis capabilities, making it a cutting-edge solution for modern security needs.

- Uses Deep Learning models (CNNs, HOG, Dlib, OpenCV) to recognize faces accurately

- Can distinguish between authorized and unauthorized individuals in real time

- Works under varying lighting conditions and with partial occlusions

- Supports continuous learning, improving accuracy over time

Uses Machine Learning and Signal Processing (RNNs, spectrogram analysis) to recognize abnormal sounds like:

- Glass breaking (forced entry detection)

- Detects patterns in sound to differentiate between normal household noises and threats

- Sends real-time alerts to users via Telegram, email

- Differentiates between normal events and potential security threats

- Supports multi-factor authentication, combining facial recognition with voice commands

- Supports role-based access control so only authorized users can modify settings

This advanced security system enhances traditional surveillance by intelligently detecting, learning from, and responding to security events in real time With its automated features, it offers a proactive, efficient, and user-friendly solution for enhanced safety.

FINAL PRODUCT

Table 1 Table of action command

Number Action Expected results Real results

1 /collect The system perform collecting new face 10/10 times

2 /train The system begin to train collected image 10/10 times

3 /recognize The system perform recognizing realtime faces 10/10 times

The system performs sending email 10/10 times

The model performs connecting to the Telegram bot 10/10 times

6 /voice The model performs listening to the user voices 10/10 times

Comparing my system with existing solution in terms of accuracy, cost, and scalability across different technologies like cloud-based AI services Below is a detailed breakdown [10]:

Table 2 Table of Accuracy Comparison

Feature My system Cloud-Based AI

Facial Detection High (~95%-99%) Very High (~99%)

Facial Recognition High (~95%-99%) Very High (~99%)

Handles variations well Excellent adaptability

Occlusion Handling Moderate Best (Deep learning models trained on diverse datasets)

Table 3 Table of Cost Comparison

Hardware Cost Moderate (Requires GPU for training)

Implementation Cost Moderate (Pre-trained models available)

High (Pay-per-use API calls)

Maintenance Cost Moderate High (Ongoing cloud fees)

Scalability Cost High (GPU clusters required)

Low (Auto-scaling with cloud)

Table 4 Table of Scalability Comparison

Ease of Scaling Moderate (More GPUs required)

Very Easy (Auto-scaling in cloud)

Real-time Performance Good on small datasets High-speed response

Infrastructure Needs Local processing only Fully managed cloud setup

Good Best (Handles millions of requests)

My system is an affordable and lightweight solution, perfect for offline use and small-scale deployments Nevertheless, it lacks the accuracy and scalability offered by cloud-based alternatives.

FUTURE ENHANCEMENTS

Future enhancements

The existing Face Recognition and Sound Analysis Security System offers a strong and reliable solution; however, there are numerous opportunities for enhancements and new features that could significantly improve its performance, scalability, and user experience.

The system can be integrated with a variety of IoT devices to enhance automation and security These devices can include [11]:

Smart Locks: Automatically lock or unlock doors based on facial recognition or authorized sound events

Smart Lighting: Trigger lighting systems when a face is recognized or when a suspicious sound is detected, improving visibility in areas

Cameras: Integration with additional surveillance cameras or cloud-based cameras can provide enhanced coverage for monitoring larger areas

The system can become more intelligent by adding automation features:

Automated Action Triggers: For example, the system could automatically lock doors, turn on lights, or activate an alarm when suspicious activity is detected, without needing manual intervention [12]

User behavior learning leverages machine learning algorithms to analyze occupant patterns, enabling the system to adapt security measures accordingly For instance, if the system identifies that a family member consistently returns home at a specific time, it can modify security protocols to align with that individual's arrival, enhancing overall safety and convenience.

By implementing edge computing (on the local device, such as a Raspberry Pi or other hardware), the system can:

Reduce Latency: Immediate actions such as locking doors or sounding alarms can be taken without relying on the cloud, thus reducing response time [13]

Lower Bandwidth Usage: Processing the video and sound data locally reduces the need for continuous streaming to the cloud, lowering bandwidth consumption and costs [13]

4 Improved Facial Recognition with Deep Learning

While the current face recognition model is efficient, using more advanced deep learning techniques can improve the system's accuracy:

Deep Learning Models: Training deeper convolutional neural networks (CNNs) or using more complex architectures such as ResNet could provide higher accuracy and robustness in face detection

3D Facial Recognition: Integrating 3D face recognition models can improve recognition accuracy, especially in varied lighting conditions or when faces are partially obscured

As privacy and data security are paramount in systems that process sensitive information, additional security measures could be implemented:

End-to-End Encryption: Ensuring that all data (including video and audio) is encrypted from the point of capture to storage and transmission

Data masking and anonymization are essential for users prioritizing privacy, as they enable the system to protect stored facial data by encoding images Instead of retaining the actual images, only the encoded features are preserved, ensuring that personal information remains secure.

Ensuring GDPR compliance is essential for any system, as it upholds privacy regulations like the General Data Protection Regulation in Europe This compliance empowers users by granting them control over their personal data and the ability to request the deletion of their information from the system when desired.

Incorporating voice-activated controls could allow users to interact with the system more intuitively Commands like:

“Show me the last detected face.”

These voice commands could be processed through a natural language processing (NLP) system, allowing hands-free interaction

7 Integration with External Security Systems

To expand the functionality, the system can integrate with existing security systems [14]:

• Integration with Alarm Systems: Trigger alarms in response to face recognition failures or sound detection of a potential break-in

• Monitoring Services: The system could notify a professional security monitoring service if a breach is detected

Limitations

While the system offers many advantages, there are some challenges and limitations to consider:

Face recognition systems often face challenges in low-light environments or when a person's face is partially covered, such as by masks, hats, or sunglasses To improve performance in these conditions, utilizing infrared cameras or implementing adaptive lighting in the deployment areas can be effective solutions.

2 False Positives and False Negatives

All security systems face the challenge of false positives, which incorrectly identify a person or sound as a threat, and false negatives, which fail to detect genuine threats To minimize these errors, it's essential to continuously enhance the accuracy of face recognition and sound detection algorithms Implementing regular system updates and retraining models can significantly reduce these issues.

Environmental noise, including background chatter, music, and traffic, can disrupt sound detection accuracy Implementing advanced noise cancellation and filtering techniques can enhance the precision of sound classification.

Concerns about privacy often arise with the use of facial recognition technology To address these issues, it is crucial to educate users about the functionality of the system, implement robust data protection measures, and offer users the ability to manage and delete their data as needed.

The system aims to be both cost-effective and scalable; however, hardware limitations of devices such as the Raspberry Pi can impact real-time performance, particularly in extensive setups with numerous cameras and microphones To enhance performance, utilizing more powerful edge devices or distributing processing tasks across multiple devices is recommended.

Development direction

The system delivered satisfactory functionality; however, there are key areas for improvement, including enhancements in functionality, performance optimizations, and the introduction of additional features.

1 Accuracy Improvement in Face Recognition:

Increasing the size of the dataset for training face encodings can greatly enhance recognition accuracy This improvement can be realized by gathering more images of each individual in various conditions, such as different lighting, angles, and facial expressions.

Advanced face recognition algorithms, particularly those utilizing deep learning techniques such as Convolutional Neural Networks (CNNs) or pre-trained models like FaceNet and VGGFace, can significantly enhance accuracy and effectively manage complex scenarios.

Keyword Expansion: The system can be extended to recognize multiple keywords or phrases, allowing more versatile control and interaction For example, detecting

"security alert," "help," or "lock door."

Noise Filtering: Implementing noise cancellation algorithms or using more sophisticated speech recognition APIs can improve the system’s performance in noisy environments [18]

Language Support: Currently, the system supports Vietnamese ("vi-VN") Adding support for more languages would make the system more adaptable for international users

Utilizing multi-threading or parallel processing techniques can significantly enhance the efficiency of face and audio recognition tasks, allowing them to operate simultaneously without performance degradation This approach is particularly beneficial for devices with limited hardware capabilities, such as the Raspberry Pi, ensuring a smoother and more responsive user experience.

Mobile App Integration: Creating a mobile application to manage and monitor the security system could offer an intuitive user interface for controlling settings and receiving alerts on the go [18]

A Graphical User Interface (GUI) enables users to easily configure and manage the system through a desktop or web-based platform, providing features such as person management, adjustable threshold settings, and access to alert history.

Appendix

This article discusses the implementation of a face recognition system using Python libraries such as OpenCV, face_recognition, and others It outlines the integration of Telegram bot functionalities through the Telegram API, utilizing command handlers for user interaction The system also incorporates email notifications using smtplib and MIME for sending messages Additionally, it highlights the use of speech recognition and sound processing libraries like SpeechRecognition and librosa for audio input handling The article emphasizes the importance of threading for monitoring processes efficiently.

# Hàm tạo thư mục nếu chưa tồn tại def create_directory(name): path = f"dataset/{name}" if not os.path.exists(path): os.makedirs(path) return path

The function `detect_and_crop_face(image_path)` utilizes OpenCV to read an image and convert it to grayscale It employs a pre-trained Haar cascade classifier to detect faces within the image The detected faces are then cropped from the original image and stored in a list Finally, the function returns the list of cropped faces for further processing.

To collect data using a camera, the function `collect_data(person_name)` is implemented, which starts by creating a directory for the specified individual It initializes the camera and loads a pre-trained face detection model The process begins by capturing images, converting them to grayscale, and detecting faces in the frames The operation continues until the camera fails to read a frame, ensuring a comprehensive collection of images for the specified person.

57 for (x, y, w, h) in faces: cv2.rectangle(frame, (x, y), (x+w, y+h), (255, 0, 0), 2) count += 1 cv2.imwrite(f"{save_path}/{person_name}_{count}.jpg", gray[y:y+h, x:x+w])

#cv2.imshow("Face", frame) if count >= 50: print(f"Đã thu thập đủ 50 ảnh cho {person_name}.") break if cv2.waitKey(1) & 0xFF == ord('q'): break camera.release() cv2.destroyAllWindows()

To train a facial recognition model, the function `train_model()` initializes by setting the dataset path to "dataset" and creating empty lists for encodings and labels It then iterates through each person's directory within the dataset, processing their data while ensuring that only directories are considered For each individual, the function prints a message indicating the processing of their data, followed by iterating through the files in their respective folder.

The script processes image files by checking if the filename ends with supported image formats like jpg, jpeg, or png If the file is not an image, it skips the file and prints a message For valid images, it attempts to detect and crop faces If no faces are found, it logs a message and continues to the next file For each detected face, the script converts the image to RGB format and encodes the face Successful encodings are appended to a list along with corresponding labels, and a success message is printed for each encoded face If encoding fails, an error message is displayed Finally, the trained model, containing the encodings and labels, is saved to a file named "trained_model.pkl," with a confirmation message upon successful saving.

# Trích xuất đặc trưng MFCC từ âm thanh

59 def extract_features(audio_data, fsD100): mfcc = librosa.feature.mfcc(y=audio_data.flatten(), sr=fs, n_mfcc) mfcc_mean = np.mean(mfcc, axis=1) return mfcc_mean

# Dự đoán âm thanh def predict_sound(model, audio_data, fsD100): mfcc_features = extract_features(audio_data, fs) prediction = model.predict([mfcc_features]) return prediction

The `send_email` function is designed to send an automated email notification upon face recognition It takes parameters such as the user's name, chat ID, bot token, email sent flags, and an optional email content message The function sets up the sender and receiver email addresses and constructs a message with a subject line indicating it's from a bot If the user's name has not been flagged as having received an email, the function attempts to send the email using Gmail's SMTP server with SSL encryption Upon successful sending, it confirms that the email was sent successfully.

A notification has been successfully sent via email to {receiver_email}, with the recipient identified as {name} The message is transmitted through Telegram using the specified chat ID and bot token.

The code attempts to mark an email as sent to a specified recipient using `email_sent_flags.add(name)` If an exception occurs during the email sending process, it captures the error and prints a message indicating the failure, while also sending a notification via Telegram with the error details If the email was previously sent, it notifies that the email has already been sent to the recipient.

The function `recognize_and_check_keyword` utilizes the SpeechRecognition library to capture and process voice input It initializes a speech recognizer and a microphone, adjusting for ambient noise before listening for audio Once the audio is captured, it attempts to recognize the speech using Google's speech recognition service in Vietnamese The recognized speech is then printed, and the function checks for specific keywords such as "open door," "turn off the light," and "send email."

The system identifies keywords within the speech text, notifying users when a keyword is recognized If the keyword "send email" is detected, the user is prompted to provide the email content The email content is then recognized and sent to the specified recipient using the bot's token However, if the voice recognition fails, an error message is printed indicating the inability to recognize speech or connect to the voice recognition service.

The function `recognize_email_content` utilizes the SpeechRecognition library to capture and transcribe email content from voice input It initializes a recognizer and a microphone, then listens for audio while adjusting for ambient noise Upon receiving the audio, it attempts to recognize the speech in Vietnamese using Google's speech recognition service If successful, it prints and returns the transcribed email content; if the audio is unclear or there's a request error, it provides an appropriate error message.

62 print(f"Không thể kết nối với dịch vụ nhận diện giọng nói: {e}") return "Lỗi khi nhận diện nội dung email."

To send an audio alert via email, use the `send_email_alert` function, which requires the alert type, sender's email, receiver's email, and password The function creates a multipart email message, setting the sender, receiver, and subject to "Audio Alert." It includes a body that notifies the recipient of the detected audio type By utilizing `smtplib.SMTP_SSL`, the function securely logs into the Gmail server and sends the email If successful, it confirms the email was sent; otherwise, it catches and prints any errors encountered during the process.

The system monitors audio and sends alerts upon detecting sounds of breaking glass or door slamming It utilizes a model to analyze audio recordings, with a sampling rate of 44,100 Hz and a recording duration of three seconds The process begins with a message indicating that audio recording and monitoring are in progress, continuously running in a loop to ensure real-time detection and alerting.

# Kiểm tra nếu đã vượt quá thời gian giám sát if time.time() - start_time > duration:

63 print("Kết thúc giám sát âm thanh.") send_telegram_message(chat_id, "Kết thúc giám sát âm thanh.", bot_token) break

# Ghi âm từ microphone audio_data = sd.rec(int(record_duration * fs), samplerate=fs, channels=1, dtype='float32') sd.wait()

# Dự đoán âm thanh prediction = predict_sound(model, audio_data, fs)

The system checks for specific sounds, and if it detects the sound of breaking glass, it prints "Glass breaking detected!" and sends an email alert Alternatively, if it detects a knocking sound, it prints "Knocking detected!" and also sends an email alert After each detection, the system pauses for one second before recording again.

# Hàm gửi thông báo qua Telegram def send_telegram_message(chat_id, message, bot_token): try: url = f"https://api.telegram.org/bot{bot_token}/sendMessage" payload = {"chat_id": chat_id, "text": message} response = requests.post(url, json=payload) if response.status_code == 200: print("Thông báo Telegram đã được gửi thành công.") else: print(f"Không thể gửi thông báo Telegram Mã lỗi: {response.status_code}")

Conclusion

In conclusion, this project effectively meets its three main objectives: the Facial Recognition System, Audio Analysis, and Smart Security Alert System, all powered by Raspberry Pi By integrating computer vision, speech processing, and alert mechanisms, it delivers a comprehensive and intelligent security solution The utilization of embedded systems like Raspberry Pi ensures that the project remains scalable and cost-effective.

The advancement of Smart Security Systems significantly enhances safety in both private and public spaces By integrating technologies such as facial recognition, audio analysis, and real-time alerts, these systems provide a proactive security solution that outperforms traditional methods Additionally, the incorporation of IoT technology increases efficiency and scalability, making it suitable for diverse environments, including homes, offices, and public areas.

Smart Security Systems offer user-friendly interfaces and robust security features, enabling users to efficiently manage and monitor their safety As technology evolves, these systems are emerging as essential solutions for real-time, intelligent security, significantly enhancing safety and providing peace of mind Consequently, they play a crucial role in shaping the future of security technology.

References

[1] Ethan, "AWS," 2023 https://aws.amazon.com/what-is/iot/

[2] R Bompilwar, "ritik12," 2021 https://ritik12.medium.com/facial-recognition-using-pytorch-and-opencv-467c4e41d1f

[3] C Dr Mark van Rijmenam, "thedigitalspeaker," 2023 https://www.thedigitalspeaker.com/privacy-age-ai-risks-challenges-solutions

[4] Sujie, "opencv," https://opencv.org/about/

[5] N & T B Dalal, "Histograms of oriented gradients for human detection," in In

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp (CVPR) (Vol 1, pp 886-893) IEEE

[6] J Holdsworth, "ibm," 2024 https://www.ibm.com/think/topics/deep-learning

[7] B | C nghệ, "fpt.vn," 2024 https://fpt.vn/vi/blog/camera-nhan-dien-khuon-mat 10146.html

[8] T bình, "thanhbinhbca.vn," 2024 https://thanhbinhbca.vn/product/he-thong-phan-tich-video-thong-minh

[9] B | C nghệ, "fpt.vn," 2025 https://fpt.vn/vi/blog/camera-cong-nghe-moi 10407.html

[10] H Glossary, "hpe.com," 2024 https://www.hpe.com/emea_africa/en/what-is/ai-cloud.html

[11] Alexan, "cyberlink.com," 2023 https://www.cyberlink.com/faceme/insights/articles/460/building-surveillance-systems- with-facial-recognition-technology

[12] AI, "realtimenetworks.com," 2024 https://www.realtimenetworks.com/blog/artificial- intelligence-trends-in-security

[13] A Plus, "a-plus-security.com," 2025 https://www.a-plus-security.com/a-plus-news/the-rise-of-facial-recognition-technology-in- commercial-security

[14] cyberlink, "cyberlink.com," 20 09 2024 https://www.cyberlink.com/faceme/insights/articles/460/building-surveillance-systems- with-facial-recognition-technology

[15] P S, "fedtechmagazine.com," 2023 https://fedtechmagazine.com/article/2013/11/4- limitations-facial-recognition-technology.

Tiêu đề	Building a smart facial recognition, sound analysis, and security alerts system using raspberry pi
Tác giả	Hoang Minh Duc
Người hướng dẫn	Dr. Le Xuan Hai
Trường học	Vietnam National University, Hanoi International School
Chuyên ngành	Informatics and Computer Engineering
Thể loại	Graduation project
Năm xuất bản	2025
Thành phố	Hanoi

Định dạng
Số trang	82
Dung lượng	2,49 MB