Design and implementation of a baby monitoring system

- Design and implement the LSTM network model in accordance with the monitoring system.. - Design and implement the LSTM network model in accordance with the monitoring system.. 6 Figure

OVERVIEW

INTRODUCTION

This article discusses the development of a waking behavior monitoring system for babies aged 6 months and older, focusing on how various factors, such as sleep processes and positions, can impact their sleep Babies often change positions during sleep, which can lead to waking up or even falling out of their crib To address this issue, the implementation group has designed a system that monitors the baby's movements and sends real-time notifications to parents' phones when the baby wakes up or moves out of a designated area While there are various monitoring solutions available, each utilizing different algorithms, the primary goal remains the same: to ensure accurate real-time updates on the baby's behavior This system aims to integrate multiple characteristics of the baby's behavior for enhanced monitoring and safety.

To determine if a baby is asleep or awake, the system calculates the Eye Activity Ratio (EAR) using 12 points on the baby's eyes Next, it employs the MediaPipe library to assess the baby's movement through marked skeleton points Users can define a monitored area, and the system utilizes an algorithm to ensure the baby remains within this designated space If any of these conditions are met, the system sends a photo and a direct notification to the parent via Telegram.

PROJECT OBJECTIVES

Analyzing and collecting data on infant sleep behavior is a crucial initial step in developing effective solutions for parents and caregivers, ensuring high applicability and accuracy.

Design and implement an LSTM network model tailored for the monitoring system, adhering to specified requirements Utilize the Meadiapipe library for selective point calculations on the skeleton and employ the Eye Aspect Ratio (EAR) to detect the baby's alertness.

2 last goal is to use Telegram to enable notification data transmission and reception to the parent's phone

A Summary of the system design and implement process includes the following steps:

1 Collect about the actions that go through sleep and waking up in your baby (crawling/ moving arms/legs/ rolling) From there, select the actions and body parts that tend to change between sleeping and waking

2 Selecting solutions (neural networks and hardware) to be able to identify and calculate actions and parts of the body in preparation for coaching and program execution

Develop training model and introduce algorithms as well as algorithms in image processing (MediaPipe/OpenCV/Shapely.Geometry) to be able to test and run system demo on the software

3 Test and optimize based on test sets to ensure stable execution to achieve high accuracy and limit errors during execution.

RESEARCH METHODOLOGY

This article discusses the design and implementation of a baby monitoring system that utilizes behavior and skeletal analysis during sleep It references existing research and theoretical foundations, highlighting similar systems identified in previous scientific studies and research group topics.

The assessment of a baby's parameters during awakening relies on algorithms that identify key landmarks on both eyes, while also integrating skeletal recognition to analyze behaviors and movements throughout the sleep monitoring process.

The system's ability to simultaneously recognize and monitor three baby behaviors is one of its key benefits

- A baby moving and displaying indications of getting out of bed

- Keeping the baby inside the parentally selected monitoring range

The results will be compared with those of similar systems to enhance and overcome the shortcomings of other approaches.

THESIS OUTLINE

The project will consist of 5 main chapters, details of each chapter include:

CHAPTER 1- OVERVIEW: The issue, solutions, and the goals and scope of the research will all be briefly introduced in this chapter

CHAPTER 2- BACKGROUND: In this chapter, implementation group will discuss the theory of neural network (RNN, LSTM), besides the diagram of PYTHON programming language and libraries used in the project MediaPipe (Support for landmarks on the skeleton), OpenCV (Computer Vison)

CHAPTER 3- SYSTEM DESIGN: in this chapter, the analysis of the block diagram of the system, the solutions of the proposal team and the details of the functional components of each block are presented

CHAPTER 4- EXPERIMENTAL RESULTS: designing the system execution on the hardware, presenting the execution results, building a complete system model to give evaluation comments in all aspects

CHAPTER 5- CONCLUSION: presents the results achieved after completing the complete system, thereby giving direction to develop and expand the application of the system in the future

BACKGROUND

INTRODUCTION TO DEEP LEARNING

New technologies in computer science are rapidly evolving, particularly in the realm of Artificial Intelligence (AI) Machine Learning and Deep Learning have become integral to modern society, enabling computers to perform complex tasks that are challenging for humans These advancements allow for the identification of numerous objects in images, as well as speech and text recognition, enhancing human-computer interaction.

Deep Learning, a significant subset of Machine Learning, encompasses a vast array of computationally intensive techniques and methods that are utilized to address various complex problems.

Figure 2.1: The training process between Machine Learning and Deep Learning

Deep learning discussions often highlight the significance of Recurrent Neural Networks (RNNs) in addressing sequence-related challenges Traditional neural network architectures typically consist of three main components: the Input layer, Hidden layer, and Output layer This structured division leads to a key limitation, as the inputs and outputs of conventional neural networks tend to be independent of one another.

5 traditional neural network, which is not suitable for problems or sequence/time-series information that requires subsequent predictions depending on the data and images of previous predictions

Figure 2.2: Traditional Neural Network model

The RNN network model was developed to address the problem by utilizing internal loops that enable memory to retain information from previous computational steps, facilitating accurate predictions for the current step.

Recurrent neural networks (RNNs) are dynamic systems characterized by internal states that evolve with each classification time step The presence of circular connections between neurons in both upper and lower layers, along with optional self-feedback connections, facilitates this process These feedback links enable RNNs to transmit information from past events to current processing stages, allowing them to effectively remember time series occurrences.

Recurrent Neural Networks (RNNs) consist of hidden layers formed by recurrent cells that utilize feedback connections, allowing their states to be influenced by both past and present inputs The arrangement of these recurrent layers can vary, leading to the creation of different types of RNNs, which are primarily distinguished by their network architecture and recurrent cell design This diversity in cell types and internal connections enables RNNs to possess a wide range of capabilities.

Figure 2.3: The structure Recurrent Neural Network have loops

A recursive neural network A consists of an input \$X_t\$ and an output \$H_t\$, featuring a loop that enables the transmission of information across different steps of the network This looping mechanism facilitates the creation of interconnected lists of networks that can memorize and replicate information effectively.

Figure 2.4: Equivalence performance Recurrent Neural Network

The above model describes the implementation and calculation inside the RNN neural network:

The inputs $X_{0,1,2,\ldots,t}$ represent one-hot vectors at each step from 0 to $t$ The hidden state $A_t$ at step $t$ serves as the network's memory, calculated from the previous hidden states $A_{t-2}$ and $A_{t-1}$ along with the current input at that step.

The function $ f $ is typically a nonlinear function, such as the hyperbolic tangent (tanh) or ReLU The output at the $ t $-th position, denoted as $ h_t $, represents a probability vector derived from earlier in-memory lists, which are utilized to estimate the network's next state This relationship is expressed mathematically as $ h_t = \text{softmax}(V A_t) $.

The softmax function is an exponential average function that calculates the probability of a class among all possible classes This probability is essential for determining the target class for a given input.

The softmax function converts k-dimensional vectors of any real values into k-dimensional vectors that sum to 1 Regardless of whether the input values are positive, negative, zero, or exceed 1, the softmax function consistently maps them to a range between 0 and 1.

Depending on the amount of input and output of each problem, it is possible to choose an appropriate RNN during the training process

Figure 2.5: Types of issues in RNN

- One to one: simple recognition signs for problems with 1 input and 1 output, often seen in the sample for Neural Network (NN) and Convolution Neural Network (CNN)

- One to many: the problem will have one input but many outputs

- Many to one: the opposite of case One to many, used in problems with many inputs but only 1 output

- Many to many: for problems with many inputs and outputs

RNN is now used in deep learning to solve issues involving sequence data or time-series data Typical examples of these applications include:

LONG SHORT-TERM MEMORY NETWORK (LSTM)

Recurrent neural networks (RNNs) are widely used in fields that involve sequential data, such as text, audio, and video However, traditional RNNs with sigma or tanh cells struggle to capture relevant information when there are significant input gaps Long short-term memory (LSTM) networks address this issue by incorporating gate functions into their cell structure, effectively managing long-term dependencies.

The Long Short-Term Memory (LSTM) network is a specialized type of Recurrent Neural Network (RNN) designed to address the limitations of traditional RNNs While RNNs theoretically transmit information across layers, they struggle to retain information over long sequences due to the vanishing gradient problem, which restricts their learning to only nearby states (Short-Term Memory) To overcome this challenge, the LSTM architecture was developed, enabling the network to effectively learn and remember information over extended periods.

Introduced in 1997 by Hochreiter & Schmidhuber, Long Short-Term Memory (LSTM) networks have significantly evolved and gained popularity in machine learning and deep learning LSTM addresses the long-term dependency issue inherent in traditional Recurrent Neural Networks (RNNs) and can learn independently without external assistance This unique capability allows LSTM to retain default information over extended periods without the need for training, setting it apart from conventional neural network architectures.

Figure 2.6: In a typical RNN, the repeating module just has one layer

LSTM networks, like other regression and RNN architectures, feature a chain-like structure; however, they differ from standard RNNs by incorporating multiple neuron types instead of just one (Tanh) In LSTM, four interconnected layers work together to enhance the network's performance.

Figure 2.7: An LSTM has a repeating module with four interconnected layers Tanh Function

The Tanh function is a commonly used activation function in deep learning that transforms real number inputs into values ranging from -1 to 1 Unlike the sigmoid function, the Tanh function is symmetric around zero, which helps address the saturation issues present in the sigmoid function This symmetry enhances its performance in various neural network applications.

Figure 2.8: The chart present Tanh function

The LSTM network architecture utilizes cell state to facilitate information transmission across its nodes, effectively addressing the short-term memory limitations of traditional RNNs This is achieved through the unique property of cell state, which allows for the seamless storage and execution of information throughout the process without alteration.

Figure 2.9: Details of cell state structure

The Long Short-Term Memory (LSTM) network features the ability to selectively manage information within its cells, allowing for the addition or discarding of data This process is facilitated by gates that screen the information passing through, and LSTM integrates with a Sigmoid network layer to produce outputs within the range of (0,1) A value of 0 signifies that no information has been transmitted, while a value of 1 indicates complete transmission Typically, an LSTM architecture includes up to three gates to effectively maintain and operate the cells.

Figure 2.10: Gate structure consists of a sigmoid layer and a multiplication

The sigmoid function transforms real number inputs into values within the range (0,1), effectively representing probabilities For small negative inputs, the output approaches zero, while large positive inputs yield outputs close to one This characteristic ensures that minor input variations lead to minimal changes in output, resulting in a stable and continuous response.

The vanishing gradient problem is a significant drawback of the saturated sigmoid function, making it less common in modern neural networks This issue arises when the input values are large, leading to gradients that approach zero Consequently, the weights associated with the affected units experience minimal updates, hindering the learning process.

Figure 2.11: The chart present Sigmoid function 2.2.2 Detail structure memory cell of LSTM

The initial phase involves selectively removing data from the LSTM cell's internal state at the previous time step, Ct-1 During this phase, the forget gate's activation value, ft, is computed using the current input value, xt, and the output value, ht-1, from the preceding LSTM cell, along with the forget gate's bias, bf The sigmoid function is then applied to these activation values, transforming them into results that range between 0 and 1.

Figure 2.12: Forget Gate detail structure in first step

In the second step of the LSTM cell's operation, the cell determines which information to retain in the internal state $C_t$ This involves two key calculations: the input gate $i_t$ and the candidate cell state $\tilde{C}_t$ The input gate, represented by the sigmoid function, selects the value to be updated, while the candidate cell state is computed using the hyperbolic tangent function The equations governing these processes are $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ and $\tilde{C}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$.

The Tanh layer, positioned directly after the current state, generates a vector for the value Čt, which is then added to the state This process combines the values to deliver the latest update for the cell state.

Figure 2.13: Input gate layer for second steps

In a third step, the step of updating previous cell states Ct-1 to a new state Ct is calculated based on the calculation results obtained previously

Figure 2.14: The present process update Ct for internal cell at third step

In this process, we apply Hadamard multiplication (*) to the previous state $ C_{t-1} $ using the forget gate $ f_t $ to eliminate the information we intend to discard Subsequently, we incorporate the updated value $ \tilde{C}_t $ to form the new state, which is influenced by the updates made at each step.

The final output value of the cell is calculated through a series of screening processes Initially, the sigmoid layer is employed to identify the cell state being ejected Subsequently, the chosen Ct cell states are processed through a final function to constrain the values within the range of (-1, 1) Ultimately, this constrained value is multiplied by the output from the previous sigmoid layer to achieve the desired output value.

Figure 2.15: The last step to filter information in output 2.2.3 Type of advance LSTM model

The LSTM network model described above represents a fundamental architecture of LSTM However, numerous upgraded versions have emerged, designed to enhance training efficiency and improve performance significantly.

Figure 2.16: Type of advance LSTM model

It can be seen in the LSTM network variants that were later calibrated to add pathways (peephole connections) at the interconnected ports in type (1) introduced by Gers

& Schmidhuber, this helps the port floor to receive the input value which is the cell state

OpenCV

Computer vision, the field focused on enabling computers to analyze and understand images and videos, opens up exciting opportunities across technology, engineering, and entertainment Solving even small challenges in this domain can lead to significant advancements To further vision research and share knowledge effectively, a library of efficient, portable programming functions, ideally available for free, is crucial.

OpenCV, or Open Source Computer Vision Library, was launched in 1999 with the goal of advancing computer vision technology, primarily by an Intel team Since then, numerous programmers have contributed to its development, leading to significant updates The latest major release, OpenCV 2, occurred in 2009, focusing mainly on enhancing the C++ user interface For the most recent version of the library, users can visit the official OpenCV website.

The package features nearly 2,500 optimized algorithms and has achieved over 2.5 million downloads with more than 40,000 users globally OpenCV, licensed under BSD, is applicable for both commercial and academic use To fully grasp the OpenCV library, consulting various books on the subject is recommended However, this paper provides a foundational understanding of OpenCV, making it easier to delve into more detailed resources The material presented here is intentionally aligned with recent OpenCV sources to enhance convenience for readers.

In terms of the features offered by OpenCV, it can be divided into the following groups:

- Objects detected (objected, features2D, nonFREE)

- Image /Video/I/O processing and display (core, imgproc, highgui)

OpenCV is organized into five main sections: the CV component, which encompasses advanced computer vision and fundamental image processing techniques; the ML section, which features a variety of statistical classifiers and clustering tools; HighGUI, responsible for input/output capabilities for video and photo management; and CXCore, which provides essential content and data structures.

Figure 2.18: The basic structures of OpenCV

Besides, it is easy to see the structure of OpenCV divided into module structures, in other words, it will include some static libraries or shared libraries

Some popular modules are now supported in OpenCV:

- Core functionality: a compact module used to define basic data structures including

The Image Processing (imgproc) module is essential for enhancing images through various techniques, including both linear and non-linear image filtering It facilitates geometric transformations such as resizing and perspective adjustments, as well as color space conversions and chart generation.

- Video Analysis: module used to analyze video in motion estimation, background separation and algorithms depending on the problem

- 2D Features Framework (features2d): module to detect the outstanding characteristics of the recognizer, used to retrieve parameters

- Object Detection (objecdetect): detects objects and simulations of predefined functions (people, animals, vehicles )

- Video I/O (videoio): easy-to-use interface for video capture and encoding.

MediaPipe

MediaPipe is a versatile framework designed for building pipelines that perform inference operations on various types of sensory data It enables the integration of model inference, media processing algorithms, data transformations, and other modular components to create a comprehensive perception system.

The 17 pipeline utilizes MediaPipe to process sensory information, including audio and video streams, which enter the network In turn, it outputs perceptual information, such as face landmark streams and object localization streams An illustration of this process can be found in Figure 2.19.

Figure 2.19: MediaPipe is used for object detection

MediaPipe is designed for ML practitioners, including researchers, students, and software developers, who implement production-ready machine learning applications, publish supporting code, and develop technological prototypes Its primary purpose is to facilitate the rapid development of perception pipelines using reusable inference models and components Additionally, MediaPipe supports the effective deployment of perception technologies, ensuring that de-processing stages operate efficiently on target devices.

MediaPipe addresses challenges by abstracting and integrating multiple perception models into reliable pipelines Its architecture encompasses all necessary instructions to analyze sensory input and generate perceived outcomes With a consistent interface centered on time-series data, MediaPipe components work seamlessly together.

Practitioners can easily repurpose applications across different projects by utilizing pipelines that maintain consistent behavior across various platforms This allows for the development of applications on desktops, which can then be seamlessly deployed on mobile devices.

There are now some of MediaPipe's well-liked solutions available:

- Object detection by Google Lens

MediaPipe stands out from other frameworks by utilizing fewer system resources and energy, making it highly efficient for embedded systems and IoT devices with limited capacity It supports GPU compute and rendering nodes that can be integrated with other GPU and CPU-based nodes While MediaPipe does not offer a single cross-API GPU abstraction, it allows individual nodes to be programmed with various APIs, such as OpenGL ES, Metal, and Vulkan, to leverage platform-specific functionalities This GPU support ensures that GPU nodes maintain efficiency while benefiting from the same encapsulation and composability as CPU nodes.

Many of the key challenges in Computer Vision are addressed by Google's MediaPipe, which offers open-source pre-built examples utilizing specific pre-trained TensorFlow or TFLite models as effective solutions.

In the realm of computer vision, challenges in facial recognition persist MediaPipe offers an effective identification solution by analyzing photos or videos of human faces, highlighting key features, and determining the location of the bounding box Utilizing the BlazeFace network as its foundation, MediaPipe Face Detection will implement modified backbones and replace the non-maximum suppression algorithm to enhance system processing speed.

Figure 2.20: Face detection with MediaPipe 2.4.2.2 Face Mesh

A common application of perception is landmark estimation, as illustrated by a MediaPipe graph that segments portraits and performs facial landmark detection To reduce the computational load of executing both tasks at once, one effective approach is to apply them to two separate groups of frames This can be easily achieved using a demultiplexing node in MediaPipe, which separates the input stream's packets into interleaving subsets and outputs each subset as a distinct stream.

Figure 2.21: Output from landmark detection and segmentation

MediaPipe offers a Face Mesh solution that identifies a series of 468 points on the face, creating a comprehensive mesh rather than just a bounding box like in Face Detection This mesh facilitates 3D face image editing, 3D alignment, and anti-spoofing tasks, all achievable with a single live camera.

Figure 2.22: The Face Mesh created by 468 Landmark points on the face

Hand recognition is a key feature of MediaPipe's skeleton model, providing users with a visual representation of the human humeral skeleton by marking and connecting specific points.

In the above figure, it is easy to see that MediaPipe will accurately position 21 coordinate points on the 3D knuckle inside the detected hand area

Human Pose Estimation, an advancement of the Hands Detection solution, offers a comprehensive 3D skeleton model of the human body by defining and connecting joint points to accurately simulate the human skeleton.

MediaPipe utilizes the BlazePose framework to identify and display 33 3D landmarks throughout the human body, including the face, as illustrated in the figure below.

Botogram (Telegram bot framework)

With the help of the Python framework botogram, you can concentrate solely on building Telegram bots without having to worry about the underlying Bots API

Botogram stands out among Telegram libraries by prioritizing a robust development environment and providing an exceptional API Unlike many libraries that merely wrap the Bots API, Botogram handles most implementation details, allowing developers to concentrate solely on building their bots.

HARDWARE DESIGN

The system is designed to monitor a baby's wakefulness and movement from bedtime to waking up, requiring specific technical specifications It must be compact and user-friendly, with a surveillance camera positioned to capture the entire area around the baby Additionally, the system should ensure stability and real-time accurate identification to promptly notify parents While these specifications are approximate and lack standardization, we select relevant and suitable parameters for the design The system diagram is illustrated in Fig.

Figure 3.1: System diagram 3.1.1 Central Processing Block

The system uses an embedded NVIDIA Jetson Nano computer as the central processing unit The central processor must ensure that it is able to execute the behavioral

The recognition model utilizes parameters derived from videos, integrating them with a processor identification model The hub efficiently processes and transmits signals to both the display and output blocks At the core of this system is the NVIDIA Jetson Nano, featuring a 64-bit ARM quad-core CPU, a 128-core NVIDIA GPU, and 4GB of memory This powerful setup enables the NVIDIA Jetson Nano to run multiple neural networks simultaneously for object detection applications while consuming less than 5 watts of power.

The NVIDIA Jetson Nano is backed by NVIDIA JetPack, which provides essential board support packages (BSP) and powerful software libraries like CUDA, cuDNN, and TensorRT for deep learning, computer vision, GPU computing, and multimedia processing Additionally, the SDK facilitates the installation of popular machine learning frameworks such as TensorFlow, PyTorch, Keras, and MXNet, enabling rapid development and integration of AI models.

The Jetson Nano features a powerful Maxwell GPU with 128 CUDA cores, setting it apart from other minicomputers such as the Raspberry Pi Additionally, it is equipped with a quad-core ARM A57 CPU running at 1.43 GHz and 4GB of RAM, enhancing its performance capabilities.

Figure 3.2: Pinout of Jetson Nano

1) micro SD card slot for main storage

3) Micro-USB port for 5V power in put or data

8) DC Barrel jack for 5V power input

This GPU features PCIe and USB 3.0 ports, a 64-bit quad-core ARM Cortex-A57 CPU, and 4GB of RAM It is capable of encoding at 4K 30fps and decoding at 4K 60fps.

The Jetson Nano is equipped with a powerful quad-core 64-bit ARM CPU and a 128-core integrated NVIDIA GPU, along with 4GB of LPDDR4 memory, providing an impressive 472 GFLOPS for rapid execution of modern AI algorithms This capability allows for the management of multiple high-resolution sensors and the simultaneous operation of various neural networks.

The Jetson Nano boasts impressive processing power for video applications, enabling it to handle multiple video streams for tasks such as object identification, tracking, and obstacle avoidance While it does not support 4K video playback, it is capable of decoding up to 8 video streams or cameras at Full HD resolution.

HD 30 frames per second in addition to 4K 60 frames per second For the purpose of tracking objects, machine learning algorithms will concurrently decode these streams

The black 4MP USB web camera serves as the system's input component, designed for monitoring, video recording, and photography during an infant's sleep With support for 2.0 Mega Pixels/1080p and a compact size of 61.5 x 90 x 45.6mm, it offers a maximum resolution of Full HD 1920 x 1080 at 15FPS, making it capable of handling various resolutions effectively.

The Black 4MP USB Web camera is compatible with OS, Windows, Mac, and Linux, featuring a built-in 4MP CMOS sensor that enhances performance, delivering clear and bright image quality.

Figure 3.3: Black 4MP USB Web Camera

Besides, the mouse and keyboard will be directly connected to the system so that users can directly interact and calibrate the monitoring area by the display screen

The warning block will perform two key functions: it will send notifications directly to parents through Telegram and activate audible alarms on the system's speakers, ensuring that parents receive alerts even when their phones are not in use.

The Jetson Nano features a built-in USB port that connects to the Logitech Z121 speaker, a 2W stereo speaker designed for easy installation with a long connection chord This output block is responsible for playing notification sounds to alert parents of any changes in their baby's behavior or movement Additionally, the system is powered directly by a 5V-4A adapter connected to the Jetson Nano.

To enhance system monitoring and control, the output is connected to a separate display In this setup, the Jetson Nano interfaces with a Glowy 19-inch (GL19) computer monitor through its internal HDMI connector This display features a rapid 5ms response time, supports up to 16.7 million colors, and boasts a standard resolution of 1600x900 PLS, all within a compact 144x900 design to ensure optimal image quality and frame integrity.

Figure 3.5: The Glowy 19inch computer screen

Overview of system connections between input, output and central blocks

Figure 3.6: Overview of connected hardware devices.

SOFTWARE DESIGN

The sleeping baby monitoring model was designed with three key requirements: detecting when the baby wakes up, monitoring baby movements, and identifying when the baby gets out of bed This model is developed using the Python programming language and is implemented on a Jetson Nano 4Gb embedded computer that operates on the Ubuntu operating system.

3.2.1 The overview of the software system

The model utilizes a camera to capture each frame, which is then processed to calculate essential parameters for predicting model outcomes and delivering notification data through three distinct methods The software processing flow of the system is illustrated in Figure 3.7.

Figure 3.7: Block diagram of software system

The image processing and parameter calculation block is crucial in the sleeping baby monitoring model, as it utilizes the MediaPipe support library to detect the baby's skeleton and calculate the Eye Aspect Ratio.

The comparison and decision block utilizes the previously calculated data from the processing and parameter calculation block to evaluate it against the trained database using the LSTM network, ultimately leading to informed conclusions.

The notification block has three main functions: Display on the screen, send messages on Telegram application, announce by speaker

The implementation group has identified a suitable dataset comprising 15 videos, each ranging from 20 to 50 seconds in length, featuring babies aged 6 to 8 months.

Figure 3.8: Illustration of the video in the dataset 3.2.3 Flowchart of data collection algorithm

The sleeping baby monitoring model requires the collection of video clips of the baby both during sleep and wakefulness to effectively detect when the baby wakes up Additionally, the team gathers video footage of the baby lying still and moving to meet the second requirement of monitoring the baby's movements The data collection process for sleeping baby detection is illustrated in figure 3.9 below.

Figure 3.9: Flowchart to get data of baby wake up detection

First, choose the number of frames needed to use during training The implementation team decided to choose 2000 frames of images divided equally between 2 classes “Sleep” and “Wake up”

After opening the camera, the face is integrated into the facemesh model using the Mediapipe library, where the team extracts the coordinates of 12 key points.

468 points of the facemesh model The figure 3.6 below depicts these 12 points:

Figure 3.10: The image depicts 12 selected points in the model

The system calculates the Eye Aspect Ratio (EAR) using the 12-point coordinates of the newly acquired eyes This ratio is determined by measuring six specific points around each eye, labeled P1, P2, P3, P4, and P5, in a clockwise direction starting from the left corner of the eye The formula for calculating the EAR is provided below.

Figure 3.11: Feature points of the eyes [17]

The system exports the EAR calculation parameters for each frame into a csv file format Once the specified number of parameters is collected, the program concludes the data retrieval process.

Below is the data collection flowchart of baby body motion detection:

Figure 3.12: Flowchart of getting data of body motion detection

To begin training, select the required number of frames, with the implementation team opting for a total of 2000 images—1000 frames designated for the "Body Moving" class and 1000 frames for the "No Moving" class.

After opening the camera, the system utilizes the MediaPipe library's pose model to identify 33 body points, which are then saved in a csv format Excel file.

Finally, if enough frames have been obtained for training, the program will finish retrieving the data

After collecting and categorizing data during the retrieval step, the implementation team labels each input dataset They then train a model using an LSTM network to detect waking babies and body movements Upon completion of the training process, key metrics and network data are saved in a file with the ".h5" extension.

Figure 3.13: Flowchart of the training algorithm

The sleeping baby monitoring model incorporates three key features: wake-up detection, movement detection, and outside detection The accompanying flowchart illustrates the wake-up detection algorithm utilizing the LSTM network model After extensive testing, the team determined a time step of 30 for optimal performance The system continuously calculates the EAR, and every 30 frames, the output is compared with the trained weight file to generate the prediction results.

Figure 3.14: Flowchart of the algorithm to detect the baby waking up

The baby motion detection system utilizes an LSTM network to identify wake-up signals by analyzing the coordinates of 33 key body points These coordinates are compared against a trained weight file to generate accurate predictions The process is visually represented in the baby motion detection flow chart, illustrated in Figure 3.15.

Figure 3.15: Flowchart of algorithm to detect moving baby

The model's primary function is to detect the baby, as illustrated in Figure 3.16 The system creates a polygon that outlines the baby's boundary, with the implementation team identifying five key points on the body: the nose, left hand, and right hand.

Left/Right Foot to use detection for this function Just at least 1 point outside the polygon, the system emits the message “OUTSIDE”

Figure 3.16: Flowchart of algorithm to detect baby outside

RESULTS AND DISCUSSIONS

Results of the practical model

The sleeping baby monitoring system is illustrated in the accompanying figure This system features a camera, keyboard, and mouse as input devices, while the output devices consist of a speaker and monitor At its core, the system is powered by the Jetson Nano embedded computer.

System results and evaluation

In order to have the most objective results and evaluate the performance of the system, the research team conducted a practical model test on 3 functions of the system:

The team proposed using a team member to simulate a baby's behavior for real-time checks, as they lacked an actual baby for testing This approach does not compromise the model's predictions, which primarily depend on the activity of the object being monitored The model's "wake-up detection" utilizes EAR calculations for accurate predictions, illustrated in figures (a) and (b) Figures (c) and (d) showcase the predicted output for "Moving baby detection," while figures (e) and (f) represent the "Detecting a baby out" functionality.

Figure 4.2: The test result on three case (a-b) wake up, (c-d) moving , and (e-f) outside

The system alerts parents' phones when it detects one of three specific cases, providing updates on the baby's status This feature allows caretakers to avoid the need for constant supervision of the baby.

Figure 4.3: The results of the notification are to be sent to the user's phone through the

Utilizing two parallel LSTM networks simultaneously promotes a simple, compact, and scalable system architecture due to the lightweight nature of LSTM networks However, the system does experience some false positives, as the training data series has yet to be fully optimized.

Baby wake-up detection reached an impressive accuracy of 94.5% after 200 real-time tests The "Sleep" label showed no errors, as the eye ratio is nearly zero during sleep, allowing for accurate detection However, the "Wake up" label experienced some inaccuracies due to noise from blinking.

Detecting baby movements boasts an impressive accuracy of 93.5%; however, inaccuracies arise due to the unpredictable nature of a baby's limb movements Additionally, the training data utilized may not be fully optimized, contributing to these errors.

The table below describes the accuracy of the system in detecting the baby waking up and detecting the baby moving, respectively

Table 4.1: The table describes the accuracy of "Baby wake-up detection"

Table 4.2: The table describes the accuracy of "Moving baby detection"

To have a more objective view on the topic, below is a comparison table with scientific articles "Sleeping baby monitoring system" has been made

Table 4.3: Comparison table with previous models [15]

Live Video Yes Yes Yes Yes Yes Yes

Boundary No No No Yes Yes Yes

No No No Yes Yes No

No No Yes No No No

No No No No Yes No

No No Yes No No Yes

No No No No No Yes

The DXR 8 Video Baby Monitor is an award-winning non-Wi-Fi device that features innovative interchangeable lens technology, allowing users to switch between zoom, normal, and wide-angle lenses for clear day and night monitoring It also includes two-way audio capabilities, enhancing communication with your baby Priced at $165.99 less than many competitors, the DXR 8 offers a cost-effective solution for parents However, it primarily focuses on live video monitoring, which limits its ability to recognize and track a baby's activities and behaviors.

The Nanit device is a popular baby monitoring system that integrates various features to track sleep patterns and breathing movements using the Breathing Band It offers 1080p HD video quality and is accessible through the Nanit App, allowing parents to monitor their baby's sleep in real-time The technology also provides guidance on optimal sleep times for children Additionally, Breathing Wear enables users to view their baby's breathing rate per minute, enhancing the system's functionality The current retail price for the complete equipment bundle, which includes the Nanit Pro Camera, a choice of wall mount, a travel pack with a flex stand and case, and one Breathing Band, is $322.

Concerning the device [13], Lollipop is one of the analytical systems with a wealth of functions for monitoring and multi-action assistance in infants But unlike the gadget

The system focuses on monitoring and detecting a baby's behavior by analyzing crying sounds, effectively distinguishing them from ambient noises such as doors opening or closing and television sounds Furthermore, it can detect when a baby crosses a designated boundary, a feature utilized by the implementation team to identify when a baby is left outside.

The Lollipop Smart Baby Camera offers advanced monitoring capabilities, including noise detection that alerts parents via an app if loud sounds are detected near their baby Its smart camera excels in low-light conditions and automatically adjusts monitoring settings from day to night Additionally, the system features a Lollipop Intelligent Air Quality Sensor that simultaneously measures temperature, humidity, and air quality, providing daily data logging charts for easy tracking The Lollipop Smart Baby Camera is priced at $169, while the Lollipop Smart Air Quality Sensor costs $55.

The implementation group successfully completed 4 out of the 7 proposed items, aligning with their original goal to concentrate on "Computer Vision." Additionally, they enhanced the "Awake Detection from Eye" feature by optimizing it through training with a lightweight LSTM network.

CONCLUSION AND FURTER WORK

Conclusion

After extensive research, the team, guided by Assoc Prof Truong Ngoc Son, successfully completed the project "Design and Implementation of a Baby Monitoring System." This innovative system utilizes two parallel LSTM networks to monitor sleeping babies effectively All functional components operate correctly, ensuring accurate data updates The system's image processing speed and detection accuracy have been objectively evaluated, achieving results that meet the original goals.

The Jetson Nano has proven inadequate for the demands of faster detection and higher accuracy, particularly due to its central processor's limitations in handling the threaded processing speed required by LSTM networks The system performs optimally when subjects are directly facing the camera under moderate lighting conditions Due to time and budget constraints, the implementation team has been unable to address all arising issues or enhance the system's efficiency, resulting in a compact and simple model that only meets basic objectives rather than more realistic needs.

Future work

To enhance the system's quality, it is essential to focus on continuous improvement in both speed and accuracy The upcoming development direction of the project will prioritize these enhancements.

- Build and develop better hardware model with faster processing speed and more stability when the system has to do complex calculations

- Combine more sensors, NLP, IOT to develop more tasks such as: "Detecting crying babies", "Breathing Monitoring"

- Expand the data set of baby movements such as lying face down, crawling, rolling

[1] Paul J Werbos Backpropagation through time: What it does and how to do it Proc of the IEEE, 78(10):1550–1560, 1990

[2] Ronald J Williams and David Zipser A learning algorithm for continually running fully recurrent neural networks Neural Computation, 1(2):270– 280, jun 1989

[3] Yong Yung, Xiaosheng Si, Changhua Hu, and Jia A Review of Recurrent Neraul Networks: LSTM Cells and Network Architecturecs, vol 31, isuue 7, July 2019

[4] Felix A Gers, Nicol N Schraudolph, and J¨urgen Schmidhuber Learning precise timing with LSTM recurrent networks Journal of Machine Learning Research (JMLR), 3(1):115–143, 2002

[5] Qi Lyu and Jun Zhu Revisit long short-term memory: an optimization perspective In Deep Learning and Representation Learning Workshop (NIPS 2014), pages 1–9, 2014

[6] R Szeliski Computer Vision: Algorithms and Applications Springer 2011

[7] R Laganière OpenCV 2 Computer Vision Application Programming Cookbook Packt Publishing 2011

MediaPipe, developed by Google Research, is a versatile framework that enables the building of multimodal applied machine learning pipelines It supports various applications, including real-time video processing and computer vision tasks, making it a valuable tool for developers and researchers alike The collaborative efforts of experts such as Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, and others have contributed to its robust functionality and widespread adoption in the field.

A Framework for Building Perception Pipelines, Jun 2019

[9] S Hochreiter and J Schmidhuber, "Long Short-Term Memory," in Neural Computation, vol 9, no 8, pp 1735-1780, 15 Nov 1997, doi: 10.1162/neco.1997.9.8.1735

[10] MOTOROLA MBP36XL Baby Monitor Available online: https://www.motorola.com/us/motorola-mbp36xl-2-5-portablevideo-baby-monitor with- 2-cameras/p (accessed on 3 June 2021)

[11] Infant Optics DXR-8 Video Baby Monitor Available online: https://www.infantoptics.com/dxr-8/ (accessed on November 2022)

[12] Nanit Pro Smart Baby Monitor Available online: https://www.nanit.com/products/nanit-pro-complete-monitoring-system?mount=wall- mount (accessed on November 2022)

[13] Lollipop Baby Monitor with True Crying Detection Available online: https://www.lollipop.camera/ (accessed on November 2022)

[14] Lienhart, R.; Maydt, J An extended set of Haar-like features for rapid object detection

In Proceedings of the International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002

[15] Khan, T An Intelligent Baby Monitor with Automatic Sleeping Posture Detection and Notification AI 2021, 2, 290-306 https://doi.org/10.3390/ai2020018

[16] M -T Duong, T -D Do, M C Le, V -B Nguyen and M -H Le, "An Efficient Data Collecting Method for Enhanced Real-Time Drowsiness Detection Systems," 2021 International Conference on System Science and Engineering (ICSSE), 2021, pp 105-110, doi: 10.1109/ICSSE52999.2021.9538480

[17] Akihiro Kuwahara, Kazu Nishikawa, “Eye fatigue estimation using blink detection based on Eye Aspect Ratio Mapping” , doi: https://doi.org/10.1016/j.cogr.2022.01.003

[18] M Ramzan, H U Khan, S M Awan, A Ismail, M Ilyas and A Mahmood, "A Survey on State-of-the-Art Drowsiness Detection Techniques," in IEEE Access, vol.7, pp 61904-61919, 2019, doi: 10.1109/ACCESS.2019.2914373.

Tiêu đề	Design and Implementation of a Baby Monitoring System
Người hướng dẫn	Trương Ngọc Sơn, Assoc. Prof.
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Computer Engineering Technology
Thể loại	graduation project
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	66
Dung lượng	5,89 MB