Technology progress has fostered the connection between the physical world, the digital realm, and the organic world, resulting in the creation of production tools that bridge the gap between the tangible and the virtual. Typical components of Industry 4.0 include the emergence of the Internet of Things (IoT), smart cities, artificial intelligence, autonomous vehicles, robots, 3D printing, new materials, nano-technology, and breakthroughs in biological sensing. Speaking of IoT, which stands for "Internet of Things," it can be easy to translated as "Connecting Everything to the Internet." It is a system where every device is interconnected, and all things are linked through a common protocol, typically the communication network or the Internet. Simply put, all you need is a network-connected device, and you can control it from wherever you are. Controlling devices remotely with IoT has become straightforward nowadays; all you need is to connect the device to the internet, isn't it?
INTRODUCTION
Background
Technology progress has fostered the connection between the physical world, the digital realm, and the organic world, resulting in the creation of production tools that bridge the gap between the tangible and the virtual Typical components of Industry 4.0 include the emergence of the Internet of Things (IoT), smart cities, artificial intelligence, autonomous vehicles, robots, 3D printing, new materials, nano-technology, and breakthroughs in biological sensing
Speaking of IoT, which stands for "Internet of Things," it can be easy to translated as
"Connecting Everything to the Internet." It is a system where every device is interconnected, and all things are linked through a common protocol, typically the communication network or the Internet Simply put, all you need is a network-connected device, and you can control it from wherever you are Controlling devices remotely with IoT has become straightforward nowadays; all you need is to connect the device to the internet, isn't it?
For wireless remote control systems, the limitation in terms of distance is a weakness of this technique On the contrary, with the Internet extending globally, the distance limitation is no longer a factor, opening up a new avenue in the field of automatic control Currently, as the demand for information exchange among people increases and the widespread use of internet-connected devices becomes more prevalent, using the internet to transmit control signals is the most convenient method It saves time for tasks, ensures safety features for household electrical devices, and is cost-effective, ensuring both network and asset security for every individual.
Motivation
Based on practical IoT applications’s requirements, a robot which has the ability to support reconnaissance is necessary The designed robot should have remote control
11 capabilities, camera streaming, environmental temperature readings to support rescue missions, firefighting assistance, or remote monitoring Additionally, integrating machine learning for object recognition, this model could play a crucial role in various diverse activities.
Research methods
The content of the research "Mini robot design for supporting reconnaissance" will be divided into the following main parts:
1 Collecting data on the process of designing a robot control system support reconnaissance
2 Design solutions for a robot control system, including a model with live stream functionality
4 Writing code and designing the control system
Target of the topic
After all researches,I decide to create project with following main objectives:
- Design a complete product consisting of a remotely controlled robot, live streaming of images through a web server, pushing environmental temperature and humidity data to the web server using ESP32 and ESP32 Cam, using WiFi
-Write control programs for ESP32 and ESP kits, set up a web server, and implement data transmission
- Build a system for object detection from streaming images intergrated ESP32-CAM
Limitation
• The robot is connected via WiFi, so the control range is limited
• Issues arise in the case of an unstable network connection (buf, fb->len) != ESP_OK) { httpd_resp_send_500(req); }: Send image data to the client via response If sending fails, returns HTTP error 500 esp_camera_fb_return(fb);: Returns the frame to the system, freeing memory used for the frame
Object detective
One of the crucial fields in Artificial Intelligence is Computer Vision Computer Vision is a domain that involves methods for acquiring, processing digital images, analyzing and recognizing images, detecting objects, generating images, super-resolution imaging, and much more Object Detection is perhaps the most profound aspect of machine vision due to its frequent real-world applications
Object Detection refers to the ability of computer systems and software to locate objects in an image and identify each object Object Detection has been widely used for face detection, vehicle detection, pedestrian counting, security systems, and autonomous vehicles There are many approaches to object detection that can be applied in various practical fields Like any other technology, a plethora of innovative and excellent applications of Object Detection will arise from programmers and software developers
SSDs are made with real-time object recognition in mind By creating bounding boxes using a region proposal network and using those boxes for object categorization, faster R-CNN operates more quickly The entire process operates at 7 frames per second, which is far less than what is needed for real-time processing, even if it is thought to be accurate
By removing the requirement for the region proposal network, SSD quickens the procedure SSD makes a number of enhancements, such as default boxes and multi-scale feature maps, to solve the issue of decreased accuracy These enhancements enable SSD to employ lower-resolution pictures, increasing speed even more, while approaching the accuracy of Faster R-CNN
The SSD model is divided into two stages:
1 Extracting feature maps (based on the VGG16 backbone network) to enhance efficiency in object detection => it is recommended to use ResNet, InceptionNet, or MobileNet
2 Applying convolutional filters to detect objects
3.3.2 Algorithm SSD - SINGLE SHOT DETECTOR in Object Detection
Figure 3.14 Simulation of SSD algorithm
SSD only needs one input, which is a picture, plus ground truth boxes that show the locations of objects' bounding boxes during training We analyze a limited collection of default boxes on each feature map throughout the object recognition phase These boxes correspond to different aspect ratios on feature maps with different scales (scales of 8x8 and 4x4 in figures (b) and (c) are two examples of these scales) We must forecast a probability distribution c = (c1, c2, , cn) for each default box (dashed boxes in the diagram), which corresponds to the classes C = C1, C2, , Cn During training, we first need to match default boxes with ground truth boxes so that the error level measured through localization loss is minimized (usually using the Smooth L1 function, which will be explained in section 2.2) Then, we aim to minimize the error of the predicted labels corresponding to each detected object in default boxes through confidence loss (usually using the softmax function, which will be explained in section 2.2)
Thus, the loss function for object detection differs from the loss function for image classification in that it includes additional localization loss regarding the position error of predicted boxes compared to ground truth boxes
This is the general operating principle of SSD However, the specific architecture of layers and the loss function of SSD will be explored below
SSD creates an early 3D output feature map block using the forward propagation of a standard architecture (like VGG16) This network architecture (from the input picture to Conv7 in Figure 2) is what we call the base network Then, Extra Feature Layers’s part in the design, we will add structures to operate object detection beneath the basic network The interpretation of these levels is straightforward:
• Layer of SSD model: o Input Layer: Getting input pictures that are 300 x 300 x 3 (width x height x channels) for SSD300 architectures or 500 x 500 x 3 for SSD500 architectures o Conv5_3 Layer: This is the base network that uses the VGG16 design, however at the end it eliminates a few completely connected layers The Conv4_3 Layer - a feature map with dimensions of 38 × 38 x 512—is the result of this layer o Conv4_3 Layer: With dimensions of 38 x 38 x 512, Conv4_3 can be regarded as a feature map There are two primary changes performed on this feature map:
Obtaining the subsequent output layer by applying a convolutional layer, similar as a standard CNN With a kernel size of 3 x 3 x 1024 in the convolutional layer, Conv6 has dimensions of 19 × 19 x 1024
In order to identify items on the feature map, a classifier based on a 3 x 3 convolutional filter is applied simultaneously, as seen in the diagram Due to the requirement for both item categorization and object identification (using bounding box detection), this procedure is rather complicated This procedure is comparable to the one that Figure 2 outlines The feature map, which has dimensions of 38 x
38 x 512, is first divided into a grid cell of 38 x 38 (depth is ignored as convolution would be applied across the depth) Next, four default bounding boxes with various aspect ratios will be generated for each grid cell For each default bounding box, the following parameters must be determined: a vector with dimensions of n classes plus one, and a probability distribution of labels (Note: the backdrop always adds
1 to the total number of classes) In addition, in order to get the bounding box of the object in the frame, we must provide four parameters as offsets As a result,
44 there are n classes + 4 parameters on a default bounding box and 4 * (n classes +
4) outputs to be predicted on a single cell The output tensor, when multiplied by the number of cells in Conv4_3, has dimensions of 38 x 38 x 4 x (n_classes + 5)
If the backdrop is taken into consideration as a single label, the tensor's dimensions are 38 x 38 x 4 x (n_classes + 4) Additionally, 38x38x4 bounding boxes in total were produced o Applying the classifier to the feature map for the Conv7, Conv8_2, Conv_9, Conv10_2, and Conv11_2 layers is a similar process The convolutional technique used in the preceding layer, the kernel filter size (the kernel_size is always 3 x 3 in the figure), and the convolution's stride will all affect the form of the layers Either four or six default bounding boxes are determined for every cell in the feature map
As a result, the following is the number of default boxes created in the ensuing layers: o Conv7: 19×19×6 = 2166 boxes (6 boxes/cell) o Conv8_2: 10×10×6 = 600 boxes (6 boxes/cell) o Conv9_2: 5×5×6 = 150 boxes (6 boxes/cell) o Conv10_2: 3×3×4 = 36 boxes (4 boxes/cell) o Conv11_2: 1×1×4 = 4 boxes (4 boxes/cell)
There will be a total of boxes at the output of: 5776+2166+600+150+36+4 = 8732 This means that we need to predict classes for approximately 8732 bounding boxes at the output This number is much larger than YOLO, which only has to predict 98 boxes at the output That's why the algorithm is slower than YOLO Can you imagine for one output frame which dimensions we need to predict? Let me remind you that these are the output vectors corresponding to each default bounding box type: yT=[x,y,w,h⏟bounding box,c1,c2, ,cC⏟scores of C classes]
• Apply feature maps with diffirent size: We will keep adding convolutional layers after getting the feature map from the basic network By shrinking the feature map, these layers hope to lower the amount of frames that need to be forecasted This makes it possible to anticipate and identify things of different sizes Smaller feature maps are better at detecting bigger things than larger feature maps at detecting little ones More details on the size of kernel filters will be explained in the architecture diagram in the Extra Features Layers section
PRODUCT AND TESTCASE
Product
This is the display screen of the web server, where we can see 5 buttons used to control the direction of the robot's movement Following these are 2 sliders for adjusting the intensity of the flashlight and the speed of the robot's movement The last slider is used to control the angle of the servo motor, allowing the camera to observe the surroundings from various angles Additionally, real-time monitoring of the temperature and humidity of the environment around the robot is displayed on the web server
With the advancement of science and technology, we can easily come across mini robot design projects for reconnaissance, which mainly use LiDAR modules or color sensors For this project, I decided to use a camera for the following reasons
Firstly, compared to LiDAR, which uses laser pulses to measure distances and create 3D point clouds, cameras provide several advantages While LiDAR operates independently of lighting conditions, it struggles with black or highly light-absorbing surfaces, is more expensive, and has higher maintenance complexity Moreover, LiDAR cannot capture colors or detailed characteristics of objects, which cameras can easily accomplish
Secondly, when compared to color sensors, which are used to detect and measure the color of light, typically providing RGB (Red, Green, Blue) values, cameras offer more comprehensive data Color sensors are limited to providing color data without any context about shapes or patterns and cannot capture or process images In contrast, cameras can provide high-resolution images with many details about the scene, including objects, patterns, and colors
Figure 4.3 Object detective program display
Thirdly, while a camera may struggle in low-light conditions, I have addressed this by using the built-in flash on the ESP32-CAM Additionally, with SSD technology, it is easy to recognize objects seen through the camera
In summary, the camera's ability to capture detailed, high-resolution images with rich color information makes it the preferred choice for this reconnaissance mini robot project.
Testcase
Table 1: Table of action command
The model performs forward 10/10 times
9/10(Once the network connection was disconnected midway)
The model performs backward 10/10 times
The model performs stop 10/10 times
The model performs turn left 10/10 times
The model performs turn right 10/10 times
The model rotates the camera 10/10 times
This table presents the results of testing various commands on a reconnaissance robot model, to verify its responsiveness and functionality Here's a breakdown of each column and its meaning
1 Number: This is the identifier for each test case, ranging from 1 to 6
2 Action: This column lists the specific command given to the model (e.g., Forward command, Backward command)
3 Expected results: This describes the anticipated outcome for each command, indicating that the model should perform the specified action 10 out of 10 times
4 Real results: This column shows the actual performance of the model during the test For example, for the Forward command, the model performed the action 9 out of 10 times due to a network disconnection midway
5 Delay: This indicates the delay time (0.5 seconds) between issuing the command and the model's response
6 Result: This states whether the test passed or failed based on whether the real results met the expected results
- Expected results: Model performs forward 10/10 times
- Real results: Model performed forward 9/10 times due to a network disconnection
- Expected results: Model performs backward 10/10 times
- Real results: Model performed backward 10/10 times
- Expected results: Model performs stop 10/10 times
- Real results: Model performed stop 10/10 times
- Expected results: Model performs turn left 10/10 times
- Real results: Model performed turn left 10/10 times
- Expected results: Model performs turn right 10/10 times
- Real results: Model performed turn right 10/10 times
- Expected results: Model rotates the camera 10/10 times
- Real results: Model rotated the camera 10/10 times
Overall, the table summarizes the reliability and performance of the model in executing the given commands, showing that all tests passed successfully despite a minor network interruption during the forward command test interruption during the forward command test
Table 2: Table of objects detection
Number Objects Expected results Real results Relia bility
This table summarizes the results of object detection tests using the COCO (Common Objects in Context) dataset It evaluates the system's ability to correctly identify various objects and provides metrics on the reliability of these identifications
1 Number: The identifier for each test case, from 1 to 8
2 Objects: The specific object being tested for identification (e.g., Person, Car)
3 Expected results: The anticipated outcome, which is successful identification 10 out of 10 times
4 Real results: The actual performance of the system in identifying the objects For example, for the Mouse object, the system successfully identified it 9 out of 10 times
5 Reliability: The percentage representing the confidence or accuracy of the identification process
6 Result: The outcome of the test, indicating whether the system passed based on its performance
The table demonstrates that the system reliably identifies various objects with high accuracy, achieving perfect scores in most cases except for the Mouse, which had a slight miss in one instance The reliability percentages indicate the confidence level of the system in identifying each object, all of which are above 80%, signifying robust performance All tests have passed according to the criteria set
Table 3: Table of Temperature and Humidity measuring
Number Environmental temperature and humidity
This table presents the results of testing a robot's ability to measure environmental temperature and humidity It compares the measurements taken by the robot to the actual environmental conditions and provides the accuracy of the measurements along with the result of each test
1 Number: The identifier for each test case, from 1 to 5
2 Environmental temperature and humidity: The actual environmental conditions for temperature and humidity
3 Robot measured: The temperature and humidity values measured by the robot
4 Accuracy: The percentage accuracy of the robot's measurements compared to the actual environmental conditions
5 Result: The outcome of the test, indicating whether the robot's measurements are acceptable
The table shows that the robot's temperature and humidity measurements are generally accurate, with all tests passing The accuracy percentages indicate how closely the robot's measurements match the actual environmental conditions, with values ranging from 81% to 100% This suggests that the robot performs well in measuring temperature and humidity, although there is some variation in its accuracy
RESULTS AND DEVELOPMENT DIRECTION
Results
After 12 weeks of research, I am proficient in using the ESP32 to communicate with modules such as L298N, DHT11, and SG90 Additionally, I can program the ESP32- CAM to upload images to the web and perform smooth, low-latency video streaming over the Internet via WiFi Moreover, it can use SSD to perform object recognition with an accuracy of over 70%
In summary, I have successfully completed the project "Mini robot design for supporting reconnaissance." This project began as an idea and was then fully realized, demonstrating significant advancements in engineering and technology The project effectively integrated sensors and the ESP32 microcontroller to create a mini robot capable of operating in various conditions with great potential for further development
Throughout the project's completion, Dr Pham Thi Viet Huong, with her extensive knowledge, provided enthusiastic and dedicated guidance, ensuring the project's timely completion
Finally, the project's completion not only met the set challenges but also reinforced my knowledge, skills, and experience in this field.
Development Direction
• Increase the accuracy of the recognition algorithm by finding more data for training
• Integrate recognition system through API to display on the web
• Integrated lidar sensor to draw 3D models of areas inaccessible to humans
• Integrated 5G module to control devices further away, avoiding being limited by wifi connectivity.