Đồ án tốt nghiệp Công nghệ kỹ thuật máy tính: Generate 3D model of objects from images

MINISTRY OF EDUCATION AND TRAININGHO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION GRADUATION THESIS MAJOR: COMPUTER ENGINEERING TECHNOLOGY Ho Chi Minh city, 7/2024 GENERATE 3D MO

INTRODUCTION

Recent advancements in computer vision have led to the creation of advanced algorithms for extracting valuable information from images A notable application is the generation of three-dimensional (3D) models from two-dimensional (2D) images, which has significant implications for augmented reality, virtual reality, robotics, archaeology, and the preservation of cultural heritage.

The process of generating 3D models from images, often referred to as

Photogrammetry is the process of reconstructing the spatial geometry and appearance of real-world objects or scenes through computational methods Traditional techniques in photogrammetry typically depend on manual measurements, stereo vision, or structured light scanning, which can be labor-intensive and time-consuming, often necessitating specialized hardware.

This graduation thesis investigates the generation of high-quality 3D models from images by utilizing advanced computer vision techniques The emphasis is on developing a workflow that ensures precise and efficient reconstruction of both object geometry and appearance from multiple 2D images Key aspects of the project will be thoroughly explored to enhance the understanding of this innovative process.

Image Acquisition and Preprocessing: The acquisition of image data and preprocessing steps to enhance the quality and suitability of input images for 3D reconstruction

Feature Extraction and Matching: Techniques for extracting distinctive features from images and establishing correspondences across multiple views, essential for robust reconstruction

Surface Reconstruction: Methods for generating a coherent 3D surface mesh from the estimated depth maps, incorporating techniques for mesh regularization and refinement

Texture Mapping and Rendering: Strategies for mapping texture information from input images onto the reconstructed 3D model and rendering realistic visualizations

The project aims to advance 3D reconstruction techniques through practical implementation, experimentation, and evaluation of proposed methodologies using real-world datasets By benchmarking these methods against existing approaches, the initiative seeks to enhance the accuracy and efficiency of generating 3D models from images, promoting their adoption across various applications.

In summary, this graduation project explores the fascinating field of 3D reconstruction from images through computer vision, highlighting how deep learning techniques can transform this area and lead to innovative applications and technological advancements.

- Build a system to collect data images base on photogrammetry

- Control exposure level for high detail quality

- Build a pipeline to create structure from motion then generate point cloud and create surface

- Images collected and reconstructed in different Structure from motion software

- Compare benchmark against existing methods

This article explores the process of capturing images of an object from various angles to generate spatial points that create interconnected planes The resulting data can be exported in formats compatible with 3D editing software, facilitating advanced modeling and design applications.

- Data Collection: Gather diverse image datasets and ground truth data

- Preprocessing: Enhance image quality and consistency

- Feature Extraction and Matching: Evaluate algorithms for identifying key points and establishing correspondences

- Surface Reconstruction: Implement algorithms for generating 3D surface meshes and refine surface quality

- Texture Mapping and Rendering: Develop methods for texture extraction, mapping, and realistic rendering

- Evaluation: Assess reconstruction accuracy quantitatively and qualitatively

- Performance Analysis: Measure computational efficiency and scalability

- Comparison with Baselines: Benchmark against existing methods

SUBJECTS AND SCOPE OF RESEARCH

- Pipeline and workflow for data collection, pre-processing raw data, feature extraction, depth analysis, plane reconstruction,

Chapter 1: Introduction: Provides an overview of the topic, research objectives, scope, research methods, and the target audience

Chapter 2: Theoretical Basis: Introduces the current state of research, research directions, and relevant services being utilized

Chapter 3: System Design and Implementation: Presents the general model of the entire system, the system components, detailed design of each component, and the devices used in the components

Chapter 4: Results: Presents the execution results of the system model Compare with other different method

Chapter 5: Conclusion and Future Development: Draws conclusions, highlights strengths and weaknesses, and suggests directions for future development of the model

This report presents a structured framework for communicating research findings on generating 3D models from images using computer vision, ensuring clarity in the study's objectives, methods, and outcomes.

THEORETICAL BASIC

INTRODUCTION TO COMPUTER VISION

Computer vision, a vital field within computer science, strives to empower computers to interpret and analyze images and videos similarly to human vision Its primary objective is to develop artificial systems capable of extracting meaningful information from various visual data types, including video sequences, depth images, and multi-camera views By focusing on real-world scene description and analysis, computer vision identifies and reconstructs essential properties such as color, shape, texture, and illumination from captured images.

Computer vision mimics human vision but lacks the extensive context and experience that humans possess Throughout their lives, individuals learn to distinguish objects, assess distances, perceive motion, and recognize anomalies in images The goal of computer vision is to replicate these human abilities in machines, enabling them to accurately understand and interpret visual information.

Figure 2.1 Human visual system vs computer vision

Computer vision enables machines to perform visual tasks efficiently by utilizing cameras, data, and algorithms, in contrast to the human visual system comprising retinas, optic nerves, and a visual cortex This technology is essential for applications such as product inspection and surveillance, where speed and accuracy are crucial.

6 a production asset can analyze thousands of products or processes a minute, noticing imperceptible defects or issues, it can quickly surpass human capabilities

Figure 2.2 Computer vision and related disciplines 2.2 3D RESCONSTRUCTION

3D reconstruction is the process of generating a three-dimensional model of an object or scene from two-dimensional images or other data sources This technology aims to create a virtual representation for diverse applications like visualization, animation, simulation, and analysis It is widely utilized across various fields, including computer vision, robotics, and virtual reality.

Photogrammetry is a popular method for 3D reconstruction that analyzes images to obtain 3D data By applying geometric and optical principles, it effectively reconstructs the three-dimensional shapes of objects and environments.

SfM is a technique used to estimate the 3D structure of a scene from a collection of 2D images It involves determining the camera poses and 3D structure simultaneously by tracking features across multiple images

Depth sensing technologies, such as LiDAR (Light Detection and Ranging) and structured light scanning, directly capture depth information of a scene, which can be used for 3D reconstruction

Stereo vision utilizes multiple cameras to capture images from various angles, allowing for the triangulation of corresponding points This process infers depth information, which is essential for accurate 3D reconstruction of scenes.

Volumetric reconstruction techniques reconstruct the 3D shape of an object or scene by directly analysing voxel data, which represents the volume occupied by the object

MVS techniques aim to reconstruct 3D geometry by combining information from multiple views of a scene They typically involve dense matching of image patches to estimate depth at each pixel

Point cloud processing focuses on analyzing 3D point sets derived from sensors like LiDAR or stereo vision Key techniques, including point cloud registration and segmentation, are essential for effectively processing and interpreting this data.

Surface reconstruction techniques aim to create a continuous surface representation from sparse or dense point clouds This is often done using algorithms such as Delaunay triangulation or Poisson surface reconstruction

Meshing involves creating a mesh representation (consisting of vertices, edges, and faces) from point clouds or other 3D data This mesh can then be used for visualization, simulation, or analysis

After reconstructing a 3D model, texture mapping techniques allow for the application of images onto its surface, enhancing realism Subsequently, rendering techniques can be utilized to produce lifelike images or animations of the reconstructed scene.

INTRODUCTION TO 3D ACQUISITION

A 3D acquisition taxonomy is given in Figure below

3D acquisition methods can be categorized into active and passive techniques Active techniques utilize controlled light sources to capture 3D information, often employing temporal or spatial modulation of illumination While these methods are computationally less demanding, they are limited to environments where lighting can be controlled In contrast, passive techniques rely on ambient light without any special controls, making them more computationally intensive due to the lack of simplified lighting conditions.

Single-Vantage Methods operate from a singular perspective, providing streamlined setups that minimize occlusion issues In contrast, Multi-Vantage Systems utilize several viewpoints or light sources, strategically placed at a significant distance to accurately gather 3D information These systems require a broad baseline between their components for optimal performance.

Structured Light Techniques: Project a known pattern onto an object and capture its deformation to infer 3D shape Useful for fast, accurate 3D shape acquisition with a single image or continuous video

Time-of-Flight (ToF) technology measures the duration it takes for light to travel to an object and return, utilizing lasers (LIDAR) for accurate distance measurement This method often generates detailed point clouds, providing high precision in spatial data, although it tends to be more costly compared to other measurement techniques.

Shape-from-Shading and Photometric Stereo: Use light intensity variations to infer surface orientation Photometric stereo uses multiple light sources to improve stability and accuracy

Shape-from-Texture and Shape-from-Contour: Infer 3D shape from patterns or outlines in an image They work well with textured surfaces and regular shapes

Shape-from-Defocus: Uses varying focal lengths to create depth maps based on the level of blur in images

Shape-from-Silhouettes: Uses silhouettes from multiple angles to approximate 3D shape It’s cost-effective but can miss internal cavities and fine details

To enhance accuracy and accommodate a wider variety of objects, hybrid approaches combine the strengths of various methods, such as shape-from-silhouettes and stereo techniques These innovative strategies aim to leverage the benefits of multiple techniques concurrently The subsequent sections will delve deeper into the methods outlined in the taxonomy, focusing primarily on the detailed Structure-from-Motion (SfM) techniques, which represent the core objective of this project.

TRIANGULATION CONCEPT

Multi-vantage approaches utilize the principle of triangulation to extract depth information, which is a fundamental concept in structure-from-motion (SfM) methods.

Stereo imaging involves capturing two images simultaneously from different viewpoints, as illustrated in Figure 2.4.

Stereo-based 3D reconstruction operates on a straightforward principle: by using two images of a specific point, the spatial position of that point is determined through the intersection of the two projection rays This method is known as 'triangulation.'

The triangulation process, which involves repeating measurements from multiple points, allows for the reconstruction of the 3D shape and configuration of objects within a scene This method necessitates comprehensive knowledge of camera parameters, including their relative positions, orientations, and settings such as focal length The determination of these parameters is known as camera calibration Additionally, successful triangulation requires solving the correspondence problem, which entails identifying the corresponding point in a second image for a specific point in the first image, or vice versa.

BASIC CONCEPTS OF IMAGE FORMATION

The initial phase of a computer vision system involves image acquisition, which is essential for modeling the image formation process This process relies on geometric primitives and transformations to convert 3-D geometric features into 2-D representations Additionally, image formation is influenced by discrete color and intensity values, requiring an understanding of environmental lighting, camera optics, and sensor properties.

A digital image, denoted as f(x, y), represents the output of an image sensor at specific spatial locations defined by 2D Cartesian coordinates (x = 1, 2, , M; y = 1, 2, , N) This image is obtained by spatially sampling and quantizing intensity values from a continuous tone or analog image In this context, the indices x and y correspond to the rows and columns of the image, with individual pixels identified by their 2D spatial coordinates.

The image formation process can be mathematically represented as:

The equation Image = PSF ∗ Object function + Noise illustrates the relationship between an object or scene being imaged and the resulting image quality The object function reflects how light from a source interacts with the surface of the scene, which is then captured by the camera The point spread function (PSF) serves as the system's impulse response, characterizing how it responds to a point light source and influencing the clarity of the image A narrow PSF is indicative of a high-quality imaging system that produces sharp images, whereas a broad PSF leads to blurred images, highlighting the importance of PSF in evaluating imaging performance.

The image formation process involves the convolution operator, which transforms the object's light reflection into image data by convolving it with the Point Spread Function (PSF) This process accounts for noise within the imaging system, effectively converting an input distribution into an output distribution.

Figure 2.6 Digital image formation process

Radiometry involves measuring electromagnetic radiation, particularly in the optical spectrum, while photometry focuses on assessing the sensitivity of cameras and human vision The plane angle θ is defined as the ratio of the arc length s of a circle to its radius r, with the center point being the basis for this measurement.

A solid angle, denoted as Ω, is defined as the ratio of a surface area S on a sphere to the square of its radius Solid angles are measured in steradians (sr) A complete sphere encompasses a surface area of 4πr², which corresponds to a total solid angle of 4π sr.

Figure 2.7 Plane angle in 2D & Solid Angle in 3D

In the spherical coordinate system, local coordinates for a surface patch are defined by the colatitude (polar angle) θ, where θ = 0 aligns with the normal n, and the longitude (azimuth) φ The solid angle Ω, measured in steradians (sr), represents the area of the surface patch on a unit sphere, with Ω = 4π for the entire sphere As illustrated in Figure 2.8, lines and patches that are tilted relative to the viewer's perspective appear smaller in apparent length and area due to the foreshortening effect When a surface area A is tilted at an angle θ between the surface normal and the line of observation, the solid angle is diminished by a foreshortening factor of cos θ.

Figure 2.8 Solid angle in spherical coordinate system

The brightness of an opaque object relies on external light sources, as it does not produce its own energy Image irradiance, which refers to the brightness at a specific point in an image, is directly proportional to the scene radiance The gray value in an image quantifies this irradiance Ultimately, the illumination perceived by a viewer or camera is influenced by factors such as intensity, position, orientation, and light type.

14 of the light source A typical image formation process is illustrated in the figure below

Figure 2.9 Image formation process in a camera

Image irradiance is influenced not only by various factors but also by the reflective properties and local orientations of the image surfaces The light reflected from a surface is determined by its microstructure and physical characteristics.

An image acquisition system transforms three-dimensional real-world scenes into two-dimensional images through a camera projection process, which compresses 3D data into a 2D format As a result, reconstructing depth or 3D information from these 2D images presents a significant challenge in the field of computer vision.

The human visual system excels at interpreting shadows and shading to perceive depth, as shading offers essential cues for depth perception Our brains can transform a 2D image, such as a face, into a 3D object by analyzing lighting conditions, with variations in darkness aiding in the understanding of 3D shapes This concept is exemplified in Figure 2.10, which displays three circles of varying grayscale intensities, illustrating how these differences effectively communicate a strong sense of scene structure.

Figure 2.10 Different shadings give a cue for the actual 3D shape

The main objective of shape from shading is to extract information about normals of the surface only from the intensity image

To project 3D primitives onto the image plane, we utilize a linear 3D to 2D projection matrix The simplest projection model is orthographic, which does not require division for the final inhomogeneous result However, the perspective model is more widely used, as it provides a more accurate representation of how real cameras capture images.

Figure 2.11 Geometry in 2D-image generation

The technique projection was invented by the Swiss mathematician, engineer, and astronomer "Leonhard Euler" in 1756 The "Episcope" was the first projection

Projection is a technique used to convert a three-dimensional object into a two-dimensional plane It involves mapping points \( P(x, y, z) \) onto their corresponding image \( P'(x', y', z') \) in the projection or view plane, resulting in a display surface.

Perspective projections create a three-dimensional graphic image on a plane by using straight lines that radiate from a common point and pass through points on a sphere This technique mimics human vision, where parallel lines not aligned with the projection plane converge at distinct vanishing points The projection reference point is where coordinate positions are transferred, resulting in distances and angles that are not preserved, causing parallel lines to converge at the center of projection There are three types of perspective projections, characterized by the vanishing point and perspective foreshortening, which causes objects and lengths to appear shorter than their actual size.

17 smaller as they are further from the center of projection The projections are not parallel, and a center of projection is specified

Parallel projection involves extending parallel lines from each vertex of an object until they intersect the view plane, allowing for accurate representations of the object's proportions This technique is characterized by a center of projection positioned at an infinite distance from the projection plane, ensuring that the projection lines remain parallel While it is widely used in drafting to create scale drawings of three-dimensional objects, parallel projection does not offer a realistic depiction of the object The intersection points of these lines represent the vertices of the object in the projection.

CAMERA SELF-CALIBRATION

Camera calibration is the process of determining the camera intrinsic parameters, which includes the focal length and position of the optical centre and

Camera calibration involves determining the 3D position and orientation of the camera frame in relation to a specific world coordinate system, known as extrinsic parameters Traditionally, this calibration is performed offline using a 3D calibration object with a defined structure However, self-calibration offers a more flexible approach by utilizing a known pattern as a reference, allowing for camera calibration directly from an image sequence, even when the camera's movements are unknown.

The camera self-calibration system employed is comprised of four stages [5]

Figure 2.15 Camera Self-Calibration Framework

Stage 1- Image Preparation: Feature detection and Matching Feature detection and matching is an initial stage of techniques for 3D reconstruction from multiple views Given a pair of images, a set of correspondences need to be established such that a 3D structure can be constructed, or an in-between view can be generated Firstly, for any object in an image, feature points on the object can be extracted to provide a feature description of the object Such feature points usually lie on high- contrast regions of the image, such as object edges, corners and blobs Then, each region around detected feature locations is converted into a more compact and robust (invariant) descriptor that can be matched against other descriptors

Stage 2- Fundamental Matrix Calibration: With two views of a scene taken from different view angles, the geometrical relationship between these views is given by epipolar geometry In epipolar geometry, the geometric relation is

The fundamental matrix, initialized using the linear normalized eight-point method, defines the epipolar geometry between two images This matrix, denoted as F, can be directly calculated from corresponding points in the images For each pair of corresponding points \( x_i \) and \( x'_i \) in the two images, the relationship governed by F is established.

𝑥′ 𝑖 𝐹𝑥 𝑖 = 0 Defining point correspondences 𝑥 𝑖 ~[𝓊 1 𝑣 1 1] 𝑇 and 𝑥′ 𝑖 ~[𝓊′ 1 𝑣′ 1 1] 𝑇 , the constraint on the elements of the fundamental matrix F has the form:

𝑣 1 1 ]=0 For n pairs of correspondences, the constraints can be written as :

This equation has the form:

𝐾𝑓 = 0 (5) Where K is a n × 9 measurement matrix, and the Gold Standard method [7] is used to optimize F in the set of corresponding points for each image pair that minimizes the geometric distance d:

𝑑 = ∑ 𝑑(𝑥 𝑖 , 𝑥̂ 𝑖 ) 2 + 𝑑(𝑥′ 𝑖 , 𝑥̂′ 𝑖 ) 2 (6) where 𝑥 𝑖 and 𝑥′ 𝑖 i are the measured correspondences, and 𝑥̂ 𝑖 and 𝑥̂′ 𝑖 are the estimated correspondences

Stage 3- Intrinsic Matrix Calibration The intrinsic parameters can be represented by the matrix A

In the process of mapping the camera coordinate system to the image coordinate system, we define magnification factors \( f_u \) and \( f_v \) for the x and y directions, respectively, with principal point coordinates \( u_0 \) and \( v_0 \), and a skew \( S \) between the axes Assuming square pixels, we simplify this to \( f_u = f_v = f \) and \( s = 0 \), allowing us to interpret \( f \) as the lens's focal length in pixel dimensions By establishing point correspondences across three images and calculating the fundamental matrices from these correspondences, we can effectively recover the camera's intrinsic parameters and motion parameters, leading to the computation of coherent perspective projection matrices essential for reconstructing a 3D structure, albeit up to a scale factor.

Stage 4-Camera motion estimation is a fundamental challenge in computer vision, with tracking methods categorized as either marker-based or marker less

Markerless methods track camera motion by utilizing natural scene features, eliminating the need for fiducial reference marks, which makes them ideal for unprepared environments The full 3D camera motion is obtained through geometric constraints that connect corresponding feature points across multiple images.

Figure 2.16 Two camera are indicated by their centers and camera baseline intersects each image plane

As shown in Fig 2.16, the two cameras are indicated by their camera centers

𝐶 0 and 𝐶 1 The camera centres, 3D point X and its image points 𝑥 and 𝑥′ lie in the

Epipolar geometry defines the relationship between two views by examining the intersection of image planes along a baseline axis Key components include epipoles, which are the intersection points of the baseline and the image plane, and epipolar lines, formed by the intersection of epipolar planes with the image plane The epipolar constraint, which connects corresponding image points across two views, can be expressed using the essential matrix E In this context, two cameras with projection matrices P and P' capture corresponding points X' and X in their respective images.

𝑋 = 𝑅𝑋 ′ + 𝑡 (7) Pre-multiplying both sides by 𝑋 𝑇 [𝑡] 𝑥 gives,

𝑋 𝑇 [𝑡] 𝑥 𝑅𝑋 ′ = 𝑋 𝑇 𝑅𝑋 ′ = 0 (8) where the essential matrix 𝐸 is defined as the cross product of the translation vector 𝑡 and rotation matrix 𝑅

𝐸 = [𝑡] 𝑥 𝑅 (9) The essential matrix encodes the epipolar geometry between two camera views, and the normalized essential matrix is given by:

𝑥̃ 𝑇 𝐹𝑥̃ = 0 (10) where 𝑥̃ is the image point expressed in the normalized form of 𝑥, 𝑥̃ = 𝐴 −1 𝑥

[8] The relationship between the fundamental and essential matrices is thus:

The essential matrix (E) and the fundamental matrix (F) differ primarily in the type of information they encode While the essential matrix E captures the rotation and translation related to the cameras' position and orientation in the environment, the fundamental matrix F includes this information along with the intrinsic parameters of both cameras Essentially, E is a geometric representation that maps the physical coordinates of a point P as viewed by the left camera to its projection in the right camera In contrast, F relates the point's location in the two images using pixel coordinates.

23 fundamental matrix is used for uncalibrated cameras, while the essential matrix is used for calibrated cameras.

STRUCTURE FROM MOTION

Structure from Motion (SfM) techniques for reconstructing 3D models from multiple views of an object or scene can be categorized into three main classes: depth-map merging, volumetric-based methods, and feature-point based approaches.

Figure 2.17 SfM system flow chart

The workflow for reconstructing a 3D model from 2D images utilizes Structure from Motion (SfM) techniques, which involve capturing multiple images of an object or scene from various angles This diverse set of images ensures a comprehensive perspective for precise 3D reconstruction Once collected, the images undergo preparation and are processed through the SfM pipeline, resulting in a dense point cloud that accurately represents the 3D coordinates of points in the scene This point cloud is subsequently transformed into a continuous surface, typically represented as a mesh, creating a coherent and smooth 3D surface that reflects the object's surface geometry.

Texture reconstruction enhances the visual fidelity of a 3D model by applying color and texture details derived from input images, ultimately providing a realistic appearance to the point cloud data.

POINT CLOUD

A point cloud is a three-dimensional data set consisting of numerous points, each defined by X, Y, and Z coordinates, representing specific locations in space These point clouds are essential for depicting the surface geometry of physical objects or environments, and they are generated using advanced scanning technologies like LiDAR (Light Detection and Ranging), structured light scanning, and photogrammetry.

Figure 2.18 Example of a point cloud from model

Point clouds can include various attributes such as color, intensity, and reflectance, depending on the data acquisition sensor The density of point clouds varies, with denser clouds capturing more points and providing finer details of surface geometry.

A point cloud typically consists of many individual points, ranging from thousands to billions, depending on the data capture method and the object or scene being represented.

PHOTOGRAMMETRY

Photogrammetry comprises all techniques concerned with making measurements of real-world objects and terrain features from images [9] These

Photogrammetry is a 3D coordinate measurement technique that utilizes photographs captured from various viewpoints, employing both standard digital cameras and specialized photogrammetry cameras This method is essential for applications such as measuring coordinates, quantifying distances, and generating topographic maps and digital elevation models By analyzing multiple images of the same scene, photogrammetry reconstructs the 3D geometry of objects or landscapes, relying on the principles of geometry and optics Understanding the geometry of a single photograph is crucial for accurate measurements, whether the images are obtained from terrestrial or aerial sources, including film cameras, digital cameras, or electronic scanners mounted on tripods, airborne, or spaceborne platforms.

Photogrammetry generates a point cloud by triangulating corresponding features across multiple images, enabling the creation of detailed 3D models By overlaying these images onto the reconstructed geometry, photogrammetry also produces textured models that enhance visual realism.

Photogrammetry is relatively cost-effective and can be performed using consumer-grade cameras It can capture large areas and complex scenes with high

26 detail Photogrammetry can be used to create textured 3D models, making it suitable for applications such as visualizations and simulations

To achieve high-quality photogrammetry images, it is essential to manage lighting, shadows, refraction, and reflections to prevent distortions Ensure the subject remains consistent and motionless, avoiding any foreground objects that could disrupt the capture process Additionally, lock the camera's focus, aperture, and white balance settings to maintain uniformity across all images.

Intentional and consistent camera movement is essential for capturing every angle efficiently while minimizing motion blur This technique guarantees high-quality images, which are crucial for producing accurate 3D models.

Figure 2.21 Camera movement to capture for an object

SYSTEM DESIGN AND IMPLEMENTATION

INTRODUCTION

To achieve high-resolution 3D reconstructions, a meticulous four-stage pipeline is essential The initial stage focuses on capturing intricate details through high-resolution raw images, which preserve uncompressed sensor data, ensuring maximum detail and flexibility in the processing phase.

After capture, a careful preprocessing phase is essential for maintaining data integrity This process includes correcting color discrepancies, precisely eliminating lens distortion to ensure accurate geometry, and optionally downscaling or enhancing details to optimize processing efficiency and improve reconstruction quality.

The pipeline reveals the 3D structure by utilizing Structure from Motion (SfM) software to analyze preprocessed images This software identifies corresponding features across the images to estimate camera positions and create a sparse 3D point cloud that represents the scene's geometry The point cloud is then densified to enhance detail and accuracy.

The dense point cloud is converted into a visually striking 3D model through a four-stage pipeline First, mesh generation algorithms create a surface mesh that approximates the underlying geometry Next, texture mapping techniques apply color and appearance information from high-resolution images to the mesh, ensuring the 3D model accurately represents the intricate visual characteristics of the captured scene Finally, this model can be exported for various applications, providing a robust and meticulous approach to high-quality 3D reconstruction from detailed imagery.

A 3D PHOTOGRAMMETRY SCANNER SYSTEM

A 3D photogrammetry scanner system is an advanced automated tool designed for precise data collection By leveraging photogrammetry principles, it efficiently captures and processes multiple images from different angles and positions around the target object or scene, ensuring accurate 3D modeling.

Figure 3.2 Photogrammetry Scanner system diagram

The system leverages ARDUINO technology to control motors and send signals for image capture by the camera, while an LCD displays the progress of data collection A power module powers the entire setup, and the camera unit can be easily adjusted using a digital camera, smartphone, or any image-capturing device Once the data collection is complete, the information is transmitted to a computer and processed using specialized software for subsequent stages.

To build a 3D photogrammetry scanner system, this project uses the following component that show in Figure 3.3

The Arduino Mega 2560 serves as the central controller for the 3D photogrammetry scanner, coordinating all components for optimal performance It manages the Nema 17 stepper motor through the Easy Driver A3973 board, which not only receives control signals but also supplies power to the motor User interaction is facilitated via an LCD 1607 module with a keypad shield, enabling seamless switching between automated and manual control modes Additionally, the Moto Servo SG90, also controlled by the Arduino Mega, acts as a mechanical trigger for the camera's shutter, ensuring synchronized image capture The turntable's rotation is precisely adjustable, allowing for increments of approximately 10 degrees based on the programmed revolution steps in the Arduino.

Captured data from the DSLR Canon EOS 6D camera is directly transferred to a laptop via a mini USB cable, ensuring immediate access to images for processing

Efficient power management is achieved using an MB102 breadboard power supply, which powers both the Arduino Mega and the SG90 servo Additionally, a Hi-Link AC to DC power supply is utilized to convert mains electricity into the appropriate voltage for the system.

DC voltage required by the Nema 17 motor This setup ensures each component receives the necessary power to function correctly, maintaining the overall stability and reliability of the system

3.1.2 SYSTEM FLOWCHART FOR PHOTOGRAMMETRY SCANNER

The flowchart of the photogrammetry scanner system is structured into three main states: IDLE, RUNNING, and MENU, each serving distinct functions to control the scanning process efficiently

In the IDLE state, the system awaits user input while the Arduino monitors the keypad for key presses Depending on the input, actions such as advancing the stepper motor, adjusting the step size, or switching between manual and automated modes can be executed The current status and mode are shown on the LCD display When the user activates the run flag to initiate an automated scan, the system shifts to the RUNNING state.

In the RUNNING state, the system automates image capture by utilizing an Arduino to control the stepper motor for precise turntable rotation and triggering the camera shutter through a servo motor To maintain accurate timing, it incorporates pre-exposure and post-exposure delays, while continuously updating the current step count and status on an LCD display After reaching the designated number of exposures, the system completes the automated sequence, shows the total exposures taken, and transitions back to the IDLE state.

The MENU state enables users to configure various system parameters, including step size, pre-wait time, post-wait time, and servo delay Users can easily navigate these options using the keypad, with each selection leading to a specific parameter-setting state for value adjustments After configuring the desired settings, the system returns to the main menu, allowing for customized scanning processes tailored to individual requirements.

Together, these states form a cohesive workflow that manages user interactions, automated scanning sequences, and system configurations, making the photogrammetry scanner system both flexible and efficient

IMAGE PRE-PROCESSING WITH CAPTURE ONE SOFTWARE 34

Capture One is a professional image editing software tailored for photographers and digital artists, developed by Phase One It offers advanced tools for photo editing, tethered shooting, and raw image processing, making it a robust solution for enhancing images.

Figure 3.6 Process multiple images in Capture One interface

Once the image capture process concludes, the images are imported into Capture One for thorough enhancement This software allows for resizing images, automatic exposure adjustments, and improvements in textures and colors Such preparation guarantees optimal image quality for input into Structure from Motion (SfM) software, aiding in the development of accurate and detailed 3D models.

SFM SOFTWARE : MESHROOM

Meshroom is a free and open-source 3D reconstruction software that utilizes the AliceVision Photogrammetric Computer Vision framework This powerful tool enables users to generate 3D models from a collection of photographs, making it an invaluable resource for various applications in 3D modeling and visualization.

35 field of photogrammetry The software is designed to be accessible for both professionals and hobbyists who are interested in 3D reconstruction and modeling

Figure 3.7 User interface in open source Meshroom software

Import output images from Capture One into Meshrom and chose

To configure the photogrammetry pipeline, adjust the nodes and parameters according to your specific conditions and initiate the process The pipeline includes various descriptor types with different algorithms, allowing users to select the most suitable option based on their use case Once configured, the photogrammetry software will automatically begin processing and, upon completion, will export the results in a 3D stl format.

Transform detects key points invariant to scale, rotation, and illumination changes, offering high accuracy

Ideal for projects requiring high precision and robustness, such as detailed archaeological documentation and heritage preservation

Transform extracts dense features for improved

Suitable for creating detailed models in areas with few distinct features, such as

36 reconstruction in texture-less or repetitive regions smooth architectural surfaces or large, homogenous landscapes

Accelerated KAZE is faster than SIFT, using nonlinear scale spaces for better performance and accuracy

Effective for real-time 3D scanning applications where quick processing is essential, like in interactive augmented reality (AR) experiences

Speeded Up Robust Features is optimized for speed, suitable for real-time applications but less accurate than SIFT in some cases

Useful for rapid prototyping and real- time applications where speed is more critical than precision, such as robotics and automation

Segment Test is a high-speed corner detector for real-time performance

Ideal for high-speed video processing tasks, such as sports analysis or surveillance, where rapid detection is necessary

Detects features in a nonlinear scale space for detailed and accurate key points, useful for high-quality 3D reconstruction

Best for high-fidelity reconstructions in scientific research and medical imaging, where detailed and accurate models are critical

Table 3.3: Feature Describer type and use case

These feature types enable Meshroom to accurately reconstruct 3D models by providing robust and efficient methods for feature detection and matching, tailored to various application needs

CAMERA CALIBRATION AND TRANSMISSION

To achieve maximum accuracy in photogrammetry, it is vital to control several key factors, including resolution, lens quality, and stability High-resolution cameras are crucial for capturing fine details needed for precise reconstructions, while high-quality lenses reduce distortions that can compromise image clarity Ensuring stability during image capture is essential to prevent blurring and guarantee accurate photo alignment, which is fundamental for creating reliable 3D models Additionally, accurately marking points in the scene with high-contrast targets enhances the matching of points across images, which is critical for robust reconstruction Finally, meticulous calibration of camera and lens parameters is necessary to maintain accurate geometric relationships between the images and real-world objects.

Figure 3.9 Camera capture and instantly transmission

To achieve optimal accuracy and fidelity in photogrammetric projects, it is essential to utilize reliable equipment and techniques In this project, we employed a Canon EOS 6D DSLR with a 50mm focal length to ensure ideal exposure of the subject The captured images are promptly transmitted to a laptop through the Canon Utility App via a mini-USB port, facilitating efficient processing for precise 3D model creation.

RESULT

PRE-PROCESSING IMAGE RESULT

In the second stage of the process, the images captured earlier are processed using Capture One This involves scaling down all original images by 50% to minimize file size while maintaining high-quality texture and detail.

40 they are sequentially renamed to facilitate faster feature extraction upon import into the SFM software.

STRUCTURE FROM MOTION RESULT

The third stage of the process involved using a Lenovo IdeaPad Gaming laptop featuring an Intel Core i5 10300H processor and an NVIDIA GTX 1650 Ti GPU Despite some hardware constraints, the Meshroom pipeline was completed in just 5 hours, yielding results that closely matched the preview from the sparse cloud node, indicating a high level of precision However, additional surface reconstruction steps are necessary for further refinement.

Figure 4.3 Result of reconstruct sparse cloud with camera position Meshroom

41 necessary within the software to refine the model's mesh, achieve smoother surface reconstruction, and enhance the overall texture quality of the model

Figure 4.4 Point Cloudd result from Meshroom

After densifying and meshing the point cloud and applying the texture extracted from images, we obtain a 3D object file in obj format This file is then opened in Blender, as illustrated in Figure 4.5 However, the surface quality is not yet refined, necessitating the next step of mesh refinement in Blender.

Figure 4.5 Final Texture result from Meshroom

REFINE & EXPORT IN BLENDER

To begin, import the reconstructed 3D model into Blender using the import function for formats like OBJ or FBX, ensuring proper scaling and orientation Utilize Blender’s selection tools to eliminate unnecessary parts of the mesh, such as floating vertices and stray edges, to streamline the model and reduce file size Next, apply the Decimate Modifier, a powerful tool that decreases the number of vertices, edges, and faces while preserving the model's overall shape, making it more manageable in size and computational load without sacrificing detail.

Figure 4.6 After refine in Blender

The model retains its quality even after significant reduction, making it ready for export to various formats using Blender's export function This achievement marks the successful completion of the project's objectives.

CONCLUSION AND FUTURE WORK

CONCLUSION

This thesis examines the complex process of creating high-quality 3D models from images through advanced computer vision techniques It highlights the significance of image acquisition and preprocessing, particularly in managing exposure levels for optimal detail By utilizing cutting-edge algorithms for feature extraction and matching, the research establishes strong reconstruction pipelines that lead to the development of precise 3D surface meshes.

During the project, we assessed multiple Structure-from-Motion (SFM) software solutions, showcasing their effectiveness in converting image collections into coherent spatial point clouds and surfaces By benchmarking these tools against existing methods, we gained crucial insights into performance metrics, identifying opportunities for refinement and enhancement.

FUTURE DEVELOPEMENT

Looking ahead, the future development of this project holds significant potential for advancing the field of 3D reconstruction:

The integration of deep learning models significantly enhances feature extraction, matching accuracy, and surface reconstruction By leveraging the capabilities of neural networks, these processes can be automated and optimized, leading to reduced manual intervention and improved reconstruction quality.

Real-time Reconstruction: Investigate real-time reconstruction techniques to enable dynamic applications in augmented reality (AR) and robotics Developing

44 algorithms that can process and reconstruct 3D models in real-time will open new avenues for interactive and responsive environments

Multi-sensor fusion combines data from various sensors, including RGB cameras and depth sensors like LiDAR, to enhance depth estimation and texture mapping This multi-modal technique significantly improves the accuracy and detail of reconstructed models, especially in intricate scenes.

Incorporating semantic understanding into the reconstruction pipeline enhances the differentiation between object classes, ultimately improving model interpretability This advancement has significant implications for applications in automated inspection, virtual heritage preservation, and autonomous navigation systems.

To enhance scalability and efficiency in reconstruction pipelines, it is essential to optimize computational performance for managing larger datasets and more intricate scenes This involves investigating distributed computing methods and parallel processing strategies to significantly reduce reconstruction times.

To promote widespread adoption of 3D reconstruction software among non-expert users, it is essential to enhance its user interface and accessibility By incorporating intuitive tools and effective visualization techniques, the process of creating and manipulating 3D models can be significantly simplified, making it more approachable for users across various domains.

This thesis seeks to advance 3D reconstruction technology through innovative research and development, impacting areas such as cultural heritage preservation and industrial applications By integrating emerging technologies and methodologies, it aims to fully leverage computer vision's capabilities for creating detailed and realistic 3D models from images.

In summary, this thesis establishes a strong groundwork for the ongoing pursuit of comprehensive and efficient 3D reconstruction, paving the way for future advancements and exciting opportunities in computer vision and related fields.

[1] Richard Szeliski, “Computer Vision: Algorithms and Applications 2nd Edition September 30, 2021”

[2] Maarten Vergauwen, “3D Reconstruction from Multiple Images, Foundations and Trends® in Computer Graphics and Vision Series, 2009”

[3] Manas Kamal Bhuyan, “Computer Vision and Image Processing Fundamentals and Applications 1st Edition 2020”

[4] M Sonka, V Hlavac, and R Boyle, “Image processing, analysis and machine vision Cengage Learning, 2007”

[5] Li Ling, M.Eng, “Dense Real-time 3D Reconstruction from Multiple Images August 2013”

[6] R I Hartley, “In Defense of the Eight-Point Algorithm,” IEEE Transaction on Pattern Recognition and Machine Intelligence, vol 19, no 6, p 14, 1997

[7] Cornell University, Indian Institute of Technology Roorkee, “Deep Fundamental Matrix Estimation without Correspondences”

[8] Po-Han Lee, Jui-Wen Huang, Huei-Yung Lin, “3D Model Reconstruction Based on Multiple View Image Capture”, 2012 IEEE International Symposium on Intelligent Signal Processing and Communication Systems November 4-7, 2012

[9] E Rosten and T Drummond, “Machine Learning for Highspeed Corner Detection,” In European Conference on Computer Vision, 2006

[10] Jingming Dong, Stefano Soatto, “Domain-Size Pooling in Local Descriptors: DSP-SIFT” UCLA Vision Lab, University of California, Los Angeles, CA 90095

[11] Lester Kalms, Khaled Mohamed, Diana Gửhringer, “Accelerated Embedded AKAZE Feature Detection Algorithm on FPGA”

[12] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, “SURF: Speeded Up Robust Features”

[13] Deepak Geetha Viswanathan “Features from Accelerated Segment Test (FAST)”

[14] Pablo Fern´andez Alcantarilla1, Adrien Bartoli, and Andrew J Davison

Tiêu đề	Generate 3D Model of Objects From Images
Tác giả	Tăng Hiệp Vy Quí
Người hướng dẫn	Associate Professor PhD Trương Ngọc Sơn
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Computer Engineering Technology
Thể loại	Graduation Project
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	62
Dung lượng	5,23 MB