Overview
Smartphones have become essential tools for a diverse range of users, serving not only as communication devices but also for activities like listening to music, gaming, and photography Often, we need to enhance our images before sharing them on social media by removing unwanted elements such as street signs, shadows, or unexpected individuals Achieving a flawless photo every time can be challenging, but with a few simple techniques, we can significantly improve their appearance This article aims to introduce an innovative Android app designed to efficiently remove unwanted objects from photos, helping users achieve the perfect image.
Problem statement
Removing objects from photos while preserving their natural appearance is a challenging task Numerous methods and applications have been developed for both mobile and desktop platforms to tackle this issue However, not all solutions deliver flawless results This process is closely related to image inpainting, which involves finding effective ways to fill in the gaps left behind.
Our project consists of two key components: initially, we investigate existing methods for image restructuring post-editing, followed by the application of deep learning technology to enhance image reconstruction Additionally, we focus on optimizing our deep learning model to ensure efficient performance on mobile devices.
Figure 1.1: Remove unwanted object example The image on the left has unwanted children playing and the result on the right one contains only the main girl.
(Source: https://www.fiverr.com/ad_grfx/remove-a-person-or-object-from-a-picture-with-photoshop)
Motivation
Mobile phones have become the leading gadget for all age groups, with users spending more time on their devices than on any other technology This trend highlights the dominance of mobile in today's digital landscape Additionally, the popularity of image enhancement apps like FaceApp, Camera360, and BeautyPlus continues to grow, reflecting the increasing demand for mobile photography and editing tools.
Mobile applications increasingly incorporate features for removing unwanted objects from images, making this an intriguing yet challenging task, particularly for newcomers to image processing and deep learning.
Problems
There are some problems with building an android application specifically for this task.
• Mobile hardware with limited resources cannot have high performance.
• Technologies in mobile are changing rapidly, so the product needs to update constantly. The old version and different operating systems may not be fully supported.
The deep learning approach faces challenges with lengthy training and inference times, along with the necessity for substantial data While most deep learning applications are designed for IoT and servers, their optimization for personal mobile devices remains inadequate Consequently, deploying deep learning models on mobile requires optimization processes that can diminish the expected performance outcomes.
Goal and scope
The goals of this project included:
• Study in-camera processing pipeline, research, and apply work related to the topic.
• Propose, implement, and evaluate an application for removing unwanted objects on an image.
• Customize, optimize and quantize the model that is suitable for the mobile environment.
• For an application, we decided to make it run locally for many reasons In this way, we can utilize the device’s hardware and make it independent from the Internet.
• About the scope of the smartphone, we also supported an Android device which is at least
The system requires a minimum of 8 cores and over 4GB of RAM for optimal performance, surpassing current standards Additionally, it mandates an API version of at least 5.0 While it is compatible with multiple devices, utilizing higher-quality devices will yield superior results and reduced run times.
• For the approach we are using, we will work on different aspects and we can only focus on just some main points.
Research method
To finish the project, our team works together and applies these methodologies.
• Combine theory and implementation, evaluate the resulting base on user experience and product.
• Compare the result with other authors and published constructions, propose a new method and technique that can operate on mobile.
• Testing on different devices to get objective results
• Weekly meeting with lecturers and instructors.
Progress
For giving the best result of the research, our group will conduct the work on 5 main stages:
• Stage 1: Research theory about deep learning, GANs, Generative model, and relative top- ics to choose the appropriate one for using.
• Stage 2: Research about quantizing and other compression methods or attempt to optimize model for deploying suitable models for mobile.
• Stage 3: Training model or finding checkpoint if training phase is not good Also, imple- ment a backup traditional one in case the deep learning approach is not successful.
• Stage 4: Implement android application and deploy the model.
Thesis outline
This report will be presented in 7 chapters with specific details:
Chapter 1 - Introduction: This chapter is about the current problem so that we can analyze possibility, motivation, scope, and challenge before we step into the topic.
Chapter 2 - Background: This section provides foundational knowledge on images and image processing, explores various techniques for image inpainting, and discusses the deep learning methods that serve as the basis for the implementation of these inpainting techniques.
Chapter 3 - Related work: In this chapter, the survey on available approaches will be conducted to find out the base of the most possible one on those methods.
Chapter 4 - System analysis and design: This chapter report focuses on the methods that we used include the data set, architecture, training method, and evaluation metrics.
Chapter 5 - Experimental result: In this chapter, we will present the final result with previous methods in some aspects: time, image quality, mobile memory, and other standard evaluation metrics.
Chapter 6 - Application: In this chapter, the report is about the mobile application, how it works to give the user an intuitive tool for users to experience our methods.
In the concluding chapter, we will summarize our findings and provide a final assessment of our work thus far Additionally, we will evaluate the advantages and disadvantages of the existing solutions and explore potential avenues for future development.
Image on human eye
When light strikes an object, it is reflected and enters the human eye through the lens The image is formed on the retina, located at the back of the eye, where it appears inverted The brain then processes and interprets this image, allowing us to perceive its size and depth from various angles.
Figure 2.1: How image is created on eyes [4]
Figure 2.1 illustrates the process of image formation in the human eye, where an object, such as a tree, creates an inverted image on the retina Despite this inversion, our brains have the remarkable ability to interpret and correct the image, allowing us to perceive the tree in its accurate position.
Image on an analog camera
An analog camera captures images through a chemical reaction on a roll of film, which typically contains thirty to fifty sequential photographs The standard size of this film strip is 35 mm.
16 mm For a basic understanding, light or photo particles that pass through the camera will react with silver halide particles on the strip.
Image on digital camera
A digital camera utilizes a CCD array, or charge-coupled device, as its image sensor This technology detects analog signals generated by photon energy and transforms them into electrical signals The CCD features a matrix structure resembling an image, with each cell serving as a sensor that measures light intensity.
Cameras and human eyes operate on similar principles, as both rely on light reflection to create images In the eye, light passes through the lens, forming an angle that results in an image on the retina, which the brain then interprets This process involves estimating the height and depth of objects through angle construction Similarly, in cameras, light reflects off an image, striking a sensor chip where photon collisions are recorded and stored Major companies employ proprietary technologies to manufacture CCD sensors, as the quality of digital images hinges on this component Digital images offer the significant advantage of being copied without loss of quality, unlike their analog counterparts.
Digital image
A digital image is a two-dimensional representation characterized by finite and discrete coordinates (x, y), where each individual element is referred to as a pixel or pel Designed for processing on digital computers, these images enable efficient manipulation and analysis in various applications.
Image inpainting
Object removal techniques can be categorized into two main types: image inpainting and copy-move The copy-move method is simpler, involving the replication of image sections to remove unwanted objects, but it is less effective for complex scenarios and works best with repetitive structures In contrast, image inpainting is the preferred technique for removing unwanted objects, as it reconstructs missing pixels to create a realistic image that maintains the original context This method not only enhances image quality but also corrects distortions Common approaches to image inpainting include sequential-based methods, CNN-based methods, and GAN-based methods Sequential-based approaches encompass patch-based and diffusion-based techniques, where patch-based methods fill in missing areas by matching and pasting suitable patches, employing techniques like Markov Random Field (MRF) and Sum of Squared Difference (SSD) Diffusion-based methods, on the other hand, use boundary content propagation to fill in gaps, utilizing techniques such as fractional-order derivatives and Fourier transforms to exploit diffusion coefficients.
Deep-learning techniques, particularly those utilizing convolutional neural networks (CNNs), excel in image reconstruction by generating intricate textures and effectively filling in substantial missing areas Unlike traditional patch-based methods, these advanced approaches can create entirely new content, showcasing their superior capabilities in handling complex image restoration tasks.
(c) Globally and Locally Consistent Image Completion [7]
Figure 2.2: An example to show the need of generating novel fragments for the task of image inpainting [7].
Filtering involves substituting a pixel's value based on specific operations or functions, which are essential to traditional image processing methods These operations are commonly referred to as filters, masks, kernels, templates, or windows In digital imaging, two primary types of filters are predominantly utilized.
• Suppress the high frequencies in an image: smoothing the image.
• Suppress the low frequencies in an image: enhancing the image.
For sharpening, the Laplacian filter is really casual And for smoothing filter, there are some common options that can be used in traditional image processing:
The mean filter is a technique that utilizes a sliding window to substitute the value within the window with the average of all pixel values contained in it This method effectively smooths images by reducing noise, as illustrated in Figure 2.3.
Figure 2.3: An example of mean filter.
The median filter is an image processing technique that operates within a defined window radius, adjusting the luminosity of the target pixel to match the median value of the pixels in that window.
• Max filter: maximum filter replaces it with the lightest one.
• Min filter: The transformation replaces the central pixel with the darkest one in the running window.
Deep learning models often utilize transposed convolution for upsampling, addressing the limitations of traditional methods like bilinear and bicubic upsampling, which are not learnable and can only be applied externally to the architecture This restriction can impact model effectiveness In contrast, learnable techniques, such as deconvolution, enhance both speed and performance, making them a valuable addition to deep learning frameworks.
Sub-pixel convolution, illustrated in Figure 2.4, is a specialized form of deconvolution layer that functions as standard convolution in low-resolution regions, followed by a periodic shuffling process This technique offers a significant advantage over traditional scaling convolutions, as it maintains the same computational complexity while incorporating more parameters, resulting in enhanced modeling capabilities.
Evaluating the quality of generative images in deep learning, particularly within GAN models, presents significant challenges Key metrics for assessing image content include Manhattan distance (L1), mean squared error (L2), structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), and Frechet Inception Distance (FID) These metrics are essential for measuring generated image quality, particularly in terms of distortion visibility By quantifying the disparity between the generator's output and the ground truth image, these functions enable the training of models to minimize loss effectively.
P stands for a pixel, x(p) are values of the pixel in processed path and y(p) are the values ground truth.
• x(p) are pixel values of processed path.
• y(p) are pixel values of ground truth.
The peak signal-to-noise ratio (PSNR) is a widely used metric for assessing image quality degradation caused by various codecs and compression methods In this context, the "signal" refers to the original image, while the "noise" represents the errors introduced during compression PSNR is typically expressed in decibels (dB) as a logarithmic value, providing a standardized way to evaluate image fidelity.
• MSE is mean square error of 2 images.
• MAX I is maximum value of pixel on image If pixel is 8 bits soMAX I would be 2 8 - 1.
Structural Similarity Index Measure (SSIM) is based on visible structures in the image. SSIM as a metric in the loss function results in a more appealing enhanced image than PSNR.
• à x andà y are mean of pixel intensity in image block x and y.
• σ x andσ x are standard deviation of pixel intensity in image block x and y.
• C 1andC 2are constants to avoid denominator close to zero.
The SSIM formula evaluates the similarity between two samples, x and y, using three key measurements: luminance, contrast, and structure Luminance is calculated as luminancel(x,y) = à 2à 2 x à y +C 1 x +à y 2 +C 1, while contrast is defined by contrastc(x,y) = σ 2σ 2 x σy+C 2 x +σ y 2 +C 2 The structure measurement is represented as structures(x,y) = σ σ xy +C 3 x σ y +C 3, where C 3 = C 2 2 By multiplying these three components, we derive the SSIM formula, which effectively quantifies the perceived quality of images.
The Fréchet Inception Distance (FID) is a key metric for evaluating the quality of images produced by generative models, such as generative adversarial networks (GANs) It measures the similarity between the distribution of generated images and the distribution of real images utilized during the training of the generator.
• à andà w refer to the feature-wise mean of the real and generated images.
• ΣandΣ w are the covariance matrix for the real and generated feature vectors.
• The|à−à w | 2 refers to the sum squared difference between the two mean vectors.
• tr refers to the trace linear algebra operation, e.g the sum of the elements along the main diagonal of the square matrix.
Generative Adversarial Networks (GANs) are a type of deep learning architecture designed for generative modeling These advanced neural networks can produce realistic images that resemble photographs of human faces, despite the fact that these faces are entirely fictional and do not represent any actual individuals.
(Source: https://developers.google.com/machine-learning/gan/gan_structure)
The architecture of Generative Adversarial Networks (GANs) consists of two key models: the generator and the discriminator The generator is responsible for creating new samples based on an existing distribution, while the discriminator's role is to determine if the generated samples are authentic or fabricated This relationship between the two models is visually represented in Figure 2.5.
The generator model takes a random vector as input to produce samples within a specific domain This initial vector is typically derived from a Gaussian or Normal distribution, serving as the foundation for the generative process.
The discriminator model in Generative Adversarial Networks (GANs) classifies input images as either real or fake If an image is classified as real, it originates from the dataset; if fake, it is generated by the generator Essentially, the discriminator functions as a standard classification model Once training is complete, the focus shifts from the discriminator to the generator, which is the central component of the GAN architecture.
Figure 2.6: The discriminator can easily detect clones and real ones.
(Source: https://developers.google.com/machine-learning/gan/gan_structure)