1.4 Aims and objectives In this project, we will research, design and construct a demo application which relies on Lambda architecture to solve some problems about latency, availability
Trang 1
TECHNOLOGY AND EDUCATION
MINISTRY OF EDUCATION AND TRAINING
HO CHI MINH CITY UNIVERSITY OF
GRADUATION THESIS INFORMATION TECHNOLOGY
STUDY LAMBDA ARCHITECTURE AND
BUILD A DEMO APPLICATION
LECTURER: M.S QUACH DINH HOANG STUDENTS: NGUYEN THANH PHAT TRAN DUC TUAN
S K L 0 1 1 1 5 3
Trang 2MINISTRY OF EDUCATION AND TRAINING
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION
FACULTY FOR HIGH-QUALITY TRAINING
CAPSTONE PROJECT INFORMATION TECHNOLOGY
LECTURER: M.S Quach Dinh Hoang STUDENTS: Nguyen Thanh Phat - 19110101
STUDY LAMBDA ARCHITECTURE AND BUILD A
DEMO APPLICATION
Trang 3THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, August 11, 2023
GRADUATION PROJECT ASSIGNMENT
Advisor: M.S Quach Dinh Hoang
1 Project title: Study Lambda architecture and build a demo application
2 Initial materials provided by the advisor: Advisor suggests me to read about some
reference sources or articles about Lambda architecture which will help me to do this project
3 Content of the project: Combine Big Data tools together and get advantage of them
Then learning about Lambda architecture and how it resolves problems about scalability and complexity of Big Data systems
4 Final product: A model that applies Lambda architecture to stream and process a
massive amount of data and a dashboard illustrates these data under charts
CHAIR OF THE PROGRAM
(Sign with full name)
ADVISOR
(Sign with full name)
Trang 4THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, August 11, 2023
ADVISOR’S EVALUATION SHEET
Major: Software Engineering
Project title: Study Lambda architecture and build a demo application
Advisor: M.S Quach Dinh Hoang
EVALUATION
1 Content of the project:
2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
5 Overall evaluation: (Excellent, Good, Fair, Poor)
6 Mark: ……… (in words: )
Ho Chi Minh City, August 11, 2023
ADVISOR
(Sign with full name)
Trang 5THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, August 11, 2023
PRE-DEFENSE EVALUATION SHEET
Major: Software Engineering
Project title: Study Lambda architecture and build a demo application
Name of Reviewer: PhD Tran Nhat Quang
Trang 6THE SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom– Happiness
-
Ho Chi Minh City, August 11, 2023
EVALUATION SHEET OF DEFENSE COMMITTEE MEMBER
Major: Software Engineering
Project title: Study Lambda architecture and build a demo application
Name of defense Committee Member:
EVALUATION
1 Content of the project:
2 Strengths:
3 Weaknesses:
4 Approval for oral defense? (Approved or denied)
5 Overall evaluation: (Excellent, Good, Fair, Poor)
6 Mark: ……… (in words: )
Ho Chi Minh City, August 11, 2023
COMMITTEE MEMBER
(Sign with full name)
Trang 7ACKNOWLEDGEMENTS
We would like to offer our heartfelt appreciation to the Department of High-Quality Training, Ho Chi Minh City University of Technology and Education, and all the professors and teachers who have actively taught and aided me during my study and research
In particular, we would like to express our heartfelt thanks to Mr Quach Dinh Hoang for personally instructing, leading and establishing all suitable conditions to let me carry out the thesis He helped us a lot in picking a topic and directing us to understand, investigate the theory and practice the issue During the thesis implementation, the knowledge that the teacher provided us was extremely valuable, not only assisting us in completing the thesis but also supplementing us with a significant amount of core knowledge
However, due to our limited professional expertise and limitations in our personal experience, the substance of the report is unavoidable We would like to get your thoughts and more direction He guides and criticizes to make this report more thorough
Ho Chi Minh City, August 11, 2023
Group student
Nguyen Thanh Phat - 19110101 Tran Duc Tuan - 19110140
Trang 8To restrict the limitations of NoSQL and traditional databases, we will combine these tools together to ensure the system runs with low latency, high performance and fault tolerance which allows to continue running even if the system goes down randomly This architecture
is called the Lambda architecture which we will discuss and put into practice later
Applying that practice, our team would like to research deeper and present a report on the application topic of Lambda architecture which is "Study Lambda architecture and build a demo application"
Trang 9TABLE OF CONTENTS
ACKNOWLEDGEMENTS 7
FOREWORD 8
TABLE OF CONTENTS 9
LIST OF FIGURES 13
CHAPTER 1 INTRODUCTION 15
1.1 Motivation for selecting the topic
1.2 Research object
1.3 Project scope
1.4 Aims and objectives
CHAPTER 2 OVERVIEW OF LAMBDA ARCHITECTURE 18
2.1 Definition
2.2 Desired properties
2.2.1 Fault tolerance
2.2.2 Low latency
2.2.3 High throughput
2.2.4 Scalability
2.2.5 Extensibility
2.2.6 Ad hoc queries
2.3 Architecture
2.3.1 Batch layer
2.3.2 Speed layer
2.3.3 Serving layer
CHAPTER 3 APPLY LAMBDA ARCHITECTURE TO SOLVE THE PROBLEMS OF DATABASE SCALABILITY IN THE REAL WORLD 22
Trang 103.1 Some related research
3.1.1 Research on “Big Data: Principles and Best Practices of Scalable Realtime Data Systems” [1]
3.1.1.1 Summary
3.1.1.2 Result of the research
3.2 Design Lambda architecture
CHAPTER 4 EXPERIMENTS AND RESULTS 26
4.1 Data for experiments
4.1.1 Using Twitter API
4.1.1.1 Definition
4.1.1.2 Limitation
4.1.2 Using static dataset
4.1.2.1 Reference source for data
4.1.2.2 Description of the dataset
4.2 Environments and tools
4.2.1 Python
4.2.2 Docker
4.2.3 Visual Studio Code
4.2.4 Great Expectations
4.3 Big data tools
4.3.1 Snowflake
4.3.1.1 Definition
4.3.1.2 Architecture
4.3.1.2.1 Database storage
4.3.1.2.2 Query processing
Trang 114.3.1.2.3 Cloud services
4.3.2 Apache Spark
4.3.2.1 Definition
4.3.2.2 Architecture
4.3.2.2.1 What is RDD
4.3.2.2.2 Lazy evaluation
4.3.2.2.3 How Spark works?
4.3.2.3 What is Spark Streaming?
4.3.3 Apache Kafka
4.3.3.1 Definition
4.3.3.2 Architecture
4.3.3.2.1 Consumer
4.3.3.2.2 Producer
4.3.3.2.3 Topic
4.3.3.2.4 Partition
4.3.3.2.5 Broker
4.3.3.3 How Kafka works?
4.3.4 Apache Airflow
4.3.4.1 Definition
4.3.4.2 Architecture
4.3.4.2.1 Scheduler
4.3.4.2.2 Executor
4.3.4.2.3 Webserver
4.3.4.2.4 DAG directory
Trang 124.3.4.2.5 Workers
4.3.4.2.6 Metadata database
4.4 Implementation process
4.4.1 Setup environment using Docker
4.4.2 Ingest data
4.4.2.1 How to simulate real-time dataset
4.4.3 Design database
4.4.4 Insert data to Kafka topic and data lake
4.4.5 Batch layer
4.4.6 Speed layer
4.4.7 Serving layer
4.5 Training model
4.5.1 Feature Selection & Data Split
4.5.1.1 Checking null value and select features
4.5.1.2 Numerical data
4.5.1.3 Categorical data
4.5.2 Modeling
4.5.3 Hyperparameter tuning
4.6 Experimental Results
CHAPTER 5 CONCLUSIONS AND DEVELOPMENT ORIENTATIONS 61
5.1 Content implemented
5.2 Restrictions
5.3 Further development direction
REFERENCES 62
Trang 13LIST OF FIGURES
Figure 1 Lambda architecture 18
Figure 2 Proposed Lambda architecture 23
Figure 3 Snowflake's architecture 29
Figure 4 Spark RDDs 31
Figure 5 Spark architecture 33
Figure 6 Kafka architecture 36
Figure 7 Airflow's architecture 37
Figure 8 Environment file 39
Figure 9 Kafka and Zookeeper containers 40
Figure 10 Fetch data workflow 41
Figure 11 Airflow variables 43
Figure 12 Database schema 44
Figure 13 Kafka jobs workflow 45
Figure 14 Batch layer workflow 47
Figure 15 Speed layer workflow 48
Figure 16 Calculate average total sale ratio 50
Figure 17 Missing value by columns 51
Figure 18 Heatmap of Correlations Between Variables 52
Figure 19 Features Standard Scaler 53
Figure 20 Replace Null value 53
Figure 21 Convert A Categorical Variable into Dummy Variables 54
Figure 22 Split the Dataset to Train & Test Set 54
Figure 23 Pre-built algorithms by scikit-learn package 54
Figure 24 Explained Variance Score of 5 Models 55
Figure 25 Mean Absolute Percentage Error of 5 Models 56
Figure 26 Cross validation of 5 Models 56
Figure 27 List of hyperparameters 57
Figure 28 Best Hyperparameters and MAPE 58
Figure 29 True and Predicted Prices 58
Trang 14Figure 30 Dashboard 59 Figure 31 Charts depict the total sale amount, total customer and average sale ratio 59 Figure 32 Charts about total customer by property type and town 60
Trang 15CHAPTER 1 INTRODUCTION
1.1 Motivation for selecting the topic
In traditional databases, when we start developing an application, it getting bigger and bigger over time when your application becomes more popular Then you will hit the limits
of traditional database technologies
Suppose we build a simple web analytics application The application should track the number of pageviews for any URL a customer wishes to track The customer’s web page pings the application’s web server with its URL every time a pageview is received Besides, the application should tell us at any time what the top 50 URLs are sorted by number of pageviews The database you use to develop the application is traditional databases such as MySQL, PostgreSQL, … Whenever someone loads a web page which is tracked by the web application, it will ping your web server with the pageview and increase the corresponding row in the database
There is no problem about the scalability and complexity of your database until our application becomes more popular and the amount of data received is very huge Then there are so many problems that may emerge, we have many solutions to resolve but it is just a temporary way Because the amount of data you get is so huge and it is beyond the writes
of the RDBMS database when update increments
Due to the limitations of traditional databases, we decide to use NoSQL databases NoSQL helps to handle very large amounts of data, large scale and other problems which occur in traditional databases There are some large-scale computation systems like Hadoop and databases such as Cassandra and MongoDB
Although it can resolve the problem about database scalability and complexity of your application, there are serious trade-offs
Hadoop, for example, can parallelize large-scale batch computations on very large amounts
of data, but the computations have high latency We don’t use Hadoop for anything where
we need low-latency results
Trang 16NoSQL databases like Cassandra achieve their scalability by offering a much more limited data model than you’re used to with something like SQL Squeezing our application into these limited data models can be very complex
There are several problems emerging when the application becomes more and more popular among traditional databases and NoSQL databases
● Cassandra: the speed of reading data is slow and does not support aggregations
● MongoDB: complexities in joining two documents so it is hard to perform complex
or ad-hoc queries
● Hadoop: high latency when computing data
From the problems we mentioned above, we will research and study about Lambda architecture which provides a consistent approach with high performance, low latency and fault tolerance when the application is more popular It also removes all trade-offs of every single tool by combining multiple different technologies and tools together Then we will design and build a demo application based on this architecture
1.2 Research object
After researching information about the topic, reading a book about how Lambda architecture was born to resolve these problems and applying the knowledge learned at school, our group identified objects to be studied in the topic as:
• Lambda architecture
• Apache Kafka
• Apache Spark
Trang 17to ensure the quality of data meets our requirements
1.4 Aims and objectives
In this project, we will research, design and construct a demo application which relies on Lambda architecture to solve some problems about latency, availability and consistency when ingesting and processing data in real-time In addition, we create a real-time dashboard to give information and depict some features of data in the form of charts to ensures that our application runs as our expectations We also apply some machine learning
to this dataset to predict and train the model as new data comes to the database
Trang 18
CHAPTER 2 OVERVIEW OF LAMBDA ARCHITECTURE
2.1 Definition
The Lambda Architecture is a Big Data processing architecture that enables processing massive amounts of data at scale and building big data systems as a series of layers It allows for massive data processing by scaling out rather than scaling up
The Lambda Architecture is a new paradigm for handling vast amounts of data It is a scalable and easy-to-understand approach to processing large volumes of rapidly arriving data using batch and stream-processing methods It is used to create high-performance, scalable, flexible, and extendable systems
There are 3 layers in this architecture, as shown in figure 1:
• Batch layer: compute a vast amount of data with high latency
• Speed layer: stream real-time data
• Serving layer: index the views from batch layer and speed layer so that it can be used
in ad-hoc query with low latency
Figure 1 Lambda architecture
Trang 192.2.3 High throughput
It is mentioned that the vast amount of data is processed and transmitted from one place to another place in a given time Due to processing an enormous amount of data, it makes latency higher and the cost of processing large packets to achieve high-throughput increases rapidly
2.2.4 Scalability
Ability to maintain performance when the amount of data increases rapidly or load by adding resources to the system Lambda architecture is horizontally scalable, it means that you do not need to replace your commodity (inexpensive) machines with the more powerful and expensive ones, just add more commoditymachines to distribute the data
2.2.5 Extensibility
Extensible systems allow functionality to be added with a minimal development cost When
a new feature or a change to an existing feature requires a migration of old data into a new format Part of making a system extensible is making it easy to do large-scale migrations
2.2.6 Ad hoc queries
Being able to do ad hoc queries on the large data is extremely important Ad hoc queries are single questions or requests for a database written in SQL or another query language by the user on-demand when the user needs more information outside of regular reporting or predefined queries
Trang 202.3 Architecture
2.3.1 Batch layer
When a system ingests the data continuously, it gets fed to a data lake where it stores an immutable, constantly growing dataset and computes arbitrary function on arbitrary dataset and speed layer simultaneously Then the batch layer will get the data from the data lake and process them
In this layer, the most important mission is that it will compute all data to get a value (for example: we want to calculate the total amount of income of a taxi company from the date
of establishment until now) Because the data is very large and it computes all data, we will receive a result of that computation with high latency We can tune the throughput by altering the size of the batch, more data in the batch will mean that those data will have to wait longer to be processed, so this increases latency, but overall larger batch sizes may increase total throughput
On the other hand, the batch layer is easy to use and can recompute the data at any time Because we just loop over all data and process them all Therefore, we can parallel these functions in the batch layer and get highly scalable computations Usually, this layer is scheduled to run once or twice a day
Even if there is loss of data in other layers, the results can be recomputed by running through the dataset The batch layer also precomputes the data into batch views and sends them to the serving layer for ad hoc queries
2.3.2 Speed layer
When the batch layer computes the data, it may take a few hours or several days to complete, therefore we will lose the data in this period of time This is how the speed layer comes in, it takes care of the data that is yet to be computed by the batch layer The main goal is to ensure new data is represented in query functions as quickly as needed for the application requirements
Trang 21Speed layer is similar to batch layer, except it produces a real-time view based on the data
it receives The big difference is that the speed layer just deals with the recent data, while the batch layer looks at all data at once
Besides, it updates the real-time view when new data comes, this way help to minimize the latency Consequently, we can do some analytics about real-time data in this layer with very low response time
2.3.3 Serving layer
Serving layer supports read-only data access and real-time queries The output of the batch layer and speed layer will store in this serving layer When new batch views are available, the serving layer automatically swaps those in so that more up to date results are available This layer also supports indexing the views so that we can do some ad hoc queries with low latency
To get a complete result at a point, we need to combine both views from batch and speed layer together Because the result from the batch layer is out of date and the result from the speed layer is recent, we need to aggregate them to get results in real-time
Additionally, when data comes from batch layer to serving layer, the corresponding results
in real-time views are no longer needed so we can remove these views This will reduce the time for recompute data and storage in the database
Trang 22CHAPTER 3 APPLY LAMBDA ARCHITECTURE TO SOLVE THE PROBLEMS OF DATABASE SCALABILITY IN THE REAL WORLD
3.1 Some related research
3.1.1 Research on “Big Data: Principles and Best Practices of Scalable Realtime
Data Systems” [1]
3.1.1.1 Summary
This book aims to resolve the problems emerging when the application becomes more and more popular and ingests a vast amount of data in a period of time Some services like social networks, web analytics, and intelligent e-commerce often need to manage data at a scale too big and beyond the ability of traditional databases As scale and demand increase,
so does complexity Fortunately, scalability and simplicity are not mutually exclusive rather than using some trendy technology, so another approach is needed Besides, big data systems are horizontally scalable by using many separate machines working in parallel to store and process data, which introduces fundamental challenges unfamiliar to most developers
This book describes a scalable, easy to understand approach to big data systems that can be built and run by a small team It shows how to build systems using Lambda architecture by taking advantage of clustered new big data tools Additionally, it also mentions how this architecture was born, what limitations of traditional databases are, why NoSQL is not a panacea and why we need to combine multiple different big data tools together
On the other hand, it gives detail of how each layer in this architecture works, why we need those layers and what desired properties are needed for them Then it introduces tools needed for each layer such as Hadoop, Cassandra Following a realistic example, this book guides readers through the theory of big data systems, how to use them in practice, and how
to deploy and operate them once they're built
3.1.1.2 Result of the research
This research work aims to understand the need of Lambda architecture and how it resolves the problem about scalability and complexity of big data systems Helping us to know in
Trang 23detail about each layer, how it works, benefits and challenges of them Besides, this research gives us more knowledge about new tools such as Hadoop, Cassandra as well as combining them together to create a big data system with high performance, low latency, and fault tolerance
3.2 Design Lambda architecture
Figure 2 Proposed Lambda architecture
After reading this book, we propose a Lambda architecture as we designed in figure 2 to resolve the problems when facing big data In this architecture, when a system ingests the data, it will store those data into Kafka topic and then produce directly to two consumers which are data lake and speed layer
In the data lake, it stores a copy of master data with a huge amount of incoming data We use Snowflake as a cloud database in a data lake because it enables data storage, processing and analytic solutions that are faster, easier to use and more flexible
In the batch layer, it will ingest data from data lake and do some computations on all data, this may take a lot of time to generate a result as batch view Apache Spark is a great tool
Trang 24for processing data with speed more than Hadoop 10 times and it supports some API for many different purposes
In the speed layer, we leverage the API of Spark is Spark streaming to stream near time data when it comes to Kafka topics and compute them with very low latency and high performance
real-We use Apache Airflow to orchestrate workflows in this project Batch layer will be executed every 20 minutes and speed layer will process the data every 5 seconds in micro-batch We can access the Airflow webserver UI to easily monitor, manage, author and view logs of all tasks
Finally, the serving layer will store both batch views and real-time views in Snowflake The combination of results from the batch layer and speed layer will produce a completely real-time result at any time
The main purpose of using Kafka to distribute data to each layer is because when data goes
to layers and a system also crashes, we will lose all the data in this period of time To avoid this problem, Kafka stores the data, it has the ability to get all data even if the system goes down or not The other benefit of using Kafka is that we can resend the old data to layers
by adjusting the offset
The reason why we use Spark instead of Hadoop is because Spark can run 100 times faster
in memory and 10 times faster on disk than Hadoop It reduces the number of read/write cycles to disk and stores intermediate data in memory, hence faster-processing speed Besides, it supports a lot of APIs such as Spark streaming, Spark SQL and MLlib for various purposes In addition, it also has the ability to process real-time data, from real-time events like Twitter, Instagram or Facebook
Eventually, for the database, we use Snowflake cloud database instead of other NoSQL or SQL databases to store the data as data lake and serving layer because of some limitations
of resources in the local machine Snowflake stores data in columnar format which is suitable for creating analytic dashboards In addition, we do not need to install any software
to use it or care about management or configuration of the hardware in the local machine
Trang 25because Snowflake will do it instead It gives us a webserver UI which we can easily create, manage databases, grant permissions or view the history of a query directly
By combining these tools together, we design and create an architecture based on Lambda architecture to handle massive quantities of data by taking advantage of both batch and stream processing methods, and removing the trade-offs of each tool
Trang 26CHAPTER 4 EXPERIMENTS AND RESULTS
4.1 Data for experiments
4.1.1 Using Twitter API
to use and get the data in near real-time But there are some trade-offs about using this without paying a product package
4.1.2 Using static dataset
4.1.2.1 Reference source for data
In this project, we will use the raw dataset of real estate sales 2001-2020 GL which is in CSV files The time of the dataset is from 2001 to 2020, so there are about 997 thousand records in this dataset to ingest for the system
Trang 27Here is the link to download the dataset: 2001-2018
https://catalog.data.gov/dataset/real-estate-sales-4.1.2.2 Description of the dataset
The Office of Policy and Management maintains a listing of all real estate sales with a sales price of $2,000 or greater that occur between October 1 and September 30 of each year For each sale record, the file includes: town, property address, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment
This is a description about above features included in this file: [4]
• Serial Number: Serial number
• List Year: Year the property was listed for sale
• Date Recorded: Date the sale was recorded locally
• Town: Town name
• Address: Address
• Assessed Value: Value of the property used for local tax assessment
• Sale Amount: Amount the property was sold for
• Sales Ratio: Ratio of the sale price to the assessed value
• Property Type: Type of property including: Residential, Commercial, Industrial, Apartments, Vacant, etc
• Residential Type: Indicates whether property is single or multifamily residential
• Non-Use Code: Non usable sale code typically means the sale price is not reliable for use in the determination of a property value See attachments in the dataset description page for a listing of codes
• Assessor Remarks: Remarks from the assessor
• OPM remarks: Remarks from OPM
• Location: Lat / Lon coordinates
Trang 284.2 Environments and tools
4.2.3 Visual Studio Code
Visual Studio Code (known as VS Code) is a free open-source text editor invented by Microsoft VS Code is available for different platforms such as Windows, Linux, and macOS Although the editor is relatively lightweight, it includes some powerful features and libraries that have made VS Code one of the most popular development environment tools
4.2.4 Great Expectations
A platform for validating, documenting and profiling data to maintain quality and collaborate with other teams It is written by Python language and used to assert the data to check if their correctness, uniqueness, consistency and accuracy meet specific requirements when they are ingested or transformed into a database
Trang 294.3 Big data tools
4.3.1 Snowflake
4.3.1.1 Definition
Snowflake is a fully managed SaaS (software as a service) that provides a platform for data warehousing, data lakes, data engineering, data science, data application development, and secure sharing and consumption of real-time or shared data It is built on top of the Amazon Web Services, Google Cloud platform or Microsoft Azure cloud infrastructure allowing for storage and compute, on-the-fly scalable compute, data sharing, and third-party tools support in order to handle the demanding needs of growing enterprises [5]
4.3.1.2 Architecture
Figure 3 Snowflake's architecture
Snowflake’s architecture shown in figure 3 is a hybrid of traditional disk and nothing database architectures Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform But similar to shared-nothing architectures, Snowflake processes queries using
Trang 30shared-MPP (massively parallel processing) compute clusters where each node in the cluster stores
a portion of the entire data set locally This approach offers the data management simplicity
of a disk architecture, but with the performance and scale-out benefits of a nothing architecture [6]
shared-4.3.1.2.1 Database storage
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format Snowflake stores this optimized data in cloud storage
Snowflake manages all aspects of how this data is stored such as the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled
by Snowflake The data objects stored by Snowflake are not directly visible nor accessible
by customers; they are only accessible through SQL query operations run using Snowflake [6]
4.3.1.2.2 Query processing
Query execution is performed in the processing layer Snowflake processes queries using
“virtual warehouses” Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider [6]
Each virtual warehouse is an independent compute cluster that manages resources such as RAM, disk and does not share them with other virtual warehouses As a result, each virtual warehouse has no impact on the performance of other virtual warehouses
The virtual warehouses can be started and stopped at any time They can also be resized at any time, even while running, to accommodate the need for more or less compute resources, based on the type of operations being performed by the warehouse which saves a lot of cost
4.3.1.2.3 Cloud services
The cloud services layer is a collection of services that coordinate activities across Snowflake These services tie together all of the different components of Snowflake in order to process user requests, from login to query dispatch The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider [6]
Trang 31Services managed in this layer include:
4.3.2.2 Architecture
Figure 4 Spark RDDs
In figure 4, RDDs are the building blocks of any Spark application RDDs stand for
“Resilient distributed dataset”, this name means that the dataset can distribute the data
Trang 32across multiple nodes in a cluster and have the ability to tolerate any failure of the system
It is immutable and follows lazy transformations
The data stored in an RDD is split into small pieces and they are replicated across multiple nodes This makes them become highly resilient because they can recover quickly from any failures
As the RDDs are immutable, this means that you cannot modify them after you created it but they can be transformed by calling a function The benefit of using RDDs is that you can process data stored in RDDs in parallel
With RDDs, there are two types of operations you can perform:
• Transformations: operations that are applied to create a new RDD
• Actions: operations that apply computation and pass the result back to the driver
In Spark, lazy evaluation means that you can apply as many transformations as you want, but Spark will not perform until you call an action Instead, it will create a DAG which stores operations in order When an action is called, it will execute the DAG with the most optimized way
The benefit of applying lazy evaluation is that it helps to improve efficiency, save time for executing and reduce the memory needed for storing a result