Study lambda architecture and build a demo application

1.4 Aims and objectives In this project, we will research, design and construct a demo application which relies on Lambda architecture to solve some problems about latency, availability

Trang 1

TECHNOLOGY AND EDUCATION

MINISTRY OF EDUCATION AND TRAINING

HO CHI MINH CITY UNIVERSITY OF

GRADUATION THESIS INFORMATION TECHNOLOGY

STUDY LAMBDA ARCHITECTURE AND

BUILD A DEMO APPLICATION

LECTURER: M.S QUACH DINH HOANG STUDENTS: NGUYEN THANH PHAT TRAN DUC TUAN

S K L 0 1 1 1 5 3

Trang 2

MINISTRY OF EDUCATION AND TRAINING

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION

FACULTY FOR HIGH-QUALITY TRAINING



CAPSTONE PROJECT INFORMATION TECHNOLOGY

LECTURER: M.S Quach Dinh Hoang STUDENTS: Nguyen Thanh Phat - 19110101

STUDY LAMBDA ARCHITECTURE AND BUILD A

DEMO APPLICATION

Trang 3

THE SOCIALIST REPUBLIC OF VIETNAM

Independence – Freedom– Happiness

-

Ho Chi Minh City, August 11, 2023

GRADUATION PROJECT ASSIGNMENT

Advisor: M.S Quach Dinh Hoang

1 Project title: Study Lambda architecture and build a demo application

2 Initial materials provided by the advisor: Advisor suggests me to read about some

reference sources or articles about Lambda architecture which will help me to do this project

3 Content of the project: Combine Big Data tools together and get advantage of them

Then learning about Lambda architecture and how it resolves problems about scalability and complexity of Big Data systems

4 Final product: A model that applies Lambda architecture to stream and process a

massive amount of data and a dashboard illustrates these data under charts

CHAIR OF THE PROGRAM

(Sign with full name)

ADVISOR

(Sign with full name)

Trang 4

-

ADVISOR’S EVALUATION SHEET

Major: Software Engineering

Project title: Study Lambda architecture and build a demo application

Advisor: M.S Quach Dinh Hoang

EVALUATION

1 Content of the project:

2 Strengths:

3 Weaknesses:

4 Approval for oral defense? (Approved or denied)

5 Overall evaluation: (Excellent, Good, Fair, Poor)

6 Mark: ……… (in words: )

ADVISOR

(Sign with full name)

Trang 5

-

PRE-DEFENSE EVALUATION SHEET

Name of Reviewer: PhD Tran Nhat Quang

Trang 6

-

EVALUATION SHEET OF DEFENSE COMMITTEE MEMBER

Name of defense Committee Member:

EVALUATION

1 Content of the project:

2 Strengths:

3 Weaknesses:

4 Approval for oral defense? (Approved or denied)

5 Overall evaluation: (Excellent, Good, Fair, Poor)

6 Mark: ……… (in words: )

COMMITTEE MEMBER

(Sign with full name)

Trang 7

ACKNOWLEDGEMENTS

We would like to offer our heartfelt appreciation to the Department of High-Quality Training, Ho Chi Minh City University of Technology and Education, and all the professors and teachers who have actively taught and aided me during my study and research

In particular, we would like to express our heartfelt thanks to Mr Quach Dinh Hoang for personally instructing, leading and establishing all suitable conditions to let me carry out the thesis He helped us a lot in picking a topic and directing us to understand, investigate the theory and practice the issue During the thesis implementation, the knowledge that the teacher provided us was extremely valuable, not only assisting us in completing the thesis but also supplementing us with a significant amount of core knowledge

However, due to our limited professional expertise and limitations in our personal experience, the substance of the report is unavoidable We would like to get your thoughts and more direction He guides and criticizes to make this report more thorough

Ho Chi Minh City, August 11, 2023

Group student

Nguyen Thanh Phat - 19110101 Tran Duc Tuan - 19110140

Trang 8

To restrict the limitations of NoSQL and traditional databases, we will combine these tools together to ensure the system runs with low latency, high performance and fault tolerance which allows to continue running even if the system goes down randomly This architecture

is called the Lambda architecture which we will discuss and put into practice later

Applying that practice, our team would like to research deeper and present a report on the application topic of Lambda architecture which is "Study Lambda architecture and build a demo application"

Trang 9

TABLE OF CONTENTS

ACKNOWLEDGEMENTS 7

FOREWORD 8

TABLE OF CONTENTS 9

LIST OF FIGURES 13

CHAPTER 1 INTRODUCTION 15

1.1 Motivation for selecting the topic

1.2 Research object

1.3 Project scope

1.4 Aims and objectives

CHAPTER 2 OVERVIEW OF LAMBDA ARCHITECTURE 18

2.1 Definition

2.2 Desired properties

2.2.1 Fault tolerance

2.2.2 Low latency

2.2.3 High throughput

2.2.4 Scalability

2.2.5 Extensibility

2.2.6 Ad hoc queries

2.3 Architecture

2.3.1 Batch layer

2.3.2 Speed layer

2.3.3 Serving layer

CHAPTER 3 APPLY LAMBDA ARCHITECTURE TO SOLVE THE PROBLEMS OF DATABASE SCALABILITY IN THE REAL WORLD 22

Trang 10

3.1 Some related research

3.1.1 Research on “Big Data: Principles and Best Practices of Scalable Realtime Data Systems” [1]

3.1.1.1 Summary

3.1.1.2 Result of the research

3.2 Design Lambda architecture

CHAPTER 4 EXPERIMENTS AND RESULTS 26

4.1 Data for experiments

4.1.1 Using Twitter API

4.1.1.1 Definition

4.1.1.2 Limitation

4.1.2 Using static dataset

4.1.2.1 Reference source for data

4.1.2.2 Description of the dataset

4.2 Environments and tools

4.2.1 Python

4.2.2 Docker

4.2.3 Visual Studio Code

4.2.4 Great Expectations

4.3 Big data tools

4.3.1 Snowflake

4.3.1.1 Definition

4.3.1.2 Architecture

4.3.1.2.1 Database storage

4.3.1.2.2 Query processing

Trang 11

4.3.1.2.3 Cloud services

4.3.2 Apache Spark

4.3.2.1 Definition

4.3.2.2.1 What is RDD

4.3.2.2.2 Lazy evaluation

4.3.2.2.3 How Spark works?

4.3.2.3 What is Spark Streaming?

4.3.3 Apache Kafka

4.3.3.1 Definition

4.3.3.2.1 Consumer

4.3.3.2.2 Producer

4.3.3.2.3 Topic

4.3.3.2.4 Partition

4.3.3.2.5 Broker

4.3.3.3 How Kafka works?

4.3.4 Apache Airflow

4.3.4.1 Definition

4.3.4.2.1 Scheduler

4.3.4.2.2 Executor

4.3.4.2.3 Webserver

4.3.4.2.4 DAG directory

Trang 12

4.3.4.2.5 Workers

4.3.4.2.6 Metadata database

4.4 Implementation process

4.4.1 Setup environment using Docker

4.4.2 Ingest data

4.4.2.1 How to simulate real-time dataset

4.4.3 Design database

4.4.4 Insert data to Kafka topic and data lake

4.4.5 Batch layer

4.4.6 Speed layer

4.4.7 Serving layer

4.5 Training model

4.5.1 Feature Selection & Data Split

4.5.1.1 Checking null value and select features

4.5.1.2 Numerical data

4.5.1.3 Categorical data

4.5.2 Modeling

4.5.3 Hyperparameter tuning

4.6 Experimental Results

CHAPTER 5 CONCLUSIONS AND DEVELOPMENT ORIENTATIONS 61

5.1 Content implemented

5.2 Restrictions

5.3 Further development direction

REFERENCES 62

Trang 13

LIST OF FIGURES

Figure 1 Lambda architecture 18

Figure 2 Proposed Lambda architecture 23

Figure 3 Snowflake's architecture 29

Figure 4 Spark RDDs 31

Figure 5 Spark architecture 33

Figure 6 Kafka architecture 36

Figure 7 Airflow's architecture 37

Figure 8 Environment file 39

Figure 9 Kafka and Zookeeper containers 40

Figure 10 Fetch data workflow 41

Figure 11 Airflow variables 43

Figure 12 Database schema 44

Figure 13 Kafka jobs workflow 45

Figure 14 Batch layer workflow 47

Figure 15 Speed layer workflow 48

Figure 16 Calculate average total sale ratio 50

Figure 17 Missing value by columns 51

Figure 18 Heatmap of Correlations Between Variables 52

Figure 19 Features Standard Scaler 53

Figure 20 Replace Null value 53

Figure 21 Convert A Categorical Variable into Dummy Variables 54

Figure 22 Split the Dataset to Train & Test Set 54

Figure 23 Pre-built algorithms by scikit-learn package 54

Figure 24 Explained Variance Score of 5 Models 55

Figure 25 Mean Absolute Percentage Error of 5 Models 56

Figure 26 Cross validation of 5 Models 56

Figure 27 List of hyperparameters 57

Figure 28 Best Hyperparameters and MAPE 58

Figure 29 True and Predicted Prices 58

Trang 14

Figure 30 Dashboard 59 Figure 31 Charts depict the total sale amount, total customer and average sale ratio 59 Figure 32 Charts about total customer by property type and town 60

Trang 15

CHAPTER 1 INTRODUCTION

1.1 Motivation for selecting the topic

In traditional databases, when we start developing an application, it getting bigger and bigger over time when your application becomes more popular Then you will hit the limits

of traditional database technologies

Suppose we build a simple web analytics application The application should track the number of pageviews for any URL a customer wishes to track The customer’s web page pings the application’s web server with its URL every time a pageview is received Besides, the application should tell us at any time what the top 50 URLs are sorted by number of pageviews The database you use to develop the application is traditional databases such as MySQL, PostgreSQL, … Whenever someone loads a web page which is tracked by the web application, it will ping your web server with the pageview and increase the corresponding row in the database

There is no problem about the scalability and complexity of your database until our application becomes more popular and the amount of data received is very huge Then there are so many problems that may emerge, we have many solutions to resolve but it is just a temporary way Because the amount of data you get is so huge and it is beyond the writes

of the RDBMS database when update increments

Due to the limitations of traditional databases, we decide to use NoSQL databases NoSQL helps to handle very large amounts of data, large scale and other problems which occur in traditional databases There are some large-scale computation systems like Hadoop and databases such as Cassandra and MongoDB

Although it can resolve the problem about database scalability and complexity of your application, there are serious trade-offs

Hadoop, for example, can parallelize large-scale batch computations on very large amounts

of data, but the computations have high latency We don’t use Hadoop for anything where

we need low-latency results

Trang 16

NoSQL databases like Cassandra achieve their scalability by offering a much more limited data model than you’re used to with something like SQL Squeezing our application into these limited data models can be very complex

There are several problems emerging when the application becomes more and more popular among traditional databases and NoSQL databases

● Cassandra: the speed of reading data is slow and does not support aggregations

● MongoDB: complexities in joining two documents so it is hard to perform complex

or ad-hoc queries

● Hadoop: high latency when computing data

From the problems we mentioned above, we will research and study about Lambda architecture which provides a consistent approach with high performance, low latency and fault tolerance when the application is more popular It also removes all trade-offs of every single tool by combining multiple different technologies and tools together Then we will design and build a demo application based on this architecture

1.2 Research object

After researching information about the topic, reading a book about how Lambda architecture was born to resolve these problems and applying the knowledge learned at school, our group identified objects to be studied in the topic as:

• Lambda architecture

• Apache Kafka

• Apache Spark

Trang 17

to ensure the quality of data meets our requirements

1.4 Aims and objectives

In this project, we will research, design and construct a demo application which relies on Lambda architecture to solve some problems about latency, availability and consistency when ingesting and processing data in real-time In addition, we create a real-time dashboard to give information and depict some features of data in the form of charts to ensures that our application runs as our expectations We also apply some machine learning

to this dataset to predict and train the model as new data comes to the database

Trang 18

CHAPTER 2 OVERVIEW OF LAMBDA ARCHITECTURE

2.1 Definition

The Lambda Architecture is a Big Data processing architecture that enables processing massive amounts of data at scale and building big data systems as a series of layers It allows for massive data processing by scaling out rather than scaling up

The Lambda Architecture is a new paradigm for handling vast amounts of data It is a scalable and easy-to-understand approach to processing large volumes of rapidly arriving data using batch and stream-processing methods It is used to create high-performance, scalable, flexible, and extendable systems

There are 3 layers in this architecture, as shown in figure 1:

• Batch layer: compute a vast amount of data with high latency

• Speed layer: stream real-time data

• Serving layer: index the views from batch layer and speed layer so that it can be used

in ad-hoc query with low latency

Figure 1 Lambda architecture

Trang 19

2.2.3 High throughput

It is mentioned that the vast amount of data is processed and transmitted from one place to another place in a given time Due to processing an enormous amount of data, it makes latency higher and the cost of processing large packets to achieve high-throughput increases rapidly

2.2.4 Scalability

Ability to maintain performance when the amount of data increases rapidly or load by adding resources to the system Lambda architecture is horizontally scalable, it means that you do not need to replace your commodity (inexpensive) machines with the more powerful and expensive ones, just add more commoditymachines to distribute the data

2.2.5 Extensibility

Extensible systems allow functionality to be added with a minimal development cost When

a new feature or a change to an existing feature requires a migration of old data into a new format Part of making a system extensible is making it easy to do large-scale migrations

2.2.6 Ad hoc queries

Being able to do ad hoc queries on the large data is extremely important Ad hoc queries are single questions or requests for a database written in SQL or another query language by the user on-demand when the user needs more information outside of regular reporting or predefined queries

Trang 20

2.3 Architecture

2.3.1 Batch layer

When a system ingests the data continuously, it gets fed to a data lake where it stores an immutable, constantly growing dataset and computes arbitrary function on arbitrary dataset and speed layer simultaneously Then the batch layer will get the data from the data lake and process them

In this layer, the most important mission is that it will compute all data to get a value (for example: we want to calculate the total amount of income of a taxi company from the date

of establishment until now) Because the data is very large and it computes all data, we will receive a result of that computation with high latency We can tune the throughput by altering the size of the batch, more data in the batch will mean that those data will have to wait longer to be processed, so this increases latency, but overall larger batch sizes may increase total throughput

On the other hand, the batch layer is easy to use and can recompute the data at any time Because we just loop over all data and process them all Therefore, we can parallel these functions in the batch layer and get highly scalable computations Usually, this layer is scheduled to run once or twice a day

Even if there is loss of data in other layers, the results can be recomputed by running through the dataset The batch layer also precomputes the data into batch views and sends them to the serving layer for ad hoc queries

2.3.2 Speed layer

When the batch layer computes the data, it may take a few hours or several days to complete, therefore we will lose the data in this period of time This is how the speed layer comes in, it takes care of the data that is yet to be computed by the batch layer The main goal is to ensure new data is represented in query functions as quickly as needed for the application requirements

Trang 21

Speed layer is similar to batch layer, except it produces a real-time view based on the data

it receives The big difference is that the speed layer just deals with the recent data, while the batch layer looks at all data at once

Besides, it updates the real-time view when new data comes, this way help to minimize the latency Consequently, we can do some analytics about real-time data in this layer with very low response time

2.3.3 Serving layer

Serving layer supports read-only data access and real-time queries The output of the batch layer and speed layer will store in this serving layer When new batch views are available, the serving layer automatically swaps those in so that more up to date results are available This layer also supports indexing the views so that we can do some ad hoc queries with low latency

To get a complete result at a point, we need to combine both views from batch and speed layer together Because the result from the batch layer is out of date and the result from the speed layer is recent, we need to aggregate them to get results in real-time

Additionally, when data comes from batch layer to serving layer, the corresponding results

in real-time views are no longer needed so we can remove these views This will reduce the time for recompute data and storage in the database

Trang 22

CHAPTER 3 APPLY LAMBDA ARCHITECTURE TO SOLVE THE PROBLEMS OF DATABASE SCALABILITY IN THE REAL WORLD

3.1 Some related research

3.1.1 Research on “Big Data: Principles and Best Practices of Scalable Realtime

Data Systems” [1]

3.1.1.1 Summary

This book aims to resolve the problems emerging when the application becomes more and more popular and ingests a vast amount of data in a period of time Some services like social networks, web analytics, and intelligent e-commerce often need to manage data at a scale too big and beyond the ability of traditional databases As scale and demand increase,

so does complexity Fortunately, scalability and simplicity are not mutually exclusive rather than using some trendy technology, so another approach is needed Besides, big data systems are horizontally scalable by using many separate machines working in parallel to store and process data, which introduces fundamental challenges unfamiliar to most developers

This book describes a scalable, easy to understand approach to big data systems that can be built and run by a small team It shows how to build systems using Lambda architecture by taking advantage of clustered new big data tools Additionally, it also mentions how this architecture was born, what limitations of traditional databases are, why NoSQL is not a panacea and why we need to combine multiple different big data tools together

On the other hand, it gives detail of how each layer in this architecture works, why we need those layers and what desired properties are needed for them Then it introduces tools needed for each layer such as Hadoop, Cassandra Following a realistic example, this book guides readers through the theory of big data systems, how to use them in practice, and how

to deploy and operate them once they're built

3.1.1.2 Result of the research

This research work aims to understand the need of Lambda architecture and how it resolves the problem about scalability and complexity of big data systems Helping us to know in

Trang 23

detail about each layer, how it works, benefits and challenges of them Besides, this research gives us more knowledge about new tools such as Hadoop, Cassandra as well as combining them together to create a big data system with high performance, low latency, and fault tolerance

3.2 Design Lambda architecture

Figure 2 Proposed Lambda architecture

After reading this book, we propose a Lambda architecture as we designed in figure 2 to resolve the problems when facing big data In this architecture, when a system ingests the data, it will store those data into Kafka topic and then produce directly to two consumers which are data lake and speed layer

In the data lake, it stores a copy of master data with a huge amount of incoming data We use Snowflake as a cloud database in a data lake because it enables data storage, processing and analytic solutions that are faster, easier to use and more flexible

In the batch layer, it will ingest data from data lake and do some computations on all data, this may take a lot of time to generate a result as batch view Apache Spark is a great tool

Trang 24

for processing data with speed more than Hadoop 10 times and it supports some API for many different purposes

In the speed layer, we leverage the API of Spark is Spark streaming to stream near time data when it comes to Kafka topics and compute them with very low latency and high performance

real-We use Apache Airflow to orchestrate workflows in this project Batch layer will be executed every 20 minutes and speed layer will process the data every 5 seconds in micro-batch We can access the Airflow webserver UI to easily monitor, manage, author and view logs of all tasks

Finally, the serving layer will store both batch views and real-time views in Snowflake The combination of results from the batch layer and speed layer will produce a completely real-time result at any time

The main purpose of using Kafka to distribute data to each layer is because when data goes

to layers and a system also crashes, we will lose all the data in this period of time To avoid this problem, Kafka stores the data, it has the ability to get all data even if the system goes down or not The other benefit of using Kafka is that we can resend the old data to layers

by adjusting the offset

The reason why we use Spark instead of Hadoop is because Spark can run 100 times faster

in memory and 10 times faster on disk than Hadoop It reduces the number of read/write cycles to disk and stores intermediate data in memory, hence faster-processing speed Besides, it supports a lot of APIs such as Spark streaming, Spark SQL and MLlib for various purposes In addition, it also has the ability to process real-time data, from real-time events like Twitter, Instagram or Facebook

Eventually, for the database, we use Snowflake cloud database instead of other NoSQL or SQL databases to store the data as data lake and serving layer because of some limitations

of resources in the local machine Snowflake stores data in columnar format which is suitable for creating analytic dashboards In addition, we do not need to install any software

to use it or care about management or configuration of the hardware in the local machine

Trang 25

because Snowflake will do it instead It gives us a webserver UI which we can easily create, manage databases, grant permissions or view the history of a query directly

By combining these tools together, we design and create an architecture based on Lambda architecture to handle massive quantities of data by taking advantage of both batch and stream processing methods, and removing the trade-offs of each tool

Trang 26

CHAPTER 4 EXPERIMENTS AND RESULTS

4.1 Data for experiments

4.1.1 Using Twitter API

to use and get the data in near real-time But there are some trade-offs about using this without paying a product package

4.1.2 Using static dataset

4.1.2.1 Reference source for data

In this project, we will use the raw dataset of real estate sales 2001-2020 GL which is in CSV files The time of the dataset is from 2001 to 2020, so there are about 997 thousand records in this dataset to ingest for the system

Trang 27

Here is the link to download the dataset: 2001-2018

https://catalog.data.gov/dataset/real-estate-sales-4.1.2.2 Description of the dataset

The Office of Policy and Management maintains a listing of all real estate sales with a sales price of $2,000 or greater that occur between October 1 and September 30 of each year For each sale record, the file includes: town, property address, date of sale, property type (residential, apartment, commercial, industrial or vacant land), sales price, and property assessment

This is a description about above features included in this file: [4]

• Serial Number: Serial number

• List Year: Year the property was listed for sale

• Date Recorded: Date the sale was recorded locally

• Town: Town name

• Address: Address

• Assessed Value: Value of the property used for local tax assessment

• Sale Amount: Amount the property was sold for

• Sales Ratio: Ratio of the sale price to the assessed value

• Property Type: Type of property including: Residential, Commercial, Industrial, Apartments, Vacant, etc

• Residential Type: Indicates whether property is single or multifamily residential

• Non-Use Code: Non usable sale code typically means the sale price is not reliable for use in the determination of a property value See attachments in the dataset description page for a listing of codes

• Assessor Remarks: Remarks from the assessor

• OPM remarks: Remarks from OPM

• Location: Lat / Lon coordinates

Trang 28

4.2 Environments and tools

4.2.3 Visual Studio Code

Visual Studio Code (known as VS Code) is a free open-source text editor invented by Microsoft VS Code is available for different platforms such as Windows, Linux, and macOS Although the editor is relatively lightweight, it includes some powerful features and libraries that have made VS Code one of the most popular development environment tools

4.2.4 Great Expectations

A platform for validating, documenting and profiling data to maintain quality and collaborate with other teams It is written by Python language and used to assert the data to check if their correctness, uniqueness, consistency and accuracy meet specific requirements when they are ingested or transformed into a database

Trang 29

4.3 Big data tools

4.3.1 Snowflake

4.3.1.1 Definition

Snowflake is a fully managed SaaS (software as a service) that provides a platform for data warehousing, data lakes, data engineering, data science, data application development, and secure sharing and consumption of real-time or shared data It is built on top of the Amazon Web Services, Google Cloud platform or Microsoft Azure cloud infrastructure allowing for storage and compute, on-the-fly scalable compute, data sharing, and third-party tools support in order to handle the demanding needs of growing enterprises [5]

4.3.1.2 Architecture

Figure 3 Snowflake's architecture

Snowflake’s architecture shown in figure 3 is a hybrid of traditional disk and nothing database architectures Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform But similar to shared-nothing architectures, Snowflake processes queries using

Trang 30

shared-MPP (massively parallel processing) compute clusters where each node in the cluster stores

a portion of the entire data set locally This approach offers the data management simplicity

of a disk architecture, but with the performance and scale-out benefits of a nothing architecture [6]

shared-4.3.1.2.1 Database storage

When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format Snowflake stores this optimized data in cloud storage

Snowflake manages all aspects of how this data is stored such as the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled

by Snowflake The data objects stored by Snowflake are not directly visible nor accessible

by customers; they are only accessible through SQL query operations run using Snowflake [6]

4.3.1.2.2 Query processing

Query execution is performed in the processing layer Snowflake processes queries using

“virtual warehouses” Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider [6]

Each virtual warehouse is an independent compute cluster that manages resources such as RAM, disk and does not share them with other virtual warehouses As a result, each virtual warehouse has no impact on the performance of other virtual warehouses

The virtual warehouses can be started and stopped at any time They can also be resized at any time, even while running, to accommodate the need for more or less compute resources, based on the type of operations being performed by the warehouse which saves a lot of cost

4.3.1.2.3 Cloud services

The cloud services layer is a collection of services that coordinate activities across Snowflake These services tie together all of the different components of Snowflake in order to process user requests, from login to query dispatch The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider [6]

Trang 31

Services managed in this layer include:

4.3.2.2 Architecture

Figure 4 Spark RDDs

In figure 4, RDDs are the building blocks of any Spark application RDDs stand for

“Resilient distributed dataset”, this name means that the dataset can distribute the data

Trang 32

across multiple nodes in a cluster and have the ability to tolerate any failure of the system

It is immutable and follows lazy transformations

The data stored in an RDD is split into small pieces and they are replicated across multiple nodes This makes them become highly resilient because they can recover quickly from any failures

As the RDDs are immutable, this means that you cannot modify them after you created it but they can be transformed by calling a function The benefit of using RDDs is that you can process data stored in RDDs in parallel

With RDDs, there are two types of operations you can perform:

• Transformations: operations that are applied to create a new RDD

• Actions: operations that apply computation and pass the result back to the driver

In Spark, lazy evaluation means that you can apply as many transformations as you want, but Spark will not perform until you call an action Instead, it will create a DAG which stores operations in order When an action is called, it will execute the DAG with the most optimized way

The benefit of applying lazy evaluation is that it helps to improve efficiency, save time for executing and reduce the memory needed for storing a result

Tiêu đề	Study Lambda Architecture And Build A Demo Application
Tác giả	Nguyen Thanh Phat, Tran Duc Tuan
Người hướng dẫn	M.S. Quach Dinh Hoang
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Information Technology
Thể loại	Graduation Project
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	64
Dung lượng	4,24 MB