Building a real time data processing system for affiliate application Xây dựng hệ thống xử lý dữ liệu thời gian thực cho ứng dụng liên kết
INTRODUCTION
Data has become important in every organization, and operating data effectively is indispensable for the development and success of every industry
Data Driven Decision Making (DDDM) enables businesses to make informed decisions through data analysis rather than relying on intuition or subjective assumptions This approach has transformed banking operations by facilitating faster, cost-effective, and more accurate decision-making, which enhances productivity and boosts profits By utilizing data, banks can better understand customer needs, anticipate market changes, and develop effective business strategies The DDDM process encompasses several key steps that guide banks in leveraging data for optimal decision-making.
Collect and analyze relevant data
Use tools and methods to discover patterns or trends in data
Apply discovered knowledge to help make accurate business decisions The following benefits of DDDM:
Reliable data empowers businesses to make accurate and informed decisions, ensuring that choices are based on factual information rather than subjective assumptions This trust in data significantly enhances overall business operations.
Data-Driven Decision Making (DDDM) empowers companies to make objective decisions by eliminating bias and subjectivity from the process By relying solely on data and thorough analysis, businesses can ensure that their strategies are grounded in factual information rather than personal opinions, experiences, or biases This approach fosters more reliable and effective decision-making, ultimately leading to better outcomes.
Enhancing operational productivity is essential for businesses, enabling quicker and more effective decision-making through data-driven insights rather than relying solely on trial and error The rise of "data democratization" facilitates this shift, as self-service business intelligence (BI) tools empower users to independently inquire about and analyze their enterprise data within data lake platforms.
Leveraging data effectively provides businesses with a significant competitive advantage, enabling them to gain a comprehensive understanding of the market This insight allows for informed decision-making, the identification of opportunities that competitors may overlook, and the recognition of potential challenges that can be addressed Ultimately, this strategic use of data fosters long-term success for the business.
Some problems in banking that use data can be mentioned as follows:
Customer portrait analysis involves leveraging the data collected by banks to gain deeper insights into their clientele, ultimately enhancing the customer experience This process includes analyzing customer feedback, tracking user behavior, and employing machine learning algorithms to derive actionable insights By understanding customer preferences and behaviors, banks can tailor their services to meet individual needs and improve overall satisfaction.
3 relevant product recommendations (product recommendation) Some case studies can be mentioned as follows:
– Netflix analyzes customer movie viewing behavior data and makes relevant movie recommendations for customers, or based on the data, makes decisions to invest in new movie production funds
Enhancing operational productivity is crucial for banks, as analyzing data collected during banking operations can lead to optimized processes and reduced costs By leveraging data from their ERP (Enterprise Resource Planning) systems, banks can effectively track inventory, identify production bottlenecks, and streamline their supply chains.
Effective risk management leverages data to identify and mitigate potential threats For instance, banks can utilize machine learning models that analyze transaction data from their Core Banking systems to detect fraudulent activities or money laundering attempts This proactive approach enables the automatic blocking of suspicious transactions while simultaneously alerting human operators for further investigation.
Data analysis in marketing and sales offers valuable insights into customer behavior, preferences, and purchasing trends By utilizing predictive analytics, businesses can identify specific customer segments for targeted and cost-effective marketing strategies Additionally, analyzing customer sentiment through social media data enhances understanding of consumer attitudes and helps refine marketing approaches.
Effective product development involves analyzing user feedback and telemetry data from test products, as well as insights gained from A/B testing, to understand user needs and preferences The concept of Big Data, which encompasses characteristics such as volume, variety, velocity, and veracity, plays a crucial role in informing product design decisions.
Reliability poses a significant challenge in advanced data analysis and artificial intelligence, as poor-quality input data results in poor-quality output To address emerging problems and challenges associated with big data, a revolution in data management is underway, transforming how data is stored, processed, managed, and delivered to users Compared to traditional data management systems, new Big Data technologies offer enhanced scalability and cost-effectiveness, making them better suited for handling large volumes of data.
In the competitive financial sector, bank employees face significant challenges in efficiently opening new accounts and meeting key performance indicators (KPIs) related to card issuance This process demands considerable time and effort, as staff must identify the right target customers to drive revenue Consequently, banks invest heavily in training their employees to enhance their skills and effectiveness in this area, yet many tellers still struggle to achieve desired results.
4 calling without receiving any attention from customers is very high, causing customers to become frustrated with taking phone calls from consultants
To effectively address this challenge, banks must enhance their outreach strategies to customers A recent survey indicates that customers are increasingly drawn to affiliates promoted by Key Opinion Leaders (KOLs), who are recognized and trusted figures within their respective fields By leveraging the influence of KOLs, banks can provide customers with valuable insights into the advantages of these promotions.
The affiliate model, combined with collaborations with Key Opinion Leaders (KOLs) in the financial sector, offers banks a strategic solution to enhance customer engagement By partnering with KOLs and influencers who share a similar target audience, banks can move beyond traditional calling methods These KOLs can be trained to effectively promote the bank's products and services, leveraging influential communication channels such as social media, blogs, online videos, and live events to reach potential customers.
By partnering with Key Opinion Leaders (KOLs), banks can leverage their influence and credibility to foster customer interest and trust, thereby enhancing access to target audiences and unlocking new business opportunities It is essential for banks to ensure that these collaborations adhere to relevant regulations and guidelines Implementing an affiliate model in conjunction with KOL partnerships can effectively address the challenges faced by banks, facilitating more efficient customer engagement, building trust, and generating new business prospects through the promotion and referrals of trusted KOLs.
THEORERICAL OVERVIEW
2.1.1 Introducing the Evolution of Data Analyst
The evolution of data warehousing began with the goal of enabling business leaders to gain analytical insights by consolidating data from operational databases into centralized warehouses This centralized approach facilitates decision support and enhances business intelligence (BI).
Figure 1 - A typical data lake pattern
Centralized warehouses have played a crucial role in decision support and business intelligence (BI) for over thirty years However, the evolution of data types and hosting methods has led to significant changes in data warehouse architecture Key advancements in this area reflect the ongoing transformation of BI solutions.
The first generation of data analytics platforms introduced schema-on-read, allowing data to remain unprocessed until needed for analysis This innovation enhanced flexibility and scalability in data analysis by accommodating unstructured and semi-structured data sources.
The second generation of data analytics platforms has revolutionized data warehousing by offering faster, more reliable, and cost-effective solutions through distributed systems and cloud computing These advanced platforms enhance accessibility to data analysis for a wider range of users, thanks to their robust analytics capabilities and intuitive data visualization and presentation tools.
The fourth generation of data analytics platforms marks a significant advancement in data warehousing innovation, driven by the demand for real-time and predictive analytics These systems are designed to efficiently handle vast amounts and diverse types of data, automating processes to enhance data management capabilities.
Leveraging artificial intelligence (AI) and machine learning (ML) technologies, businesses can enhance data input, integration, quality, and governance through eight key activities These advancements significantly improve organizations' decision-making capabilities by generating valuable data-driven insights and recommendations.
Many year ago, a number of issues with the initial generation of systems began to arise
According to recent research by Harvard Business Review, data is increasingly dispersed across various environments, including data centers, private clouds, and public clouds The study highlights the need for leaders in data and analytics to navigate the complexities of data management, with integration, performance, and costs being significant challenges Notably, 82% of businesses have adopted hybrid cloud strategies, while 92% support multi-cloud approaches, driving a demand for effective integration of public cloud solutions with private clouds and on-premises data centers However, legacy data warehouses often struggle to support these deployment models, leading to inefficient operations, inconsistent user experiences, and difficulties in managing diverse solutions.
According to the International Data Corporation (IDC), the total amount of data generated is expected to reach around 163 zettabytes (ZB) by 2025, a significant increase from approximately 0.5 ZB in 2010 This dramatic growth in data volume is primarily attributed to advancements in internet technologies, which have fueled the expansion of various industries.
The telecoms sector experienced significant transformation due to the surge of data generated by social media platforms like Facebook, Twitter, and Instagram, as well as by e-commerce and streaming services This influx of data not only reshaped consumer habits but was also fueled by technological advancements in the Internet of Things (IoT), leading to an unprecedented volume of information in the digital landscape.
Figure 2 - The speed of data
The International Data Corporation (IDC) forecasts that global data generation will reach around 163 zettabytes (ZB) by 2025, a significant rise from approximately 0.5 ZB in 2010 This dramatic increase in data is primarily driven by advancements in internet technologies, which have fueled growth across various industries.
The telecoms sector has experienced significant transformation due to the overwhelming influx of data generated by social media platforms like Facebook, Twitter, and Instagram, as well as by e-commerce and streaming services This surge in data has influenced and shaped consumer habits, while advancements in technology within the Internet of Things (IoT) have further contributed to the vast amounts of data produced.
Flexibility: Older data warehouse solutions frequently can't handle different tool and engine versions or offer fine-grained control over the resources allotted to jobs and tasks
Legacy data warehouses require all users, groups, and workloads to operate on the same versions of query engines and tools, which often hinders innovation and complicates upgrades In contrast, modern data warehouses support multiple contexts and versions, allowing for rapid adoption of the latest technologies while ensuring operational stability with established versions.
2.1.2 The second data analyst platform
To tackle emerging challenges, second-generation data analytics platforms shifted to utilizing data lakes, which are cost-effective storage solutions featuring file APIs that accommodate standard open formats like ORC and Apache Parquet This transformation was initiated by the Apache Hadoop movement.
10 using the Hadoop File System (HDFS) for cheap storage Data lakes based on this traditional architecture are deployed on-premise
HDFS serves as the centralized storage system while MapReduce functions as the computational model, leading to the development of various components such as HBase (a key-value database), Apache Hive (for SQL querying), and Apache Pig Additionally, new computational models like Tez and Spark have been introduced to address the demands of large-scale batch data processing.
In the Hadoop ecosystem, data is stored in HDFS, a distributed file system that spans multiple nodes within a cluster When a job is submitted to YARN for processing, it is divided into smaller tasks that are then distributed across the cluster's nodes Research indicates that "data locality" is a crucial optimization technique in Apache Hadoop's design Upon job submission, YARN breaks the job into numerous map and reduce tasks, as illustrated in Figure 1.5, and prioritizes scheduling these tasks on nodes based on specific criteria.
1 Node local or Data local: JobTracker will prioritize scheduling tasks to the node that contains data related to the corresponding task
Figure 5 - Map and Reduce Tasks
IMPLEMENT LAKEHOUSE
Below are the goals and technical requirements for the banking business's affiliate project:
Fast and accurate result retrieval: An important requirement is that the
A Lakehouse system must efficiently process and deliver query results with speed and accuracy, ensuring that users receive timely information without delays or errors during data retrieval.
To maintain data integrity in the Lakehouse system, it is crucial to ensure the accuracy of information This accuracy is vital for correct payment processing for publishers and for reliable transaction details for affiliates, preventing significant losses or errors.
The Lakehouse system must ensure data availability for Data Science (DS) and Artificial Intelligence (AI) tools, facilitating model training and resolving user-related issues in web and app environments This guarantees that data is both accessible and dependable, essential for crafting intelligent solutions.
To maintain accuracy and fairness in publisher payments, the Lakehouse system must ensure the correctness of financial data This commitment to precise data handling guarantees that payment figures are calculated accurately, thereby preventing potential disputes or disagreements among publishers.
The Lakehouse system must efficiently manage and process high volumes of data, ensuring the capability to store large data files and execute complex queries on big data seamlessly, without delays or interruptions.
The Lakehouse system must be scalable to accommodate increasing data demands and user traffic, allowing for seamless expansion by adding hardware resources or enlarging server clusters to enhance processing and storage capabilities.
To achieve optimal performance in a Lakehouse system, it is essential to implement strategies such as query optimization, utilizing indexes, and incorporating caching mechanisms These practices help to minimize query time and enhance processing speed, ensuring the system operates efficiently.
The Lakehouse system must be designed for high reliability and availability, prioritizing data integrity while minimizing downtime It is essential to implement robust backup and recovery mechanisms to swiftly and effectively restore operations in the event of failures.
The Lakehouse system must feature resource elasticity, enabling it to dynamically adjust resources to accommodate fluctuating workloads This includes automatic scaling and resource allocation driven by demand, as well as effective resource management to enhance performance and optimize resource utilization.
I built the System for this project including:
PySpark: Python programming language for big data processing and complex analytics on distributed systems
Parquet: High-performance data storage format, suitable for working with large data
Postgres: Relational database with real-time data processing capabilities
ClickHouse: Column-based high-performance analytical database that supports real-time data querying
Superset: Data visualization and report generation tool
Airflow: Job scheduling and management tool
Kafka and Flink: Real-time event and streaming data processing tools
With this system, I can process big data, store and query data in real time, visualize data, and organize work efficiently
The data collected in the mathematical link between the publisher (pub) and the user (user) provides a series of extremely important basic information This aims to
Optimizing your target's next strategy is essential for gaining insights into your campaign's performance By collecting and analyzing this data, you can effectively monitor each manufacturer's performance while also gaining a deeper understanding of user characteristics and behaviors This knowledge opens up opportunities for further optimization and personalization of your marketing strategies.
The data provides crucial insights into the transactional relationship between publishers and users, highlighting the interaction process Each record features distinct attributes, including publisher encryption, publisher name, and comprehensive user details such as device type, city, country, and a random phone number.
Collecting data is essential for organizations to gain a comprehensive understanding of campaign performance, enabling them to evaluate and compare results across different publishers By analyzing user information, organizations can set targeted goals and optimize strategies based on key factors such as geographic location, device usage, and traffic sources.
Organizations can implement effective follow-up strategies to optimize costs and improve performance by leveraging diverse and detailed databases This approach enables the development of personalized strategies while monitoring trends and adjusting tactics over time for continuous improvement.
3.4 Storing data in the data lake layer
Data in a data lake is organized within a hierarchical file structure, resembling a traditional operating system's filesystem, which allows for easy file movement and renaming This structure also enables fine-grained Role-Based Access Control (RBAC) at both directory and sub-directory levels, enhancing security and organization.
Figure 12 - The hierarchical file system structure
Is shown as described below:
Data Providers will deliver information through the Ingestion Layer, utilizing the Flink tool to stream data into the Raw Datastore Here, the data is organized based on the source's structure and stored in a hierarchical filesystem format.
ESTIMATIONS
It is impossible to estimate the time and resources needed upfront without knowing what type of processing or transformation is needed to perform on the data
I will make assumptions and split the data needed to process in 1 day into categories, and estimate the resources and time needed to completely process 20 TB of data in 1 day
The probability of data ingestion from 30 sources is equal, meaning 20 TB of data will be ingested equal through 30 sources, each source will ingest approximately 666.67 GB per day
All data pipelines will be independent, this mean when multiple data pipelines are running together, 1 failed pipeline will not affect the others
Regardless of whether a data lakehouse is implemented on-premise or in the cloud, the communication and computation conditions will remain consistent, as long as the configurations are identical.
The data can be categorized into three main types based on their composition: 20% is structured and semi-structured data, which includes formats like SQL, JSON, XML, and CSV; 30% consists of textual data, encompassing raw text and natural language text files; and the remaining 50% is classified as unstructured data, primarily comprising web data.
The data will be kept at least 1 year
The parquet format will serve as the primary file format within the data lakehouse due to its columnar structure, allowing for efficient data retrieval by enabling the selection of individual columns Additionally, parquet supports various compression schemes, such as Snappy, Gzip, and Zlib, which help to significantly reduce data size.
Before being ingested into the data lakehouse, structured and semi-structured data will undergo preprocessing, transforming them into parquet files These files will be compressed using the Snappy compression scheme, achieving an average compression ratio of 76% This process will significantly reduce their size from 20TB to approximately 0.96TB.
Textual and unstructured data require additional processing to be converted into parquet files, allowing them to be stored in their raw forms post-ingestion Consequently, the total data size after ingestion will be 16.96 TB, resulting in a reduction from 20 TB per day to 16.96 TB per day in the data lake.
To accommodate data retention for one year, an initial storage requirement of 16.96 TB per day translates to a total of 6,190.4 TB Considering additional transformations and analytics, the data size is expected to increase by at least 50%, resulting in a minimum total data size of 9,285.6 TB.
In order to save storage space, the data should be further categorized into 4 types:
1 Processed data: This type of data is accessed frequently so it should be kept in parquet format with compression scheme snappy
2 Intermediate data: This type of data is a by-product in the process of creating processed data but still needed by other processes It is accessed less frequently than processed data so it should be kept in parquet format with compression scheme gzip
3 Temporary data: This type of data is generated while analyses and testing is performed, after those processes are done, it is no longer needed It should be scheduled for deletion
4 Archived data: This type of data is rarely accessed and can consume the most space in the data lakehouse so it should be kept in orc file format with compression scheme zlib
If 4 types of data has distribution: 10%, 20%, 10%, 60% Then the final size of the data becomes:
Type 3 data size should be scheduled for deletion every month, so it excluded from the calculation
The final data size after 1 year should be: 222.8544 + 352.8528 + 167.1408 742.848 TB
The storage space should not be populated more than 80% so after 1 year the storage necessary storage will be: 742.848 / 0.8 = 928.56 TB
In order to store 928.56 TB after 1 year The storage layer should have 920 TB and the data warehouse should have 9 TB
The processing layer must operate exclusively in the evening to ready data for the following day, with a total processing window of 12 hours from 18:00 to 6:00 To efficiently handle 25TB of data—comprising 20TB for ingestion and 5TB post-transformation—the system requires an average processing speed of 2.083 TB/h, equating to approximately 606.718 MB/s.
Estimating speed in a parallel and multistage data processing system can be complex In a case study involving the processing of 150GB of data, I utilized 50GB and 5 cores for the driver, along with 80GB for 4 workers, resulting in a total processing time of 20 minutes This calculation reveals an average processing speed of approximately 0.125 GB/s, or 119.21 MB/s Assuming that increasing the number of identical jobs proportionally enhances processing speed, the system would require about 6 units to achieve optimal performance.
The resources needed for 1 job is: 8 cores (for 4 worker) + 5 cores (driver) = 13 cores and 80 GB + 50 GB = 130 GB So the resources needed for 6 jobs is: 13 * 6 = 78 cores and 780 GB memory
The cluster needs 78 cpu cores, 780 GB memory and at least 929 TB mass storage To process 20 TB/day for 1 year
The data ingression and processing layers will utilize Apache Spark to extract and transform data from various sources within the data lakehouse, while Airflow will be employed to orchestrate the ETL data pipelines efficiently.
Spark is designed for parallel data processing and analysis, demonstrating the capability to manage petabytes of data Utilizing Spark for data processing significantly enhances overall system performance Additionally, as an ingestion tool, it efficiently reads data in parallel, boosting ingestion performance.
Airflow is a workflow management platform, it can be used to schedule and orchestrate data pipeline jobs, and provides a web UI to manage running jobs
The storage layer will use Apache Hadoop as the data lake and Delta Lake to add features that Hadoop does not provide
Hadoop offers a robust storage solution for building data lakes capable of handling petabytes of data by easily extending storage across multiple machines and automatically duplicating files or utilizing erasure coding to prevent data loss However, it follows the Write Once, Read Many (WORM) principle, meaning data can only be written to the cluster and not updated directly; users must read the old data, merge it with new data, and replace the original file, which complicates the transformation process Delta Lake addresses this challenge by creating transactions that maintain records of all operations performed on the data, thereby simplifying processing and analysis.
The data serving layer has to support multiple types of serving, it is divided further into 2 smaller components:
ClickHouse, a columnar database, will be utilized for the data warehouse, enabling support for SQL queries This functionality allows it to operate like other SQL databases, making it suitable for business intelligence (BI) reporting and advanced data analysis.
API based serving will use a server built based on Finatra server to get data for other applications
BENCHMARK
Logging and Tracing in OpenTelemetry are vital for measuring performance and latency in pipelines Logging delivers structured records of key events, while Tracing enables tracking execution across distributed components Together, they provide an in-depth view of system operations, helping to identify and optimize weaknesses and latencies By simulating data and integrating it into the OpenTelemetry Tracing system, we can evaluate if the pipeline fulfills customer performance expectations.
Below are the evaluation parameters for the system:
Operating system Ubuntu 18.04.6 LTS x86_64, Linux 4.15.0-209-generic Virtualizer KVM/QEMU System Container + Multipass
CPU 8 vCPUs Intel Xeon Gold 5120 @ 2.20GHz
QEMU Qemu version 1:2.11+dfsg-1ubuntu7.42
The network speed test outcome offers essential insights into the efficiency and performance of network infrastructure by measuring data transfer rates, latency, and overall responsiveness, which are critical factors in evaluating the network's ability to manage data transmissions effectively.
Table 2 - Network speed test results
Try Idle Latency (ms) (Download)
Kafka, under the topic "1," facilitates data streaming from the data provider to the Data Ingestion Layer, where the data is stored in Parquet file format Following five experimental measurements, the system achieved a throughput of 2000 records per second, yielding significant results.
The real-time pipeline efficiently processes data from the source to prepared Parquet files at a rate of 3000 records per second, enabling seamless querying in PostgreSQL This deployment enhances application performance for end-users and provides valuable insights for publishers.
Table 4 - Kafka streaming speed into Parquet
The results from querying the Postgres database, including the specified fields, are now ready for web or application deployment This data can be seamlessly integrated into the user interface, offering valuable insights and enhanced functionality for end-users.
Table 5 - Kafka streaming speed into Postgres
To enhance our web system's data processing efficiency, we evaluated the query performance of ClickHouse, a leading columnar database management system This assessment was vital for aligning response times with our performance goals The test results revealed important insights into ClickHouse's query speed and overall performance By analyzing response times across different loads and query complexities, we gained a better understanding of the database's capabilities and identified potential optimization areas.
Retrieve the total order value (Retrieve all records from advCode and calculate the total order value for that adv)
2 Order value ratio (Calculate the order value ratio) 43ms
Retrieve the total order value for advCode, campainCode, formDate, and toDate selected, divided by the total order value for advCode, campainCode, formDate, and toDate from the previous month
4 Retrieve the total number of orders for advCode from formDate to toDate, campainCode 36ms
5 Retrieve the total commission value for advCode from formDate to toDate, campainCode 42ms
6 Commission value ratio (Calculate the commission value ratio) 37ms
8 Retrieve the total number of clicks for advCode from formDate to toDate, campainCode 84ms
9 Click ratio (Calculate the click ratio) 69ms
Retrieve the total records for advCode, campainCode, formDate, and toDate in the current period, divided by the records for advCode, campainCode, formDate, and toDate from the previous month
11 Conversion rate (Calculate the conversion rate from click to order) 35ms
12 Calculate the total click records by advCode, campainCode, formDate, and toDate 57ms
Calculate the total records by advCode, campainCode, formDate, and toDate with status
Retrieve the total order value per day in the month (Retrieve all records from advCode and calculate the total order value for that adv)
Retrieve the total commission value per day in the month (Retrieve all records from advCode and calculate the total commission value for that adv)
Retrieve the total clicks per day in the month (Retrieve all records from advCode and count the records for that adv)
17 Total revenue by top 10 pubs (Calculate the total revenue by pub for each adv) 40ms
Revenue ratio by pub (Calculate the total revenue by pub for each adv in month t-1, divided by the total revenue by pub for each adv in month t-2)
19 Calculate total clicks by device for advCode 57ms
20 Calculate total clicks by country 23ms
OVERVIEW
A lakehouse serves as a crucial architectural link between data lakes and data warehouses, enabling organizations to store raw data while facilitating complex queries and analysis This article explores the effective implementation of a lakehouse architecture that handled a 1-billion-data-point dataset within three months However, it identifies significant gaps in data quality assurance, security, and the extensibility of the infrastructure using Kubernetes Notably, the current system lacks a robust data quality assurance framework, essential for ensuring data cleanliness, accuracy, and consistency To build trust in the lakehouse's information, a comprehensive approach involving data profiling, cleansing, and continuous monitoring is necessary.
Data security remains a significant gap in the current system, which lacks robust measures to protect sensitive information To strengthen the lakehouse against potential security breaches, it is crucial to implement strategies such as encryption, access controls, and thorough auditing Additionally, incorporating Kubernetes-native security features can greatly enhance the system's overall resilience.
Integrating Kubernetes into the lakehouse architecture is crucial for enhancing scalability and manageability With its powerful container orchestration capabilities, Kubernetes streamlines deployment, optimizes resource utilization, and provides an agile framework for meeting the evolving demands of data processing To future-proof the architecture, a strategic integration with Kubernetes is essential for maintaining responsiveness and supporting growth.
In summary, the effectiveness of the current lakehouse architecture in fulfilling immediate query needs relies on overcoming significant challenges related to data quality, security, and scalability To build a resilient and future-proof lakehouse architecture, it is essential to implement strong data quality assurance practices, enhance data security measures, and adopt Kubernetes for infrastructure orchestration.