Building and deploying data warehouse model for financial institution Xây dựng và triển khai mô hình kho dữ liệu cho tổ chức tài chính
THEORETICAL BASIS
Overview of Data Warehouse
According to Inmon,a data warehouse is a subject-oriented, integrated, time- variant, and non-volatile collection of data in support of management’s decision- making process (Inmon, 2005)
A subject-oriented data warehouse is specifically designed to enhance analysis and reporting within a particular domain, containing datasets that are interconnected and relevant to the main subject Its primary goal is to serve as a centralized and optimized information source tailored to meet the needs of specific user groups By concentrating on a distinct subject, this type of data warehouse significantly improves data retrieval and processing performance, facilitating more efficient data analysis and reporting.
The integration of data for a data warehouse involves aggregating information from diverse sources, including relational databases, Excel files, CSV files, and flat files This process necessitates thorough data cleaning, organization, and condensation to maintain consistency and reliability across the dataset.
Non-volatile data storage ensures that once a record is finalized, it cannot be altered or deleted, providing long-term data retention After the processes of extraction, transformation, cleaning, and loading into the data warehouse, the data remains unchanged Operations within the warehouse are restricted to adding new records and retrieving existing ones, maintaining the integrity and durability of the stored information.
• Time-variant (Isolation): Data retrieval is not affected by other data or interactions The storage duration is longer compared to operational systems
Operational systems focus on storing current values, whereas data warehouses are designed to maintain long-term historical information The time dimension plays a crucial role in ensuring the uniqueness of the data stored in these warehouses.
Comparisions of Datawarehouse and Database
A Data Warehouse serves as a centralized repository for an organization's data, accessible exclusively to authorized employees for data analysis and reporting Input data is sourced from various origins, including relational databases, Excel files, CSV files, and text files, which are aggregated at the lowest level Through a comprehensive ETL (Extract, Transform, Load) process, this data undergoes transformation and cleaning, ensuring it is readily available for analytical reporting.
A database is an organized collection of interrelated data designed for efficient information recording and querying Typically accessed through computer systems, databases support multiple users within an organizational framework They facilitate online processing and are often normalized, particularly in relational database models, to reduce data redundancy and enhance storage efficiency.
Table 2-1: Differences between Database and Data Warehouse
Purpose Manipulate and query data Process, integrate and analyze data
Function Supports daily operations for users interacting with the database
Supports strategic decisions for businesses, used by employees and business leaders for analysis and reporting
Model Entity-relational model Multidimensional data model
Highly complex: standardized (for RDMS) according to standards such as 1NF, 2NF, 3NF to reduce redundant data and optimize storage capacity
Tables are easily joined in the data warehouse to serve queries
Detail, updated regularly Historical and statistical, can only be added, cannot be deleted or updated
Regularly In specific cases when analysis is in need
Orientation Application orientation Subject orientation
A database is an application or system specifically designed to store information in one central location, while a data warehouse is a comprehensive collection of various data sources and information systems that facilitates organization, analysis, and reporting based on user queries Importantly, a data warehouse can consolidate data from multiple databases and other diverse sources.
Architecture of Data Warehouse
The basic architecture of a data warehouse includes four components:
• Data Sources: Structured and unstructured data from various sources are consolidated in a single location
• Data Processing Area : The area where data extraction, transformation, and loading (ETL) processes are performed to ensure data consistency before being loaded into the data warehouse
• Data Mart: The place where data is processed, consolidated, and stored
Based on this, data marts can be developed for specific fields or subjects to improve system performance
• Reporting: The area where querying, report generation, visualization, and data analysis are performed.
Logical model of Data Warehouse
A star schema is a data modeling technique characterized by a central Fact table surrounded by related Dimension tables The Fact table, which holds dynamic data generated from various operations, is linked to the Dimension tables that contain static data The keys in the Fact table are derived from the keys of the Dimension tables, ensuring a clear and organized structure for data analysis.
The logical model offers a clear and straightforward structure, with each Dimension table linked to the Fact table through a single relationship, eliminating the need for additional Dimension tables This streamlined approach enhances data simplicity, facilitating easier querying and improving execution times.
Disadvantages: There is redundancy, as each Dimension table stores separate information
The snowflake schema is an extension of the star schema, where each dimesion table can be connected to other dimension tables through a many-to-one relationship
Advantages: The snowflake schema involves normalization of Dimension tables, which reduces data redundancy and saves storage space, ensures consistency and integrity of data
Normalized databases can lead to slower query performance due to the need for more complex queries involving multiple joins Additionally, the intricate relationships between tables in a normalized structure often necessitate increased maintenance and management efforts.
1.4.3 Fact constellations schema (Galaxy schema)
The galaxy schema integrates both the star schema and the snowflake schema, featuring multiple Fact tables that share Dimension tables This structure represents a combination of various Data Marts, enhancing data organization and accessibility.
Advantages: Reduce the size of the database, especially when using
Dimension tables with many values
Disadvantage: More complicated and risky, compared to star schema and snowflake schema.
Financial Service Data Model
The Financial Services Data Model (FSDM) serves as a foundational business model for financial organizations, offering standardized data classification principles and methodologies Developed by IBM and derived from global finance and banking projects, FSDM is essential for creating Data Warehouse projects and enhancing management and transaction systems.
The structure of the Financial Services Data Model (FSDM) consists of nine distinct zones, where all data from the financial institution is modeled and reorganized in accordance with data warehouse standards, ensuring effective consolidation across these zones.
Figure 2-3: Overview architecture of FSDM (Linh, 2022)
• Involved Party (IP): Stores all information related to entities such as organizations, individuals, etc., that have relationships with the company
• Classifications (CL): Stores classifications of various entities, for example, if the Involved Party is an individual customer, the classifications might include male, female, etc
• Arrangement (AR): Represents the cooperation relationship between
• Product (PD): Represents the products and services that the organization provides to the related Involved Parties
• Location (LO): Stores information about addresses or business areas related to an Involved Party
• Condition (CD): Specifies requirements related to how the financial institution conducts business with the components related to the bank's business relationships
• Event (EV): An event or action taken by the bank related to its business activities
• Resource Item (RI): Any asset related to the organization's business operations
• Business Direction Item (BD): A business strategy closely linked to the organization's operations with the Involved Party
Involved Party (IP) Individual who belong to a certain gender according to the classification Classifications (CL) male opens
Arrangement (AR) a mortage loan account of the product which is Product (PD) A mortgage product at a branch in
Location (LO) Hanoi with an agreement related to the contract is:
Condition (CD) Interest at a fixed rate to be paid is
Event (EV) Monthly interest payments with the collateral being Resource Item (RI) The house, under a business strategy of
Business Direction Item (BD) Low-income housing
Extract – Transform - Load process in Data Warehouse
Gathering data from diverse sources, encompassing both structured and unstructured formats, is essential for effective analysis Leveraging IBM's InfoSphere DataStage platform allows for the creation of Sequence Jobs and Parallel Jobs, which facilitate the execution of ETL (Extract, Transform, Load) processes to streamline data collection and management.
Data extraction involves gathering information from a variety of sources, which may include different data structures like databases, Excel files, and raw data formats The primary objective of this process is to efficiently retrieve data from these systems for further processing.
• Transform: This process involves transforming data from the source into a different structure suitable for loading into the target database
• Load: This is the process of pushing data into the target data warehouse after it has been transformed
Storing Data: Storing normalized data into Oracle DBMS, DB2, or Redshift
Oracle DBMS, DB2, and Redshift are all capable of handling large datasets
When building a Data Warehouse, it is essential to design effective data models that include Dimension and Fact tables Utilizing IBM's 9 concepts data model ensures both consistency and scalability in the system To keep the database updated with new data and maintain synchronization with Dimension and Fact tables, ETL tools are employed These functions can be integrated within IBM InfoSphere DataStage or developed using third-party tools.
Building Dashboards and Reports: Utilizing BI tools such as Tableau,
PowerBI, or IBM Cognos to query and visualize data within the Data Warehouse Connecting Dimension tables and Fact tables with BI tools to create reports and dashboards
Automating Data Synchronization: Automatically updating and synchronizing reports and dashboards with the latest data, using ETL tools and scheduling automation for this process
After establishing the data integration requirements and analytical goals, the database layers will be designed according to the 9 concepts model to effectively integrate various data sources into the Data Warehouse system.
Data source: Data from various systems (including structured and unstructured data) is aggregated and brought to a single location (Staging)
Staging: To ensure performance and integrity of the server containing source data, the Staging layer is created to store data collecteded from sources in preparation for subsequent processing
Data Warehouse/System of Records (SoR): After the ETL process, data is moved from the Staging layer to the Data Warehouse layer and organized according to IBM's 9 concepts model
Data Mart: Data is divided into Fact and Dimension tables, optimizing data queries to meet predefined requirements.
Data storage mechanism in Data Warehouse
Slowly Changing Dimension (SCD) tables are essential for managing and retrieving data efficiently, particularly in the creation of Data Marts They enable the tracking of data changes over time, which enhances the accuracy and reliability of analysis and reporting within Data Marts.
When creating Data Marts, the emphasis is often on specific information types, including sales, profits, and customer details To effectively manage changes in related attributes, corresponding Slowly Changing Dimension (SCD) tables are essential For instance, managing customer information necessitates SCD tables to track changes in attributes such as name, address, phone number, and email over time.
SCD tables commonly have the following columns:
A surrogate key serves as the primary key that connects various tables within a Data Mart Usually implemented as an auto-incremented identifier, it effectively replaces the primary key of the business entity, ensuring streamlined data management and integrity.
A natural key is created from the attributes of a business entity and may not be unique across different records It serves as a crucial element for searching and retrieving records associated with that entity.
• Effective Date: This is the start date when the record becomes effective When the record is updated, this date changes to mark the change
The expiry date signifies the point at which a record ceases to be valid When a record is updated, the expiry date of the previous entry is adjusted to indicate its termination.
• Current Flag: This flag marks the current record It's set to true for the current record and false for historical records
SCD tables used in Data Marts can be classified into three main types: SCD Type 1, SCD Type 2, and Snapshot
In this type of Slowly Changing Dimension (SCD), information is updated without retaining a history of previous changes, as new values simply overwrite the old ones This approach is ideal for tables that need to store only the most current data of an entity, disregarding any historical information.
Figure 2-5 Example of SCD Type 1 storage
The SCD Type 2 table captures historical changes in entity information by generating new records whenever updates occur These records are sequentially numbered based on the time of change, with only the most recent entry deemed valid This approach is essential for maintaining historical business data and enables effective analysis of information changes over time.
Example: A collateral is effective from October 15, 2022 On date T, March
15, 2023, when the source data changes the value from 10 billion to 9 billion, the Data Warehouse will directly update the record's end date (END_DT) to March 15,
2023, marking the end of validity for that record, and generate a new record with the new asset value of 9 billion effective from March 15, 2023, with an indefinite end date
Figure 2-6 Example of SCD Type 2 storage on date T+1
On June 20, 2023, the source data updated the collateral value from 9,000,000,000 to 9,500,000,000, prompting the Data Warehouse to generate a new record with an effective date of T+x, while simultaneously invalidating the previous record.
Figure 2-7 Example of SCD Type 2 storage on date T+x
Storing snapshot data in a data warehouse enables the retention of data versions at specific moments, allowing for historical analysis This approach avoids overwriting current data by creating and saving copies as snapshots, ensuring that past states of the data are preserved for future reference.
Periodic data snapshots can be created daily, weekly, or monthly to reflect the data's status at specific intervals Each snapshot includes a comprehensive copy of the data, encompassing all associated tables and relevant information.
To efficiently store daily data, organizations can save information from days T, T+1, and so on up to T+n While this method enhances the convenience of analyzing historical data, it also necessitates increased storage capacity.
Figure 2-8 Example of Snapshort storage
In a dynamic source database where information is updated daily, the snapshot mechanism ensures that data is consistently recorded Even on days T+1 and T+2, when no changes occur, data continues to be added to the data warehouse, with variations tracked through the Data_date field.
Introduction of the tools and technology used in the project
DBeaver is a versatile database management software that supports JDBC drivers, aimed at improving database management and development It offers a robust graphical user interface, making it easier to interact with and manage various types of databases efficiently.
This software provides a range of essential features, including database querying, data viewing and editing, and table structure analysis Users can create and manage various database objects such as tables, keys, relationships, and views Additionally, it offers robust backup and data recovery options, along with support for SQL debugging.
Oracle (Oracle Database) is a popular and powerful database management system developed by Oracle Corporation The platform is designed to manage, store, and retrieve data in large-scale enterprise environments
Oracle Database utilizes the relational model to efficiently store and process data in table formats, establishing relationships between those tables It employs SQL (Structured Query Language) for querying and database interaction The platform boasts numerous features and robust capabilities that enhance its functionality.
• High reliability: The system ensures data integrity and recovery capabilities after incidents
• Easy integration and scalability: It allows integration with other enterprise applications and scalability to meet business growth needs
• High performance: Provides query optimization and fast data processing capabilities, ensuring high performance and quick response times
• Big data management: Supports storage and processing of big data and is capable of handling complex workloads
1.8.3 IBM Infosphere DataStage and QualityStage
IBM InfoSphere DataStage and QualityStage are essential tools for extracting, transforming, and loading (ETL) data, ensuring high-quality data for effective business analysis They support a wide range of data sources, including sequential files, indexed files, relational databases, external data sources, archives, and enterprise applications.
• Integrates data from the widest range of enterprise and external data sources
• Scales to process and transform large data
• Utilizes parallel processing for scalability
• Handles complex transformations and manages multiple integration processes
• Utilizes direct connections to enterprise applications as source or target
• Utilizes metadata for analysis and maintenance
• Operates in batch, real-time, or as a web service
InfoSphere DataStage and QualityStage can access data from:
• Enterprise Resource Planning (ERP) and Customer Relationship Management (CRM) systems
• Performance Management and Online Analytical Processing (OLAP) systems
IBM InfoSphere features a series of interconnected processing stages that outline the data flow from source to target Each stage generally consists of at least one input and/or output data stream, with some capable of handling multiple inputs and outputs to various stages Key stages utilized in job design encompass Transform, Filter, Aggregator, Remove Duplicates, Join, Lookup, Copy, Sort, and Container.
Figure 2-9 Properties of an input stage of a Parallel Job in DataStage
To perform ETL process on data, Parallel jobs and Sequence jobs are required
The Parallel Job feature in DataStage enables simultaneous processing of data tasks, enhancing speed and efficiency through a parallel processing model By dividing data processing tasks into stages, it allows for concurrent execution across multiple nodes in a distributed environment.
Figure 2-10 A Parallel Job in DataStage
A Sequence Job in DataStage allows users to create a data processing workflow that follows a linear sequence of tasks Each task, or stage, executes in order from top to bottom, with the output of one task seamlessly passed to the next for further processing.
Figure 2-11 A Sequence Job in DataStage
The architecture is divided into two parts:
• Shared Components: Includes Unified User Interface, Common Services, and Common Parallel Processing
• Runtime Architecture: Includes OSH (OSH Script) command sets
DataStage comprises four key components: the Administrator, which manages user setup, project criteria, and project movement; the Manager, serving as the primary interface for storing and managing reusable metadata while allowing users to view and edit repository content; the Designer, which facilitates job design by specifying data sources, transformation methods, and data destinations, with jobs compiled into executable files; and the Director, responsible for confirming, scheduling, executing, and monitoring these jobs.
Parallel Jobs on the DataStage server
IBM DataStage facilitates ETL (Extract, Transform, Load) processes through the design of Flow Jobs, utilizing various mechanisms tailored to the capabilities of both source and target systems The three predominant design approaches include
The ET-L process involves performing Extraction and Transformation steps on the source server, while the Load step occurs on the ETL server, which subsequently transfers the data to the target server.
• EL-T: The Extraction step is executed on the source server, the Load step is performed on the ETL server, and the Transformation step is done on the target server
The E-T-L approach streamlines data aggregation into data warehouses while efficiently transferring data between source and target systems, thereby alleviating the strain on the ETL server's engine This model effectively addresses the data warehouse needs of Core Banking systems and offers scalability options tailored to the unique data processing requirements of different organizations.
1.8.4 Structured Query Language – Procedural Language
Structured Query Language (SQL) is a specialized programming language that enables users to access and manipulate databases It is designed for managing data within relational database management systems and processing streams in relational data stream management systems, serving as a vital tool for effective communication with databases.
SQL PL (SQL Procedural Language) enhances the SQL query language by incorporating procedural programming features It enables the creation of procedures, functions, and program blocks that run directly on the database server, offering increased power and flexibility for data manipulation and executing complex tasks efficiently.
Power BI Desktop is a powerful application that connects to various data sources, enabling users to transform and visualize data effectively By combining these data sources into a cohesive data model, users can create interconnected tables and charts These visualizations can be shared as reports within the organization, making Power BI Desktop an essential tool for business projects Most users create reports in Power BI Desktop and then leverage Power BI Service for sharing, enhancing collaboration and data-driven decision-making.
Some of the most popular applications and features of Power BI Desktop include:
• Transforming and cleaning data to create a data model
• Creating visuals such as charts or graphs, providing visual presentations of the data
• Creating reports, which are collections of visuals, across one or multiple report pages
• Publishing reports to Power BI Service to share them with others, without needing installation on individual machines
This chapter offers a comprehensive overview of Data Warehousing concepts, focusing on technologies and tools essential for data storage and processing to enhance financial reporting After examining the latest advancements and comparing them with existing resources, I will develop a Data Warehouse model accompanied by specialized databases tailored to this model Further details regarding these models will be elaborated in Chapter 2 of the project.
DESIGN AND BUILD DATA WAREHOUSE
Design of Data Warehouse Model
The source data passes through Staging, Atomic, and Data Mart layers to generate statistical reports, where:
Data Source: The data area includes various test data sources related to loans at TP bank
Staging Area: This area collects data from different sources and stores it daily, divided into 3 layers: Today, Preday, and Minus
Atomic/SoR: Data moved from the Staging layer to Atomic/System of Record is structured according to the Data 9 Concepts model, providing data for the Data Mart
Data Mart: This layer is built with data marts consisting of Dimension and
Fact tables The data here is exploited to serve reporting purposes
Reporting: This layer uses business requirements to produce reports
Reports can be in the form of numerical tables or dashboards
Figure 3-2 Data Flow in Staging layer
Data is transferred from the Today layer to the Preday layer, with source data being loaded into Today An SQL query is then employed to identify new or modified records, which are compared against the Previous day Any changes detected between Today and the Previous day are flagged and subsequently transferred to the Minus table.
After extracting and aggregating the source data in Staging, dividing it into Today, Preday, and Minus layers, the data continues to be processed for loading into Atomic
Figure 3-3 Data Flow in SOR layer
Data in the Data Mart will be organized in accordance to the Star Schema through Dimension and Fact tables
Figure 3-4 Data Flow in Data Mart layer
After data has been stored and processed through the above stages, it can be retrieved from Data Mart to extract statistical reports
Building Database for DataWarehouse
The UAT data tables of TP Bank are used to exemplify the data that an enterprise stores during its operations These tables include:
Figure 3-5 Source data in Data Warehouse System
Staging refers to the storage area for data extracted from multiple sources Utilizing the source table information from the UAT loan dataset at TP Bank, I have developed specific Staging tables for efficient data management.
Figure 3-6 Minus tables in Staging Area 2.2.3 SOR Area
Figure 3-7 Data Relationship Diagram in SOR Area
Figure 3-8 Entity Relationship Diagram in SOR Area
The article outlines the key entities represented by tables, including Involve Party (IP), Customer (CST), Organizational Units (OU), Arrangement (AR), Arrangement Summary (AR_TVR_SMY), Product (PD), Resource Item (RI), Resource Item Value (RI_VAL), and Currency (CCY) Additionally, it describes the relationships between these entities through various tables: IP_X_IP (Involve Party to Involve Party), RI_X_IP (Resource Item to Involve Party), AR_X_IP (Arrangement to Involve Party), and AR_X_PD (Arrangement to Product).
Building Database for Data Mart
2.3.1 Structure of tables in Data Mart Area
Dimension tables typically feature a structure where the primary key is a field, often numeric, that contains unique, meaningless values known as a Surrogate Key This Surrogate Key is usually generated during ETL data processing flows as auto-generated keys, ensuring uniqueness within the Data Warehouse.
In the Dimension table, the Natural Key serves as the primary key for business data, exemplified by the CST_ID field in the Dimension Customer table that stores customer codes While CST_ID could function as the primary key, it is essential to implement a Surrogate Key, like CST_DIM_ID, to avoid issues with duplicate natural keys when importing data from multiple source systems.
2.3.1.2 Update values in Dimension tables
When a Data Warehouse identifies a change in a record's value within a Dimension table, it must be configured to respond accordingly There are two primary processing options available: Type 1 and Type 2.
Type 1 Slowly Changing Dimension (SCD Type 1) simply overwrites the changed data in the Dimension table The data warehouse designer chooses Type 1 when changes in the source data are not significant enough to alter the meaning of the Fact table
Type 2 Slowly Changing Dimension (SCD Type 2) allows tracking changes in the Dimension table and accurately linking Fact records with the current Dimension record When the data warehouse identifies an update in the source data, instead of overwriting, the system updates the status of the old record and creates a new record in the Dimension table This new record is assigned a new Surrogate Key, and from this point onward, the data warehouse will use this new record to link with any new Fact records generated Fact records generated on previous days will still link to the old Dimension record
Figure 3-10 Slowly Change Dimension Processing
This type of change most clearly reflects the evolution of data over time because every single change in the source data is recorded in the Data Warehouse
A Fact table is characterized by the level of detail in its data and typically lacks a primary key, relying instead on foreign keys to link to Dimension tables for contextual understanding These tables often contain fields that hold numerical values, representing the key performance measures essential for the business.
AR_ANL_FCT table is a fact table for loans, with the primary key being a combination of the keys from the surrounding Dimension tables: CCY_DIM_ID,
AC_AR_DIM_ID, PD_DIM_ID, OU_DIM_ID, CST_DIM_ID The
The AR_ANL_FCT loan fact table serves as the central component of the Data Mart, utilizing a Snapshot mechanism for storage In addition, the associated Dimension tables, which detail various types of information, are maintained using the SCD Type 2 mechanism.
Having successfully designed the Data Warehouse model and constructed the databases for each area, I now have a comprehensive Data Warehouse ready for ETL processes and data analysis I will implement the dataset containing loan information into this model to generate insightful financial statistical reports.
DEPLOYMENT OF LOAN DATA INTO THE DATA
ETL data from Data Source to Staging
Data extracted from the Oracle database undergoes an ETL process, where it is first stored in the Staging area before being loaded into the Data Warehouse This Staging area consists of three layers: Today, Preday, and Minus By comparing the Today and Preday layers after loading, we can effectively identify and flag records that have been deleted, added, or modified.
SELECT A *, S.DATE AS PPN_DT
SELECT A.UNQ_ID_SRC_STM, 'I' REC_IND FROM STG_LNMAST_TDY A
SELECT B.UNQ_ID_SRC_STM, 'I' REC_IND FROM STG_LNMAST_PDY B
SELECT B.UNQ_ID_SRC_STM, 'D' REC_IND FROM STG_LNMAST_PDY B
SELECT A.UNQ_ID_SRC_STM, 'D' REC_IND FROM STG_LNMAST_TDY A
SELECT C.UNQ_ID_SRC_STM, 'U' REC_IND FROM
SELECT A.* FROM STG_LNMAST_TDY A, STG_LNMAST_PDY B
WHERE A.UNQ_ID_SRC_STM = B.UNQ_ID_SRC_STM
SELECT B.* FROM STG_LNMAST_PDY B
Then, the data in Minus layer can be flagged:
Figure 3-12 Data in Minus layer of Loan Master table
ETL data from Staging to Atomic
The Staging tables are structured based on the nine FSDM data concepts, enabling effective data management By utilizing the minus and today records, we can efficiently query and identify new, modified, or deleted data to be incorporated into the Atomic storage tables.
SELECT FUNCTION_HASH02('PD' || '|' || 'IFTB.PRODUCT' ,mns.PD_CODE ) PD_ID, dty.PD_CODE UNQ_ID_SRC_STM,
CV.CL_ID SRC_STM_ID, s.DATA_DATE PPN_DT, dty.ETL_DATE EFF_DT,
FROM STG_IFTB_PRODUCT_MNS_DA mns
LEFT JOIN STG_IFTB_PRODUCT_TDY_DA dty ON dty.PD_CODE = MNS.PD_CODE
LEFT JOIN CV CV ON CV.CL_SCM_CODE = 'SRC_STM' AND CV.CL_CODE = 'IFTB.PRODUCT' JOIN ZSYSDAY_DWH s ON 1=1
Data is loaded into temporary tables TWT, TMP for transformation, aggregation, and encoding before being added to the storage table
Figure 3-13 Sequence Job loads data from temporary table to storage table
Before transferring data from temporary tables to storage tables in the Atomic layer, records are processed according to their assigned flags (Insert, Update, Delete) Notably, records marked for deletion are not physically removed from the Data Warehouse storage tables; instead, they are flagged as deleted to maintain their complete history.
USING (SELECT CST_ID, END_CST_DT
ON (D.CST_ID = S.CST_ID)
UPDATE SET D.END_CST_DT = S.END_CST_DT;
-UPDATE DL CUA CAC BAN GHI THAY DOI HOAC INSERT DL MOI
CST_LCS_TP_ID, CST_TP_ID, IDY_TP_ID, EFF_CST_DT, END_CST_DT, CST_MSEG_ID, UNQ_ID_IN_SRC_STM FROM TMP_CST_DA
ON (D.CST_ID = S.CST_ID)
UPDATE SET D.CST_LCS_TP_ID = S.CST_LCS_TP_ID, D.CST_TP_ID = S.CST_TP_ID,
D.IDY_TP_ID = S.IDY_TP_ID,
D.EFF_CST_DT = S.EFF_CST_DT,
D.END_CST_DT = S.END_CST_DT,
D.CST_MSEG_ID = S.CST_MSEG_ID,
D.UNQ_ID_IN_SRC_STM = S.UNQ_ID_IN_SRC_STM
UNQ_ID_IN_SRC_STM,
CST_LCS_TP_ID, CST_TP_ID, IDY_TP_ID, EFF_CST_DT, END_CST_DT, CST_MSEG_ID) VALUES (
S.UNQ_ID_IN_SRC_STM,
S.CST_LCS_TP_ID, S.CST_TP_ID, S.IDY_TP_ID, S.EFF_CST_DT, S.END_CST_DT, S.CST_MSEG_ID );
ETL data from Atomic to Data Mart
The newly introduced Dim tables will feature a surrogate key, which guarantees the uniqueness of each record and prevents duplicate IDs during data extraction from various sources This key is generated as a sequentially increasing number using a database Sequence, such as AR_SEQUENCE.NEXTVAL.
AR.AR_ID AC_AR_ID, AR.UNQ_ID_IN_SRC_STM AC_NBR, ic.CCY_CODE CCY_CODE,
PD.UNQ_ID_IN_SRC_STM PD_CODE, CST CST_ID CST_CODE,
OU ORG_CODE BR_CODE,
LEFT JOIN CCY_DUNQ ic ON AR.CCY_ID = ic.CCY_ID
LEFT JOIN AR_X_PD_RLTNP_TOANNT AR_X_PD
ON AR.AR_ID = AR_X_PD.AR_ID AND AR_X_PD.EFF_DT