Big Data Storage Patterns

Một phần của tài liệu Aytas y building a modern data platform big data systems 2021 (Trang 62 - 65)

Over the years, storage patterns have changed drastically with regard to data pro- cessing. With the pace of big data, many concepts are presently obtained that are very close to each other but with subtle differences. The big data platform should effectively use relevant patterns to make the best out of the data. In this section, we will explain storage patterns: data lakes, data warehouses, and data marts.

4.1.1 Data Lakes

The appearance of the data lake concept is quite new. Dixon (2010) first referred to the term as a large body of water in a natural state. The catch here is that data

k k lakes may contain raw data without any processing and need further processing

to make it worthwhile. This is, in fact, one of the defining differences between a traditional data store vs a data lake. In the ingestion phase, data can be stored without a schema. Later, the schema can be defined when the data is ready to be read.

The data lake can contain structured, semi-structured, and unstructured data both from internal and external resources. We can drive results from combining all of these data types. With regard to semi-structured and unstructured data, the parsing or defining of schema may be delayed until data is read. Different stake- holders may extract data from the same data source in various ways. Nevertheless, this has a cost in terms of processing and knowledge transfer. Extracting data might become a burden if done more than once. Thus, it might be a good idea to do the computation once and save the structured data once the data is read.

Since the data lake may contain data sources outside of the company, it might not be a good area for analysts. The domain knowledge in data lakes can get close to nonexistent. One might need to elicit the metadata from the data itself. Nonethe- less, it might be a good sandbox for data scientists and other interested parties.

I believe it is beneficial to document the metadata information once the data gets investigated for future use.

The data lake can serve as a source for other big data storage such as data ware- houses and data marts. Once the data gets cleansed, it can be moved out of the data lake to finer controlled storage.

4.1.2 Data Warehouses

The data warehouses have been widely adopted by the industry for business intel- ligence and analytics. The data warehouses like data lakes can hold large amounts of data. Nevertheless, traditional warehouses cannot be horizontally scaled as can be with data lakes since the underlying technology between the two is different.

Moreover, data warehouses accept well-defined data structures and often with some documentation. Companies typically require structured data since they drive business decisions based on the data. All decisions might be in limbo if the data structure or metadata is not well defined.

Data warehouses are the systems where different sets of structured data can be combined. The aggregation of the data forms the information that the busi- ness needs. The derived information becomes the organizational memory for years back reporting for several aspects of the business. We can go even further and set up a single source of truths from aggregated data. The single source of truth is an important asset that provides the same information for every department in

k k the organization. A single source of truth helps organizations in making decisions

based on the same facts.

Data warehouses provide ad hoc querying methods over different interfaces.

Data analysts and engineers, as well as executives, can discover business insights from different viewpoints. Occasionally, the querying would result in insight into the business. So, it would become a dashboard over the underlying data serving different stakeholders.

4.1.3 Data Marts

A data mart is a relatively simple warehouse specialized in an area of the business or organization. For example, the company might have a data mart for sharing information with the partners. It is driven by subjects such as sales, customers, and partners and meets the expectation of different stakeholders within the orga- nization.

Once the data gets processed in the data warehouse, one can import the data into the relevant data mart. Having a more focused use, data marts can be fast and easy to use. The reporting on top of data marts is also a general use case for them.

If the company wants to have dashboards for sales, then it would make sense to feed data into a sales data mart and build dashboards on top of it.

4.1.4 Comparison of Storage Patterns

The comparison of different patterns of data storage and processing is presented in Table 4.1.

Table 4.1 Comparison of big data storage patterns

Properties Data lake Data warehouse Data mart

Business intelligence

Big data analysis Reporting Fast lookups for

Usage Machine learning Data analytics summary data

Data insights Data visualization

Data source Internal systems Data lakes Data warehouses External systems Relational databases

Structured Aggregated

Data types Unstructured Transferred Summarized

Semi-structured

k k

Một phần của tài liệu Aytas y building a modern data platform big data systems 2021 (Trang 62 - 65)

Tải bản đầy đủ (PDF)

(327 trang)