Aggregating Big Data Sets

Data aggregation is the process of shaping the data into a meaningful summary.

Once the data gets aggregated, we can query summarized data rather than going through raw data. Hence, aggregating data provides efficiency when data is queried. Data aggregation enables engineers, analysts, and managers to get insights about data quicker. Before aggregation, we have to move data into a common processing environment, as discussed in the first two sections. Once we move data, then we can cleanse it first, then transform it into the target summary and save summarized data with appropriate retention. Aggregation stages are depicted in Figure 7.5.

7.3.1 Data Cleansing

Data cleansing is the step where we can fix problems coming with data. We can remove incorrect, incomplete, or corrupted data, reformat when there is a formatting problem, and dedupe duplicates. Since we receive the data from multiple sources, each data source might bring its way of formatting or labeling data. If data is inaccurate or misleading, we cannot reach actionable and insightful results.

Thus, we have to cleanse the data before processing it. The cleansing problem depends entirely on the data sources we depend. It varies drastically from one working set to another. Nevertheless, we will try to give some common scenarios to fix before starting the transformation.

Some of the data we receive might not be usable because the data might be par- tial, incorrect, and sometimes corrupted. We have to detect these problems and remove rows that don’t conform to our data quality guidelines. The problem is

k k

Table A

A_B_joined

A_B_C_aggregated C_D_E_aggregated

C_D_joined E_count

Table B Table C Table D Table E

Retention

Cleaned up data with friendly column/attribute names Deduplicated, parsed raw data from various data sources

Intermediary results

Group by and count Group by and sum

Partition range Partition range

Filter by attribute

Figure 7.5 Data aggregation stages.

easier when some of the columns don’t have data. We can easily filter. Neverthe- less, it gets complicated when the data has incorrect data. For example, a country column might have correct countries along with some random text. We have to remove the rows with random text. An easy way is to execute an inner join with the country table. When the data is corrupted, we might not even load it properly to get eliminated when loading. Nevertheless, it is always good to have a threshold for loading. For instance, for every billion rows, 100 loading errors might be fine.

The next step is to fix the differences in data. We might get “True” as text from one data source and “1” from another one for a boolean value. We need to convert them into the same type. We need to go through all such columns and fix them before any transformation. Moreover, some data might come in completely dif- ferent forms like JSON. We might need to explode JSON values into appropriate columns before joining with other tables.

The data might be almost complete but not fully. For example, we might have a city but not a country for some of the rows. We can potentially guess country value by observing the city name. Yet, we need to document and decide carefully about our observations. We never want to provide inaccurate data. It is probably better to miss the data than have fallacious data. The consumers of the data should not have doubts about accuracy.

k k Finally, some of the data might not align with the data trend, e.g. too extreme

cases. We might want to smoothen data by removing such outliers. To detect them, we need to run some exploratory queries to find the topnelements for a given column or columns. Once we detect such outliers, we should come up with a rea- sonable threshold to remove them.

Consequently, we should review the data in terms of completeness, accuracy, and outliers. We should then develop a set of rules to cleanse data and add it as a step in our pipeline. We should also run quality checks on the data as well as running some anomaly detection.

7.3.2 Data Transformation

The transformation step executes a series of stages from one or more data sources into target data. Depending on the path from source to target, the transformation can be costly in time and effort. Yet, data transformation helps to organize data better for both humans and computers. We can produce an efficient data model for both querying and storage purposes. The transformation process utilizes transformation functions and stages that consist of common mutations such as filtering and aggregation. Transformations stages define the transition from one or more data sources to the target data.

7.3.2.1 Transformation Functions

Transformation functions let us mutate data from one more data source to the target. Some of the transformation functions change the data, but some of the transformations simply update the metadata. Metadata updates include casting, type conversion, localization, and renaming. Some common transformation functions that change data are as follows.

Filtering Data can come with irrelevant rows for what our target might need. In that case, we would like to filter these rows. For example, we might want to sum- marize API call-based data and remove any other HTTP calls.

Projection When summarizing data, we generally don’t need all the columns.

Thus, we would project data from a set of columns or attributes from the source data.

Mapping Mapping involves the conversion of one or more columns into a target column or columns. For example, we might combine two columns, weight, and height, to compute BMI (body mass index).

Joining The target data might need information from many sources, and we might need to join two data sources based on common keys. For example, we can join the customer action table derived from logs with the customer table.

k k Masking We might need to hide original data due to security concerns. We would

hide the original data with some modifications.

Partitioning Most common Big Data systems have the notion of partitioning.

We need to limit the amount of data we process by partitioning it into time ranges.

A typical transformation might limit data into weeks, days, or hours.

Aggregation We can apply numerous aggregation functions over one or more data sources grouped by a set of columns. In some cases, we can even introduce our aggregation functions as well. Common aggregation functions are sum, max, min, median, count, and so forth.

7.3.2.2 Transformation Stages

Going from multiple sources to targets might involve a complex directed acyclic graph (DAG), where we would have to output intermediary results. We might use these intermediary results in other flows. Most commonly, we need these results simply because the process would become cumbersome otherwise. There are a few areas we need to harden for the transformation stages.

Timestamping We already discussed timestamping in data transfer. We should timestamp the data for intermediary stages as well. Flows outside of the current one might depend on the intermediary stage to complete their transformation.

Timestamping will ensure that the data won’t be incomplete for other flows.

Validation Important steps in complex transformations should have relevant validations and anomaly detection. Validations and anomalies should fail the entire transformation. We don’t want to waste resources on computing inaccurate data.

More importantly, we shouldn’t produce misleading data.

Idempotency Retrying any step in transformation should always give the same effect. The steps in transformation should always clean up before rerunning the same step. For example, we should clear the directory for the data before executing the transformation step.

Check Pointing When a single transformation step gets complex or takes a long time to compute, we should consider it to divide it into multiple stages. Multiple stages will allow us to fall back to the previous stage when a stage fails. Further- more, multiple steps allow optimizing the transformation easier.

7.3.3 Data Retention

We should consider data retention as part of the aggregation process. We should decide the retention policy for each aggregation step. Having well-defined

k k retention on steps offers advantages for downstream data consumers and system

admins. Consumers can know how long the data is kept to set up their business logic accordingly. System admins may not need to execute manual or semiauto- mated processes to free unused data. We should also have a retention policy for intermediate stages of data. Having a retention policy on intermediate results can help us in reacting to bugs and data problems easier. We can run transformation steps with appropriate retention where the bug was introduced and skip the previous ones.

7.3.4 Data Reconciliation

Data reconciliation is a postprocessing method to improve the accuracy of the data by handling missing records and incorrect values. Reconciliation jobs work with updated data sources to produce the most accurate results. Ideally, all of the data come cleansed to the aggregation layer. Nonetheless, data might deviate due to integration failures, upstream data format change, and many other factors.

In some cases, reconciliation might need to coexist with normal aggregation because we deal with a stream of data. For example, if we are processing data daily, we might need to cut out a user session in the middle of the night. We can correct this behavior with a reconciliation job that finds out such cases and updates them.

Although we must aim for perfect accuracy, it is not possible. When it comes to reconciliation jobs, we must define an error margin to trigger them. If the error rate is under the error margin, we should skip postprocessing. If the error margin is too high, we might serve unexpectedly inaccurate data. However, if it is too low, we would waste resources for unnecessary postprocessing. We should find a nice balance for error margin depending on the business requirements.

When we are running reconciliation as part of our data pipelines, we should avoid rewriting the whole partitions if possible. One way to deal with such oper- ations is to writing reconciliation updates to a partitioned table with the same partitions as the original aggregation. When reading for accuracy, we can consol- idate results from both tables and return the most accurate answer to the user. In many cases, a view over both tables might serve users very effectively, e.g. avoiding the whole consolidation.

Processing Large Data with Linux Commands

Processing Large Data with PostgreSQL