Data Science Life Cycle

The data science life cycle is a relatively new concept. It is an extension of the data life cycle, which has a long history in data mining and data management.

A typical data life cycle has steps like data acquisition, data cleaning, data processing, data publishing, and saving results (Stodden, 2020). As an extension of the data life cycle, the data science life cycle has the same phases with additional steps.

It consists of business objectives, data understanding, data ingestion, data preparation, data exploration, feature engineering, data modeling, model evaluation, model deployment, and operationalizing. The data science life cycle is depicted in Figure 8.1.

8.2.1 Business Objective

The vital part of creating a data solution is to establish the grounds for the work.

It sounds obvious. However, many people rush into solutions without thinking

Business objective

Data understanding

Data ingestion

Data preparation Data

exploration Feature

engineering Modeling

Model evaluation

Model

deployment Operationalizing

Figure 8.1 Data science life cycle.

k k about business value. Once a business problem is identified, it can be cast into

potential data science problems. The following questions require answers:

● How many items can we sell?

● How much should be the price of the new item?

● How should we arrange our items?

● Which segment of the customers should we target?

● Are spikes normal?

● Which books should we recommend?

● What user groups do exist?

● Is it likely for customers to accept a new offer?

These are a couple of questions that need data science for answers. Nevertheless, it is still not enough to have questions. Justification on how the organization benefits by investing in the data science project is needed. The project’s objective might be customer retention, additional revenue, and user engagement. It is crucial to set the objective to start the project.

8.2.2 Data Understanding

Understanding data is a prerequisite to building a data science solution. However, teams do not accommodate enough time for it. The meaning and the strengths, costs, and weaknesses of the data should be understood. The historical data might not be comprehensive for the problem being solved. Moreover, it may be slightly irrelevant to the solution. Even if the data is available, there might be costs associated with it.

Historical transactions or logged data may not have concerns similar to data science solution. The data might lack the information needed or might be in a different format. Revisions of the infrastructure might be needed to get better data.

Moreover, the data might have different issues like reliability and consistency.

While building the solution, several trade-offs are made about those issues.

There is also an associated cost to retrieving data. Buying the data from a third party company or partner with another organization to bring in data might be needed. Even if the data is in the platform, there is some preprocessing cost.

While thinking about a data solution, the costs should also be considered as it could be challenging to justify the expenses later. It is important to uncover costs before investing.

8.2.3 Data Ingestion

Data ingestion might be a step to skip if the data needed is in the storage system.

In reality, additional data or modifications to current data integrations are often

k k needed, such as integration with a completely new data source - a new database,

third-party data. If the Big Data platform is flexible enough, integration to a new data source may be easy. There will be more focus on frequency and backfilling once the integration’s technical part has been figured out. A daily update to the data might be good enough for many cases, but it also depends on our deploy- ments’ frequency. Moreover, there is a need to backfill historical data. Backfill- ing can occur when a modification is needed in the existing pipeline, and extra attributes are computed down the chain.

8.2.4 Data Preparation

Once all data is in the storage system, it is ready for data preparation. Preparation involves formatting, removing unrecoverable rows, consistency modifications, and inferring missing values. Other than structural issues, information leaks should be observed.

The source format of the data might not fit in the project. It can come in JSON and should be converted into a column-oriented format. If some rows are corrupted or missing many values, it might be a good idea to remove them.

Consistency problems like uppercase/lowercase or yes/no vs 1/0 can be addressed.

In the last step, missing values can be inferred through imputation. Imputation is the process of replacing missing values in rows with substitutions (Efron, 1994).

It can use different techniques like mean/median, most frequent, and feature similarity.

A leak is a condition where collected data has columns that give information about the target but cannot be available at the computation (Kaufman et al., 2012).

For example, there is a total number of items in the shopping cart at the checkout time. However, the total number of items can only be known after the checkout.

Therefore, it cannot be part of the model when the customer is still shopping. Leak- age should be carefully taken as modeling happens in historical data.

8.2.5 Data Exploration

Data exploration is a phase to form a hypothesis about data by unveiling points of interest, characterizations, and primary patterns. Data exploration can use many tools like data visualization software to libraries like pandas. Data exploration tools help summarize data by simple aggregations to uncover the nature of the data and provide an easy way to detect the relationship between variables, outliers, and data distribution.

Data exploration tools can be used to script many aspects of the data. With auto- mated exploration tools, the data can be visualized in many ways. Variables can be chosen, and statistics derived from them. Moreover, many visualization tools allow

k k writing SQL statements where a more sophisticated view of data can be leveraged.

Consequently, data insights that can inspire subsequent feature engineering and later model development are sought.

8.2.6 Feature Engineering

Feature engineering refers to all activities carried to determine and select infor- mative features for machine-learning models (Amershi et al., 2019). A feature is a direct or formed attribute in a data set. Feature engineering generally involves additing new features based on existing ones or selecting useful ones from a data set and helps data science algorithms perform better. It is an indispensable element that makes the difference between a good model and a bad model.

Feature engineering requires a good grasp of domain knowledge. It often requires some creativity to come up with good features that can potentially help algorithms. Suppose the business of takeaway restaurants and their delayed orders need to be examined. There are columns like order id, the timestamp of order, the number of items they ordered, and the delivery timestamp. The dataset can be enriched with two additional columns. The first one is to add another attribute, such as order hour, that depends on order timestamp. The second one is the order duration that depends on the order timestamp and delivery timestamp.

With these two additions, the hour of the day is correlated easily with durations, and the delivery time for a customer is predicted.

There is no one size fits all solution for feature engineering as it heavily depends on the domain. Common approaches still need work. The columns can be gen- eralized such that it is easier to fit the model. Instead of age, a feature like age groups can be added, and sparse categories withotherscan be replaced to get better results. After adding new features, redundant columns created from new features and unused features like order id can be removed, as they do not provide anything for a potential model. Lastly, data science algorithms generally do not have a way of dealing with enumeration types, which can be replaced with numerical values.

It takes time to engineer good features and sometimes engineered features might not help the model as much as we expected. The features need to be revisited and revised. Feature engineering is not popular, but it is fundamental to building good models. It supplies features that algorithms can understand.

8.2.7 Modeling

Modeling is the step where data science algorithms come to the stage. In the first section, data science applications using algorithms have been discussed. One or more algorithms are chosen for modeling. The models chosen to train depends on several factors like size, type, and quality of the data set. When models are chosen,

k k duration, cost, and output should be monitored. Modeling consists of several steps,

as follows:

● Choose one or more algorithms such as random forest and decision trees.

● Split the data into two sets. The first split is for training purposes, and the second split is for model testing.

● Feed data into the chosen algorithms to build the model.

Once the model is built from the data set and algorithms, the next step is to evaluate the performance of the models. Problems may arise during the modeling stage, and there will be a need to go back to feature engineering or even further to get better results.

8.2.8 Model Evaluation

Evaluating models is a decisive step for the data science life cycle. Depending on the evaluation outcome, deploying a model, or re-engineering, previous steps can be done to revise the current model. During model training, the chosen models are trained with a selected validation method. The model with predefined metrics is tested in the model evaluation.

Several validation methods split the data set, such as hold-out, cross-validation, leave-one-out, and bootstrap. The available data are separated into two mutually exclusive sets in the hold-out method as a training set and a test set. The model on the training data set is trained and evaluated with the test set. In the k-fold cross-validation method, the dataset split intokmutually exclusive data subsets, k-folds. One of the subsets is used as a testing set and trains the model on the remaining sets. The process is repeatedktimes, each time choosing a different test set. One data point is left out for testing and trains on the remaining data in the leave-one-out method. This is continued until all data points get a chance to be tested. In the bootstrap method,mdata points are sampled from the data set.

The model trained and tested against it. The process is repeatedmtimes (Kohavi et al., 1995). Depending on the amount of the data and compute resources available, a validation method is chosen. The hold-out method is cheaper and easy, but the rest require a bit more sophistication and compute resources.

There are various ways to check the performance of the model on the test data set. Some of the notable measures are confusion matrix, accuracy, precision, sensi- tivity, specificity, F1 score, PR curve, and receiver operating characteristics (ROC) curve (Powers, 2011). A few measurements will be briefly discussed. Recall the concept of a confusion matrix as follows:

● True positives (TP): Predicted positive and are positive.

● False positives (FP): Predicted positive and are negative.

k k

● True negatives (TN): Predicted negative and are negative.

● False negatives (FN): Predicted negative and are positive.

Accuracy is commonly used and can be defined as follows.

TP+TN

TP+FP+TN+FN (8.1)

Precision is another measurement method that tries to answer how much the model is right when it says it is right. The formula for precision is as follows:

TP+FP (8.2)

The measurement method depends on the domain. For example, false negatives for cancer should not be missed. Thus, a measurement method that aligns with such concern is needed.

8.2.9 Model Deployment

After evaluating the models, a model that meets expectations and has a reason- able performance is selected. The selected model is placed into a real-world system where investment can finally be returned in the deployment step. Depending on the solution, the model’s interaction with the outside world may vary. A classifier may expose an API for the outside world to make classifications.

Typically, a model can be serialized into various formats, and the serialized arti- fact can be loaded back into production with the rest of the system. Once the model gets loaded, it can receive new data and return results. If the model is expected to receive real-time requests, a Rest API can be built around the model and service requests. Although the manual deployment of the model is possible, it would be highly valuable to have an auto-deployment mechanism to enable data scientists to train new models easily and deploy their solutions on demand.

When building a data science solution, a task queue can be useful. Machine learning models may take some time to evaluate a given request. Therefore, distributing work through a queue to multiple workers and letting the caller return immediately instead of waiting for the result can solve potential delays. The calling service can poll for the status and get the result once the model evaluates the result, and the worker saves the result to the result storage. With the task queue, the deployment might look like in Figure 8.2. Nevertheless, the suggested solution increases the complexity of the system by additional components. Thus, the requirements should be carefully considered before committing to them.

8.2.10 Operationalizing

Once there is a successful working solution, the last step is to operationalize.

Operationalizing requires automation, testing, and monitoring practices. Most

k k

Model API

Model worker

Model worker Message

broker

Model result

Figure 8.2 A sample data science model deployment.

data science projects start as a research and development project. At some point, they become part of the production system. Thus, they have to meet production system requirements as the rest of the infrastructure.

Operationalizing a data science project requires investment from the organization as they are different from the standard software development life cycle. A good way to handle operationalizing data science projects is to have people with operational experience in the data science teams. With this approach, the operational requirements can be baked into the solution early on. Deployment, monitoring, and other operations can become relatively easy. This will be discussed further down in the chapter.

Processing Large Data with Linux Commands

Processing Large Data with PostgreSQL