Data Engineering and Data Sciences

Knowledge discovery in databases (KDD) and pattern recognition (PR) are the core problem-solving techniques used for applying data engineering techniques in data science problems. Before understanding KDD, let us review problem-solving techniques. In general, the output of problem-solving methodologies is knowledge.

Knowledge can satisfy one or more themes such as empirical, rational and prag- matic knowledge. Individuals can build coherence between different pieces of such knowledge. Or consensus can be the ultimate criteria for judging such knowledge.

Different methodologies for acquiring knowledge include ontological, epistemolog- ical, axiological and rhetorical methodologies.

In data engineering problems, various notions of knowledge discovery crystallize into the three steps of problem definition, problem analysis and solution implementation. To do problem definition, we need to be able to specify the end data science problem as a composition of well-defined statistical problems. To do problem analysis, we need to understand the present and new system of algorithms implementing the statistical problems. To do solution implementation, we need to generate con- ceptual solutions that consider the subject matter knowledge related to implementation/deployment detail. Additionally, the implementation/deployment detail can be validated against decision-making processes to determine triggers for redesign or reconfiguration or decommissioning.

The three steps of problem definition, problem analysis and solution implementation can also benefit from the various standards in systems life cycles. The triggers for problem definition can result from situation analysis and objective formulation.

The triggers for problem analysis can result from solution search and evaluation. The triggers for solution implementation can result from solution selection and decision making. Systems life cycle is a systematic and structured procedure of formulating objectives and finding solutions given a situation analysis. Evaluating decisions and avoiding frequent errors are also part of problem solving with a system’s life cycle. An iterative situation analysis is to be summarized as a problem definition derived from one or more of mining tasks, system architectures and future goals. Ideas obtained from a strength–weakness–opportunity–threat, cost-benefit, root cause analysis from system’s thinking are also useful in problem definition. Each method of systems analysis can emphasize one or more of institution, discourse, solution and convention in the problem-solving processes.

The problem-solving techniques applied to the scenario of data engineering are called as KDD. According to a standard definition by [7], KDD process consists of the following nine steps. KDD that deals with data mining and machine intelligence in the computational space can be fruitfully combined with human–computer

interaction (HCI) in cognitive space that deals with questions of human perception, cognition, intelligence and decision making when humans interact with machines.

Consequently, variants of steps in KDD and HCI have been incorporated in industry standards like complex event processing (CEP), real-time business intelligence (RTBI), cross industry standard process for data mining (CRISP-DM), Predictive Model Markup Language (PMML) and Sample, Explore, Modify, Model and Assess (SEMMA).

• Learning from the Application Domain: includes understanding previous knowledge, goals of application and a certain amount of domain expertise;

• Creating a Target Dataset: includes selecting a dataset or focusing on a subset of variables or data samples on which discovery shall be performed;

• Data Cleansing (and Preprocessing): includes removing noise or outliers, strate- gies for handling missing data, etc.;

• Data Reduction and Projection: includes finding useful features to represent the data, dimensionality reduction, etc.;

• Choosing the Function of Data Mining: includes deciding the purpose and princi- ple of the model for mining algorithms (e.g., summarization, classification, regression and clustering);

• Choosing the Data Mining Algorithm: includes selecting method(s) to be used for searching for patterns in the data, such as deciding which models and param- eters may be appropriate and matching a particular data mining method with the criteria of KDD process;

• Data Mining: includes searching for patterns of interest in a representational form or a set of such representations, including classification rules or trees, regression, clustering, sequence modeling, dependency and line analysis;

• Interpretation: includes interpreting the discovered patterns and possibly return- ing to any of the previous steps, as well as possible visualization of the extracted patterns, removing redundant or irrelevant patterns and translating useful patterns into terms understandable by user;

• Using Discovered Knowledge: includes incorporating this knowledge into the performance of the system, taking actions based on the knowledge or documenting it and reporting it to interested parties, as well as checking for and resolving, potential conflicts with previously believed knowledge.

In practice, problem solving for data science with pattern recognition is defined in two steps. The first step is the study of problem context for problem formulation, data collection and feature extraction. The second step is the study of mathematical solution for modeling, evaluation and decision making. Whereas features (or key data dimensions) are output from first step, models (or key complexity measures) are output from the second step. The features engineered from the first step give the analyst an idea of the data properties that capture the data complexity. Typically, the data properties are obtained by data sampling, data transformation and feature selection. The models engineered from the second step give the analyst an idea of the data properties that impact the analytics solution. Accuracy of the solution depends on goodness of fit between model and problem. By removing irrelevant dimensions

and compressing discriminatory information, feature selection may change computational complexity of the model. Thus, the data properties in the model are quantita- tively described through ideas measuring computational complexity such as degree of linear separability, length of class boundary, shapes of class manifolds, uncertainty in feature space and uncertainty in complexity space. Research in either computational machine learning or statistical machine learning uses the data properties to do a systematic evaluation of algorithms against problems.

In data mining project development, often the methodology translating business objectives into data mining objectives is not immediately available. The common pitfalls of data mining implementation are addressed by questions on data mining cost estimation. Main factors affecting the cost estimation are data sources, development platform and expertise of the development team. Technical elements include data quality, data ownership, data generation processes and type of data mining problems.

As discussed by [8], the drivers for cost estimation in data mining project development can be listed as following. Until all these drivers are sufficiently addressed, KDD and PR are conducted in a non-sequential non-ending circle of iterations where backtracking to previous phases is usually necessary.

• Data Drivers

• Model Drivers

• Platform Drivers

• Tools and Techniques Drivers

• Project Drivers

• People Drivers.

At the beginning of the data mining project, we need to be able to design hypoth- esis and experiments that evaluate and validate the impact of implementation and deployment in detail. In this context, agile and lean software development methodologies, like SCRUM and parallel thinking, need to be flexible enough for managing data product development, management paradigms and programming paradigms in terms of features, personnel and cost. From a modeling perspective, the development process centers around a system model with executable specifications of design in continuous tests and verification of implementation. Feasibility and compatibility of design goals are to be analyzed by multidomain simulation. Each organization has varying levels of formality in the modeling processes driven by people culture, legal regulation, best practices and latest trends.

Six Sigma is a popular method to monitor the quality and cost of organizational processes and activities. The focus of Six Sigma is on increasing throughput while reducing bottlenecks in scalability. Six Sigma is implemented by defining the skill sets necessary to successfully coordinate people’s activities around organizational objectives. The formal approach to Six Sigma is defined by the acronym, DMAIC, which stands for Define, Measure, Analyze, Improve and Control. DMAIC cycle is monitored by statistical process control (SPC) tools. Problem definition and root cause analysis are the most difficult parts of Six Sigma method. The ideas taken from

KDD and PR can help us with these parts of Six Sigma. Control charts, flow charts, scatter diagrams, concept maps and infographics are also popular data visualization techniques for Six Sigma.

Machine Intelligence and Computational Intelligence

Distributed Systems and Database Systems