Open-Source Data Science

Data science programming is largely driven by descriptive programming languages such as SQL, R, Scala and object-oriented programming languages such as Java, Python and Julia. Depending on the choice of programming language, the following data science programming stacks are available:

• JavaScript: JavaScript libraries are suitable for data visualization and exploratory data analysis. D3.js, Protovis, jQuery, OpenRefine, Tableau and knockout are some of the popular JavaScript libraries.

• R: R has an extensive suite of libraries for all the data analytics tasks. It is the de facto standard platform for statistical modeling. However, R is most useful only

for datasets that can fit in the main memory of a computer. Thus, R is suitable for rapid prototyping than large-scale development of the data science patterns.

Functionally, R is comparable to open-source analytics tools such as Pentaho, RapidMiner.

• Python: Python is an object-oriented programming language. Python is an alter- native to R when the dataset is too large for the main memory. Most of the models built in Python can reuse CPU and disk for analyzing big data. However, the python packages are specialized to a particular analytics task or model. NumPy, SciPy, scikits, , matplotlib, SageMath, PyTables, IPython are some of the popular Python libraries.

• Java: Java is an object-oriented programming language. Java is useful for per- forming Big Data Analytics on commodity hardware. The Hadoop software stack released by Apache Foundation is written in Java. Hadoop stack has components for data storage, data querying, parallel processing and machine learning. Research stacks such as Weka, LIBSVM Tools and ELKI are also written in Java.

• Scala: Scala is a functional programming language. The programming constructs in Scala are especially useful for machine learning on big data. The Apache Spark stack is written in Scala. Spark stack has components for real-time computing and machine learning.

• C: In terms of sheer performance, C remains the best programming language for data analytics. However, mixing the low-level memory management of C with the abstractions of data science patterns is a difficult task. Many analytics packages produced by research communities are written in either C or C++. Vowpal Wabbit, Lnknet, megam, VFML and BLAS are some of the popular C libraries.

• Julia: Julia is said to have the usability of Python and performance of C. Julia was built for technical computing and statistical computing. Julia’s packages for analytics include those made for parallel execution, linear algebra and signal processing.

To scale machine learning with multithread programming, we can reuse one or more of main memory, CPU cores and hard disk. Following is the open-source software designing scalable machine learning algorithms for data science with parallel processing over a hardware cluster. The design is a mixture of data parallel and task parallel approaches to parallel processing. The resulting libraries like Mahout and MLLib are widely used in the data science community.

• Iterative and Real-Time Applications: Hadoop is the most common iterative computing framework. Hadoop-like software is mainly built for embarrassingly parallel problems with iterative computations and in-memory data structures for caching data across iterations. The software is not suitable if sufficient statistics cannot be defined within each data or task split. As a design pattern, MapReduce has origin in functional programming languages. MapReduce is built for embarrassingly parallel computations. Machine learning algorithms that can be expressed in as statistical queries over summations are suitable for MapReduce programming model. Furthermore, the algorithms for real-time processing can be categorized

by the number of MapReduce executions needed per iteration. Examples for iterative algorithms are ensemble algorithms, optimization algorithms, time–frequency analysis methods and graph algorithms over social networks, neural networks and tensor networks.

• Graph Parallel Processing Paradigms: Extensions to iterative computing frameworks using MapReduce, add loop aware task scheduling, loop invariant caching and in-memory directed acyclic graphs like the Resilient Distributed Datasets hat may be cached in memory and reused across iterations. By comparison, R programming environment is designed for single threaded, single node execution over a shared array architecture suitable for iterative computing. R has a large collection of serial machine learning algorithms. R extensions integrate R with Hadoop to support distributed execution over Hadoop clusters. Beyond key-value set-based computing in MapReduce frameworks, we also have iterative computing frameworks and programming models building on graph parallel systems based on think-like-a-vertex programming models such as the Bulk Synchronous Paral- lel (BSP) model found in Apache Hama, Apache Giraph, Pregel, GraphLab and GoFFish. In contrast to MR programming model, BSP programming model is suitable for machine learning algorithms such as conjugate gradient and support vector machines. A real-time data processing environment such as Kafka-Storm can help integrate real-time data processing with machine learning. BSP is suitable for implementing deterministic algorithms on graph parallel systems. However, user must architect the movement of data by minimizing the dependencies affecting parallelism and consistency across jobs in the graph. Such dependencies in a graph may be categorized as asynchronous, iterative, sequential and dynamic; whereas, data is associated with edges within a machine and vertices across machines. In contrast to BSP model, Spark has RDDs that allow parallel operations on data partitioned across machines in a cluster. RDDs can then be stored and retrieved from the Hadoop Distributed File System.

In applying various algorithms in open-source software to a particular data science problem, the analyst building data products has to focus on the algorithm properties with respect to two steps, namely feature engineering and model fitting. Features encode information from raw data that can be consumed by machine learning algorithms. Algorithms that build features from data understood by domain knowledge and exploratory data analysis are called feature construction algorithms. Algorithms that build features from data understood by scientific knowledge and model evaluation are called feature extraction algorithms. Both feature construction and feature extraction algorithms are commonly used in Big Data Analytics. When many features are available for analysis, a compact mapping of the features to data is derived by feature selection or variable selection algorithms. For a successful solution, the data analyst must obtain maximum amount of information about the data science problem by engineering a large number of independent relevant features from data.

Feature engineering can be integrated from multiple statistical analysis in a loosely coupled manner with minimum coordination between the various analysts. Depend- ing on statistical analysis being performed, feature engineering is an endless cycle

of iterative code changes and tests. The templates and factors affecting variability and dependencies of data, respectively, may be categorized into desired–undesired, known–unknown, controllable–uncontrollable and observable–unobservable factors involved in feature design and validation. For big data systems focused on scalability, feature engineering is focused on throughput than latency. Distant supervision and crowd sourcing are two popular techniques that have been used to define features on web scale. Several different feature engineering approaches can be compared in terms of the feature interdependence, feature subset selection, feature search method, feature evaluation method, parametric feature stability and predictive performance.

Better variable selection, noise reduction and class separation may be obtained by adding interdependent features into an objective function.

Univariate and multivariate algorithms for feature engineering are implemented using filter, wrapper and embedded programming paradigms. Feature engineering from filters ranks features or feature subsets independent of a predictor or classifier. Feature engineering from wrappers uses a classifier to assess features or feature subsets. To search for feature or feature subsets with the aid of a machine learning process, embedded feature engineering combines the search criteria evaluating the goodness of a feature or feature subset with the search criteria generating a feature or feature subset. Common ways to evaluate goodness of feature are statistical significance tests, cross validation and sensitivity analysis. Common ways to gener- ate features are exhaustive search, heuristic or stochastic search, feature or feature subset ranking and relevance, forward selection and backward elimination. Heuris- tic preprocessing, feature normalization, iterative sampling, parson windows, risk minimization, minimum description length, maximum likelihood estimation, gener- alized linear models, convolutional filters, clustering, decision trees, rule induction, neural networks, kernel methods, ensemble methods, matrix decomposition methods, regression methods and Bayesian learning are some of the most commonly used feature engineering models. Applications of feature engineering include data visualization, case-based reasoning, data segmentation and market basket analysis.

Machine Intelligence and Computational Intelligence

Data Engineering and Data Sciences