Unsupervised Machine Learning Techniques Issues in common with supervised learning Issues specific to unsupervised learning Feature analysis and dimensionality reduction Notation Linear
Trang 2Table of Contents
Mastering Java Machine Learning
Credits
Foreword
About the Authors
About the Reviewers
What this book covers
What you need for this book
Who this book is for
1 Machine Learning Review
Machine learning – history and definition
What is not machine learning?
Machine learning – concepts and terminology
Machine learning – types and subtypes
Datasets used in machine learning
Machine learning applications
Practical issues in machine learning
Machine learning – roles and process
Data quality analysis
Descriptive data analysis
Basic label analysis
Basic feature analysis
Trang 3Visualization analysis
Univariate feature analysis
Categorical features
Continuous features
Multivariate feature analysis
Data transformation and preprocessing
Training, validation, and test set
Feature relevance analysis and dimensionality reductionFeature search techniques
Feature evaluation techniques
Filter approach
Univariate feature selection
Information theoretic approachStatistical approach
Multivariate feature selection
Minimal redundancy maximal relevance (mRMR)Correlation-based feature selection (CFS)
Algorithm input and output
How does it work?
Advantages and limitations
Nạve Bayes
Algorithm input and output
How does it work?
Advantages and limitations
Logistic Regression
Algorithm input and output
How does it work?
Advantages and limitations
Non-linear models
Decision Trees
Trang 4Algorithm inputs and outputs
How does it work?
Advantages and limitations
K-Nearest Neighbors (KNN)
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Support vector machines (SVM)
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Ensemble learning and meta learners
Bootstrap aggregating or bagging
Algorithm inputs and outputs
How does it work?
Random ForestAdvantages and limitations
Boosting
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Model assessment, evaluation, and comparisonsModel assessment
Model evaluation metrics
Confusion matrix and related metrics
ROC and PRC curves
Gain charts and lift curves
Model comparisons
Comparing two algorithms
McNemar's Test
Paired-t testWilcoxon signed-rank test
Comparing multiple algorithms
Trang 5Sample end-to-end process in JavaWeka experimenter and model selectionRapidMiner experiments
Visualization analysisFeature selectionModel process flowModel evaluation metricsEvaluation on Confusion Metrics
ROC Curves, Lift Curves, and Gain ChartsResults, observations, and analysis
Summary
References
3 Unsupervised Machine Learning Techniques
Issues in common with supervised learning
Issues specific to unsupervised learning
Feature analysis and dimensionality reduction
Notation
Linear methods
Principal component analysis (PCA)
Inputs and outputsHow does it work?
Advantages and limitationsRandom projections (RP)
Inputs and outputsHow does it work?
Advantages and limitationsMultidimensional Scaling (MDS)
Inputs and outputsHow does it work?
Advantages and limitationsNonlinear methods
Kernel Principal Component Analysis (KPCA)
Inputs and outputsHow does it work?
Advantages and limitationsManifold learning
Inputs and outputsHow does it work?
Advantages and limitationsClustering
Clustering algorithms
k-Means
Inputs and outputs
Trang 6How does it work?
Advantages and limitations
DBSCAN
Inputs and outputs
How does it work?
Advantages and limitations
Mean shift
Inputs and outputs
How does it work?
Advantages and limitations
Expectation maximization (EM) or Gaussian mixture modeling (GMM)Input and output
How does it work?
Advantages and limitations
Hierarchical clustering
Input and output
How does it work?
Advantages and limitations
Self-organizing maps (SOM)
Inputs and outputs
How does it work?
Advantages and limitations
Spectral clustering
Inputs and outputs
How does it work?
Advantages and limitations
Affinity propagation
Inputs and outputs
How does it work?
Advantages and limitations
Clustering validation and evaluation
Internal evaluation measures
Rand index
F-Measure
Normalized mutual information index
Outlier or anomaly detection
Outlier algorithms
Trang 7Inputs and outputs
How does it work?
Advantages and limitations
Distance-based methods
Inputs and outputs
How does it work?
Advantages and limitations
Density-based methods
Inputs and outputs
How does it work?
Advantages and limitations
Clustering-based methods
Inputs and outputs
How does it work?
Advantages and limitations
High-dimensional-based methods
Inputs and outputs
How does it work?
Advantages and limitations
One-class SVM
Inputs and outputs
How does it work?
Advantages and limitations
Outlier evaluation techniques
Supervised evaluation
Unsupervised evaluation
Real-world case study
Tools and software
Business problem
Machine learning mapping
Data collection
Data quality analysis
Data sampling and transformation
Feature analysis and dimensionality reduction
Observations and clustering analysis
Outlier models, results, and evaluation
Observations and analysis
Trang 8How does it work?
Advantages and limitationsTransductive SVM (TSVM)
Inputs and outputsHow does it work?
Advantages and limitationsCase study in semi-supervised learningTools and software
Datasets and analysis
Feature analysis resultsExperiments and results
Analysis of semi-supervised learningActive learning
Representation and notation
Active learning scenarios
Active learning approaches
Uncertainty sampling
How does it work?
Trang 9Least confident samplingSmallest margin samplingLabel entropy samplingAdvantages and limitationsVersion space sampling
Query by disagreement (QBD)
How does it work?
Query by Committee (QBC)How does it work?
Advantages and limitations
Data distribution sampling
How does it work?
Expected model changeExpected error reductionVariance reductionDensity weighted methodsAdvantages and limitations
Case study in active learning
Tools and software
Business problem
Machine learning mapping
Data Collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
5 Real-Time Stream Machine Learning
Assumptions and mathematical notations
Basic stream processing and computational techniquesStream computations
Trang 10Drift Detection Method or DDMEarly Drift Detection Method or EDDMMonitoring distribution changes
Welch's t testKolmogorov-Smirnov's testCUSUM and Page-Hinckley testAdaptation methods
Online linear models with loss functions
Inputs and outputsHow does it work?
Advantages and limitationsOnline Nạve Bayes
Inputs and outputsHow does it work?
Advantages and limitationsNon-linear algorithms
Hoeffding trees or very fast decision trees (VFDT)Inputs and outputs
How does it work?
Advantages and limitationsEnsemble algorithms
Weighted majority algorithm
Inputs and outputsHow does it work?
Advantages and limitationsOnline Bagging algorithm
Inputs and outputsHow does it work?
Advantages and limitationsOnline Boosting algorithm
Inputs and outputsHow does it work?
Advantages and limitationsValidation, evaluation, and comparisons in online settingModel validation techniques
Prequential evaluation
Holdout evaluation
Controlled permutations
Trang 11Evaluation criteria
Comparing algorithms and metrics
Incremental unsupervised learning using clusteringModeling techniques
Inputs and outputs
How does it work?
Advantages and limitations
Inputs and outputs
How does it work?
Advantages and limitations
Density based
Inputs and outputs
How does it work?
Advantages and limitations
Grid based
Inputs and outputs
How does it work?
Advantages and limitations
Validation and evaluation techniques
Key issues in stream cluster evaluationEvaluation measures
Cluster Mapping Measures (CMM)V-Measure
Other external measuresUnsupervised learning using outlier detection
Partition-based clustering for outlier detectionInputs and outputs
How does it work?
Advantages and limitations
Distance-based clustering for outlier detectionInputs and outputs
How does it work?
Exact Storm
Abstract-C
Direct Update of Events (DUE)
Micro Clustering based Algorithm (MCOD)Approx Storm
Trang 12Advantages and limitationsValidation and evaluation techniques
Case study in stream learning
Tools and software
Business problem
Machine learning mapping
Data collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Supervised learning experiments
Concept drift experimentsClustering experiments
Outlier detection experiments
Analysis of stream learning results
Chain rule and Bayes' theorem
Random variables, joint, and marginal distributionsMarginal independence and conditional independenceFactors
Factor typesDistribution queries
Probabilistic queriesMAP queries and marginal MAP queriesGraph concepts
Graph structure and properties
Subgraphs and cliques
Path, trail, and cycles
Combined reasoningIndependencies, flow of influence, D-Separation, I-MapFlow of influence
Trang 13I-Map
Inference
Elimination-based inference
Variable elimination algorithm
Input and outputHow does it work?
Advantages and limitationsClique tree or junction tree algorithm
Input and outputHow does it work?
Advantages and limitationsPropagation-based techniques
Belief propagation
Factor graphMessaging in factor graphInput and output
How does it work?
Advantages and limitationsSampling-based techniques
Forward sampling with rejection
Input and outputHow does it work?
Advantages and limitationsLearning
Learning parameters
Maximum likelihood estimation for Bayesian networksBayesian parameter estimation for Bayesian networkPrior and posterior using the Dirichlet distributionLearning structures
Measures to evaluate structures
Methods for learning structures
Constraint-based techniquesInputs and outputs
How does it work?
Advantages and limitationsSearch and score-based techniquesInputs and outputs
How does it work?
Advantages and limitationsMarkov networks and conditional random fields
Representation
Parameterization
Trang 14Gibbs parameterizationFactor graphs
Log-linear modelsIndependencies
GlobalPairwise MarkovMarkov blanketInference
Learning
Conditional random fields
Specialized networks
Tree augmented network
Input and output
How does it work?
Advantages and limitations
Markov chains
Hidden Markov models
Most probable path in HMM
Machine learning mapping
Data sampling and transformation
Multi-layer feed-forward neural network
Inputs, neurons, activation function, and mathematical notationMulti-layered neural network
Structure and mathematical notations
Activation functions in NN
Sigmoid functionHyperbolic tangent ("tanh") functionTraining neural network
Empirical risk minimizationParameter initializationLoss function
Trang 15GradientsGradient at the output layerGradient at the Hidden LayerParameter gradient
Feed forward and backpropagationHow does it work?
RegularizationL2 regularizationL1 regularizationLimitations of neural networks
Vanishing gradients, local optimum, and slow trainingDeep learning
Building blocks for deep learning
Rectified linear activation function
Restricted Boltzmann Machines
Definition and mathematical notation
Input and outputs
How does it work?
Deep Autoencoders
Deep Belief Networks
Inputs and outputs
How does it work?
Deep learning with dropouts
Definition and mathematical notation
Inputs and outputs
How does it work?
Learning Training and testing with dropouts
Sparse coding
Trang 16Convolutional Neural Network
Local connectivityParameter sharingDiscrete convolutionPooling or subsamplingNormalization using ReLUCNN Layers
Recurrent Neural Networks
Structure of Recurrent Neural NetworksLearning and associated problems in RNNsLong Short Term Memory
Gated Recurrent UnitsCase study
Tools and software
Business problem
Machine learning mapping
Data sampling and transfor
Feature analysis
Models, results, and evaluation
Basic data handling
Parameter search using Arbiter
Results and analysis
Summary
References
8 Text Mining and Natural Language Processing
NLP, subfields, and tasks
Trang 17Machine translation
Semantic reasoning and inferencing
Text summarization
Automating question and answers
Issues with mining unstructured data
Text processing components and transformations
Document collection and standardization
Inputs and outputs
How does it work?
Tokenization
Inputs and outputs
How does it work?
Stop words removal
Inputs and outputs
How does it work?
Stemming or lemmatization
Inputs and outputs
How does it work?
Local/global dictionary or vocabulary?
Feature representation and similarity
Vector space model
Binary
Term frequency (TF)
Inverse document frequency (IDF)
Term frequency-inverse document frequency (TF-IDF)Similarity measures
Trang 18How does it work?
Advantages and limitations
Text clustering
Feature transformation, selection, and reductionClustering techniques
Generative probabilistic models
Input and outputHow does it work?
Advantages and limitationsDistance-based text clustering
Non-negative matrix factorization (NMF)
Input and outputHow does it work?
Advantages and limitationsEvaluation of text clustering
Named entity recognition
Hidden Markov models for NER
Input and output
How does it work?
Advantages and limitations
Maximum entropy Markov models for NER
Input and output
How does it work?
Advantages and limitations
Deep learning and NLP
Tools and usage
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Analysis of text processing results
Trang 19References
9 Big Data Machine Learning – The Final FrontierWhat are the characteristics of Big Data?
Big Data Machine Learning
General Big Data framework
Big Data cluster deployment frameworksHortonworks Data Platform
Cloudera CDHAmazon Elastic MapReduceMicrosoft Azure HDInsightData acquisition
Publish-subscribe frameworksSource-sink frameworks
SQL frameworksMessage queueing frameworksCustom frameworks
Data storage
HDFSNoSQLKey-value databasesDocument databasesColumnar databasesGraph databasesData processing and preparation
Hive and HQLSpark SQLAmazon RedshiftReal-time stream processingMachine Learning
Visualization and analysis
Batch Big Data Machine Learning
H2O as Big Data Machine Learning platformH2O architecture
Machine learning in H2O
Tools and usage
Case study
Business problem
Machine Learning mapping
Data collection
Data sampling and transformation
Experiments, results, and analysis
Feature relevance and analysis
Trang 20Evaluation on test dataAnalysis of resultsSpark MLlib as Big Data Machine Learning platform
Spark architecture
Machine Learning in MLlib
Tools and usage
Experiments, results, and analysis
k-Meansk-Means with PCABisecting k-Means (with PCA)Gaussian Mixture Model
Random ForestAnalysis of resultsReal-time Big Data Machine Learning
SAMOA as a real-time Big Data Machine Learning frameworkSAMOA architecture
Machine Learning algorithmsTools and usage
Experiments, results, and analysisAnalysis of results
The future of Machine Learning
EigendecompositionPositive definite matrixSingular value decomposition (SVD)
Trang 22Mastering Java Machine Learning
Trang 23Copyright © 2017 Packt Publishing
All rights reserved No part of this book may be reproduced, stored in a retrieval system, ortransmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews
Every effort has been made in the preparation of this book to ensure the accuracy of theinformation presented However, the information contained in this book is sold without
warranty, either express or implied Neither the authors, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be causeddirectly or indirectly by this book
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals
However, Packt Publishing cannot guarantee the accuracy of this information
First published: June 2017
Trang 27Dr Uday Kamath is a volcano of ideas Every time he walked into my office, we had fruitfuland animated discussions I have been a professor of computer science at George MasonUniversity (GMU) for 15 years, specializing in machine learning and data mining I have
known Uday for five years, first as a student in my data mining class, then as a colleagueand co-author of papers and projects on large-scale machine learning While a chief datascientist at BAE Systems Applied Intelligence, Uday earned his PhD in evolutionary
computation and machine learning As if having two high-demand jobs was not enough,
Uday was unusually prolific, publishing extensively with four different people in the computerscience faculty during his tenure at GMU, something you don't see very often Given thispedigree, I am not surprised that less than four years since Uday's graduation with a PhD, I
am writing the foreword for his book on mastering advanced machine learning techniqueswith Java Uday's thirst for new stimulating challenges has struck again, resulting in thisterrific book you now have in your hands
This book is the product of his deep interest and knowledge in sound and well-groundedtheory, and at the same time his keen grasp of the practical feasibility of proposed
methodologies Several books on machine learning and data analytics exist, but Uday's
book closes a substantial gap—the one between theory and practice It offers a
comprehensive and systematic analysis of classic and advanced learning techniques, with afocus on their advantages and limitations, practical use and implementations This book is aprecious resource for practitioners of data science and analytics, as well as for
undergraduate and graduate students keen to master practical and efficient
implementations of machine learning techniques
The book covers the classic techniques of machine learning, such as classification,
clustering, dimensionality reduction, anomaly detection, semi-supervised learning, and activelearning It also covers advanced and recent topics, including learning with stream data,deep learning, and the challenges of learning with big data Each chapter is dedicated to atopic and includes an illustrative case study, which covers state-of-the-art Java-based toolsand software, and the entire knowledge discovery cycle: data collection, experimental
design, modeling, results, and evaluation Each chapter is self-contained, providing greatflexibility of usage The accompanying website provides the source code and data This istruly a gem for both students and data analytics practitioners, who can experiment first-hand with the methods just learned or deepen their understanding of the methods by
applying them to real-world scenarios
As I was reading the various chapters of the book, I was reminded of the enthusiasm Udayhas for learning and knowledge He communicates the concepts described in the book withclarity and with the same passion I am positive that you, as a reader, will feel the same Iwill certainly keep this book as a personal resource for the courses I teach, and strongly
Trang 28recommend it to my students.
Dr Carlotta Domeniconi
Associate Professor of Computer Science, George Mason University
Trang 29About the Authors
Dr Uday Kamath is the chief data scientist at BAE Systems Applied Intelligence He
specializes in scalable machine learning and has spent 20 years in the domain of AML,
fraud detection in financial crime, cyber security, and bioinformatics, to name a few Dr.Kamath is responsible for key products in areas focusing on the behavioral, social
networking and big data machine learning aspects of analytics at BAE AI He received hisPhD at George Mason University, under the able guidance of Dr Kenneth De Jong, wherehis dissertation research focused on machine learning for big data and automated sequencemining
I would like to thank my friend, Krishna Choppella, for accepting the offer to co-authorthis book and being an able partner on this long but satisfying journey
Heartfelt thanks to our reviewers, especially Dr Samir Sahli for his valuable comments,suggestions, and in-depth review of the chapters I would like to thank Professor
Carlotta Domeniconi for her suggestions and comments that helped us shape various
chapters in the book I would also like to thank all the Packt staff, especially Divya
Poojari, Mayur Pawanikar, and Vivek Arora, for helping us complete the tasks in time.This book required making a lot of sacrifices on the personal front and I would like to
thank my wife, Pratibha, and our nanny, Evelyn, for their unconditional support Finally,thanks to all my lovely teachers and professors for not only teaching the subjects, butalso instilling the joy of learning
Krishna Choppella builds tools and client solutions in his role as a solutions architect for
analytics at BAE Systems Applied Intelligence He has been programming in Java for 20years His interests are data science, functional programming, and distributed computing
Trang 30About the Reviewers
Samir Sahli was awarded a BSc degree in applied mathematics and information sciences
from the University of Nice Sophia-Antipolis, France, in 2004 He received MSc and PhDdegrees in physics (specializing in optics/photonics/image science) from University Laval,Quebec, Canada, in 2008 and 2013, respectively During his graduate studies, he workedwith Defence Research and Development Canada (DRDC) on the automatic detection andrecognition of targets in aerial imagery, especially in the context of uncontrolled environmentand sub-optimal acquisition conditions He has worked since 2009 as a consultant for
several companies based in Europe and North America specializing in the area of
Intelligence, Surveillance, and Reconnaissance (ISR) and in remote sensing
Dr Sahli joined McMaster Biophotonics in 2013 as a postdoctoral fellow His research was
in the field of optics, image processing, and machine learning He was involved in severalprojects, such as the development of a novel generation of gastrointestinal tract imagingdevice, hyperspectral imaging of skin erythema for individualized radiotherapy treatment,and automatic detection of the precancerous Barrett's esophageal cell using fluorescencelifetime imaging microscopy and multiphoton microscopy
Dr Sahli joined BAE Systems Applied Intelligence in 2015 He has since worked as a datascientist to develop analytics models to detect complex fraud patterns and money
laundering schemes for insurance, banking, and governmental clients using machine
learning, statistics, and social network analysis tools
Prashant Verma started his IT career in 2011 as a Java developer in Ericsson, working in
the telecom domain After a couple of years of Java EE experience, he moved into the bigdata domain and has worked on almost all of the popular big data technologies such asHadoop, Spark, Flume, Mongo, Cassandra, and so on He has also played with Scala
Currently, he works with QA Infotech as a lead data engineer, working on solving e-learningproblems with analytics and machine learning
Prashant has worked for many companies, such as Ericsson and QA Infotech, with domainknowledge of telecom and e-learning He has also worked as a freelance consultant in hisfree time
I want to thank Packt Publishing for giving me the chance to review the book, as well as
my employer and my family for their patience while I was busy working on this book
Trang 31eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with PDF andePub files available? You can upgrade to the eBook version at www.PacktPub.com and as
a print book customer, you are entitled to a discount on the eBook copy Get in touch with
us at < customercare@packtpub.com > for more details
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for
a range of free newsletters and receive exclusive discounts and offers on Packt books andeBooks
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt Mapt gives you full access to all Packtbooks and video courses, as well as industry-leading tools to help you plan your personaldevelopment and advance your career
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Trang 32Customer Feedback
Thanks for purchasing this Packt book At Packt, quality is at the heart of our editorial
process To help us improve, please leave us an honest review on this book's Amazon page
at https://www.amazon.com/dp/1785880519
If you'd like to join our team of regular reviewers, you can e-mail us at
customerreviews@packtpub.com We award our regular reviewers with free eBooks andvideos in exchange for their valuable feedback Help us be relentless in improving our
products!
Dedicated to my parents, Krishna Kamath and Bharathi Kamath, my wife, Pratibha
Shenoy, and the kids, Aaroh and Brandy
Dr Uday Kamath.
To my parents
Krishna Choppella
Trang 33There are many notable books on machine learning, from pedagogical tracts on the theory
of learning from data; to standard references on specializations in the field, such as
clustering and outlier detection or probabilistic graph modeling; to cookbooks that offer
practical advice on the use of tools and libraries in a particular language The books thattend to be broad in coverage are often short on theoretical detail, while those with a focus
on one topic or tool may not, for example, have much to say about the difference in
approach in a streaming as opposed to a batch environment Besides, for the non-noviceswith a preference for tools in Java who wish to reach for a single volume that will extendtheir knowledge—simultaneously, on the essential aspects—there are precious few options.Finding in one place
The pros and cons of different techniques given any data availability scenario—whendata is labeled or unlabeled, streaming or batch, local, or distributed, structured orunstructured
A ready reference for the most important mathematical results related to those verytechniques for a better appreciation of the underlying theory
An introduction to the most mature Java-based frameworks, libraries, and visualizationtools with descriptions and illustrations on how to put these techniques into practice isnot possible today, as far as we know
The core idea of this book, therefore, is to address this gap while maintaining a balancebetween treatment of theory and practice with the aid of probability, statistics, basic linearalgebra, and rudimentary calculus in the service of one, and emphasizing methodology,
case studies, tools and code in support of the other
According to the KDnuggets 2016 software poll, Java, at 16.8%, has the second highestshare in popularity among languages used in machine learning, after Python What's more isthat this marks a 19% increase from the year before! Clearly, Java remains an importantand effective vehicle to build and deploy systems involving machine learning, despite claims
of its decline in some quarters With this book, we aim to reach professionals and motivatedenthusiasts with some experience in Java and a beginner's knowledge of machine learning
Our goal is to make Mastering Java Machine Learning the next step on their path to
becoming advanced practitioners in data science To guide them on this path, the book
covers a veritable arsenal of techniques in machine learning—some which they may already
be familiar with, others perhaps not as much, or only superficially—including methods ofdata analysis, learning algorithms, evaluation of model performance, and more in
supervised and supervised learning, clustering and anomaly detection, and
semi-supervised and active learning It also presents special topics such as probabilistic graphmodeling, text mining, and deep learning Not forgetting the increasingly important topics inenterprise-scale systems today, the book also covers the unique challenges of learning
Trang 34from evolving data streams and the tools and techniques applicable to real-time systems,
as well as the imperatives of the world of Big Data:
How does machine learning work in large-scale distributed environments?
What are the trade-offs?
How must algorithms be adapted?
How can these systems interoperate with other technologies in the dominant Hadoopecosystem?
This book explains how to apply machine learning to real-world data and real-world
domains with the right methodology, processes, applications, and analysis Accompanyingeach chapter are case studies and examples of how to apply the newly learned techniquesusing some of the best available open source tools written in Java This book covers morethan 15 open source Java tools supporting a wide range of techniques between them, withcode and practical usage The code, data, and configurations are available for readers todownload and experiment with We present more than ten real-world case studies in
Machine Learning that illustrate the data scientist's process Each case study details thesteps undertaken in the experiments: data ingestion, data analysis, data cleansing, featurereduction/selection, mapping to machine learning, model training, model selection, modelevaluation, and analysis of results This gives the reader a practical guide to using the toolsand methods presented in each chapter for solving the business problem at hand
Trang 35What this book covers
Chapter 1, Machine Learning Review, is a refresher of basic concepts and techniques that
the reader would have learned from Packt's Learning Machine Learning in Java or a
similar text This chapter is a review of concepts such as data, data transformation,
sampling and bias, features and their importance, supervised learning, unsupervised
learning, big data learning, stream and real-time learning, probabilistic graphic models, andsemi-supervised learning
Chapter 2, Practical Approach to Real-World Supervised Learning, cobwebs dusted, divesstraight into the vast field of supervised learning and the full spectrum of associated
techniques We cover the topics of feature selection and reduction, linear modeling, logisticmodels, non-linear models, SVM and kernels, ensemble learning techniques such as
bagging and boosting, validation techniques and evaluation metrics, and model selection.Using WEKA and RapidMiner, we carry out a detailed case study, going through all thesteps from data analysis to analysis of model performance As in each of the other
chapters, the case study is presented as an example to help the reader understand how thetechniques introduced in the chapter are applied in real life The dataset used in the casestudy is UCI HorseColic
Chapter 3, Unsupervised Machine Learning Techniques, presents many advanced
methods in clustering and outlier techniques, with applications Topics covered are featureselection and reduction in unsupervised data, clustering algorithms, evaluation methods inclustering, and anomaly detection using statistical, distance, and distribution techniques Atthe end of the chapter, we perform a case study for both clustering and outlier detectionusing a real-world image dataset, MNIST We use the Smile API to do feature reductionand ELKI for learning
Chapter 4, Semi-supervised Learning and Active Learning, gives details of algorithms andtechniques for learning when only a small amount labeled data is present Topics coveredare self-training, generative models, transductive SVMs, co-training, active learning, andmulti-view learning The case study involves both learning systems and is performed on thereal-world UCI Breast Cancer Wisconsin dataset The tools introduced are
JKernelMachines ,KEEL and JCLAL
Chapter 5, Real-Time Stream Machine Learning, covers data streams in real-time presentunique circumstances for the problem of learning from data This chapter broadly covers theneed for stream machine learning and applications, supervised stream learning,
unsupervised cluster stream learning, unsupervised outlier learning, evaluation techniques instream learning, and metrics used for evaluation A detailed case study is given at the end
of the chapter to illustrate the use of the MOA framework The dataset used is Electricity(ELEC)
Trang 36Chapter 6, Probabilistic Graph Modeling, shows that many real-world problems can beeffectively represented by encoding complex joint probability distributions over multi-
dimensional spaces Probabilistic graph models provide a framework to represent, drawinferences, and learn effectively in such situations The chapter broadly covers probabilityconcepts, PGMs, Bayesian networks, Markov networks, Graph Structure Learning, HiddenMarkov Models, and Inferencing A detailed case study on a real-world dataset is
performed at the end of the chapter The tools used in this case study are OpenMarkov andWEKA's Bayes network The dataset is UCI Adult (Census Income)
Chapter 7, Deep Learning, If there is one super-star of machine learning in the popular
imagination today it is deep learning, which has attained a dominance among techniquesused to solve the most complex AI problems Topics broadly covered are neural networks,issues in neural networks, deep belief networks, restricted Boltzman machines,
convolutional networks, long short-term memory units, denoising autoencoders, recurrentnetworks, and others We present a detailed case study showing how to implement deeplearning networks, tuning the parameters and performing learning We use DeepLearning4Jwith the MNIST image dataset
Chapter 8, Text Mining and Natural Language Processing, details the techniques,
algorithms, and tools for performing various analyses in the field of text mining Topics
broadly covered are areas of text mining, components needed for text mining,
representation of text data, dimensionality reduction techniques, topic modeling, text
clustering, named entity recognition, and deep learning The case study uses real-worldunstructured text data (the Reuters-21578 dataset) highlighting topic modeling and text
classification; the tools used are MALLET and KNIME
Chapter 9, Big Data Machine Learning – the Final Frontier, discusses some of the mostimportant challenges of today What learning options are available when data is either big
or available at a very high velocity? How is scalability handled? Topics covered are big datacluster deployment frameworks, big data storage options, batch data processing, batchdata machine learning, real-time machine learning frameworks, and real-time stream
learning In the detailed case study for both big data batch and real-time we select the UCICovertype dataset and the machine learning libraries H2O, Spark MLLib and SAMOA
Appendix A, Linear Algebra, covers concepts from linear algebra, and is meant as a briefrefresher It is by no means complete in its coverage, but contains a whirlwind tour of someimportant concepts relevant to the machine learning techniques featured in the book It
includes vectors, matrices and basic matrix operations and properties, linear
transformations, matrix inverse, eigen decomposition, positive definite matrix, and singularvalue decomposition
Appendix B, Probability, provides a brief primer on probability It includes the axioms ofprobability, Bayes' theorem, density estimation, mean, variance, standard deviation,
Gaussian standard deviation, covariance, correlation coefficient, binomial distribution,
Trang 37Poisson distribution, Gaussian distribution, central limit theorem, and error propagation.
Trang 38What you need for this book
This book assumes you have some experience of programming in Java and a basic
understanding of machine learning concepts If that doesn't apply to you, but you are
curious nonetheless and self-motivated, fret not, and read on! For those who do have somebackground, it means that you are familiar with simple statistical analysis of data and
concepts involved in supervised and unsupervised learning Those who may not have therequisite math or must poke the far reaches of their memory to shake loose the odd
formula or funny symbol, do not be disheartened If you are the sort that loves a challenge,the short primer in the appendices may be all you need to kick-start your engines—a bit oftenacity will see you through the rest! For those who have never been introduced to
machine learning, the first chapter was equally written for you as for those needing a
refresher—it is your starter-kit to jump in feet first and find out what it's all about You canaugment your basics with any number of online resources Finally, for those innocent ofJava, here's a secret: many of the tools featured in the book have powerful GUIs Someinclude wizard-like interfaces, making them quite easy to use, and do not require any
knowledge of Java So if you are new to Java, just skip the examples that need coding andlearn to use the GUI-based tools instead!
Trang 39Who this book is for
The primary audience of this book is professionals who works with data and whose
responsibilities may include data analysis, data visualization or transformation, the training,validation, testing and evaluation of machine learning models—presumably to perform
predictive, descriptive or prescriptive analytics using Java or Java-based tools The choice
of Java may imply a personal preference and therefore some prior experience programming
in Java On the other hand, perhaps circumstances in the work environment or companypolicies limit the use of third-party tools to only those written in Java and a few others Inthe second case, the prospective reader may have no programming experience in Java.This book is aimed at this reader just as squarely as it is at their colleague, the Java expert(who came up with the policy in the first place)
A secondary audience can be defined by a profile with two attributes alone: an intellectualcuriosity about machine learning and the desire for a single comprehensive treatment of theconcepts, the practical techniques, and the tools A specimen of this type of reader can opt
to skip the math and the tools and focus on learning the most common supervised and
unsupervised learning algorithms alone Another might skim over Chapters 1, 2, 3, and 7,
skip the others entirely, and jump headlong into the tools—a perfectly reasonable strategy ifyou want to quickly make yourself useful analyzing that dataset the client said would behere any day now Importantly, too, with some practice reproducing the experiments fromthe book, it'll get you asking the right questions of the gurus! Alternatively, you might want
to use this book as a reference to quickly look up the details of the algorithm for affinitypropagation (Chapter 3, Unsupervised Machine Learning Techniques), or remind yourself
of an LSTM architecture with a brief review of the schematic (Chapter 7, Deep Learning),
or dog-ear the page with the list of pros and cons of distance-based clustering methods foroutlier detection in stream-based learning (Chapter 5, Real-Time Stream Machine
Learning) All specimens are welcome and each will find plenty to sink their teeth into.
Trang 40In this book, you will find a number of text styles that distinguish between different kinds ofinformation Here are some examples of these styles and an explanation of their meaning
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Thealgorithm calls the eliminate function in a loop, as shown here."
A block of code is set as follows:
DataSource source = new DataSource(trainingFile);
Instances data = source.getDataSet();
if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);
Any command-line input or output is written as follows:
Correctly Classified Instances 53 77.9412 %
Incorrectly Classified Instances 15 22.0588 %
New terms and important words are shown in bold.