1. Trang chủ
  2. » Công Nghệ Thông Tin

Mastering java machine learning architectures 7

618 82 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 618
Dung lượng 24,52 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Unsupervised Machine Learning Techniques Issues in common with supervised learning Issues specific to unsupervised learning Feature analysis and dimensionality reduction Notation Linear

Trang 2

Table of Contents

Mastering Java Machine Learning

Credits

Foreword

About the Authors

About the Reviewers

What this book covers

What you need for this book

Who this book is for

1 Machine Learning Review

Machine learning – history and definition

What is not machine learning?

Machine learning – concepts and terminology

Machine learning – types and subtypes

Datasets used in machine learning

Machine learning applications

Practical issues in machine learning

Machine learning – roles and process

Data quality analysis

Descriptive data analysis

Basic label analysis

Basic feature analysis

Trang 3

Visualization analysis

Univariate feature analysis

Categorical features

Continuous features

Multivariate feature analysis

Data transformation and preprocessing

Training, validation, and test set

Feature relevance analysis and dimensionality reductionFeature search techniques

Feature evaluation techniques

Filter approach

Univariate feature selection

Information theoretic approachStatistical approach

Multivariate feature selection

Minimal redundancy maximal relevance (mRMR)Correlation-based feature selection (CFS)

Algorithm input and output

How does it work?

Advantages and limitations

Nạve Bayes

Algorithm input and output

How does it work?

Advantages and limitations

Logistic Regression

Algorithm input and output

How does it work?

Advantages and limitations

Non-linear models

Decision Trees

Trang 4

Algorithm inputs and outputs

How does it work?

Advantages and limitations

K-Nearest Neighbors (KNN)

Algorithm inputs and outputs

How does it work?

Advantages and limitations

Support vector machines (SVM)

Algorithm inputs and outputs

How does it work?

Advantages and limitations

Ensemble learning and meta learners

Bootstrap aggregating or bagging

Algorithm inputs and outputs

How does it work?

Random ForestAdvantages and limitations

Boosting

Algorithm inputs and outputs

How does it work?

Advantages and limitations

Model assessment, evaluation, and comparisonsModel assessment

Model evaluation metrics

Confusion matrix and related metrics

ROC and PRC curves

Gain charts and lift curves

Model comparisons

Comparing two algorithms

McNemar's Test

Paired-t testWilcoxon signed-rank test

Comparing multiple algorithms

Trang 5

Sample end-to-end process in JavaWeka experimenter and model selectionRapidMiner experiments

Visualization analysisFeature selectionModel process flowModel evaluation metricsEvaluation on Confusion Metrics

ROC Curves, Lift Curves, and Gain ChartsResults, observations, and analysis

Summary

References

3 Unsupervised Machine Learning Techniques

Issues in common with supervised learning

Issues specific to unsupervised learning

Feature analysis and dimensionality reduction

Notation

Linear methods

Principal component analysis (PCA)

Inputs and outputsHow does it work?

Advantages and limitationsRandom projections (RP)

Inputs and outputsHow does it work?

Advantages and limitationsMultidimensional Scaling (MDS)

Inputs and outputsHow does it work?

Advantages and limitationsNonlinear methods

Kernel Principal Component Analysis (KPCA)

Inputs and outputsHow does it work?

Advantages and limitationsManifold learning

Inputs and outputsHow does it work?

Advantages and limitationsClustering

Clustering algorithms

k-Means

Inputs and outputs

Trang 6

How does it work?

Advantages and limitations

DBSCAN

Inputs and outputs

How does it work?

Advantages and limitations

Mean shift

Inputs and outputs

How does it work?

Advantages and limitations

Expectation maximization (EM) or Gaussian mixture modeling (GMM)Input and output

How does it work?

Advantages and limitations

Hierarchical clustering

Input and output

How does it work?

Advantages and limitations

Self-organizing maps (SOM)

Inputs and outputs

How does it work?

Advantages and limitations

Spectral clustering

Inputs and outputs

How does it work?

Advantages and limitations

Affinity propagation

Inputs and outputs

How does it work?

Advantages and limitations

Clustering validation and evaluation

Internal evaluation measures

Rand index

F-Measure

Normalized mutual information index

Outlier or anomaly detection

Outlier algorithms

Trang 7

Inputs and outputs

How does it work?

Advantages and limitations

Distance-based methods

Inputs and outputs

How does it work?

Advantages and limitations

Density-based methods

Inputs and outputs

How does it work?

Advantages and limitations

Clustering-based methods

Inputs and outputs

How does it work?

Advantages and limitations

High-dimensional-based methods

Inputs and outputs

How does it work?

Advantages and limitations

One-class SVM

Inputs and outputs

How does it work?

Advantages and limitations

Outlier evaluation techniques

Supervised evaluation

Unsupervised evaluation

Real-world case study

Tools and software

Business problem

Machine learning mapping

Data collection

Data quality analysis

Data sampling and transformation

Feature analysis and dimensionality reduction

Observations and clustering analysis

Outlier models, results, and evaluation

Observations and analysis

Trang 8

How does it work?

Advantages and limitationsTransductive SVM (TSVM)

Inputs and outputsHow does it work?

Advantages and limitationsCase study in semi-supervised learningTools and software

Datasets and analysis

Feature analysis resultsExperiments and results

Analysis of semi-supervised learningActive learning

Representation and notation

Active learning scenarios

Active learning approaches

Uncertainty sampling

How does it work?

Trang 9

Least confident samplingSmallest margin samplingLabel entropy samplingAdvantages and limitationsVersion space sampling

Query by disagreement (QBD)

How does it work?

Query by Committee (QBC)How does it work?

Advantages and limitations

Data distribution sampling

How does it work?

Expected model changeExpected error reductionVariance reductionDensity weighted methodsAdvantages and limitations

Case study in active learning

Tools and software

Business problem

Machine learning mapping

Data Collection

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

5 Real-Time Stream Machine Learning

Assumptions and mathematical notations

Basic stream processing and computational techniquesStream computations

Trang 10

Drift Detection Method or DDMEarly Drift Detection Method or EDDMMonitoring distribution changes

Welch's t testKolmogorov-Smirnov's testCUSUM and Page-Hinckley testAdaptation methods

Online linear models with loss functions

Inputs and outputsHow does it work?

Advantages and limitationsOnline Nạve Bayes

Inputs and outputsHow does it work?

Advantages and limitationsNon-linear algorithms

Hoeffding trees or very fast decision trees (VFDT)Inputs and outputs

How does it work?

Advantages and limitationsEnsemble algorithms

Weighted majority algorithm

Inputs and outputsHow does it work?

Advantages and limitationsOnline Bagging algorithm

Inputs and outputsHow does it work?

Advantages and limitationsOnline Boosting algorithm

Inputs and outputsHow does it work?

Advantages and limitationsValidation, evaluation, and comparisons in online settingModel validation techniques

Prequential evaluation

Holdout evaluation

Controlled permutations

Trang 11

Evaluation criteria

Comparing algorithms and metrics

Incremental unsupervised learning using clusteringModeling techniques

Inputs and outputs

How does it work?

Advantages and limitations

Inputs and outputs

How does it work?

Advantages and limitations

Density based

Inputs and outputs

How does it work?

Advantages and limitations

Grid based

Inputs and outputs

How does it work?

Advantages and limitations

Validation and evaluation techniques

Key issues in stream cluster evaluationEvaluation measures

Cluster Mapping Measures (CMM)V-Measure

Other external measuresUnsupervised learning using outlier detection

Partition-based clustering for outlier detectionInputs and outputs

How does it work?

Advantages and limitations

Distance-based clustering for outlier detectionInputs and outputs

How does it work?

Exact Storm

Abstract-C

Direct Update of Events (DUE)

Micro Clustering based Algorithm (MCOD)Approx Storm

Trang 12

Advantages and limitationsValidation and evaluation techniques

Case study in stream learning

Tools and software

Business problem

Machine learning mapping

Data collection

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

Supervised learning experiments

Concept drift experimentsClustering experiments

Outlier detection experiments

Analysis of stream learning results

Chain rule and Bayes' theorem

Random variables, joint, and marginal distributionsMarginal independence and conditional independenceFactors

Factor typesDistribution queries

Probabilistic queriesMAP queries and marginal MAP queriesGraph concepts

Graph structure and properties

Subgraphs and cliques

Path, trail, and cycles

Combined reasoningIndependencies, flow of influence, D-Separation, I-MapFlow of influence

Trang 13

I-Map

Inference

Elimination-based inference

Variable elimination algorithm

Input and outputHow does it work?

Advantages and limitationsClique tree or junction tree algorithm

Input and outputHow does it work?

Advantages and limitationsPropagation-based techniques

Belief propagation

Factor graphMessaging in factor graphInput and output

How does it work?

Advantages and limitationsSampling-based techniques

Forward sampling with rejection

Input and outputHow does it work?

Advantages and limitationsLearning

Learning parameters

Maximum likelihood estimation for Bayesian networksBayesian parameter estimation for Bayesian networkPrior and posterior using the Dirichlet distributionLearning structures

Measures to evaluate structures

Methods for learning structures

Constraint-based techniquesInputs and outputs

How does it work?

Advantages and limitationsSearch and score-based techniquesInputs and outputs

How does it work?

Advantages and limitationsMarkov networks and conditional random fields

Representation

Parameterization

Trang 14

Gibbs parameterizationFactor graphs

Log-linear modelsIndependencies

GlobalPairwise MarkovMarkov blanketInference

Learning

Conditional random fields

Specialized networks

Tree augmented network

Input and output

How does it work?

Advantages and limitations

Markov chains

Hidden Markov models

Most probable path in HMM

Machine learning mapping

Data sampling and transformation

Multi-layer feed-forward neural network

Inputs, neurons, activation function, and mathematical notationMulti-layered neural network

Structure and mathematical notations

Activation functions in NN

Sigmoid functionHyperbolic tangent ("tanh") functionTraining neural network

Empirical risk minimizationParameter initializationLoss function

Trang 15

GradientsGradient at the output layerGradient at the Hidden LayerParameter gradient

Feed forward and backpropagationHow does it work?

RegularizationL2 regularizationL1 regularizationLimitations of neural networks

Vanishing gradients, local optimum, and slow trainingDeep learning

Building blocks for deep learning

Rectified linear activation function

Restricted Boltzmann Machines

Definition and mathematical notation

Input and outputs

How does it work?

Deep Autoencoders

Deep Belief Networks

Inputs and outputs

How does it work?

Deep learning with dropouts

Definition and mathematical notation

Inputs and outputs

How does it work?

Learning Training and testing with dropouts

Sparse coding

Trang 16

Convolutional Neural Network

Local connectivityParameter sharingDiscrete convolutionPooling or subsamplingNormalization using ReLUCNN Layers

Recurrent Neural Networks

Structure of Recurrent Neural NetworksLearning and associated problems in RNNsLong Short Term Memory

Gated Recurrent UnitsCase study

Tools and software

Business problem

Machine learning mapping

Data sampling and transfor

Feature analysis

Models, results, and evaluation

Basic data handling

Parameter search using Arbiter

Results and analysis

Summary

References

8 Text Mining and Natural Language Processing

NLP, subfields, and tasks

Trang 17

Machine translation

Semantic reasoning and inferencing

Text summarization

Automating question and answers

Issues with mining unstructured data

Text processing components and transformations

Document collection and standardization

Inputs and outputs

How does it work?

Tokenization

Inputs and outputs

How does it work?

Stop words removal

Inputs and outputs

How does it work?

Stemming or lemmatization

Inputs and outputs

How does it work?

Local/global dictionary or vocabulary?

Feature representation and similarity

Vector space model

Binary

Term frequency (TF)

Inverse document frequency (IDF)

Term frequency-inverse document frequency (TF-IDF)Similarity measures

Trang 18

How does it work?

Advantages and limitations

Text clustering

Feature transformation, selection, and reductionClustering techniques

Generative probabilistic models

Input and outputHow does it work?

Advantages and limitationsDistance-based text clustering

Non-negative matrix factorization (NMF)

Input and outputHow does it work?

Advantages and limitationsEvaluation of text clustering

Named entity recognition

Hidden Markov models for NER

Input and output

How does it work?

Advantages and limitations

Maximum entropy Markov models for NER

Input and output

How does it work?

Advantages and limitations

Deep learning and NLP

Tools and usage

Data sampling and transformation

Feature analysis and dimensionality reduction

Models, results, and evaluation

Analysis of text processing results

Trang 19

References

9 Big Data Machine Learning – The Final FrontierWhat are the characteristics of Big Data?

Big Data Machine Learning

General Big Data framework

Big Data cluster deployment frameworksHortonworks Data Platform

Cloudera CDHAmazon Elastic MapReduceMicrosoft Azure HDInsightData acquisition

Publish-subscribe frameworksSource-sink frameworks

SQL frameworksMessage queueing frameworksCustom frameworks

Data storage

HDFSNoSQLKey-value databasesDocument databasesColumnar databasesGraph databasesData processing and preparation

Hive and HQLSpark SQLAmazon RedshiftReal-time stream processingMachine Learning

Visualization and analysis

Batch Big Data Machine Learning

H2O as Big Data Machine Learning platformH2O architecture

Machine learning in H2O

Tools and usage

Case study

Business problem

Machine Learning mapping

Data collection

Data sampling and transformation

Experiments, results, and analysis

Feature relevance and analysis

Trang 20

Evaluation on test dataAnalysis of resultsSpark MLlib as Big Data Machine Learning platform

Spark architecture

Machine Learning in MLlib

Tools and usage

Experiments, results, and analysis

k-Meansk-Means with PCABisecting k-Means (with PCA)Gaussian Mixture Model

Random ForestAnalysis of resultsReal-time Big Data Machine Learning

SAMOA as a real-time Big Data Machine Learning frameworkSAMOA architecture

Machine Learning algorithmsTools and usage

Experiments, results, and analysisAnalysis of results

The future of Machine Learning

EigendecompositionPositive definite matrixSingular value decomposition (SVD)

Trang 22

Mastering Java Machine Learning

Trang 23

Copyright © 2017 Packt Publishing

All rights reserved No part of this book may be reproduced, stored in a retrieval system, ortransmitted in any form or by any means, without the prior written permission of the

publisher, except in the case of brief quotations embedded in critical articles or reviews

Every effort has been made in the preparation of this book to ensure the accuracy of theinformation presented However, the information contained in this book is sold without

warranty, either express or implied Neither the authors, nor Packt Publishing, and its

dealers and distributors will be held liable for any damages caused or alleged to be causeddirectly or indirectly by this book

Packt Publishing has endeavored to provide trademark information about all of the

companies and products mentioned in this book by the appropriate use of capitals

However, Packt Publishing cannot guarantee the accuracy of this information

First published: June 2017

Trang 27

Dr Uday Kamath is a volcano of ideas Every time he walked into my office, we had fruitfuland animated discussions I have been a professor of computer science at George MasonUniversity (GMU) for 15 years, specializing in machine learning and data mining I have

known Uday for five years, first as a student in my data mining class, then as a colleagueand co-author of papers and projects on large-scale machine learning While a chief datascientist at BAE Systems Applied Intelligence, Uday earned his PhD in evolutionary

computation and machine learning As if having two high-demand jobs was not enough,

Uday was unusually prolific, publishing extensively with four different people in the computerscience faculty during his tenure at GMU, something you don't see very often Given thispedigree, I am not surprised that less than four years since Uday's graduation with a PhD, I

am writing the foreword for his book on mastering advanced machine learning techniqueswith Java Uday's thirst for new stimulating challenges has struck again, resulting in thisterrific book you now have in your hands

This book is the product of his deep interest and knowledge in sound and well-groundedtheory, and at the same time his keen grasp of the practical feasibility of proposed

methodologies Several books on machine learning and data analytics exist, but Uday's

book closes a substantial gap—the one between theory and practice It offers a

comprehensive and systematic analysis of classic and advanced learning techniques, with afocus on their advantages and limitations, practical use and implementations This book is aprecious resource for practitioners of data science and analytics, as well as for

undergraduate and graduate students keen to master practical and efficient

implementations of machine learning techniques

The book covers the classic techniques of machine learning, such as classification,

clustering, dimensionality reduction, anomaly detection, semi-supervised learning, and activelearning It also covers advanced and recent topics, including learning with stream data,deep learning, and the challenges of learning with big data Each chapter is dedicated to atopic and includes an illustrative case study, which covers state-of-the-art Java-based toolsand software, and the entire knowledge discovery cycle: data collection, experimental

design, modeling, results, and evaluation Each chapter is self-contained, providing greatflexibility of usage The accompanying website provides the source code and data This istruly a gem for both students and data analytics practitioners, who can experiment first-hand with the methods just learned or deepen their understanding of the methods by

applying them to real-world scenarios

As I was reading the various chapters of the book, I was reminded of the enthusiasm Udayhas for learning and knowledge He communicates the concepts described in the book withclarity and with the same passion I am positive that you, as a reader, will feel the same Iwill certainly keep this book as a personal resource for the courses I teach, and strongly

Trang 28

recommend it to my students.

Dr Carlotta Domeniconi

Associate Professor of Computer Science, George Mason University

Trang 29

About the Authors

Dr Uday Kamath is the chief data scientist at BAE Systems Applied Intelligence He

specializes in scalable machine learning and has spent 20 years in the domain of AML,

fraud detection in financial crime, cyber security, and bioinformatics, to name a few Dr.Kamath is responsible for key products in areas focusing on the behavioral, social

networking and big data machine learning aspects of analytics at BAE AI He received hisPhD at George Mason University, under the able guidance of Dr Kenneth De Jong, wherehis dissertation research focused on machine learning for big data and automated sequencemining

I would like to thank my friend, Krishna Choppella, for accepting the offer to co-authorthis book and being an able partner on this long but satisfying journey

Heartfelt thanks to our reviewers, especially Dr Samir Sahli for his valuable comments,suggestions, and in-depth review of the chapters I would like to thank Professor

Carlotta Domeniconi for her suggestions and comments that helped us shape various

chapters in the book I would also like to thank all the Packt staff, especially Divya

Poojari, Mayur Pawanikar, and Vivek Arora, for helping us complete the tasks in time.This book required making a lot of sacrifices on the personal front and I would like to

thank my wife, Pratibha, and our nanny, Evelyn, for their unconditional support Finally,thanks to all my lovely teachers and professors for not only teaching the subjects, butalso instilling the joy of learning

Krishna Choppella builds tools and client solutions in his role as a solutions architect for

analytics at BAE Systems Applied Intelligence He has been programming in Java for 20years His interests are data science, functional programming, and distributed computing

Trang 30

About the Reviewers

Samir Sahli was awarded a BSc degree in applied mathematics and information sciences

from the University of Nice Sophia-Antipolis, France, in 2004 He received MSc and PhDdegrees in physics (specializing in optics/photonics/image science) from University Laval,Quebec, Canada, in 2008 and 2013, respectively During his graduate studies, he workedwith Defence Research and Development Canada (DRDC) on the automatic detection andrecognition of targets in aerial imagery, especially in the context of uncontrolled environmentand sub-optimal acquisition conditions He has worked since 2009 as a consultant for

several companies based in Europe and North America specializing in the area of

Intelligence, Surveillance, and Reconnaissance (ISR) and in remote sensing

Dr Sahli joined McMaster Biophotonics in 2013 as a postdoctoral fellow His research was

in the field of optics, image processing, and machine learning He was involved in severalprojects, such as the development of a novel generation of gastrointestinal tract imagingdevice, hyperspectral imaging of skin erythema for individualized radiotherapy treatment,and automatic detection of the precancerous Barrett's esophageal cell using fluorescencelifetime imaging microscopy and multiphoton microscopy

Dr Sahli joined BAE Systems Applied Intelligence in 2015 He has since worked as a datascientist to develop analytics models to detect complex fraud patterns and money

laundering schemes for insurance, banking, and governmental clients using machine

learning, statistics, and social network analysis tools

Prashant Verma started his IT career in 2011 as a Java developer in Ericsson, working in

the telecom domain After a couple of years of Java EE experience, he moved into the bigdata domain and has worked on almost all of the popular big data technologies such asHadoop, Spark, Flume, Mongo, Cassandra, and so on He has also played with Scala

Currently, he works with QA Infotech as a lead data engineer, working on solving e-learningproblems with analytics and machine learning

Prashant has worked for many companies, such as Ericsson and QA Infotech, with domainknowledge of telecom and e-learning He has also worked as a freelance consultant in hisfree time

I want to thank Packt Publishing for giving me the chance to review the book, as well as

my employer and my family for their patience while I was busy working on this book

Trang 31

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF andePub files available? You can upgrade to the eBook version at www.PacktPub.com and as

a print book customer, you are entitled to a discount on the eBook copy Get in touch with

us at < customercare@packtpub.com > for more details

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for

a range of free newsletters and receive exclusive discounts and offers on Packt books andeBooks

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt Mapt gives you full access to all Packtbooks and video courses, as well as industry-leading tools to help you plan your personaldevelopment and advance your career

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Trang 32

Customer Feedback

Thanks for purchasing this Packt book At Packt, quality is at the heart of our editorial

process To help us improve, please leave us an honest review on this book's Amazon page

at https://www.amazon.com/dp/1785880519

If you'd like to join our team of regular reviewers, you can e-mail us at

customerreviews@packtpub.com We award our regular reviewers with free eBooks andvideos in exchange for their valuable feedback Help us be relentless in improving our

products!

Dedicated to my parents, Krishna Kamath and Bharathi Kamath, my wife, Pratibha

Shenoy, and the kids, Aaroh and Brandy

Dr Uday Kamath.

To my parents

Krishna Choppella

Trang 33

There are many notable books on machine learning, from pedagogical tracts on the theory

of learning from data; to standard references on specializations in the field, such as

clustering and outlier detection or probabilistic graph modeling; to cookbooks that offer

practical advice on the use of tools and libraries in a particular language The books thattend to be broad in coverage are often short on theoretical detail, while those with a focus

on one topic or tool may not, for example, have much to say about the difference in

approach in a streaming as opposed to a batch environment Besides, for the non-noviceswith a preference for tools in Java who wish to reach for a single volume that will extendtheir knowledge—simultaneously, on the essential aspects—there are precious few options.Finding in one place

The pros and cons of different techniques given any data availability scenario—whendata is labeled or unlabeled, streaming or batch, local, or distributed, structured orunstructured

A ready reference for the most important mathematical results related to those verytechniques for a better appreciation of the underlying theory

An introduction to the most mature Java-based frameworks, libraries, and visualizationtools with descriptions and illustrations on how to put these techniques into practice isnot possible today, as far as we know

The core idea of this book, therefore, is to address this gap while maintaining a balancebetween treatment of theory and practice with the aid of probability, statistics, basic linearalgebra, and rudimentary calculus in the service of one, and emphasizing methodology,

case studies, tools and code in support of the other

According to the KDnuggets 2016 software poll, Java, at 16.8%, has the second highestshare in popularity among languages used in machine learning, after Python What's more isthat this marks a 19% increase from the year before! Clearly, Java remains an importantand effective vehicle to build and deploy systems involving machine learning, despite claims

of its decline in some quarters With this book, we aim to reach professionals and motivatedenthusiasts with some experience in Java and a beginner's knowledge of machine learning

Our goal is to make Mastering Java Machine Learning the next step on their path to

becoming advanced practitioners in data science To guide them on this path, the book

covers a veritable arsenal of techniques in machine learning—some which they may already

be familiar with, others perhaps not as much, or only superficially—including methods ofdata analysis, learning algorithms, evaluation of model performance, and more in

supervised and supervised learning, clustering and anomaly detection, and

semi-supervised and active learning It also presents special topics such as probabilistic graphmodeling, text mining, and deep learning Not forgetting the increasingly important topics inenterprise-scale systems today, the book also covers the unique challenges of learning

Trang 34

from evolving data streams and the tools and techniques applicable to real-time systems,

as well as the imperatives of the world of Big Data:

How does machine learning work in large-scale distributed environments?

What are the trade-offs?

How must algorithms be adapted?

How can these systems interoperate with other technologies in the dominant Hadoopecosystem?

This book explains how to apply machine learning to real-world data and real-world

domains with the right methodology, processes, applications, and analysis Accompanyingeach chapter are case studies and examples of how to apply the newly learned techniquesusing some of the best available open source tools written in Java This book covers morethan 15 open source Java tools supporting a wide range of techniques between them, withcode and practical usage The code, data, and configurations are available for readers todownload and experiment with We present more than ten real-world case studies in

Machine Learning that illustrate the data scientist's process Each case study details thesteps undertaken in the experiments: data ingestion, data analysis, data cleansing, featurereduction/selection, mapping to machine learning, model training, model selection, modelevaluation, and analysis of results This gives the reader a practical guide to using the toolsand methods presented in each chapter for solving the business problem at hand

Trang 35

What this book covers

Chapter 1, Machine Learning Review, is a refresher of basic concepts and techniques that

the reader would have learned from Packt's Learning Machine Learning in Java or a

similar text This chapter is a review of concepts such as data, data transformation,

sampling and bias, features and their importance, supervised learning, unsupervised

learning, big data learning, stream and real-time learning, probabilistic graphic models, andsemi-supervised learning

Chapter 2, Practical Approach to Real-World Supervised Learning, cobwebs dusted, divesstraight into the vast field of supervised learning and the full spectrum of associated

techniques We cover the topics of feature selection and reduction, linear modeling, logisticmodels, non-linear models, SVM and kernels, ensemble learning techniques such as

bagging and boosting, validation techniques and evaluation metrics, and model selection.Using WEKA and RapidMiner, we carry out a detailed case study, going through all thesteps from data analysis to analysis of model performance As in each of the other

chapters, the case study is presented as an example to help the reader understand how thetechniques introduced in the chapter are applied in real life The dataset used in the casestudy is UCI HorseColic

Chapter 3, Unsupervised Machine Learning Techniques, presents many advanced

methods in clustering and outlier techniques, with applications Topics covered are featureselection and reduction in unsupervised data, clustering algorithms, evaluation methods inclustering, and anomaly detection using statistical, distance, and distribution techniques Atthe end of the chapter, we perform a case study for both clustering and outlier detectionusing a real-world image dataset, MNIST We use the Smile API to do feature reductionand ELKI for learning

Chapter 4, Semi-supervised Learning and Active Learning, gives details of algorithms andtechniques for learning when only a small amount labeled data is present Topics coveredare self-training, generative models, transductive SVMs, co-training, active learning, andmulti-view learning The case study involves both learning systems and is performed on thereal-world UCI Breast Cancer Wisconsin dataset The tools introduced are

JKernelMachines ,KEEL and JCLAL

Chapter 5, Real-Time Stream Machine Learning, covers data streams in real-time presentunique circumstances for the problem of learning from data This chapter broadly covers theneed for stream machine learning and applications, supervised stream learning,

unsupervised cluster stream learning, unsupervised outlier learning, evaluation techniques instream learning, and metrics used for evaluation A detailed case study is given at the end

of the chapter to illustrate the use of the MOA framework The dataset used is Electricity(ELEC)

Trang 36

Chapter 6, Probabilistic Graph Modeling, shows that many real-world problems can beeffectively represented by encoding complex joint probability distributions over multi-

dimensional spaces Probabilistic graph models provide a framework to represent, drawinferences, and learn effectively in such situations The chapter broadly covers probabilityconcepts, PGMs, Bayesian networks, Markov networks, Graph Structure Learning, HiddenMarkov Models, and Inferencing A detailed case study on a real-world dataset is

performed at the end of the chapter The tools used in this case study are OpenMarkov andWEKA's Bayes network The dataset is UCI Adult (Census Income)

Chapter 7, Deep Learning, If there is one super-star of machine learning in the popular

imagination today it is deep learning, which has attained a dominance among techniquesused to solve the most complex AI problems Topics broadly covered are neural networks,issues in neural networks, deep belief networks, restricted Boltzman machines,

convolutional networks, long short-term memory units, denoising autoencoders, recurrentnetworks, and others We present a detailed case study showing how to implement deeplearning networks, tuning the parameters and performing learning We use DeepLearning4Jwith the MNIST image dataset

Chapter 8, Text Mining and Natural Language Processing, details the techniques,

algorithms, and tools for performing various analyses in the field of text mining Topics

broadly covered are areas of text mining, components needed for text mining,

representation of text data, dimensionality reduction techniques, topic modeling, text

clustering, named entity recognition, and deep learning The case study uses real-worldunstructured text data (the Reuters-21578 dataset) highlighting topic modeling and text

classification; the tools used are MALLET and KNIME

Chapter 9, Big Data Machine Learning – the Final Frontier, discusses some of the mostimportant challenges of today What learning options are available when data is either big

or available at a very high velocity? How is scalability handled? Topics covered are big datacluster deployment frameworks, big data storage options, batch data processing, batchdata machine learning, real-time machine learning frameworks, and real-time stream

learning In the detailed case study for both big data batch and real-time we select the UCICovertype dataset and the machine learning libraries H2O, Spark MLLib and SAMOA

Appendix A, Linear Algebra, covers concepts from linear algebra, and is meant as a briefrefresher It is by no means complete in its coverage, but contains a whirlwind tour of someimportant concepts relevant to the machine learning techniques featured in the book It

includes vectors, matrices and basic matrix operations and properties, linear

transformations, matrix inverse, eigen decomposition, positive definite matrix, and singularvalue decomposition

Appendix B, Probability, provides a brief primer on probability It includes the axioms ofprobability, Bayes' theorem, density estimation, mean, variance, standard deviation,

Gaussian standard deviation, covariance, correlation coefficient, binomial distribution,

Trang 37

Poisson distribution, Gaussian distribution, central limit theorem, and error propagation.

Trang 38

What you need for this book

This book assumes you have some experience of programming in Java and a basic

understanding of machine learning concepts If that doesn't apply to you, but you are

curious nonetheless and self-motivated, fret not, and read on! For those who do have somebackground, it means that you are familiar with simple statistical analysis of data and

concepts involved in supervised and unsupervised learning Those who may not have therequisite math or must poke the far reaches of their memory to shake loose the odd

formula or funny symbol, do not be disheartened If you are the sort that loves a challenge,the short primer in the appendices may be all you need to kick-start your engines—a bit oftenacity will see you through the rest! For those who have never been introduced to

machine learning, the first chapter was equally written for you as for those needing a

refresher—it is your starter-kit to jump in feet first and find out what it's all about You canaugment your basics with any number of online resources Finally, for those innocent ofJava, here's a secret: many of the tools featured in the book have powerful GUIs Someinclude wizard-like interfaces, making them quite easy to use, and do not require any

knowledge of Java So if you are new to Java, just skip the examples that need coding andlearn to use the GUI-based tools instead!

Trang 39

Who this book is for

The primary audience of this book is professionals who works with data and whose

responsibilities may include data analysis, data visualization or transformation, the training,validation, testing and evaluation of machine learning models—presumably to perform

predictive, descriptive or prescriptive analytics using Java or Java-based tools The choice

of Java may imply a personal preference and therefore some prior experience programming

in Java On the other hand, perhaps circumstances in the work environment or companypolicies limit the use of third-party tools to only those written in Java and a few others Inthe second case, the prospective reader may have no programming experience in Java.This book is aimed at this reader just as squarely as it is at their colleague, the Java expert(who came up with the policy in the first place)

A secondary audience can be defined by a profile with two attributes alone: an intellectualcuriosity about machine learning and the desire for a single comprehensive treatment of theconcepts, the practical techniques, and the tools A specimen of this type of reader can opt

to skip the math and the tools and focus on learning the most common supervised and

unsupervised learning algorithms alone Another might skim over Chapters 1, 2, 3, and 7,

skip the others entirely, and jump headlong into the tools—a perfectly reasonable strategy ifyou want to quickly make yourself useful analyzing that dataset the client said would behere any day now Importantly, too, with some practice reproducing the experiments fromthe book, it'll get you asking the right questions of the gurus! Alternatively, you might want

to use this book as a reference to quickly look up the details of the algorithm for affinitypropagation (Chapter 3, Unsupervised Machine Learning Techniques), or remind yourself

of an LSTM architecture with a brief review of the schematic (Chapter 7, Deep Learning),

or dog-ear the page with the list of pros and cons of distance-based clustering methods foroutlier detection in stream-based learning (Chapter 5, Real-Time Stream Machine

Learning) All specimens are welcome and each will find plenty to sink their teeth into.

Trang 40

In this book, you will find a number of text styles that distinguish between different kinds ofinformation Here are some examples of these styles and an explanation of their meaning

Code words in text, database table names, folder names, filenames, file extensions,

pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Thealgorithm calls the eliminate function in a loop, as shown here."

A block of code is set as follows:

DataSource source = new DataSource(trainingFile);

Instances data = source.getDataSet();

if (data.classIndex() == -1)

data.setClassIndex(data.numAttributes() - 1);

Any command-line input or output is written as follows:

Correctly Classified Instances 53 77.9412 %

Incorrectly Classified Instances 15 22.0588 %

New terms and important words are shown in bold.

Ngày đăng: 02/03/2019, 10:43

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
1. J. B. Lovins (1968). Development of a stemming algorithm, Mechanical Translation and Computer Linguistic, vol.11, no.1/2, pp. 22-31 Sách, tạp chí
Tiêu đề: Development of a stemming algorithm
Tác giả: J. B. Lovins
Năm: 1968
2. Porter M.F, (1980). An algorithm for suffix stripping, Program; 14, 130-137 Sách, tạp chí
Tiêu đề: An algorithm for suffix stripping
Tác giả: Porter M.F
Năm: 1980
3. ZIPF, H.P., (1949). Human Behaviour and the Principle of Least Effort, Addison- Wesley, Cambridge, Massachusetts Sách, tạp chí
Tiêu đề: Human Behaviour and the Principle of Least Effort
Tác giả: ZIPF, H.P
Năm: 1949
4. LUHN, H.P., (1958). The automatic creation of literature abstracts', IBM Journal of Research and Development, 2, 159-165 Sách, tạp chí
Tiêu đề: The automatic creation of literature abstracts
Tác giả: LUHN, H.P
Năm: 1958
5. Deerwester, S., Dumais, S., Furnas, G., &amp; Landauer, T. (1990), Indexing by latent semantic analysis, Journal of the American Society for Information Sciences, 41, 391–407 Sách, tạp chí
Tiêu đề: Indexing by latentsemantic analysis
Tác giả: Deerwester, S., Dumais, S., Furnas, G., &amp; Landauer, T
Năm: 1990
6. Dempster, A. P., Laird, N. M., &amp; Rubin, D. B. (1977), Maximum likelihood fromincomplete data via the EM algorithm. Journal of the Royal Statistic Society, Series B, 39(1), 1–38 Sách, tạp chí
Tiêu đề: Maximum likelihood from"incomplete data via the EM algorithm
Tác giả: Dempster, A. P., Laird, N. M., &amp; Rubin, D. B
Năm: 1977
7. Greiff, W. R. (1998). A theory of term weighting based on exploratory data analysis.In 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY. ACM Sách, tạp chí
Tiêu đề: A theory of term weighting based on exploratory data analysis
Tác giả: Greiff, W. R
Năm: 1998
8. P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J/ C. Lai (1992), Class-based n-gram models of natural language, Computational Linguistics, 18, 4, 467-479 Sách, tạp chí
Tiêu đề: Class-based n-gram models of natural language
Tác giả: P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J/ C. Lai
Năm: 1992
9. T. Liu, S. Lin, Z. Chen, W.-Y. Ma (2003), An Evaluation on Feature Selection for Text Clustering, ICML Conference Sách, tạp chí
Tiêu đề: An Evaluation on Feature Selection for TextClustering
Tác giả: T. Liu, S. Lin, Z. Chen, W.-Y. Ma
Năm: 2003
10. Y. Yang, J. O. Pederson (1995). A comparative study on feature selection in text categorization, ACM SIGIR Conference Sách, tạp chí
Tiêu đề: A comparative study on feature selection in textcategorization
Tác giả: Y. Yang, J. O. Pederson
Năm: 1995
11. Salton, G. &amp; Buckley, C. (1998). Term weighting approaches in automatic text retrieval. Information Processing &amp; Management, 24(5), 513–523 Sách, tạp chí
Tiêu đề: Term weighting approaches in automatic textretrieval
Tác giả: Salton, G. &amp; Buckley, C
Năm: 1998
12. Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis.Machine Learning Journal, 41(1), 177–196 Sách, tạp chí
Tiêu đề: Unsupervised learning by probabilistic latent semantic analysis
Tác giả: Hofmann, T
Năm: 2001
13. D. Blei, J. Lafferty (2006). Dynamic topic models. ICML Conference Sách, tạp chí
Tiêu đề: Dynamic topic models
Tác giả: D. Blei, J. Lafferty
Năm: 2006
14. D. Blei, A. Ng, M. Jordan (2003). Latent Dirichlet allocation, Journal of Machine Learning Research, 3: pp. 993–1022 Sách, tạp chí
Tiêu đề: Latent Dirichlet allocation
Tác giả: D. Blei, A. Ng, M. Jordan
Năm: 2003
15. W. Xu, X. Liu, and Y. Gong (2003). Document-Clustering based on Non-negative Matrix Factorization. Proceedings of SIGIR'03, Toronto, CA, pp. 267-273 Sách, tạp chí
Tiêu đề: Document-Clustering based on Non-negativeMatrix Factorization
Tác giả: W. Xu, X. Liu, and Y. Gong
Năm: 2003
16. Dud´ik M. and Schapire (2006). R. E. Maximum entropy distribution estimation with generalized regularization. In Lugosi, G. and Simon, H. (Eds.), COLT, Berlin, pp. 123–138, Springer-Verlag Sách, tạp chí
Tiêu đề: Maximum entropy distribution estimation withgeneralized regularization
Tác giả: Dud´ik M. and Schapire
Năm: 2006
17. McCallum, A., Freitag, D., and Pereira, F. C. N. (2000). Maximum Entropy Markov Models for Information Extraction and Segmentation. In ICML, pp. 591–598 Sách, tạp chí
Tiêu đề: Maximum Entropy MarkovModels for Information Extraction and Segmentation
Tác giả: McCallum, A., Freitag, D., and Pereira, F. C. N
Năm: 2000
18. Langville, A. N, Meyer, C. D., Albright, R. (2006). Initializations for the Nonnegative Factorization. KDD, Philadelphia, USA Sách, tạp chí
Tiêu đề: Initializations for the NonnegativeFactorization
Tác giả: Langville, A. N, Meyer, C. D., Albright, R
Năm: 2006
19. Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence.Computational Linguistics, 19, 1, pp. 61-74 Sách, tạp chí
Tiêu đề: Accurate Methods for the Statistics of Surprise and Coincidence."Computational Linguistics
Tác giả: Dunning, T
Năm: 1993
27. Léon Bottou (2011). From Machine Learning to Machine Reasoning.https://arxiv.org/pdf/1102.1808v3.pdf Link

TỪ KHÓA LIÊN QUAN