Lecture Business management information system - Lecture 26: Data mining. In this chapter, the following content will be discussed: What is data mining? Why data mining? What applications? What techniques? What process? What software?
Trang 1Data Mining
Lecture 26
Trang 2Today’s Lecture
What is data mining?
Why data mining?
What applications?
What techniques?
What process?
What software?
Trang 3Data mining may be defined as follows:
data mining is a collection of techniques for efficient
automated discovery of previously unknown, valid, novel, useful and understandable patterns in large databases The patterns must be actionable so they may be used in
an enterprise’s decision making.
Trang 4What is Data Mining?
Efficient automated discovery of previously unknown
patterns in large volumes of data
Patterns must be valid, novel, useful and understandable
Businesses are mostly interested in discovering past
patterns to predict future behaviour
A data warehouse, as discussed earlier, is an enterprise’s memory Data mining can provide intelligence using that memory
Trang 5 amazon.com uses associations Recommendations to
customers are based on past purchases and what other
customers are purchasing
A store in USA “Just for Feet” has about 200 stores, each carrying up to 6000 shoe styles, each style in several
sizes Data mining is used to find the right shoes to stock in the right store
More examples in case studies to be discussed later
Trang 6Data Mining
We assume we are dealing with large data, perhaps
Gigabytes, perhaps in Terabytes
Although data mining is possible with smaller amount of data, bigger the data, higher the confidence in any
unknown pattern that is discovered
There is considerable hype about data mining at the
present time and Gartner Group has listed data mining as one of the top ten technologies to watch
Trang 7Why Data Mining Now?
Growth in generation and storage of corporate data –
information explosion
Need for sophisticated decision making – current
database systems are Online Transaction Processing
(OLTP) systems The OLTP data is difficult to use for
such applications Why?
Evolution of technology – much cheaper storage, easier
data collection, better database management, to data
analysis and understanding
Trang 8Information explosion
Database systems are being used since the 1960s in
the Western countries (perhaps since 1980s in India)
These systems have generated mountains of data
Point of sale terminals and bar codes on many
products, railway bookings, educational institutions,
huge number of mobile phones, electronic commerce,
all generate data
Government is now collecting a lot of information
Trang 9Information explosion
Internet banking via networked computers and ATMs
Credit and debit cards
Medical data, doctors, hospitals
Transportation, Indian railways, automatic toll collection
on toll roads, growing air travel
Passports, NRI visas, Other visas, NRI money
transfers
Question: Can you think of other examples of data collection?
Trang 10Information explosion
Many adults in India generate:
Mobile phone transactions More than 300 million phones
in India, reportedly growing at the rate of 10,000 new
ones every hour! Mobile companies must save
information about calls
Growing middle class with growing number of credit and debit card transactions About 25m credit cards and 70m debit cards in 2007 Annual growth rate about 30% and 40% respectively Could be 55m credit cards and 200m debit cards in 2010 resulting in perhaps 500m
Trang 11Information explosion
India has some huge enterprises, for example Indian
railways, perhaps the busiest network in the world with
2.5m employees, 10,000 locomotives, 10,000 passenger trains daily, 10,000 freight trains daily and 20m
passengers daily
Growing airline traffic with more than ten airlines Perhaps 30m passengers annually
Growing number of motor vehicles – registration,
insurance, driver license
Internet surfing records
Trang 12As noted earlier, most enterprise database systems were designed in the 1970’s or 1980’s and were mainly
designed to automate some of the office procedures e.g. order entry, student enrolment, patient registration,
airline reservations. These are well structured repetitive operations easily automated
Trang 13Decision Making
Need for business memory and intelligence
Need to serve customers better by learning from past interactions
OLTP data is not a good basis for maintaining an
enterprise memory
The intelligence hidden in data could be the secret
weapon in a competitive business world but given the information explosion not even a small fraction could be looked at by human eye
Question: Why OLTP is not good for maintaining an enterprise memory?
Trang 14OLTP vs Decision Making
Clerical view of data focuses on details required for
daytoday running of an enterprise
Management view of data focuses on summary data to identify trends, challenges and opportunities
The detailed data view is the operational view while
the management view is decisionsupport view.
Comparison of the two views:
Trang 15Operational vs Management View
Trang 16Question: How much is the cost of 100GB disk? What is the cost of a PC and what is its CPU performance?
Trang 17Decline in Hard Drive cost
Trang 18Growth in Worldwide Disk Capacity
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Trang 19Evolution of Technology
Question: What do the graphs in the last two slides tell us? What scales are used in
them? What was the pink line is the first graph?
Trang 20Evolution of Technology
Database technology has improved over the years
Data collection is often much better and cheaper now
The need for analyzing and synthesizing information is growing in a fiercely competitive business environment
of today
Trang 21Question: Why OLTP cannot be used for sales forecasting and analysis?
Trang 22Why Data Mining Now?
As noted earlier, the reasons may be summarized as:
•Accumulation of large amounts of data
• Increased affordable computing power enabling data mining processing
• Statistical and learning algorithms
• Availability of software
• Strong business competition
Trang 23Large amount of data
Already discussed that many enterprises have large
amounts of data accumulated over 30+ years
Noted earlier that some enterprises collect information for analysis, for example, supermarkets in USA offer
loyalty cards in exchange for shopper information.
Loyalty cards in Australia also collect information
using a reward system
Trang 24Growth of cards
A recent survey in USA found that the percentages of
US adults using the following types of cards were:
Trang 25Affordable computing power
Trang 26A variety of statistical and learning algorithms have been available in fields like statistics and artificial
intelligence that have been adapted for data mining.With new focus on data mining, new algorithms are being developed
Trang 27Availability of Software
Large variety of DM software is now available Some
more widely used software is:
IBM - Intelligent Miner and more
SAS - Enterprise Miner
Silicon Graphics - MineSet
Oracle - Thinking Machines - Darwin
Angoss - knowledgeSEEKER
Trang 28Strong Business Competition
Trang 29In finance, telecom, insurance and retail:
Loan/credit card approval
Trang 30Loan/Credit card approvals
In a modern society, a bank does not know its
customers Only knowledge a bank has is their
information stored in the computer
Credit agencies and banks collect a lot of customers’
behavioural data from many sources This information is used to predict the chances of a customer paying back a loan
Trang 31Market Segmentation
Large amounts of data about customers contains
valuable information
The market may be segmented into many subgroups
according to variables that are good discriminators
Not always easy to find variables that will help in market segmentation
Trang 32Fraud Detection
Very challenging since it is difficult to define
characteristics of fraud Often based on detecting
changes from the norm
In statistics, it is common to throw out the outliers but in data mining it may be useful to identify them since they could either be due to errors or perhaps fraud
Trang 33Better Marketing
When customers buy new products, other products may
be suggested to them when they are ready
As noted earlier, in mail order marketing for example, one wants to know:
- will the customer respond?
- will the customer buy and how much?
- will the customer return purchase?
- will the customer pay for the purchase?
Trang 34Better Marketing
It has been reported that more than 1000 variable
values on each customer are held by some mail order marketing companies
The aim is to “lift” the response rate
Trang 35Trend analysis
In a large company, not all trends are always visible to the management. It is then useful to use data mining software that will identify trends
Trends may be long term trends, cyclic trends or
seasonal trends
Trang 36Market Basket Analysis
Aims to find what the customers buy and what they buy together
This may be useful in designing store layouts or in
deciding which items to put on sale
Basket analysis can also be used for applications other than just analysing what items customers buy together
Trang 37makes customers loyal.
Cheaper to develop a retention plan and retain an old customer than to bring in a new customer
Trang 38holders that don’t use the card, bank customers with
very small amount of money in their accounts
Trang 39Web site design
A Web site is effective only if the visitors easily find
what they are looking for
Data mining can help discover affinity of visitors to
pages and the site layout may be modified based on this information
Trang 40Data Mining Process
Successful data mining involves careful determining the aims and selecting appropriate data
The following steps should normally be followed:
1. Requirements analysis
2. Data selection and collection
3. Cleaning and preparing data
4. Data mining exploration and validation
5. Implementing, evaluating and monitoring
6. Results visualisation
Trang 41Requirements Analysis
The enterprise decision makers need to formulate goals that the data mining process is expected to achieve The business problem must be clearly defined One cannot use data mining without a good idea of what kind of
outcomes the enterprise is looking for
If objectives have been clearly defined, it is easier to
evaluate the results of the project
Trang 42Data Selection and Collection
Find the best source databases for the data that is
required If the enterprise has implemented a data
warehouse, then most of the data could be available
there Otherwise source OLTP systems need to be
identified and required information extracted and stored
in some temporary system
In some cases, only a sample of the data available may
be required
Trang 43Cleaning and Preparing Data
This may not be an onerous task if a data warehouse
containing the required data exists, since most of this
must have already been done when data was loaded in the warehouse
Otherwise this task can be very resource intensive,
perhaps more than 50% of effort in a data mining project
is spent on this step Essentially a data store that
integrates data from a number of databases may need to
be created When integrating data, one often encounters problems like identifying data, dealing with missing data, data conflicts and ambiguity An ETL (extraction,
transformation and loading) tool may be used to
overcome these problems
Trang 44Exploration and Validation
Assuming that the user has access to one or more data mining tools, a data mining model may be constructed based on the enterprise’s needs It may be possible to take a sample of data and apply a number of relevant
techniques For each technique the results should be
evaluated and their significance interpreted
This is likely to be an iterative process which should
lead to selection of one or more techniques that are
suitable for further exploration, testing and validation
Trang 45Implementing, Evaluating and
Monitoring
Once a model has been selected and validated, the
model can be implemented for use by the decision
makers This may involve software development for
generating reports or for results visualisation and
explanation for managers
If more than one technique is available for the given data mining task, it is necessary to evaluate the results and
choose the best This may involve checking the accuracy and effectiveness of each technique
Trang 46Implementing, Evaluating and
Monitoring
Regular monitoring of the performance of the
techniques that have been implemented is required
Every enterprise evolves with time and so must the data mining system Monitoring may from time to time lead to the refinement of tools and techniques that have been implemented
Trang 47Results Visualisation
Explaining the results of data mining to the decision
makers is an important step Most DM software includes data visualisation modules which should be used in
communicating data mining results to the managers
Clever data visualisation tools are being developed to
display results that deal with more than two dimensions The visualisation tools available should be tried and used
if found effective for the given problem
Trang 48 What is data mining?
Why data mining?
What applications?
What techniques?
What process?
What software?