Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed.. Human/ha
Trang 1Trịnh Tấn Đạt
Khoa CNTT – Đại Học Sài Gòn
Email: trinhtandat@sgu.edu.vn
Website: https://sites.google.com/site/ttdat88/
Trang 2 Why preprocess the data?
Descriptive data summarization
Trang 3Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, …
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
Trang 4Why Is Data Dirty?
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was collected and when it is
analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Trang 5Why Is Data Preprocessing Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the
work of building a data warehouse
Trang 6Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:
Trang 7Data type
Numeric: The most used data type, and the stored content is numeric
Characters and strings: strings are arrays of characters
Boolean: for binary data with true and false values
Time series data: including time-or sequential-related properties
Sequential data: data itself has sequential relationship
Time series data: each data will be subject to change with time
Trang 8Data type
Spatial data: for data including special related attributes
For example, Google Map, Integrated Circuit Design Layout, Wafer ExposureLayout, Global Positioning System (GPS), etc
Text data: for paragraph description, including patent reports, diagnostic
reports, etc.
Structured data: library bibliographic data, credit card data
Semi-structured data: email, extensible markup language (XML)
Unstructured data: social media data of messages in Facebook
Multimedia data: Including data of pictures, audio, video, etc in media with
mass data volumes as compared to other types of data that need data compression for data storage
8
Trang 9Data scale
Each variable of data has its corresponding attribute and scale to quantify and
measure its level
natural quantitative scale
qualitative scale
When one variable is hard to find the corresponding attribute, proxy attribute
can be used instead as a measurement
Common scales: nominal scale, categorical scale, ordinal scale, interval scale,
ratio scale, and absolute scale
“A proxy attribute is a variable that is used to represent or stand in for another variable or attribute that is difficult to measure directly A proxy attribute is typically used in situations where it is not possible or practical to measure the actual attribute of interest For example, in a study of income, the amount of money a person earns per year may be difficult to determine accurately In such a case, a proxy attribute, such
as education level or occupation, may be used instead.” ChatGPT
Trang 10Six common scales
nominal scale: only used as codes, where the values has no meaning for
mathematical operations
categorical scale: according to its characteristics, and each category is marked
with a numeric code to indicate the category to which it belongs
ordinal scale: to express the ranking and ordering of the data without
establishing the degree of variation between them
interval scale: also called distance scale, can describes numerical differences
between different numbers in a meaningful way
ratio scale: different numbers can be compared to each other by ratio
absolute scale: the numbers measured have absolute meaning
10
Trang 11Data inspection
Goal: Inspects the obtained data in different view points to find the errors in
advance and then correct or remove some of them after discussion with domain experts
Data are categorized into quantitative and qualitative aspects
Inspect centralized trends (mean, median, etc.) and variability
Inspect data omissions, data noise, etc in different graphs
Trang 12Data discovery and visualization
Statistical table: a table is made according to specific rules after organized the data
Statistical chart: graphical representation of various characteristics of statistical data
in different graphic styles
Data Type:
Frequency: histogram, bar plot, pie chart
Distribution: box plot, Q-Q plot
Trends: trend chart
Relationships: scatter plot
Different data categories have different statistical charts
Categorical data: Bar chart applicable
Continuous data: histogram and pie chart applicable
12
Trang 13Major Tasks in Data Preprocessing
Trang 14Forms of Data Preprocessing
Trang 15Descriptive data summarization
Trang 16Mining Data Descriptive Characteristics
Motivation
To better understand the data: central tendency, variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Trang 17Measuring the Central Tendency
Mean (algebraic measure) (sample vs population):
Weighted arithmetic mean:
Median: A holistic measure
Middle value if odd number of values, or average of the middle two values
otherwise
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
n i
i i
w
x w x
1 1
)(
Trang 18Data Mining: Concepts and Techniques 18
Symmetric vs Skewed Data
positively and negatively skewed data
Trang 19Four moments of distribution: Mean, Variance, Skewness, and
Kurtosis
Trang 2020
Trang 21Measuring the Dispersion of Data
Quartiles, outliers and boxplots
Quartiles : Q1 (25 th percentile), Q3 (75 th percentile)
Inter-quartile range : IQR = Q3 – Q1
individually
Variance and standard deviation (sample: s, population: σ)
Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2)
x
x n
s
2 2
1
2 2
] ) (
1 [
1
1 )
( 1
1
2
) (
Trang 2222
Trang 23Properties of Normal Distribution Curve
The normal (distribution) curve
From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)
From μ–2σ to μ+2σ: contains about 95% of it
From μ–3σ to μ+3σ: contains about 99.7% of it
Trang 24 Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the
height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and
Maximum
Trang 25Visualization of Data Dispersion: Boxplot Analysis
Trang 26Line chart
Line chart: displays changes as a series of data points connected by straight
line segments, where data points are ordered by their x-axis value with y-axis label to compare data trends among variables
26
Trang 27Histogram Analysis
Graph displays of basic statistical class descriptions
Frequency histograms
A univariate graphical method
Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data
Trang 2828
Trang 29Cameraman image with the histogram
Trang 30Low contrast cameraman image with the histogram
High contrast cameraman image with the histogram
Trang 31Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
Allows the user to view whether there is a shift in going from one
distribution to another
Trang 32Scatter plot
Displays relationships between variables and usually uses dots to represent values
for two different numeric variables, each dot on the x-axis and y-axis indicatesvalues for an individual data point
Trang 33Loess Curve
Adds a smooth curve to a scatter plot in order to provide better
perception of the pattern of dependence
Loess curve is fitted by setting two parameters: a smoothing
parameter, and the degree of the polynomials that are fitted by the regression
Trang 34Positively and Negatively Correlated Data
Trang 35Not Correlated Data
Trang 36Data preprocessing
36
Trang 37 Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Trang 38Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income
in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
Trang 39How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute varies
considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
interpolation
the most probable value: inference-based such as Bayesian formula or decision tree
Trang 40Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
Trang 41How to Handle Noisy Data?
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc
smooth by fitting the data into regression functions
detect and remove outliers
detect suspicious values and check by human (e.g., deal with possible
outliers)
Trang 42Simple Discretization Methods: Binning
Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately same number of
samples
Good data scaling
Managing categorical attributes can be tricky
Trang 43Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
Trang 44Y1’
Trang 45Cluster Analysis
Trang 46Data Integration
Data integration:
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill Clinton
= William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are
different
Possible reasons: different representations, different scales, e.g., metric
vs British units
Trang 47Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may have different
names in different databases
Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Trang 48Data Transformation
Goal: Convert the data into a data format suitable for data mining
approach or enrich the content of the data to convert the original data or
re-encode to increase the value of the data
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
Trang 49Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
Ex Let income range $12,000 to $98,000 normalized to [0.0, 1.0] Then $73,000
is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
Ex Let μ = 54,000, σ = 16,000 Then
Normalization by decimal scaling
716 0 0 ) 0 0 1 ( 000 , 12 000 , 98
000 , 12 600 ,
−
−
A A
A A
A
A
min new
min new
max
new min
max
min v
−
= '
j
v v
10 ' = Where j is the smallest integer such that Max(|ν’|) < 1
225 1 000
, 16
000 , 54 600 ,
Trang 50Data Reduction Strategies
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the complete data
set
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet
produce the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation:
Dimensionality reduction — e.g., remove unimportant attributes
Data Compression
Numerosity reduction — e.g., fit data into models
Discretization
Trang 51Data Reduction
Data value itself is different based on data resolution and the value can be
enhanced through the process of data aggregation
The data collection stage should collect all the recordable variables as much as
possible, and then aggregate the data to obtain a more compact data set with the same information as the original data
Benefits include:
Enhance data quality
Decrease time for data mining
Increase data value and readability
Decrease costs for data storage
Trang 5252
Trang 53Attribute Subset Selection
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the probability distribution ofdifferent classes given the values for those features is as close as possible tothe original distribution given the values of all features
reduce # of patterns in the patterns, easier to understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward elimination
Decision-tree induction
Trang 54Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
Trang 55Heuristic Feature Selection Methods
There are 2 d possible sub-features of d features
Several heuristic feature selection methods:
Best single features under the feature independence assumption: choose by
significance tests
Best step-wise feature selection:
The best single-feature is picked first
Then next best feature condition to the first,
Step-wise feature elimination:
Repeatedly eliminate the worst feature
Best combined feature selection and elimination
Optimal branch and bound:
Use feature elimination and backtracking