Bài giảng khai phá dữ liệu (data mining) data preprocessing

 Incomplete data may come from  “Not applicable” data value when collected  Different considerations between the time when the data was collected and when it is analyzed..  Human/ha

Trang 1

Trịnh Tấn Đạt

Khoa CNTT – Đại Học Sài Gòn

Email: trinhtandat@sgu.edu.vn

Website: https://sites.google.com/site/ttdat88/

Trang 2

 Why preprocess the data?

 Descriptive data summarization

Trang 3

Why Data Preprocessing?

 Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain attributes of interest, …

 e.g., occupation=“ ”

 noisy: containing errors or outliers

 e.g., Salary=“-10”

 inconsistent: containing discrepancies in codes or names

 e.g., Age=“42” Birthday=“03/07/1997”

 e.g., Was rating “1,2,3”, now rating “A, B, C”

 e.g., discrepancy between duplicate records

Trang 4

Why Is Data Dirty?

 Incomplete data may come from

 “Not applicable” data value when collected

 Different considerations between the time when the data was collected and when it is

analyzed.

 Human/hardware/software problems

 Noisy data (incorrect values) may come from

 Faulty data collection instruments

 Human or computer error at data entry

 Errors in data transmission

 Inconsistent data may come from

 Different data sources

 Functional dependency violation (e.g., modify some linked data)

 Duplicate records also need data cleaning

Trang 5

Why Is Data Preprocessing Important?

 No quality data, no quality mining results!

 Quality decisions must be based on quality data

 e.g., duplicate or missing data may cause incorrect or even misleading statistics.

 Data warehouse needs consistent integration of quality data

 Data extraction, cleaning, and transformation comprises the majority of the

work of building a data warehouse

Trang 6

Multi-Dimensional Measure of Data Quality

 A well-accepted multidimensional view:

Trang 7

Data type

 Numeric: The most used data type, and the stored content is numeric

 Characters and strings: strings are arrays of characters

 Boolean: for binary data with true and false values

 Time series data: including time-or sequential-related properties

 Sequential data: data itself has sequential relationship

 Time series data: each data will be subject to change with time

Trang 8

Data type

 Spatial data: for data including special related attributes

 For example, Google Map, Integrated Circuit Design Layout, Wafer ExposureLayout, Global Positioning System (GPS), etc

 Text data: for paragraph description, including patent reports, diagnostic

reports, etc.

 Structured data: library bibliographic data, credit card data

 Semi-structured data: email, extensible markup language (XML)

 Unstructured data: social media data of messages in Facebook

 Multimedia data: Including data of pictures, audio, video, etc in media with

mass data volumes as compared to other types of data that need data compression for data storage

8

Trang 9

Data scale

 Each variable of data has its corresponding attribute and scale to quantify and

measure its level

 natural quantitative scale

 qualitative scale

 When one variable is hard to find the corresponding attribute, proxy attribute

can be used instead as a measurement

 Common scales: nominal scale, categorical scale, ordinal scale, interval scale,

ratio scale, and absolute scale

“A proxy attribute is a variable that is used to represent or stand in for another variable or attribute that is difficult to measure directly A proxy attribute is typically used in situations where it is not possible or practical to measure the actual attribute of interest For example, in a study of income, the amount of money a person earns per year may be difficult to determine accurately In such a case, a proxy attribute, such

as education level or occupation, may be used instead.” ChatGPT

Trang 10

Six common scales

 nominal scale: only used as codes, where the values has no meaning for

mathematical operations

 categorical scale: according to its characteristics, and each category is marked

with a numeric code to indicate the category to which it belongs

 ordinal scale: to express the ranking and ordering of the data without

establishing the degree of variation between them

 interval scale: also called distance scale, can describes numerical differences

between different numbers in a meaningful way

 ratio scale: different numbers can be compared to each other by ratio

 absolute scale: the numbers measured have absolute meaning

10

Trang 11

Data inspection

 Goal: Inspects the obtained data in different view points to find the errors in

advance and then correct or remove some of them after discussion with domain experts

 Data are categorized into quantitative and qualitative aspects

 Inspect centralized trends (mean, median, etc.) and variability

 Inspect data omissions, data noise, etc in different graphs

Trang 12

Data discovery and visualization

 Statistical table: a table is made according to specific rules after organized the data

 Statistical chart: graphical representation of various characteristics of statistical data

in different graphic styles

 Data Type:

 Frequency: histogram, bar plot, pie chart

 Distribution: box plot, Q-Q plot

 Trends: trend chart

 Relationships: scatter plot

 Different data categories have different statistical charts

 Categorical data: Bar chart applicable

 Continuous data: histogram and pie chart applicable

12

Trang 13

Major Tasks in Data Preprocessing

Trang 14

Forms of Data Preprocessing

Trang 15

Descriptive data summarization

Trang 16

Mining Data Descriptive Characteristics

 Motivation

 To better understand the data: central tendency, variation and spread

 Data dispersion characteristics

 median, max, min, quantiles, outliers, variance, etc.

Trang 17

Measuring the Central Tendency

 Mean (algebraic measure) (sample vs population):

 Weighted arithmetic mean:

 Median: A holistic measure

 Middle value if odd number of values, or average of the middle two values

otherwise

 Mode

 Value that occurs most frequently in the data

 Unimodal, bimodal, trimodal

n i

i i

w

x w x

1 1

)(

Trang 18

Data Mining: Concepts and Techniques 18

Symmetric vs Skewed Data

positively and negatively skewed data

Trang 19

Four moments of distribution: Mean, Variance, Skewness, and

Kurtosis

Trang 20

20

Trang 21

Measuring the Dispersion of Data

 Quartiles, outliers and boxplots

 Quartiles : Q1 (25 th percentile), Q3 (75 th percentile)

 Inter-quartile range : IQR = Q3 – Q1

individually

 Variance and standard deviation (sample: s, population: σ)

 Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2)

x

x n

s

2 2

1

2 2

] ) (

1 [

1

1 )

( 1

1

2

) (



Trang 22

22

Trang 23

Properties of Normal Distribution Curve

 The normal (distribution) curve

 From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)

 From μ–2σ to μ+2σ: contains about 95% of it

 From μ–3σ to μ+3σ: contains about 99.7% of it

Trang 24

 Data is represented with a box

 The ends of the box are at the first and third quartiles, i.e., the

height of the box is IRQ

 The median is marked by a line within the box

 Whiskers: two lines outside the box extend to Minimum and

Maximum

Trang 25

Visualization of Data Dispersion: Boxplot Analysis

Trang 26

Line chart

 Line chart: displays changes as a series of data points connected by straight

line segments, where data points are ordered by their x-axis value with y-axis label to compare data trends among variables

26

Trang 27

Histogram Analysis

 Graph displays of basic statistical class descriptions

 Frequency histograms

 A univariate graphical method

 Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data

Trang 28

28

Trang 29

Cameraman image with the histogram

Trang 30

Low contrast cameraman image with the histogram

High contrast cameraman image with the histogram

Trang 31

Quantile-Quantile (Q-Q) Plot

 Graphs the quantiles of one univariate distribution against the

corresponding quantiles of another

 Allows the user to view whether there is a shift in going from one

distribution to another

Trang 32

Scatter plot

 Displays relationships between variables and usually uses dots to represent values

for two different numeric variables, each dot on the x-axis and y-axis indicatesvalues for an individual data point

Trang 33

Loess Curve

 Adds a smooth curve to a scatter plot in order to provide better

perception of the pattern of dependence

 Loess curve is fitted by setting two parameters: a smoothing

parameter, and the degree of the polynomials that are fitted by the regression

Trang 34

Positively and Negatively Correlated Data

Trang 35

Not Correlated Data

Trang 36

Data preprocessing

36

Trang 37

 Data cleaning tasks

 Fill in missing values

 Identify outliers and smooth out noisy data

 Correct inconsistent data

 Resolve redundancy caused by data integration

Trang 38

Missing Data

 Data is not always available

 E.g., many tuples have no recorded value for several attributes, such as customer income

in sales data

 Missing data may be due to

 equipment malfunction

 inconsistent with other recorded data and thus deleted

 data not entered due to misunderstanding

 certain data may not be considered important at the time of entry

 not register history or changes of the data

 Missing data may need to be inferred.

Trang 39

How to Handle Missing Data?

 Ignore the tuple: usually done when class label is missing (assuming the tasks in

classification—not effective when the percentage of missing values per attribute varies

considerably.

 Fill in the missing value manually: tedious + infeasible?

 Fill in it automatically with

 a global constant : e.g., “unknown”, a new class?!

 the attribute mean

 the attribute mean for all samples belonging to the same class: smarter

 interpolation

 the most probable value: inference-based such as Bayesian formula or decision tree

Trang 40

Noisy Data

 Noise: random error or variance in a measured variable

 Incorrect attribute values may due to

 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which requires data cleaning

 duplicate records

 incomplete data

 inconsistent data

Trang 41

How to Handle Noisy Data?

 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin median, smooth by

bin boundaries, etc

 smooth by fitting the data into regression functions

 detect and remove outliers

 detect suspicious values and check by human (e.g., deal with possible

outliers)

Trang 42

Simple Discretization Methods: Binning

 Equal-width (distance) partitioning

 Divides the range into N intervals of equal size: uniform grid

 if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.

 The most straightforward, but outliers may dominate presentation

 Skewed data is not handled well

 Equal-depth (frequency) partitioning

 Divides the range into N intervals, each containing approximately same number of

samples

 Good data scaling

 Managing categorical attributes can be tricky

Trang 43

Binning Methods for Data Smoothing

❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:

Trang 44

Y1’

Trang 45

Cluster Analysis

Trang 46

Data Integration

 Data integration:

 Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id  B.cust-#

 Integrate metadata from different sources

 Entity identification problem:

 Identify real world entities from multiple data sources, e.g., Bill Clinton

= William Clinton

 Detecting and resolving data value conflicts

 For the same real world entity, attribute values from different sources are

different

 Possible reasons: different representations, different scales, e.g., metric

vs British units

Trang 47

Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases

 Object identification: The same attribute or object may have different

names in different databases

 Derivable data: One attribute may be a “derived” attribute in another

table, e.g., annual revenue

 Careful integration of the data from multiple sources may help reduce/avoid

redundancies and inconsistencies and improve mining speed and quality

Trang 48

Data Transformation

 Goal: Convert the data into a data format suitable for data mining

approach or enrich the content of the data to convert the original data or

re-encode to increase the value of the data

 Smoothing: remove noise from data

 Aggregation: summarization, data cube construction

 Generalization: concept hierarchy climbing

 Normalization: scaled to fall within a small, specified range

Trang 49

Data Transformation: Normalization

 Min-max normalization: to [new_minA, new_maxA]

 Ex Let income range $12,000 to $98,000 normalized to [0.0, 1.0] Then $73,000

is mapped to

 Z-score normalization (μ: mean, σ: standard deviation):

 Ex Let μ = 54,000, σ = 16,000 Then

 Normalization by decimal scaling

716 0 0 ) 0 0 1 ( 000 , 12 000 , 98

000 , 12 600 ,

−

A A

A

min new

max

new min

max

min v





−

= '

j

v v

10 ' = Where j is the smallest integer such that Max(|ν’|) < 1

225 1 000

, 16

000 , 54 600 ,

Trang 50

Data Reduction Strategies

 Why data reduction?

 A database/data warehouse may store terabytes of data

 Complex data analysis/mining may take a very long time to run on the complete data

set

 Data reduction

 Obtain a reduced representation of the data set that is much smaller in volume but yet

produce the same (or almost the same) analytical results

 Data reduction strategies

 Data cube aggregation:

 Dimensionality reduction — e.g., remove unimportant attributes

 Data Compression

 Numerosity reduction — e.g., fit data into models

 Discretization

Trang 51

Data Reduction

 Data value itself is different based on data resolution and the value can be

enhanced through the process of data aggregation

 The data collection stage should collect all the recordable variables as much as

possible, and then aggregate the data to obtain a more compact data set with the same information as the original data

 Benefits include:

 Enhance data quality

 Decrease time for data mining

 Increase data value and readability

 Decrease costs for data storage

Trang 52

52

Trang 53

Attribute Subset Selection

 Feature selection (i.e., attribute subset selection):

 Select a minimum set of features such that the probability distribution ofdifferent classes given the values for those features is as close as possible tothe original distribution given the values of all features

 reduce # of patterns in the patterns, easier to understand

 Heuristic methods (due to exponential # of choices):

 Step-wise forward selection

 Step-wise backward elimination

 Combining forward selection and backward elimination

 Decision-tree induction

Trang 54

Example of Decision Tree Induction

Initial attribute set:

{A1, A2, A3, A4, A5, A6}

A4 ?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Trang 55

Heuristic Feature Selection Methods

 There are 2 d possible sub-features of d features

 Several heuristic feature selection methods:

 Best single features under the feature independence assumption: choose by

significance tests

 Best step-wise feature selection:

 The best single-feature is picked first

 Then next best feature condition to the first,

 Step-wise feature elimination:

 Repeatedly eliminate the worst feature

 Best combined feature selection and elimination

 Optimal branch and bound:

 Use feature elimination and backtracking

Tiêu đề	Data Preprocessing
Tác giả	Trịnh Tấn Đạt
Người hướng dẫn	TAN DAT TRINH, Ph.D.
Trường học	Saigon University
Chuyên ngành	Information Technology
Thể loại	Lecture
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	71
Dung lượng	2,06 MB