1. Trang chủ
  2. » Giáo án - Bài giảng

Business intelligence a managerial approach 2nd by david king chapter 04

44 211 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 44
Dung lượng 0,99 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Learning Objectivesfor text mining and data mining text mining project structure to text-based data... Data Mining versus Text Mining Both seek novel and useful patterns  Both are semi

Trang 1

Chapter 4:

Text and Web Mining

Trang 2

Learning Objectives

for text mining

and data mining

text mining

project

structure to text-based data

Trang 3

mining paradigms

Trang 5

(E) election (P) Norodom Ranariddh (P) Norodom Sihanouk (L) Bangkok

(L) Cambodia (L) Phnom Penh (L) Thailand (P) Hun Sen (O) Khmer Rouge (P) Pol Pot

Trang 6

Text Mining Concepts

kind of unstructured form (e.g., text)

every 18 months.

option, but a need to stay competitive.

from unstructured data sources

textual databases

Trang 7

Data Mining versus Text Mining

 Both seek novel and useful patterns

 Both are semi-automated processes

 Difference is the nature of the data:

 Structured data: databases

 Unstructured data: Word documents, PDF files, text excerpts, XML files, and so on

 Text mining – first, impose structure to the data, then mine the structured data

Trang 8

Text Mining Concepts

in text-rich data environments

(research articles), finance (quarterly reports),

medicine (discharge summaries), biology (molecular interactions), technology (patent files), marketing (customer comments), etc

Trang 9

Text Mining Application Area

Trang 10

Text Mining Terminology

 Unstructured or semistructured data

 Corpus (and corpora)

 Terms

 Concepts

 Stemming

 Stop words (and include words)

 Synonyms (and polysemes)

 Tokenizing

Trang 11

Text Mining Terminology

 Singular value decomposition

Trang 12

Text Mining for Patent Analysis

(see Applications Case 7.2)

 What is a patent?

an inventor for a limited period of time in exchange for a disclosure of an invention”

 How do we do patent analysis (PA)?

 Why do we need to do PA?

 How does text mining help in PA?

Trang 13

Natural Language Processing (NLP)

 Old approach : bag-of-words

 New approach : natural language processing

Trang 14

Natural Language Processing (NLP)

of a topic

 Can/will computers ever understand natural

language the same/accurate way we do ?

Trang 15

Natural Language Processing (NLP)

reading and obtaining knowledge from text

Trang 16

Natural Language Processing (NLP)

words, their definitions, sets of synonyms, and

various semantic relations between synonym sets

unfavorable opinions toward specific products and services

Trang 18

Text Mining Applications

 example coming up

Trang 19

Text Mining Applications

 Application Case 7.4: Mining for Lies

 Deception detection

problem is even more difficult

 The study

of interest at military bases

Trang 20

Text Mining Applications

 Application Case 7.4: Mining for Lies

Statements Transcribed for Processing

Text Processing Software Identified Cues in Statements

Statements Labeled as

Truthful or Deceptive

By Law Enforcement

Text Processing Software Generated Quantified Cues

Classification Models Trained and Tested on Quantified Cues

Cues Extracted &

Selected

Trang 21

Text Mining Applications

 Application Case 7.4: Mining for Lies

Quantity Verb count, noun-phrase count,

Complexity Avg no of clauses, sentence length, …

Uncertainty Modifiers, modal verbs,

Nonimmediacy Passive voice, objectification,

Expressivity Emotiveness

Diversity Lexical diversity, redundancy,

Informality Typographical error ratio

Specificity Spatiotemporal, perceptual information …

Affect Positive affect, negative affect, etc

Trang 22

Text Mining Applications

 Application Case 7.4: Mining for Lies

Trang 23

Text Mining Applications

(gene/protein interaction identification)

D007962

D 016923

D 001773 D019254 D044465 D001769 D002477 D003643 D016158

Trang 24

Text Mining Process

Extract knowledge from available data sources

A0

Unstructured data (text) Structured data (databases) Context-specific knowledge

Linguistic limitations

Context diagram for

the text mining

process

Trang 25

Text Mining Process

Establish the Corpus:

Collect & Organize the Domain Specific Unstructured Data

Create the Document Matrix:

Term-Introduce Structure

to the Corpus

Extract Knowledge:

Discover Novel Patterns from the T-D Matrix

The inputs to the process

includes a variety of relevant

unstructured (and

semi-structured) data sources such

as text, XML, HTML, etc

The output of the Task 1 is a collection of documents in some digitized format for computer processing

The output of the Task 2 is a flat file called term-document matrix where the cells are populated with the term frequencies

The output of Task 3 is a number of problem specific classification, association, clustering models and visualizations

FeedbackFeedback

The three-step text mining process

Trang 26

Text Mining Process

(e.g., textual documents, XML files, emails, Web pages, short notes, voice recordings…)

(e.g., all in ASCII text files)

(e.g., in a flat file, or in a directory as

separate files)

Trang 27

Text Mining Process

1 1

1

3

1

Trang 28

Text Mining Process

Matrix (TDM)

indices (values in cells)?

Trang 29

Text Mining Process

Matrix (TDM)

the dimensionality of the TDM?

very few documents (?)

decomposition (SVD)

Trang 30

Text Mining Process

Trang 31

Text Mining Application

(research trend identification in literature)

 Mining the published IS literature

Trang 32

Text Mining Application

(research trend identification in literature)

Journal Year Author(s) Title Vol/No Pages Keywords Abstract

MISQ 2005 A Malhotra,

S Gosain and

O A El Sawy

Absorptive capacity configurations in supply chains:

Gearing for enabled market knowledge creation

partner-29/1 145-187 knowledge management

supply chainabsorptive capacityinterorganizational information systemsconfiguration approaches

The need for continual value innovation is driving supply chains to evolve from a pure transactional focus to

leveraging interorganizational partner ships for sharing ISR 1999 D Robey and

M C Boudreau

Accounting for the contradictory organizational consequences of information technology:

Theoretical directions and methodological implications

2-Oct 167-185 organizational

transformationimpacts of technologyorganization theoryresearch methodologyintraorganizational powerelectronic communicationmis implementationculture

systems

Although much contemporary thought considers advanced information technologies as either determinants or enablers

of radical organizational change, empirical studies have revealed inconsistent findings to support the deterministic logic implicit in such arguments This paper reviews the contradictory JMIS 2001 R Aron and

E K Clemons

Achieving the optimal balance between investment in quality and investment in self-promotion for

information products

18/2 65-88 information products

internet advertisingproduct positioningsignaling

signaling games

When producers of goods (or services) are confronted by a situation in which their offerings

no longer perfectly match consumer preferences, they must determine the extent to which the advertised features of

Trang 33

Text Mining Application

(research trend identification in literature)

Trang 34

Text Mining Application

(research trend identification in literature)

Trang 35

Text Mining Tools

 Commercial Software Tools

Trang 36

Web Mining Overview

Trang 37

Web Mining

 Web mining (or Web data mining) is the

process of discovering intrinsic relationships from Web data (textual, linkage, or usage)

Web Mining

Web Structure Mining

Source: the unified resource locator (URL) links contained in the Web pages

Web Content Mining

Source: unstructured

textual content of the

Web pages (usually in

HTML format)

Web Usage Mining

Source: the detailed description of a Web site’s visits (sequence

of clicks by sessions)

Trang 38

Web Content/Structure Mining

 Mining of the textual content on the Web

 Data collection via Web crawlers

 Web pages include hyperlinks

Trang 39

Web Usage Mining

through Web page visits and transactions

agent logs, and client-side cookies

attributes, and usage data

Trang 40

Web Usage Mining

based on user access patterns

rules and users' profiles

their interests and profiles

Trang 41

Web Usage Mining

(clickstream analysis)

Weblogs

Collecting Merging Cleaning Structuring

How to better the dataHow to improve the Web site

How to increase the customer value

User /

Customer

Trang 42

Web Mining Success Stories

Web Analytics Voice of Customer Customer Experience Management

Customer Interaction

on the Web

Analysis of Interactions Knowledge about the Holistic

View of the Customer

Trang 43

Web Mining Tools

Trang 44

End of the Chapter

 Questions, comments

Ngày đăng: 18/12/2017, 15:10