Learning Objectivesfor text mining and data mining text mining project structure to text-based data... Data Mining versus Text Mining Both seek novel and useful patterns Both are semi
Trang 1Chapter 4:
Text and Web Mining
Trang 2Learning Objectives
for text mining
and data mining
text mining
project
structure to text-based data
Trang 3mining paradigms
Trang 5(E) election (P) Norodom Ranariddh (P) Norodom Sihanouk (L) Bangkok
(L) Cambodia (L) Phnom Penh (L) Thailand (P) Hun Sen (O) Khmer Rouge (P) Pol Pot
Trang 6Text Mining Concepts
kind of unstructured form (e.g., text)
every 18 months.
option, but a need to stay competitive.
from unstructured data sources
textual databases
Trang 7Data Mining versus Text Mining
Both seek novel and useful patterns
Both are semi-automated processes
Difference is the nature of the data:
Structured data: databases
Unstructured data: Word documents, PDF files, text excerpts, XML files, and so on
Text mining – first, impose structure to the data, then mine the structured data
Trang 8Text Mining Concepts
in text-rich data environments
(research articles), finance (quarterly reports),
medicine (discharge summaries), biology (molecular interactions), technology (patent files), marketing (customer comments), etc
Trang 9Text Mining Application Area
Trang 10Text Mining Terminology
Unstructured or semistructured data
Corpus (and corpora)
Terms
Concepts
Stemming
Stop words (and include words)
Synonyms (and polysemes)
Tokenizing
Trang 11Text Mining Terminology
Singular value decomposition
Trang 12Text Mining for Patent Analysis
(see Applications Case 7.2)
What is a patent?
an inventor for a limited period of time in exchange for a disclosure of an invention”
How do we do patent analysis (PA)?
Why do we need to do PA?
How does text mining help in PA?
Trang 13Natural Language Processing (NLP)
Old approach : bag-of-words
New approach : natural language processing
Trang 14Natural Language Processing (NLP)
of a topic
Can/will computers ever understand natural
language the same/accurate way we do ?
Trang 15Natural Language Processing (NLP)
reading and obtaining knowledge from text
Trang 16Natural Language Processing (NLP)
words, their definitions, sets of synonyms, and
various semantic relations between synonym sets
unfavorable opinions toward specific products and services
Trang 18Text Mining Applications
example coming up
Trang 19Text Mining Applications
Application Case 7.4: Mining for Lies
Deception detection
problem is even more difficult
The study
of interest at military bases
Trang 20Text Mining Applications
Application Case 7.4: Mining for Lies
Statements Transcribed for Processing
Text Processing Software Identified Cues in Statements
Statements Labeled as
Truthful or Deceptive
By Law Enforcement
Text Processing Software Generated Quantified Cues
Classification Models Trained and Tested on Quantified Cues
Cues Extracted &
Selected
Trang 21Text Mining Applications
Application Case 7.4: Mining for Lies
Quantity Verb count, noun-phrase count,
Complexity Avg no of clauses, sentence length, …
Uncertainty Modifiers, modal verbs,
Nonimmediacy Passive voice, objectification,
Expressivity Emotiveness
Diversity Lexical diversity, redundancy,
Informality Typographical error ratio
Specificity Spatiotemporal, perceptual information …
Affect Positive affect, negative affect, etc
Trang 22Text Mining Applications
Application Case 7.4: Mining for Lies
Trang 23Text Mining Applications
(gene/protein interaction identification)
D007962
D 016923
D 001773 D019254 D044465 D001769 D002477 D003643 D016158
Trang 24Text Mining Process
Extract knowledge from available data sources
A0
Unstructured data (text) Structured data (databases) Context-specific knowledge
Linguistic limitations
Context diagram for
the text mining
process
Trang 25Text Mining Process
Establish the Corpus:
Collect & Organize the Domain Specific Unstructured Data
Create the Document Matrix:
Term-Introduce Structure
to the Corpus
Extract Knowledge:
Discover Novel Patterns from the T-D Matrix
The inputs to the process
includes a variety of relevant
unstructured (and
semi-structured) data sources such
as text, XML, HTML, etc
The output of the Task 1 is a collection of documents in some digitized format for computer processing
The output of the Task 2 is a flat file called term-document matrix where the cells are populated with the term frequencies
The output of Task 3 is a number of problem specific classification, association, clustering models and visualizations
FeedbackFeedback
The three-step text mining process
Trang 26Text Mining Process
(e.g., textual documents, XML files, emails, Web pages, short notes, voice recordings…)
(e.g., all in ASCII text files)
(e.g., in a flat file, or in a directory as
separate files)
Trang 27Text Mining Process
1 1
1
3
1
Trang 28Text Mining Process
Matrix (TDM)
indices (values in cells)?
Trang 29Text Mining Process
Matrix (TDM)
the dimensionality of the TDM?
very few documents (?)
decomposition (SVD)
Trang 30Text Mining Process
Trang 31Text Mining Application
(research trend identification in literature)
Mining the published IS literature
Trang 32Text Mining Application
(research trend identification in literature)
Journal Year Author(s) Title Vol/No Pages Keywords Abstract
MISQ 2005 A Malhotra,
S Gosain and
O A El Sawy
Absorptive capacity configurations in supply chains:
Gearing for enabled market knowledge creation
partner-29/1 145-187 knowledge management
supply chainabsorptive capacityinterorganizational information systemsconfiguration approaches
The need for continual value innovation is driving supply chains to evolve from a pure transactional focus to
leveraging interorganizational partner ships for sharing ISR 1999 D Robey and
M C Boudreau
Accounting for the contradictory organizational consequences of information technology:
Theoretical directions and methodological implications
2-Oct 167-185 organizational
transformationimpacts of technologyorganization theoryresearch methodologyintraorganizational powerelectronic communicationmis implementationculture
systems
Although much contemporary thought considers advanced information technologies as either determinants or enablers
of radical organizational change, empirical studies have revealed inconsistent findings to support the deterministic logic implicit in such arguments This paper reviews the contradictory JMIS 2001 R Aron and
E K Clemons
Achieving the optimal balance between investment in quality and investment in self-promotion for
information products
18/2 65-88 information products
internet advertisingproduct positioningsignaling
signaling games
When producers of goods (or services) are confronted by a situation in which their offerings
no longer perfectly match consumer preferences, they must determine the extent to which the advertised features of
Trang 33Text Mining Application
(research trend identification in literature)
Trang 34Text Mining Application
(research trend identification in literature)
Trang 35Text Mining Tools
Commercial Software Tools
Trang 36Web Mining Overview
Trang 37Web Mining
Web mining (or Web data mining) is the
process of discovering intrinsic relationships from Web data (textual, linkage, or usage)
Web Mining
Web Structure Mining
Source: the unified resource locator (URL) links contained in the Web pages
Web Content Mining
Source: unstructured
textual content of the
Web pages (usually in
HTML format)
Web Usage Mining
Source: the detailed description of a Web site’s visits (sequence
of clicks by sessions)
Trang 38Web Content/Structure Mining
Mining of the textual content on the Web
Data collection via Web crawlers
Web pages include hyperlinks
Trang 39Web Usage Mining
through Web page visits and transactions
agent logs, and client-side cookies
attributes, and usage data
Trang 40Web Usage Mining
based on user access patterns
rules and users' profiles
their interests and profiles
Trang 41Web Usage Mining
(clickstream analysis)
Weblogs
Collecting Merging Cleaning Structuring
How to better the dataHow to improve the Web site
How to increase the customer value
User /
Customer
Trang 42Web Mining Success Stories
Web Analytics Voice of Customer Customer Experience Management
Customer Interaction
on the Web
Analysis of Interactions Knowledge about the Holistic
View of the Customer
Trang 43Web Mining Tools
Trang 44End of the Chapter
Questions, comments