1. Trang chủ
  2. » Công Nghệ Thông Tin

nlp in scala with breeze and epic

66 277 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 66
Dung lượng 2,31 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

NER with Epic> import epic.models.NerSelector > val nerModel = NerSelector.loadNer"en".get > val tokens = epic.preprocess.tokenize"Almost 20 years ago, Bill Watterson walked away from \"

Trang 1

NLP in Scala with Breeze and Epic

David Hall

UC Berkeley

Trang 2

• Structured Prediction

• Super-fast GPU parser for English

{ }

Trang 4

Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap.

It certainly won’t get there on looks.

Natural Language Processing

Trang 6

Named Entity Recognition

Trang 7

NER with Epic

> import epic.models.NerSelector

> val nerModel = NerSelector.loadNer("en").get

> val tokens = epic.preprocess.tokenize("Almost 20 years ago, Bill Watterson walked away from \"Calvin &

Trang 8

Annotate a bunch of data?

Trang 9

Building an NER system

> val data: IndexedSeq[Segmentation[Label, String]]

= ???

> val system = SemiCRF.buildSimple(data,

startLabel, outsideLabel)

> println(system.bestSequence(tokens).render("O"))

Almost 20 years ago , [PER:Bill Watterson] walked away from `` [MISC:Calvin & Hobbes] ''

Trang 10

http://en.wikipedia.org/wiki/List_of_newspaper_comic_strips_A%E2%80%93F

Trang 11

Gazetteers

Trang 12

Using your own gazetteer

> val data: IndexedSeq[Segmentation[Label, String]]

= ???

> val myGazetteer = ???

> val system = SemiCRF.buildSimple(data,

startLabel, outsideLabel, gaz = myGazetteer)

Trang 13

• Careful with gazetteers!

• If built from training data, system will use it and only it to make predictions!

• So, only known forms will be detected.

• Still, can be very useful…

Trang 14

• Semi-Markov Conditional Random Field

• Don’t worry about the name.

Trang 15

Semi-CRFs

Trang 18

Building your own features

val dsl = new WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL import dsl._

word(begin) // word at the beginning of the span

+ word(end – 1) // end of the span

+ word(begin – 1) // before (gets things like Mr.)

+ word (end) // one past the end

+ prefixes(begin) // prefixes up to some length

+ suffixes(begin)

+ length(begin, end) // names tend to be 1-3 words

+ gazetteer(begin, end)

Trang 19

Using your own featurizer

> val data: IndexedSeq[Segmentation[Label, String]]

= ???

> val myFeaturizer = ???

> val system = SemiCRF.buildSimple(data,

startLabel, outsideLabel, featurizer = myFeaturizer)

Trang 21

Machine Learning Primer

Trang 22

Machine Learning Primer

score(x, y) = wTf(x, y)

Trang 23

Machine Learning Primer

score(x, y) = w.t * f(x, y)

Trang 24

Machine Learning Primer

score(x, y) = w dot f(x, y)

Trang 25

Machine Learning Primer

score(x, y) >= score(x, y’)

Trang 26

Machine Learning Primer

w dot f(x, y) >= w dot f(x, y’)

Trang 27

Machine Learning Primer

w dot f(x, y) >= w dot f(x, y’)

Trang 28

Machine Learning Primer

Trang 29

The Perceptron

> val featureIndex : Index[Feature] = ???

> val labelIndex: Index[Label] = ???

> val weights = DenseVector.rand[Double](featureIndex.size)

> for ( epoch <- 0 until numEpochs; (x, y) <- data ) {

val labelScores = DV.tabulate(labelIndex.size) { yy => val features = featuresFor(x, yy)

weights.t * new FeatureVector(indexed)

// or weights dot new FeatureVector(indexed)

}

}

Trang 30

The Perceptron (cont’d)

> val featureIndex : Index[Feature] = ???

> val labelIndex: Index[Label] = ???

> val weights = DenseVector.rand[Double](featureIndex.size)

> for ( epoch <- 0 until numEpochs; (x, y) <- data ) { val labelScores =

val y_best = argmax(labelScores)

if (y != y_best) {

weights += new FeatureVector(featuresFor(x, y))

weights -= new FeaureVector(featuresFor(x, y_best)) }

}

Trang 31

Structured Perceptron

• Can’t enumerate all segmentations! (L * 2n)

• But dynamic programs exist to max or sum

• … if the feature function has a nice form.

Trang 32

Structured Perceptron

> val featureIndex : Index[Feature] = ???

> val labelIndex: Index[Label] = ???

> val weights = DenseVector.rand[Double](featureIndex.size)

> for ( epoch <- 0 until numEpochs; (x, y) <- data ) {

val y_best = bestStructure(weights, x)

Trang 34

Multilingual Parser

Berkeley Epic

“Berkeley:” [Petrov & Klein, 2007]; Epic [Hall, Durrett, and Klein, 2014]

Trang 35

Epic Pre-built Models

• Parsing

– English, Basque, French, German, Swedish, Polish, Korean – (working on Arabic, Chinese, Spanish)

• Part-of-Speech Tagging

– English, Basque, French, German, Swedish, Polish

• Named Entity Recognition

– English

• Sentence segmentation

– English – (ok support for others)

• Tokenization

– Above languages

Trang 37

What Is Breeze?

Dense Vectors, Matrices, Sparse Vectors,

Counters, Matrix Decompositions

Trang 38

What Is Breeze?

Nonlinear Optimization, Probability Distributions

Trang 39

scalaVersion := "2.11.1"

Trang 41

Return Type Selection

Trang 42

Return Type Selection

Dynamic: Sparse

Trang 43

Linear Algebra: Slices

Trang 44

Linear Algebra: Slices

Trang 50

UFuncs: Implementation

> object log extends UFunc

> implicit object logDouble extends log.Impl[Double, Double] { def apply(x: Double) = scala.math.log(x)

Trang 51

UFuncs: Implementation

> object add1 extends UFunc

> implicit object add1Double extends add1.Impl[Double, Double] { def apply(x: Double) = x + 1.0

Trang 55

Nonnegative Matrix Factorization

V ≈ W  H

Vij, Wij, Hij >= 0

Trang 56

Nonnegative Matrix Factorization

• Input: matrix V

• Output: W * H ≈ V, W and H ≥ 0

[Lee and Seung, 2011]

Trang 57

(W, H)

Trang 58

From Breeze to Gust

+

Trang 59

(W, H)

Trang 61

(x2 + y – 11)2 + (x + y2 – 7)2

Trang 63

val df = new DiffFunction[DV[Double]] {

def calculate(values: DV[Double]) = {

val gradient = DV.zeros[Double](2)

val (x,y) = (values(0),values(1))

val value = pow(x * x + y - 11, 2) +

Trang 64

breeze.optimize.minimize(df, DV(0.0, 0.0))

// DenseVector(3.0000000000012905, 1.999999999997128)

Trang 66

www.github.com/dlwh/{breeze, epic, puck}

Ngày đăng: 24/10/2014, 13:47

TỪ KHÓA LIÊN QUAN