NER with Epic> import epic.models.NerSelector > val nerModel = NerSelector.loadNer"en".get > val tokens = epic.preprocess.tokenize"Almost 20 years ago, Bill Watterson walked away from \"
Trang 1NLP in Scala with Breeze and Epic
David Hall
UC Berkeley
Trang 2• Structured Prediction
• Super-fast GPU parser for English
{ }
Trang 4Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America's apple heap.
It certainly won’t get there on looks.
Natural Language Processing
Trang 6Named Entity Recognition
Trang 7NER with Epic
> import epic.models.NerSelector
> val nerModel = NerSelector.loadNer("en").get
> val tokens = epic.preprocess.tokenize("Almost 20 years ago, Bill Watterson walked away from \"Calvin &
Trang 8Annotate a bunch of data?
Trang 9Building an NER system
> val data: IndexedSeq[Segmentation[Label, String]]
= ???
> val system = SemiCRF.buildSimple(data,
startLabel, outsideLabel)
> println(system.bestSequence(tokens).render("O"))
Almost 20 years ago , [PER:Bill Watterson] walked away from `` [MISC:Calvin & Hobbes] ''
Trang 10http://en.wikipedia.org/wiki/List_of_newspaper_comic_strips_A%E2%80%93F
Trang 11Gazetteers
Trang 12Using your own gazetteer
> val data: IndexedSeq[Segmentation[Label, String]]
= ???
> val myGazetteer = ???
> val system = SemiCRF.buildSimple(data,
startLabel, outsideLabel, gaz = myGazetteer)
Trang 13• Careful with gazetteers!
• If built from training data, system will use it and only it to make predictions!
• So, only known forms will be detected.
• Still, can be very useful…
Trang 14• Semi-Markov Conditional Random Field
• Don’t worry about the name.
Trang 15Semi-CRFs
Trang 18Building your own features
val dsl = new WordFeaturizer.DSL[L](counts) with SurfaceFeaturizer.DSL import dsl._
word(begin) // word at the beginning of the span
+ word(end – 1) // end of the span
+ word(begin – 1) // before (gets things like Mr.)
+ word (end) // one past the end
+ prefixes(begin) // prefixes up to some length
+ suffixes(begin)
+ length(begin, end) // names tend to be 1-3 words
+ gazetteer(begin, end)
Trang 19
Using your own featurizer
> val data: IndexedSeq[Segmentation[Label, String]]
= ???
> val myFeaturizer = ???
> val system = SemiCRF.buildSimple(data,
startLabel, outsideLabel, featurizer = myFeaturizer)
Trang 21Machine Learning Primer
Trang 22Machine Learning Primer
score(x, y) = wTf(x, y)
Trang 23Machine Learning Primer
score(x, y) = w.t * f(x, y)
Trang 24Machine Learning Primer
score(x, y) = w dot f(x, y)
Trang 25Machine Learning Primer
score(x, y) >= score(x, y’)
Trang 26Machine Learning Primer
w dot f(x, y) >= w dot f(x, y’)
Trang 27Machine Learning Primer
w dot f(x, y) >= w dot f(x, y’)
Trang 28Machine Learning Primer
Trang 29The Perceptron
> val featureIndex : Index[Feature] = ???
> val labelIndex: Index[Label] = ???
> val weights = DenseVector.rand[Double](featureIndex.size)
> for ( epoch <- 0 until numEpochs; (x, y) <- data ) {
val labelScores = DV.tabulate(labelIndex.size) { yy => val features = featuresFor(x, yy)
weights.t * new FeatureVector(indexed)
// or weights dot new FeatureVector(indexed)
}
…
}
Trang 30The Perceptron (cont’d)
> val featureIndex : Index[Feature] = ???
> val labelIndex: Index[Label] = ???
> val weights = DenseVector.rand[Double](featureIndex.size)
> for ( epoch <- 0 until numEpochs; (x, y) <- data ) { val labelScores =
val y_best = argmax(labelScores)
if (y != y_best) {
weights += new FeatureVector(featuresFor(x, y))
weights -= new FeaureVector(featuresFor(x, y_best)) }
}
Trang 31Structured Perceptron
• Can’t enumerate all segmentations! (L * 2n)
• But dynamic programs exist to max or sum
• … if the feature function has a nice form.
Trang 32Structured Perceptron
> val featureIndex : Index[Feature] = ???
> val labelIndex: Index[Label] = ???
> val weights = DenseVector.rand[Double](featureIndex.size)
> for ( epoch <- 0 until numEpochs; (x, y) <- data ) {
val y_best = bestStructure(weights, x)
Trang 34Multilingual Parser
Berkeley Epic
“Berkeley:” [Petrov & Klein, 2007]; Epic [Hall, Durrett, and Klein, 2014]
Trang 35Epic Pre-built Models
• Parsing
– English, Basque, French, German, Swedish, Polish, Korean – (working on Arabic, Chinese, Spanish)
• Part-of-Speech Tagging
– English, Basque, French, German, Swedish, Polish
• Named Entity Recognition
– English
• Sentence segmentation
– English – (ok support for others)
• Tokenization
– Above languages
Trang 37What Is Breeze?
Dense Vectors, Matrices, Sparse Vectors,
Counters, Matrix Decompositions
Trang 38What Is Breeze?
Nonlinear Optimization, Probability Distributions
Trang 39scalaVersion := "2.11.1"
Trang 41Return Type Selection
Trang 42Return Type Selection
Dynamic: Sparse
Trang 43Linear Algebra: Slices
Trang 44Linear Algebra: Slices
Trang 50UFuncs: Implementation
> object log extends UFunc
> implicit object logDouble extends log.Impl[Double, Double] { def apply(x: Double) = scala.math.log(x)
Trang 51UFuncs: Implementation
> object add1 extends UFunc
> implicit object add1Double extends add1.Impl[Double, Double] { def apply(x: Double) = x + 1.0
Trang 55Nonnegative Matrix Factorization
V ≈ W H
Vij, Wij, Hij >= 0
Trang 56Nonnegative Matrix Factorization
• Input: matrix V
• Output: W * H ≈ V, W and H ≥ 0
[Lee and Seung, 2011]
Trang 57(W, H)
Trang 58From Breeze to Gust
+
Trang 59(W, H)
Trang 61(x2 + y – 11)2 + (x + y2 – 7)2
Trang 63val df = new DiffFunction[DV[Double]] {
def calculate(values: DV[Double]) = {
val gradient = DV.zeros[Double](2)
val (x,y) = (values(0),values(1))
val value = pow(x * x + y - 11, 2) +
Trang 64breeze.optimize.minimize(df, DV(0.0, 0.0))
// DenseVector(3.0000000000012905, 1.999999999997128)
Trang 66www.github.com/dlwh/{breeze, epic, puck}