Viết chương trình phân tích cú pháp theo phương pháp earley. có mô phỏng thực hiện từng bước

Trang 1

HỌC VIỆN KỸ THUẬT QUÂN SỰ KHOA CÔNG NGHỆ THÔNG TIN

-o0o -BÀI TẬP LỚN

Môn: Lý thuyết chương trình dịch

Đề tài: Viết chương trình phân tích cú pháp theo phương pháp Earley.

Có mô phỏng thực hiện từng bước.

Giáo viên hướng dẫn:

Ts Hà Chí Trung

Lớp: CHKHMT-K27B

TPHCM, tháng 05 năm 2016

Trang 2

MỤC LỤC

1 Tóm tắt 3

2 Giải thuật Earley 4

a.Khởi tạo 4

b Thuật toán 5

+) Dự đoán 5

+) Duyệt 5

+) Hoàn thiện 5

3 Chương trình phân tích cú pháp câu theo phương pháp Early Parser 7

(Ngôn ngữ Java) 7

4 Tài liệu tham khảo 19

Trang 3

1 Tóm tắt

Giải thuật Earley là một giải thuật cơ bản, được sử dụng tương đối rộng rãi trong các hệ thống phân tích cú pháp Tuy nhiên, giải thuật này vẫn còn hạn chế như sinh ra quá nhiều luật dư thừa trong quá trình phân tích Trong bài này, chúng tôi

đề xuất ra phương pháp phân tích cú pháp theo giải thuật Earley

Giải thuật Earley là một trong những giải thuật được sử dụng phổ biến trong việc xây dựng các hệ thống phân tích cú pháp Giải thuật này sử dụng chiến lược phân tích kiểu trên xuống (top-down), bắt đầu với một ký hiệu không kết thúc đại diện cho câu và sử dụng các luật khai triển cho đến khi thu được câu vào Hạn chế của cách tiếp cận này là không chú trọng nhiều đến các từ đầu vào Vì vậy trong quá trình phân tích, giải thuật Earley sản sinh ra rất nhiều luật dư thừa.Ngoài ra, giải thuật Earley được xây dựng cho tiếng Anh nên khi áp dụng cho tiếng Việt sẽ có hạn chế Mỗi câu vào tiếng Anh chỉ có một cách tách từ, trong khi với tiếng Việt, mỗi câu vào có thể có nhiều cách tách từ khác nhau Với đặc điểm đầu vào của giải thuật Earley chỉ là một câu với một cách tách, bộ phân tích cú pháp sẽ phải thực hiện lặp đi lặp lại giải thuật này cho từng trường hợp tách từ đối với tiếng Việt Để giải quyết vấn đề này, chúng tôi nhận thấy trong các cách tách từ Việt tồn tại các cặp cách tách giống nhau ở danh sách các từ loại đầu tiên và chỉ khác nhau

ở phần đuôi của chúng.

Giải thuật Earley cơ bản, giúp người đọc có thể hình dung một cách khái quát về giải thuật này.

Trang 4

2 Giải thuật Earley

Giải thuật Earley cơ bản được phát biểu như sau:

Đầu vào: Văn phạm G = (N, T, S, P), trong đó:

• N: tập kí hiệu không kết thúc.

• T: tập kí hiệu kết thúc.

• S: kí hiệu không kết thúc bắt đầu.

• P: tập luật cú pháp.

Xâu vào w = a1a2 an.

Đầu ra: Phân tích đối với w hoặc "sai".

Kí hiệu:

• α, β, γ biểu diễn xâu chứa các kí hiệu kết thúc, không kết thúc hoặc rỗng.

• X, Y, Z biểu diễn các kí hiệu không kết thúc đơn.

• a biểu diễn kí hiệu kết thúc.

Earley sử dụng cách biểu diễn luật thông qua dấu chấm “• ”

X→ α • β có nghĩa :

• Trong P có một luật sản xuất X→ α β.

• α đã được phân tích.

• β đang được chờ phân tích.

• Khi dấu chấm “ • ” được chuyển ra sau β có nghĩa đây là một luật hoàn thiện Thành phần X đã được phân tích đầy đủ, ngược lại nó là một luật chưa hoàn thiện Đối với mỗi từ thứ j của xâu đầu vào, bộ phân tích khởi tạo một bộ có thứ tự các trạng thái S(j).

Mỗi bộ tương ứng với một cột trong bảng phân tích Mỗi trạng thái có dạng (X →

α • β, i), thành phần sau dấu phẩy xác định rằng luật này được phát sinh từ cột thứ i.

a.Khởi tạo

• S(0) được khởi tạo chứa ROOT → • S.

• Nếu tại bộ cuối cùng ta có luật (ROOT → S•, 0) thì có nghĩa xâu vào được phân tích thành công.

Trang 5

b Thuật toán

Thuật toán phân tích thực hiện 3 bước: Dự đoán (Predictor), Duyệt (Scanner), và Hoàn thiện (Completer) đối với mỗi bộ S(j).

+) Dự đoán

Với mọi trạng thái trong S(j): (X → α • Y β, i), ta thêm trạng thái (Y → • γ, j) vào S(j) nếu có luật sản xuất Y → γ trong P.

+) Duyệt

Nếu a là kí hiệu kết thúc tiếp theo Với mọi trạng thái trong S(j): (X → α • a β, i), ta thêm trạng thái (X → α a • β, i) vào S(j+1).

+) Hoàn thiện

Với mọi trạng thái trong S(j): (X → γ• , i), ta tìm trong S(i) trạng thái (Y → α •

X β, k), sau đó thêm (Y → α X • β, k) vào S(j).

Ở mỗi bộ S(j) phải kiểm tra xem trạng thái đã có chưa trước khi thêm vào để tránh trùng lặp.

Để minh họa cho thuật toán trên, chúng ta phân tích câu “học sinh làm bài tập”

với tập luật cú pháp sau:

S → N VP

S → P VP

S → N AP

S → VP AP

VP → V N

VP → V NP

NP → N N

NP → N A

AP → R A

N → học sinh

N → bài tập

V → làm

Trong đó :

S – câu

VP – cụm động từ

NP – cụm danh từ

AP – cụm tính từ

P – đại từ

N – danh từ

V – động từ

A – tính từ

R – phụ từ

Trang 6

Do câu trên có nhiều cách tách từ, trong khi đầu vào của giải thuật Earley chỉ là một câu với một cách tách từ nên chúng tôi minh họa giải thuật Earley với cách

tách từ trong trường hợp câu được phân tích là: học sinh, làm, bài tập.

Bảng phân tích cho cách tách này như sau :

ROOT • S, 0 N học sinh•, 0 V làm•, 1 N bài tập•, 2

N •học sinh, 0

N •bài tập, 0

V •làm, 0

Bảng 1 Bảng minh họa giải thuật Earley

Trang 7

3 Chương trình phân tích cú pháp câu theo phương pháp Early Parser

(Ngôn ngữ Java)

EarleyParser Class

import java.util.ArrayList;

import java.util.HashMap;

public class EarleyParser {

public static class Node{

String text;

ArrayList<Node> siblings = new ArrayList<Node>();

Node(String s) {

text=s;

} }

class State{

class Mypair{ //need this to keep the order

String key;

ArrayList<State> values;

Mypair(String key, ArrayList<State> values) {

this.key = key;

this.values = values;

} }

int i; // position in the sentence String left;

int current; // position in the grammar rule ArrayList<String> right;

ArrayList<Mypair> parents; // each right has parents State(String left, int current, ArrayList<String> right, int i)

{

this.i = i;

this.left = left;

this.right = right;

this.current = current;

parents = new ArrayList<Mypair>();

for(String r : right) {

parents.add(new Mypair(r,new ArrayList<State>()));

} }

public void parents(Node node_parent) //visit parents {

Trang 8

for(Mypair pair : parents) {

Node son = new Node(pair.key);

for(State sparent : pair.values) {

sparent.parents(son);

} node_parent.siblings.add(son);

} }

public String toString()

{

String out = left + "->";

for(int k = 0; k < right.size(); k++) {

if(k==current)

out += "@";

out += right.get(k);

} if(right.size()==current)

out += "@";

return "("+out+","+i+")";

}

public boolean equals(Object obj) {

if(obj instanceof State) {

State s2 = (State)obj;

if(i != s2.i)

return false;

if(current != s2.current)

return false;

if(!left.equals(s2.left))

return false;

if(right.size()!=s2.right.size())

return false;

for(int k = 0; k < right.size(); k++)

if(!right.get(k).equals(s2.right.get(k)))

return false;

return true;

} return false;

}

private Sentence words;

private HashMap<String,ArrayList<ArrayList<String>>> grammar; private String start;

private ArrayList<ArrayList<State>> charts;

private ArrayList<Node> trees;

public EarleyParser(Sentence words, Grammar grammar) {

this.words = words;

Trang 9

this.grammar = grammar.getGrammar();

this.start = grammar.getStartProduction();

this.charts = new ArrayList<ArrayList<State>>(words.getSentence().size()+1);

for(int i = 0; i < words.getSentence().size()+1; i++) {

this.charts.add(new ArrayList<State>());

} }

public ArrayList<Node> getTrees()

{

return trees;

}

public int run()

{

//INICIALIZACAO ArrayList<String> right_root = new ArrayList<String>(1); right_root.add(start);

State begin = new State("_ROOT",0,right_root,0);

addIfNotContains(0,begin);

for(int i = 0; i < words.getSentence().size()+1; i++) {

System.out.println("\nWord no "+i);

if(i < words.getSentence().size())

System.out.println(words.getSentence().get(i)); if(charts.get(i).isEmpty())

{

System.out.println("Nothing to do for this word");

return i+1;

} for(int snum = 0; snum < charts.get(i).size();snum++) {

State s = charts.get(i).get(snum);

System.out.println("state to process " + s); if(s.current==s.right.size()) // end of rule {

System.out.println("Completer");

completer(s,i);

} else { if(s.right.get(s.current).startsWith("\""))

{

System.out.println("Scanner"); scanner(s,i);

} else {

System.out.println("Predictor"); predictor(s,i);

Trang 10

} }

//TREE State last_state = new State("_ROOT",1,right_root,0);

ArrayList<State> array = charts.get(charts.size()-1);

trees = new ArrayList<Node>();

for(State s_root : array) {

if(s_root.equals(last_state)) {

Node root = new Node("_ROOT");

s_root.parents(root);

trees.add(root);

} }

boolean r = charts.get(charts.size()-1).contains(last_state);

if(r)

return 0;

else return -1;

}

private void predictor(State s, int j) {

String B = s.right.get(s.current);

ArrayList<ArrayList<String>> rules = grammar.get(B);

for(ArrayList<String> rule : rules) {

System.out.print("Predictor Action");

State snew = new State(B,0,rule,j);

addIfNotContains(j,snew);

} }

private void scanner(State s, int j) {

String B = s.right.get(s.current);

boolean epsilon = B.equals("\"\"");

if(j > words.getSentence().size())

return;

if(j == words.getSentence().size() && !epsilon)//only empty strings can be scanned in last chart

return;

if(epsilon) {

System.out.print("Scanner Action epsilon");

State snew = new State(s.left,s.current+1,s.right,s.i);

State newAdded = addIfNotContains(j,snew); //adds to current chart

copyParents(s, newAdded);

Trang 11

} else if(B.equals(words.getSentence().get(j))) {

System.out.print("Scanner Action");

State snew = new State(s.left,s.current+1,s.right,s.i);

State newAdded = addIfNotContains(j+1,snew); //adds to next charts

//copy parents from duplicated state copyParents(s, newAdded);

} }

private void completer(State s, int k) {

for(int snum = 0; snum < charts.get(s.i).size(); snum++) {

State currentState = charts.get(s.i).get(snum);

if(currentState.current >= currentState.right.size())

continue;

if(s.left.equals(currentState.right.get(currentState.current)))

{

System.out.print("Completer Action");

State newState = new State(currentState.left,currentState.current+1,currentState.right,curren tState.i);

State newAdded = addIfNotContains(k,newState); //newAdded.parents.add(s);

if(newState==newAdded) //only if it's not a new state, it has parents

newAdded.parents.get(currentState.current).values.add(s);

copyParents(currentState, newAdded);

} }

}

private State addIfNotContains(int num, State s)

{

ArrayList<State> list = charts.get(num);

for(int i = 0; i < list.size(); i++) {

if(list.get(i).equals(s)) {

System.out.println(" NOT added " + s + " to chart " + num);

return list.get(i);

} }

System.out.println(" Added " + s + " to chart " + num); list.add(s);

return s;

}

private void copyParents(State s, State newAdded) {

Trang 12

for(int i = 0; i < s.parents.size(); i++) //both states have the same number of right

{

for(State value : s.parents.get(i).values) {

if(!

newAdded.parents.get(i).values.contains(value))

newAdded.parents.get(i).values.add(value); }

} }

}

Grammar Class

import java.io.BufferedReader;

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileReader;

import java.io.IOException;

import java.io.Reader;

import java.io.StringReader;

import java.util.ArrayList;

import java.util.HashMap;

import java.util.LinkedHashSet;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

public class Grammar {

/*

* Sites expressoes regulares

*

* http://www.regexr.com/

* http://www.regexplanet.com/advanced/java/index.html

*

*/

final String GR_SEPARATOR = "::=";

final String RE_SPLIT_SPACES = "[^\\s\"'] +|\"([^\"]*)\"|'([^']*)'";

final String RE_SPLIT_SPACES2 = "[^\\s\\\"'()]

+|\\\"([^\\\"]*)\\\"|'([^']*)'|\$([^\$]*)\\)*\\*"; //nova com parentesis [^\s\"'()]+|\"([^\"]*)\"|'([^']*)'|$([^$]*)\)*\*

final String RE_SPLIT_PIPES = "\\|(?![^\"]*\"(?: [^\"]*\"[^\"]*\")*[^\"]*$)";

final String RE_SPLIT_PARENTHESES = "\$([^\$]*)\\)*(\\*|\\ +|\\?)"; // $([^$]*)\)*\*

String filePath;

Trang 13

HashMap<String, ArrayList<ArrayList<String>>> grammar = new HashMap<String, ArrayList<ArrayList<String>>>();

LinkedHashSet<String> productions = new LinkedHashSet<String>(); String startProduction;

private int production_index = 1;

public Grammar(String path) throws GrammarErrorException {

filePath = path;

readFile();

semanticAnalysis();

}

public Grammar(String text, boolean test) throws GrammarErrorException {

readString(text);

semanticAnalysis();

}

public Grammar() {

}

public void readFile() throws GrammarErrorException {

File f = new File(filePath);

if (!f.exists())

throw new GrammarErrorException("File doesn't exist!");

try {

reader(new FileReader(f));

} catch (FileNotFoundException e) {

e.printStackTrace();

throw new GrammarErrorException("File doesn't exist!");

} }

public void readString(String x) throws GrammarErrorException {

reader(new StringReader(x));

}

private void reader(Reader in) throws GrammarErrorException {

try (@SuppressWarnings("resource") BufferedReader br = new BufferedReader(in)) {

String line = br.readLine();

int cont = 0;

while (line != null) {

System.out.println("LINE - " + line);

if (line.matches("[A-Za-z][A-Za-z0-9]* ::= (.*)")) { //match Rule: production ::= body

Trang 14

String head = line.substring(0,line.indexOf("::=") - 1);

line.substring(line.indexOf("::=") + 3);

if(cont == 0)

startProduction = head;

productions.add(head); //add head to productions list

if (grammar.containsKey(head)) {

ArrayList<ArrayList<String>> bodies

= grammar.get(head);

parseBody(body, bodies, cont+1); } else {

ArrayList<ArrayList<String>> bodies

= new ArrayList<ArrayList<String>>();

grammar.put(head, bodies);

parseBody(body, bodies,cont+1); }

} else {

String abc = "Line " + (cont + 1) +

": \'"+ line + "\' doesn't follow:\n Non-Terminal ::= body";

throw new GrammarErrorException(abc); }

line = br.readLine();

cont++;

} br.close();

} catch (IOException e) {

e.printStackTrace();

} finally { }

System.out.println("\nGrammar - " + grammar);

System.out.println("Non-Terminals - " + productions);

System.out.println("StartProduction - " + startProduction); }

private void parseBody(String body, ArrayList<ArrayList<String>> bodies, int lineNum) throws GrammarErrorException {

String[] tmp2 = body.split(RE_SPLIT_PIPES);

for (String i : tmp2) {

System.out.println("-> " + i);

/*ArrayList<String> parentheses = splitSpecial(i, RE_SPLIT_PARENTHESES);

System.out.println(" -> " + parentheses);

Trang 15

*/

Pattern regex = Pattern.compile(RE_SPLIT_PARENTHESES); Matcher regexMatcher = regex.matcher(i);

StringBuffer sb = new StringBuffer();

while (regexMatcher.find()) {

String matched = regexMatcher.group().trim(); String production = "#" + production_index; String rule_body = null;

if(matched.charAt(matched.length() - 1) == '*') {

rule_body = matched.substring(1, matched.length() - 2) + " " + production

+ " | \"\"";

} else if(matched.charAt(matched.length() - 1) == '+') {

rule_body = matched.substring(1, matched.length() - 2) + " " + production

matched.substring(1, matched.length() - 2);

} else if(matched.charAt(matched.length() - 1) == '?') {

rule_body = matched.substring(1, matched.length() - 2) + " | \"\"";

} //parse this new rule ArrayList<ArrayList<String>> b = new ArrayList<ArrayList<String>>();

grammar.put(production, b);

parseBody(rule_body, b, lineNum);

String replacement = production;

regexMatcher.appendReplacement(sb, replacement); production_index++;

} regexMatcher.appendTail(sb);

System.out.println(sb.toString());

/*

- */

ArrayList<String> tmp = splitSpecial(sb.toString(), RE_SPLIT_SPACES);

System.out.println(tmp);

//add non-terminals to productions list /*for(String j: tmp) {

Định dạng
Số trang	19
Dung lượng	164 KB