Named entity recognition system

Một phần của tài liệu Extraction of Vietnamese collocation from text corpora. (Trang 59 - 66)

file Name Number error total review documents time(minutes) review time(seconds)

102870.seg 21 8,12 23,2

108339.seg 14 5,31 22,76

108559.seg 15 5,54 22.16

109249.seg 14 4.73 20,27

109479.seg 25 11,45 27,48

total 89 35,15 23,69

Table 5.7: Quality control time in online framework

result which is presented in table 5.7. To compare to offline, the qualiy control time in online framework is more litte (24,91 seconds and 23,69 seconds).It prove that associate two tools reduce quality control time. To reduce more time, we need improving framework to more friendly and convenient.

Conclution, in this section we evaluate about time spent for each method. Al- though Annotation time of online is higher than offline, time total of online is more lower so much than offline(64 files per month vs 44 files per 2 days ).Because in online many people is take part in to annotation process, but in online is not. Time costing is important because it is presents performance of annotation process. Base on this time we will improve tools to enhance annotation performance.

5.4 Named entity recognition system

While we built NER corpus, we also was building Vietnamese NER system(Join in Dat Nguyen Ba project). The corpus will used to evaluate system accuracy rate(F- measure).In this section, we present the system and evaluation result.

Our NER system is a rule base system, it is built as plugins in Gate framework with the architecture shown in Figure 5.4. The system include four process which is as a plugins in to Gate framework and called processing resource.In my system, input data is a document and output is automatic annotated document. Documents

52 Chapter 5. Evaluation

Figure 5.4: Name entity recognition system architecture will process sequentially by these proceeding resources:

• Word segmentation

• Part of speech Tagger

• Gazetteer

• Transducer

We will show detail about each Processing resource in each subsection.

5.4.1 Preprocessing

Because of our system is used some exits processing resource which have been built as plugin in Gate framework. They are word segmentation module and part of speech tagger of Pham (Pham et al., 2009). So that two processing resource is preprocessing from my system.

5.4. Named entity recognition system 53 There is the biggest difference between English and Vietnamese, Vietnamese is a monosyllabic language, so that a word contain one or more tokens, but English is not. Because entity is a meaningful word or phase. So the first task of the pre- processing step in a NER system is to recognize words. The quality of an NER system depends very much on how well this step is performed For example, in this sentence:

Anh Hùng lái xe trên đại lộ 5

In Vietnamese, we can understand the phare following two ways:

• It contains two words: "Anh" (Mr) and "Hùng" (Hùng is a person name).

• It is a word: "Anh Hùng" (The hero).

If the word segmentation module works well, the result should be:

"Anh Hùng lái xe trên đại_lộ 5"

"Mr. Hung drove in highway no 5."

"Hùng" standing after the prefix "Anh" will be recognized as a person accurately. However, if the word segmentation module doesn’t work well, we will receive a wrong result:

"Anh_Hùng lái xe trên đại_lộ 5."

"The hero drove in highway no 5."

The problem is that the set of words doesn’t contain "Hùng" and it is almost impossible to recognize it as an entity.

54 Chapter 5. Evaluation

5.4.2 Gazetteer

The Gazetteer module consists of several dictionaries or gazetteers that will be used to create annotations Lookup over words with specific semantics to be used in writing rules at later stages. Each dictionary contains words with the same meaning, such as person names, organization entities, or phrases that signal the type of surrounding named entities. In our system we used list of gazetteers which can divide in to some groups:

• Gazetteers that contain potential named entities. In my system we used some Person, Location, Organization, and Facility.

• Gazetteers containing phrases used in contextual rules such as name prefix or verbs that likely follow a person name

• Gazetteer of potential ambiguous named entities

5.4.3 Transducer

The transducer module is a cascade of Jape grammars or rules. A Jape grammar allows one to specify regular expression patterns over semantic annotations. Thus, results of previous modules including word segmentation, part of speech tagging and gazetteer in the form of annotations can be used to recognize and classify named entities. The cascade of Jape grammars consists of the following components in order: Preprocess: Remove incorrect Lookup annotations and identify potential named entities.

• Recognizing Organization and Facility entities

• Recognizing Location and Nationality entities

• Recognizing Religion entities

5.4. Named entity recognition system 55

Figure 5.5: Jape rule to recognize Person entity

• Recognizing Person entities

• Resolving ambiguity improve quality using contextual rules.

The pre-processing removes Lookup annotations that are only a partial part of a word. For example the word "trường"(school) is a clue for recognizing Organization but this word can also be just part of another word with totally different meaning. Take the following sentence:

"Thị trường Việt Nam trong thời kỳ khủng hoảng"

"The market of Vietnam was so dull during the crisis."

"Trường" is a word by it is a part of the word "thị trường".

The pre-processing steps is also responsible for recognizing potential named enti- ties so that later steps focus on typing them as well as disambiguating ambiguous cases. Apart from those named entities recognized by corresponding gazetteers, we also create annotations NamePhrase over consecutive words with their first letters capitalized.

To recognize anc classify entities we use set of Jape Rules. For example to recognize people system use the rule which show in figure:5.5 In this example Person entity is recognized if before the word and phrase look up is tilleperson word.We

56 Chapter 5. Evaluation use prefix and suffix word to recorgnize entity,other hand we use sentential context.

However when using sentential context we have some ambiguous case. to solve we use higher lever. Documents context. For example:

"Bà Nùng vừa hút tẩu thuốc lá vừa kể:" ĐỜi thằng A Lưới khổ lắm.

Nhà chẳng còn ai, lao động quần quật cả năm mà cũng không đủ anư.

Không biết đời nó biết bao giờ mới có vợ". Ấy thế mà niềm vui bất ngờ đến với A Lưới anh gặp Hoa cô giáo miền núi mới nên bản."

Mrs. Nung smoked a pipe saying that: "Mr. A Luoi’s life was so horrible. Nobody in family but him now, working toil and moil all year

round still made him short of food. It’s unknown when he’ll get married. "Unexpectedly, the happiness came to A Luoi. He met Hoa, a

teacher from the delta just arriving in the hamlet.

"A Lưới" appears twice but only the first instance is recognized as a per- son name with a high level of confidence due to the title prefix "Thằng"

(Mr). Therefore the second instance of that phrase is also typed as a person name.

5.4.4 Experiment

Our corpus is randomly divided into two parts with the first part containing 53 documents (2814 sentences) and the second part containing 20 documents (1125) sentences.

Evaluation Metrics We evaluate the performance of the built system on the test data using the standard Precision, Recall and F-measure metrics using two criteria:

5.4. Named entity recognition system 57

Figure 5.6: Performance on the training data using strict criteria

Figure 5.7: Performance on test data using strict criteria

• Strict criteria: an entity is recognized correctly when both the span and the type are the same as in the annotated corpus.

• Lenient criteria: an entity is recognized correctly when the type is correct and the span partially overlaps with the one in the annotated corpus.

And we have result presentation in these: figure 5.6, figure 5.7, and figure 5.8 It can be seen from 5.7 that the system achieves a respected overall F-measure of 0.83. Given the performance of the system on the training data reaches only 0.89, we believe there is still a lot of room for improvement. While Religion and Location entities are well recognized, Organization entities appear to be a challenge. A reason is that the names of Vietnam’s organizations sometimes are quite long, and hard to

58 Chapter 5. Evaluation

Figure 5.8: Performance on the test data using lenient criteria

recognize, especially when they are not capitalized In the first step we successfully build an open source system based on GATE so that the community can access, use and develop the answer for the question of recognizing entity in Vietnamese texts.

However, there still exist some entities which give quite low recognizing results such as: organization entity, nationality entity and person entity. The reason is that we have not yet applied all the factors of context in recognizing process.

Một phần của tài liệu Extraction of Vietnamese collocation from text corpora. (Trang 59 - 66)

Tải bản đầy đủ (PDF)

(81 trang)