Research in Linguistically Motivated Event Extraction

--------------------------------------------------------------------------------

 Preliminary studies of linguistically motivated approach to event have concentrated on investigating the contribution of linguistic information to the task of event extraction (Zhang et al. 2013; Zhang et al. 2012; Zhang and Fang 2011), which has laid a solid foundation for syntactic layer analysis of dialogue utterances.

In Zhang et al. (2013) four different feature sets constructed according to lexical, grammatical, syntactic and semantic information are explored in the task of biomedical event extraction. In the biomedical domain, event extraction refers to the automated extraction of structured representations of biological processes from text (Van Landeghem et al. 2013). The experiment is based on BioNLP data provided by the Turku team[1] in the BioNLP'11 shared task. Empirical evidence has been produced to indicate the importance of linguistic information for the construction of high-performance feature sets in addition to domain knowledge for the task of biomedical event extraction.

The principal investigator has also conducted a preliminary study on the British component of  the International Corpus of English (ICE-GB)[2]. Best et al. (2008) regard event as Who did What

to Whom, Where, When, Why and with What damage. Kolyal et al. (2013) argue that the event definition varies according to the application domain. The principal investigator has conducted a preliminary study on linguistically motivated event extraction. The event definition here is close to the one mentioned in the ACE (Automatic Content Extraction) program[3] where an event is defined as a specific occurrence involving participants, or something that happens, or is described as a change of state. Given that an event extent is a sentence, an event in our study is composed of a set of arguments as follows:

arg0(arg1, arg2, arg3)

Where arg0 represents predicate, namely, a main verb in a sentence; arg1 and arg2 refer to subject and object respectively, indicating the entities that are involved in the event; and all of the other items in the sentence would be put in arg3. Therefore, arg3 constitutes a set of entities which are not properly participants, but should be understood as ‘part’ of the event such as manner or instruments when the event happens. In ACE program, such entities are named as attributes. If a sentence is composed of multiple sentences, those events existing in subordinate sentences would also be put under arg3 set. ICE-GB corpus is POS-tagged and parsed. Based on an insightful observation on the syntactic relationship between word components of ICE-GB, a set of extraction rules are created by linguistic experts in the project group. A recognition accuracy of up to 95% has been achieved in a close testing.

 

[1] http://www.biomedcentral.com/1471-2105/13/S11/S4

[2] http://ice-corpora.net/ice/icegb.htm

[3] http://www.itl.nist.gov/iad/mig/tests/ace/