wiki:RECAA/Tools/AutoAnnotator

AutoAnnotator

Requirements Engineering is one of the most important tasks in the software development process. This iterative process heavily relies on natural language (NL) specifications. These NL specifications are then transformed into (more) formal models to avoid ambiguity etc. SALE MX provides a method of deriving such models - namely UML models - directly from the NL specifications. Since the transformation from a text to models relies on thematic annotations, one has to start with explicitly encoding the semantics of the specification.

SALE provides the analyst a means to annotate the specification with the semantic information and thus make it machine processable. But this annotation process is rather time consuming - especially if the annotating analyst is not (well) trained in the usage of SALE.

AutoAnnotator derives the semantics of phrases based upon several linguistic analyses made with NLP programs that are chained together in a processing pipeline.

The Processing Pipeline

Since the late following process of SALE MX relies on textual information only, the input to AutoAnnotator must also be a plain text file. (Being only able to process text input, SALE requires the textual specification to be self-contained without any references to external information such as graphics or tabular appendices. See the article about MX and SALE form more details.)

At first the given specification is read and prepared for evaluation. Therefore we first separate the text into phrases and then the phrases into words. To store the extracted information, an internal data model is built, roughly resembling the imported text. A second representation of the text is built resembling the future SALE document; it represents at every stage a valid SALE document and can be dumped for examination. Information gathered with the several NLP tools is stored in the first model - all the derived (aka. reasoned) information is stored in the second model.

https://svn.ipd.uni-karlsruhe.de/repos/koerner/mx/public/img/AutoAnnotator/Pipeline.png

As you can see, ever stage of the processing pipeline adds some information to the information store which can later be examined en detail. Subsequent stages can rely on information previously gathered, but they can start over again of course. At the moment there is no duplication of efforts (e.g. to use multiple parsers an determine a best parse); unfortunately this also means that failures emerging from a particular tool directly correspond to further failures in the subsequent reasoning stages.

After having completed the "simple" NLP information gathering, subsequent tasks combine the gathered information with ontology queries (see the example below). All reasoning steps modify the second information base only - the original document along with the NLP information is not transformed destructively.

We do not believe, that a machine can catch all the semantic information today, but it sure can ease the process for the annotator.

Example

In the following, we show a small example to give you a feeling about how AutoAnnotator works:

  • Chillies are hot vegetables.
  • Mike Tyson likes green chillies.
  • Last week he ate five of them.

Stages 0-2: Document Preparation

After the first and second stage, the input text is transformed into the first model. It looks like this:

Document
|--> Phrase: Chillies are hot vegetables.
| |--> Words: Chillies, are, hot, vegetables, .
|--> Phrase: Mike Tyson likes green chillies.
| |--> Words: Mike, Tyson, likes, green, chillies, .
|--> Phrase: Last week he ate five of them.
| |--> Words: Last, week, he, ate, five, of, them, .

Note: Punctuation marks are also referred to as words - think of words in this context as "tokens".

After the initialization, the second model (the SALE model) looks like this:

#{ Chillies are very hot vegetables .}
[ #Chillies #are #very #hot #vegetables ].
#{ Mike Tyson likes green chillies .}
[ #Mike #Tyson #likes #green #chillies ].
#{ Last week , he ate five of them .}
[ #Last #week , #he #ate #five #of # them ].

All we know at the moment, is that we have several phrases containing some words. Since we do not know the semantics, we can only produce comments.

Stage 3: POS-Tagging

Next, a POS-Tagger is used to derive the following information (the POS-tags are attached to the words in the first model):

Chillies/NNS are/VBP very/RB hot/JJ vegetables/NNS ./.
Mike/NNP Tyson/NNP likes/VBZ green/JJ chillies/NNS ./.
Last/JJ week/NN ,/, he/PRP ate/VBD five/CD of/IN them/PRP ./.

Stage 4a: NL Parsing

A NL parser is then used to derive parse trees for the individual senteces. For brevity, we only show the trees without further discussion - if you are interested in the structure and information contained in the parse trees, have a look at Google (better reference needed ;) ).

(ROOT
 (S
  (NP (NNS Chillies))
  (VP (VBP are)
   (NP
    (ADJP (RB very) (JJ hot))
    (NNS vegetables)))
  (. .)))

(ROOT
 (S
  (NP (NNP Mike) (NNP Tyson))
  (VP (VBZ likes)
   (NP (JJ green) (NNS chillies)))
  (. .)))

(ROOT
 (S
  (NP (JJ Last) (NN week))
  (NP (PRP he))
  (VP (VBD ate)
   (NP
    (NP (CD five))
    (PP (IN of)
     (NP (PRP them)))))
  (. .)))

Stage 4b: Stanford Typed Dependencies

Aside the parse trees, the Stanford parser provides AutoAnnotator with typed dependencies.

nsubj(vegetables-5, Chillies-1)
cop(vegetables-5, are-2)
advmod(hot-4, very-3)
amod(vegetables-5, hot-4)

nn(Tyson-2, Mike-1)
nsubj(likes-3, Tyson-2)
amod(chillies-5, green-4)
dobj(likes-3, chillies-5)

amod(week-2, Last-1)
tmod(ate-4, week-2)
nsubj(ate-4, he-3)
dobj(ate-4, five-5)
prep(five-5, of-6)
pobj(of-6, them-7)

These dependencies are binary predicates, revealing some information about the structure of the phrase.

A special case is the copula; it denotes that some entity is part of a bigger group or has a certain characteristic. In this case, we see that the copula determines the chillies as being vegetables. This can be expressed with the thematic roles fingens (FIN) and fictum (FIC), specifying the element and the group.

Stage 5: Named Entity Recognition

Named Entity Recognition (NER) reveals only one information: Mike Tyson is a person:

Mike/PERSON Tyson/PERSON likes/O green/O chillies/O ./O

Anaphor Resolution

JavaRAP determines the following anaphor information:

(1,0) Mike Tyson <-- (2,3) he
(1,3) green chillies <-- (2,7) them

For the usage of the given information, we replace the anaphor with the antecedens. This is supported by the resolution algorithm itself - it resolevs third person pronominal anaphors; pronominals are less likely to bear a deeper meaning so we stick to the antecedens: https://svn.ipd.uni-karlsruhe.de/repos/koerner/mx/public/img/AutoAnnotator/AnaphorResolution.png

Stage 6a: Base Forms

The first ontology queried by AutoAnnotator is WordNet. It provides us with the base form of the words in our phrases.

Having the base form is crucial for further reasoning steps since queries must be parametrized. And querying a knowledge base with e.g. inflected verb won't be of much use...

Stage 6b: World Knowledge

Hidden in the 6th stage is also the querying of Cyc. With the help of Cyc we can determine much richer semantics. Take a look on the second phrase: It states, that Mike Tyson actually likes chillies.

Having said that, we can wonder, whether liking is an active action like hit somebody or walk. On the other hand, it could state some relationship between to entities like live in. And there is a third category to which actions can belong to: It could be a state transition like winning the game. All three possibilities are mapped to different roles in SALE (ACT, STAT, and TRANS) - so which one is to be chosen?

Asking Cyc about to like, we get the following information (excerpt):

Predicate: likes-Generic
  isa:   Predicate Relation
         TemporalExistencePredicate TemporalExistenceRelation
  arity: 2
         (argIsa likes-Generic 1 Agent-Generic)
         (argIsa likes-Generic 2 Thing)

Having a predicate, we conclude, that like indeed is some sort of relationship between two entities - and thus we choose STAT.

Looking at the next phrase, we want to know more about to eat (excerpt):

Collection: EatingEvent
  isa:   Collection ConsumingFood-Food-Topic
         CumulativeEventType DurativeEventType
  genls: Action TemporalThing
         TemporallyExistingThing TemporallyExtendedThing

Having an Action (see genls above), we conclude that the role ACT is the correct one.

Further Refinement

Further knowledge about the components contained in the text can further improve the annotation. Unfortunately, the chosen example cannot show this feature. But imagine the following:

  • Michelangelo paints a picture.

Since we do know that painting is an action and Michelangelo is the agens, we might deduce the following:

  • Michelangelo|AG paints|ACT #a picture|PAT .

But we (humans) know, that painting is indeed a creative (figuratively, literally?) task. Asking Cyc about to paint (remember: the base form to paint has been determined by WordNet in the previous stage), we learn that painting is a creative task, meaning that somebody creates something. So we can use the more specific annotation (c.f. the description of SALE for details):

  • Michelangelo|CREA paints|ACT #a picture|OPUSP .

This produces a richer UML diagram in the successive steps of SALE MX.

Final SALE Document

Incorporating all the reasoning, the SALE document looks as follows:

#{ Chillies are very hot vegetables .}
[ Chillies|FIN #{are} $very $hot vegetables|FIC ].
#{ Mike Tyson likes green chillies .}
[ Mike_Tyson|AG likes| STAT $green chillies|PAT ].
[ @ Chillies|EQK @ chillies|EQD ].
#{ Last week, he ate five of them .}
[ $ Last week|TEMP , he|AG ate|ACT *five #of them|PAT ].

[ @them|EQD @chillies|EQK ]. [ @he|EQD @Mike_Tyson|EQK ].

Related Publications

This tool has been developed as part of the ML10 thesis.

ListTagged(publications,autoannotator)?


Back to Home/RECAA?/[tagged:tools Tools]/[tagged:tools,salemx SALE MX]

Last modified 9 years ago Last modified on Oct 24, 2011 2:33:21 PM