Computer Linguistics,

Module: 
Computer Linguistics
Examiner: 
Sabine Reinhard
Assessor: 
Graham Katz

Mündliche Prüfung Computerlinguistik 2004

Prüfer: Sabine Reinhard, Graham Katz

Note: A

Themen:

1) Regular Exressions, Finte State Automata, Finite State Transducer
2) Machine Translation
3) The critical period hypothesis
4) Basic parsing (kam nicht dran)

Ablauf:

Man spricht vorher mit den Prüfern vier Themen ab und gibt zu jedem Thema etwa 50 Seiten Literatur an. In der Prüfung werden dann willkürlich drei der Themen abgefragt. Dabei darf man allerdings das erste Thema selbst wählen. Die restliche Prüfung verläuft dann eigentlich als Monolog, man sollte also zu jedem Thema etwa 9 Minuten lang etwas erzählen können. Worauf Sabine bei den einzelnen Themen wert legt fragt man sie am besten vorher, sie gibt einem da sehr detaillierte Auskünfte und rät auch zu Themen. Zwischenfragen sind eher selten aber man sollte sie dennoch in den „Vortrag“ mit einplanen. Besonders sollte man darauf achten die Zusammenhänge zwischen den einzelnen Dingen gut und verständlich zu erklären.

In meiner Prüfung entschied ich mich für „The critical period hypothesis“ als erstes Thema. Während meines Monologs fragte Graham einmal nach um die CP genau definiert zu bekommen (von wann bis wann). Als zweites Thema kamen dann die RE und FSA dran, diesmal wurde gleich zu beginn zwischen gefragt was es denn bedeutet, dass FSA und RE equivalent sind. Das dritte Thema war Machine Translation. Hier gab es keine Frage während des Monologs aber eine Frage von Graham danach weil ich zu schnell fertig war („Welchen Ansatz verwendet denn das Verbmobil und wann wurde es entwickelt“). Diese Frage musste und konnte ich nicht beantworten, da ich den MT Kurs nicht besucht hatte. Hier hat mich Sabine gerettet.

Allgemein muss ich sagen, dass Graham scheinbar gerne Fragen stellt während sich Sabine das ganze nur anhört und kurze Verbesserungen macht. Die Atmosphäre mit den Beiden war ganz entspannt und gut, man fühlte sich auch bei den Zwischenfragen nicht so als ob sie einen ausquetschen wollten.

Kritik:

An meinem Monolog wurde kritisiert, dass ich nicht schnell genug zum Hauptpunkt gekommen bin. Auch die während des Monologs gestellten Zwischenfragen deuteten darauf hin, dass allgemeine Definitionen und Erklärungen zu Beginn eines Themas erwünscht sind.

Monologe:

Hier in etwa die drei Monologe die ich halten durfte. Am besten man
macht sich zu seinem Thema auch solche Texte und übt mit einem Freund das ganze in 10 Minuten frei vorzutragen.
(Für Fehler in den Texten hafte ich nicht!)

1) The critical period hypothesis

There are numerous different languages spoken around the world. While it is very difficult for an adult to acquire a new language it seems to be rather effortless for a young baby to learn a language. A baby even seems to learn a language better and quicker than any adult learner. This is remarkable as otherwise adults learn better and it is actually a property of auto-associative memory to work better if more information is present (learning enhances learning). The question that immediately arises is why are children so good in learning a language and why do we loose this ability with age.
One possible explanation for language learning is motherese (mother-language). This hypothesis states that the caretakers provide the children with a very easy to learn version of the language (smallest sentences for smallest ears). But this hypothesis turns out to be false, mothers do uses complex sentences when they talk to children. And they often use questions and commands whereas children first learn declarative sentences. I also assume that there is more didactics in a second language course for adults than there is in daily motherese. Actually one cannot find any external aid which should enable children to learn better than adults. Therefore it is strongly assumed that the children simply posses an innate ability to learn language more easily than adults. And there also is evidence for this innate ability. The fact that adults later loose this ability to learn languages has led to the notion of a critical period for language acquisition. Various hypotheses have shown up which try to explain why there is such a short period for language acquisition and which features of language are mainly affected by this period. These hypotheses can be subsumed under the title “critical period hypothesis”.
Among the first persons to claim that there is such a critical period was Lenneberg (1967). In his book “Biological foundations of language” he collects studies from language impaired patients and compares their recovery from the impairment. These patients mainly suffer from Aphasia which is an explicit language disturbance resulting from impairments of the language areas in the brain. His main findings were that younger patients recovered better from Aphasia than older. Young children were even able to acquire language after hemispherecomy of the left hemisphere. This led Lenneberg to conclude that language learning has a significant biological basis and is controlled at least partially by some underlying maturational time table.
These findings alone are of course no proof that language can only be learned during a critical period. In order to really investigate this question one would have conduct an experiment like the Egyptian king Psammetichus. It is said that he deprived two children from any linguistic input in order to find out which language they would speak later. Luckily an experiment like this can not be conducted nowadays. But there are cases where similar things happened to children. For example isolated deaf children who were not taught any sign language by their hearing parents. Other cases are so called “wild children” who were deprived from language exposure until late in life. Three such cases are Isabelle, Genie and Chelsea. Isabelle was the youngest (6) when first exposed to language and she also learned to master the language best. Chelsea on the other hand was already 31 when she first learned a language and her performance is the worst. Genie was studied the best by S. Curtiss and many things about what can be learned later about language and what needs to be learned during a critical period were found out. It seems like passive, the use functional category words or proforms or reflexives are among the things which must be learned during a critical period.
These cases provided very strong evidence for the existence of a critical period. In the case of Genie especially for a critical period in the development of the left hemisphere. But one needs to be careful because the studies were not conducted under controlled conditions and the “wild children” usually suffered from an abnormal childhood. Therefore more controlled studies were performed by E. Newport in order to find clear evidence for the CPH. This was done on
Deaf children who varied largely on the age when they were first exposed to a sign language. The main finding of these experiments was that the early or native learners performed significantly better than the late learners. All groups were exposed to the sign language for the same time! Also the types of errors were remarkably different in the two groups. Late learners performed a lot of inconsistent errors and variable use of grammar rules. Another experiment was done on second language learners. Here the set up was similar and also the result. People who started to learn a second language already in their childhood performed significantly better than late learners although both groups used the second language for the same time. These experiments clearly show that learners who begin language acquisition at an early maturational stage end up performing better than late learners.
The last evidence for the critical period comes from the pidgin and creole languages. Here it is obvious that young learners must have some additional innate ability to learn a language. If people who speak different languages come together and live in a society usually a language form called pidgin is formed. This language is very simply structured and misses many complex forms. If these people then have children who learn this pidgin language as a native language the children start to change the pidgin into a creole. The children do not acquire the simple pidgin (although this is their only linguistic input) but they extend it to a more complex, new natural language. This phenomenon only happens when children learn the pidgin therefore there must be something about the way children learn a language which is lost in the adult life.
Many attempts have been made to explain the existence of a critical period for language acquisition. One of these possible explanations is the “less is more hypothesis” which states that children are better in learning a language because their cognitive abilities are not yet fully developed.
There are also some people who reject the notion of a critical period altogether and bring examples of successful late language learners. But I think that the main notion about the critical period cannot be pushed aside. It seems that there must be some internal constraint or ability which enables children to learn a language more easily. And it also seems obvious that for some reason this ability declines with age. If this decline is due to some maturational timetable or loss of plasticity in the brain remains unclear. So far no data exists on investigations if the language learning ability simply declines due to the fact that it is not used any more after the basics of a language have been acquired successfully. This could also be a possible explanation. Regardless of any explanation the data so far suggest that there is a critical period during our life in which we are able to learn some basic language rules. This ability is present form birth and declines around puberty.

***************************************************************************

2)Regular expressions, finite state automaton, regular languages and finite state transducers

The regular expressions and the finite state automata were both defined by Kleene in the 1950s (51 and 56). Kleene’s work was based on the theoretical work by Turing (38) and the work of McCulloch and Pitts (McCulloch-Pitts neuron). Kleene also showed that the RE and FSA are equivalent. The FSA is considered to be the foundation of modern computer science.

The regular expressions are a very important tool in theoretical computer science and in linguistics to define a language in a formal way. Languages which can be defined by a regular expression are called regular languages. Noam Chomsky defined a hierarchy of languages, in terms of complexity. This four-level containment hierarchy, called the Chomsky hierarchy, corresponds precisely to four classes of machines or grammars for formal languages.(regular languages, CFG, Context sensitive grammars, type 0 grammars)
Each higher level in the hierarchy incorporates the lower levels: that is, anything that can be computed by a machine at the lowest level can also be computed by a machine at the next highest level.
More precisely a regular expression is an algebraic notation for characterizing a set of strings. The main application of regular expressions is to specify a pattern for text search strings. The regular expression over an arbitrary alphabet Σ is formally defined as follows:
• A symbol from the alphabet is a regexp
• The concatenation of two regexps x and y, written xy, is a regexp
• The sum (disjunction / union) of two regexps x and y, written x+y, is a regexp
• The iteration of an regexp x, written x*, is a regexp (Kleene closure)
These three primitive operations are theoretically everything that is needed to describe a string pattern. For convenience there are many other advanced operators available. Among the most important are anchors (^ $ \b \B) to specify positions in a string, a negation operator (^), an optionality operator (?), a wild card for any character (.) and operators for range or counters. But any complex regular expression can be transformed into a (possibly larger) regular expression which contains only the three primitive operators. Among these operators there is a well defined precedence hierarchy ( () > *+?{} > sequences, anchors > | ).
An extended feature of regular expression is the ability to refer back to a particular subpart of the string (memory – not part of all RE languages, not part of FSA!). Together with a substitution operation this allows the easy implementation of a natural language understanding program like ELIZA (Weizenbaum 1966).

As mentioned earlier RE are one way of describing finite state automaton. The advantages of FSAs are that they are easy to implement and very efficient. A FSA is formally described by the 5-tuple (Q, Σ, qo, F, d(q,i)) where Q is the set of all states, Σ is the alphabet (finite), q0 is the start state, F is the set of all final states and d(q,i) is the transition function which goes from Qx Σ to Q. There are essentially two types of FSA, deterministic FSA and non-deterministic FSA. The DFSA are fully determined by its current state and the input whereas a NFSA contains decision points and/or (epsilon) ε-transitions. At decision points there are more possible successive states for the given input and ε-transitions are empty transitions which may be taken any time without looking at the input or advancing the input pointer. The only formal difference between the two FSA is that the NFSA have lists of states in their transition function and possibly more than one start state (NFSA with ε-transition have only one start and one final state). Otherwise the two FSA’s are identical and in fact any NFSA is equivalent to a DFSA. One can even transform any NFSA to a DFSA but then the number of states night get very big (as much as 2N states → Potenzmenge). This conversion works like a parallel expanding of all possibilities a NFSA has.
One of the applications for FSA in linguistics is to accept or recognize strings of a formal language (= subsets of Σ*). For this we think of the input string as being written on a tape with single cells for each letter. The FSA starts in its start state, looks at the next input symbol and crosses the arc labelled with the input symbol to the next state; the pointer on the input is advanced one letter as well. If there is no arc with the current input symbol leaving the current state the FSA rejects the input and/or goes to a sink state. A string is accepted by the FSA if it is in a final state when the end of the input is reached.
When we use a NFSA for recognizing a string we have a problem as the decision points might lead to rejections of correct strings. There are three approaches to deal with this, backup, look-ahead and parallelism. In the look-ahead approach we just look at the input after the current to make our decision, in the parallel approach all possibilities are expanded in parallel and in the backup approach the well known backtracking technique is used. The backup approach can be implemented as a search and the common search techniques DFS and BFS can be applied. Here we need to store the current position of the input tape and the state of the NFSA in order to backtrack and explore alternatives if the search fails.
A FSA can also be used for generating strings. This is simply done by printing the symbol written on the arc which is crossed. Here of course the FSA might have to make some decision which arc it should cross but this can be done in any arbitrary way. As an FSA is able to both generate and recognize strings of formal language it acts as definition for this formal language. A FSA is particularly useful for defining a language as it expresses an infinite set in a closed form.
Both RE and FSA are isomorphic to regular languages (= sets of strings).

An extension to the Finite State Automaton is the Finite State Transducer. Transducer maps between one set of symbols and another, an FST does this via a finite automaton. An FST thus defines a relation between sets of strings. Regular relations are sets of pairs of strings. It can be seen as a recognizer (output is accept / reject), as a generator (generates pairs of strings of the language), as a translator (reads one string and outputs another) and as a set relater (computes the relation between two sets. An FST can be formally defined as a simple extension to an FSA. Only the input alphabet has to be changed, instead of single simple symbols an FST takes complex input-output pairs.
An FST can now be taken to do morphological parsing. One approach to do this is called two-level morphology and was first proposed by KOSKENNIEMI (1983). Here a word is represented as a correspondence between a lexical level and a surface level. On the lexical level we have the morpheme and features sequences like CAT +N +PL, whereas on the surface level we have the actual spelling of the word like CATS.
To use the FST for this two-level morphology we view it as having two tapes, on for the surface level and one for the lexical level. The complex symbols are represented as pairs like “a:b” where the left letters go onto the lexical tape and the right letters onto the surface tape. Thus each complex symbol expresses how the symbol from one tape is mapped onto the symbol b on the other tape.

***************************************************************************

3) Machine Translation

Translation means to transfer a text from a source language into a target language while preserving the meaning. In order to achieve this, a high degree of knowledge about the content and situation of the text is required.
The big objective of machine translation is to achieve full-automatic high quality MT (FAHQMT) but this goal is still far from being reached. Until now there is no machine translation without human intervention or assistance. Today’s MT systems are rather tools which help a human translator than stand alone systems. This is called “Computer-Aided Translation”, CAT.
The attempt to build a working MT system is actually very old, already 1933 a mechanical procedure for carrying out translations was patented. A MT system prototype (Ru –Engl) was demonstrated to the public in 1954. From this time in there was a lot of research going on in the area of MT. But already in 1966 the whole MT research in the US was nearly completely stopped due to a report by the Automatic Language Processing Advisory Committee (ALPAC). But research continued in Canada and Europe and in the mid 70ies working systems like météo or systran were introduced. In the 80ies there was again an increase in interest in MT and AI techniques were included in new systems. During this time some commercial systems began to appear and nowadays there is a very high demand for MT systems form the economy (for translating manuals etc.).

But Machine translation is a very demanding task. Language is always situated in a context and even a culture. In a text the meaning of a sentence is not determined just by itself but also by the other preceding or following parts of the text. And the overall meaning of a text is not determined by the word, phrases or sentences it is made up, but by the situation in which it is used. Therefore a computer needs access to a lot of world knowledge to translate a sentence correctly and it also needs to determine the meaning and the context of the text automatically. This fact alone makes the goal of full-automatic MT very difficult but there are various problems which further complicate this task.
First there are so called translation mismatches, this are cases where a meaning/concept in one language cannot be adequately expressed in another language (lexical gaps) or is ambiguous. Here the resources of the target language force the translator to add or delete information in the source. Another problem are so called structural divergences. Here some grammatical feature from the source language is realized different in the target language. Examples are categorical divergence, thematic divergence, changes in active/passive voice or constituent order. Also sometimes single words need to be translated to multi-word expressions. This problem can be handled by encoding all needed information in the lexicon or by providing appropriate grammar rules for these cases.
A third problem is ambiguity. In most of the cases ambiguity arises when a word has two or more unrelated meanings in the SL (e.g. book), this is called lexical ambiguity. A similar kind of ambiguity can happen if words are used metaphorical (like a dying program). Another type of ambiguity is structural ambiguity which we already encountered in parsing (attachment ambiguity, coordination ambiguity, see parsing). Also reference ambiguity is often a problem especially if the translation is into a language which makes more or less gender distinctions. Lexical ambiguities can be resolved by the local or global context.਍ഀ
There are three main approaches to MT which can be classified according to the used level of abstraction. The first generation systems use a so called direct approach, the second generation systems use a transfer approach or an interlingua approach.
In the direct approach typically only one pair of languages is considered. Here the idea is that the MT system does as little work as possible. Such a system is composed of several stages, where each focuses on one type of problem. The first stage would be a morphological analysis of the source language input. Then all words are looked up in a bilingual dictionary (first content word). In the next step various rearrangements and local reordering happens. And in the last stage finally the morphology of the target language is generated and the translation process is finished. One advantage of this approach is that it can take structural similarities of the SL and TL completely into account. But on the other hand this is a very inflexible approach (mostly uni-directional). Its key characteristic is that it does not use any representations or complex structures and just takes the input as a string of words.
In the transfer approach there are bilingual transfer modules between the monolingual analysis and generation modules. This approach contains three phases: an analysis phase, a transfer phase and a generation phase. In the analysis step a syntactic parse of the input is done. In the transfer step a syntactic transfer of the parse tree is done with some rearranging of the tree. In the last step the lexical transfer is done. In this approach one makes more use of the contrastive knowledge (differences between the languages) and it is already more flexible as it works in two directions. To add a further language one needs one monolingual analysis module, one monolingual generation module and bilingual transfer modules for all languages already in the system. The total number of modules can be calculated: n*(n+1).
The third approach to MT is based on an interlingua between the SL and TL. This approach only involves two stages in which a text is converted to and from one language-independent meaning representation. This interlingua can have any form but it must be able to capture the meaning of the input sentence completely. Semantic representations are one way of defining an interlingua. The interlingua should represent all sentences that mean the same in the same way, regardless of the language in which they are in. In this approach even more work needs to be done by the parser and an (semantic) interpreter. After this work no further transfer is needed. But the main drawback of the interlingua approach is that an interlingua representation is not easily found. Also, to translate a sentence in the interlingua representation an exhaustive analysis of the input needs to be done including a complete disambiguation (in its pure form). But sometimes also techniques for preserving ambiguity are needed in order to produce the same “vague” meaning in the output. The main advantage of this approach is of course that is highly flexible and easily extendable. The addition of a new language always only entails the creation of two new modules, an analysis (to interlingua) and a generation (from interlingua) module. (see Vauquois triangle)
Some further approaches are example-based MT or statistics-based MT. The example-based approach is a kind of direct translation which operates on a bilingual corpus with alignments of translation units on word, phrase and sentence level. For translation the system checks if an adequate translation is stored in the corpus. Therefore the result is best if large coherent parts are found in the corpus. Also the statistics-based approach is a version of direct translation and it also uses an aligned bilingual corpus. Here the best translation is selected by probabilistic model which is calculated form a fluency measure (target) and a faithfulness measure (source to target). Both approaches focus more on the result of the translation than on the process.
Crucial design question s when building a MT system are whether a multilingual or only a bilingual system is needed, if the translation should be reversible or not and if the system should be interactive or non-intervening (batch). These factors mainly drive the decision for one particular MT approach.
Still more research needs to be done in the area of MT and today the systems are only CAT tools. But an automatic translation is better than no translation and even the post-editing of raw translations increases productivity of human translators by about 50%. And in the domain of sublanguage there are already very well working systems available which can offer good translations.

Viel Erfolg bei der Prüfung!
Klaus Tichacek (tichacek@web.de)