The Development and Testing of a Natural Language Processor that Identifies Various Parts of Speech

Contents

Background

Natural Language

A natural language processor is a computer program that speaks a natural language. In the case of this project, the natural language is American English. American English, hereafter stated as the natural language, is much more difficult than other languages because of the slang used. In addition, the natural language has no one definable spelling rule without some exceptions. These obstacles are what have stopped many researchers from writing a successful natural language processor.
There are two different ways to parse a sentence, computation and dictionary. A dictionary parser has a compiled list of words that the sentence must consist of. In computation parsing, the parser parses each individual word and attempts to find word roots to identify each word. If there is no known word or word root, the output may have incomplete or corrupted results.
Translation is another obstacle of natural language processing. Word for word translation is a semi-effective method, but is not always perfect. Since not every language or alphabet is the same, some word may be in one language but not another. For instance, the Intuits have thirteen different words for snow and therefore if you translated one of those words into English and back again, the type of snow could have changed.

Programming Language

This natural language processor will be written in a Win32 Console Project in Microsoftİ Visual C++ V. 6.0. Maps are used to identify the words when it is run through the dictionary. The map is used because it automatically search through its contents. Lists are used because of the dynamic sizing quality. It is used for storing the words in the sentence. Strings are also used because of their dynamic sizing aspect. It is used for storing all of the input data and individual words. Three functions are used to identify the part of speech of each word. The first, initWords() is used to load all of the words into the parser's dictionary. The second, identify() is used to get the part of speech from the dictionary. The final function, clairify(), is only used if a word that could have multiple parts of speech is encountered. The percentage given at the end of the report is only a percentage of the words the parser did have a word for.


Data