Abstract: Over the past 6 decades a major portion of research in AI has been dedicated to recreate human abilities. One such effort is to make computer systems intelligent enough to understand natural language information prevailing in quantity around us. The process of understanding information can be fragmented as contextualization of relevant data and identifying dependencies shared by them. Initiatives such as Semantic Web are in progress to bridge the gap between data, information and knowledge. A more characterized view is creating a personalized system capable of 'remembering' data and provides 'well thought' information responses to questions on the same.
Key Words: Sanskrit, Natural Language Processing, Intelligent Agent
This paper elaborates and investigates modules required to build an information assimilation system capable of recording and generating natural language Sanskrit sentences and responses. The input for this computer system is a simple single clause declarative sentence in Sanskrit that carries certain information regardless of its relevance to data present beforehand. Information is extracted from the input using rules of Sanskrit grammar and inserted into a specially designed knowledge base data structure. Similarly, retrieval of information begins with a simple single clause interrogative sentence as the input. Exact nature of expected output, single word or single phrase, is identified using Sanskrit grammar rules and is retrieved from the knowledge base. Efficacy of the entire system, however, depends on factual correctness of the input and completeness of information present in knowledge base.
The entire system is modularized into 3 functionally independent units as Sanskrit Interpreter, Knowledge Base Management System and Knowledge Base. These three can be visualized as speech, mind and brain where brain is merely a storage mechanism solely operated upon by mind to perform data insertion and extraction. Functional independence of these modules is data coupled and logically cohesive to ensure upgradability of the system.
Sanskrit is one of the oldest and most mature languages known to man. Sometimes referred to as the 'language of gods', Sanskrit is also one of the very few well-defined languages with a precise grammar making it most suitable for Natural Language Processing. Panini [6th century BCE] has written an elaborate grammar for Sanskrit that eases creation of unambiguous words and sentences without any need to refer a dictionary. The grammar was designed to be memorized and guidelines for creating and understanding new words were laid out.
Unlike today's contemporary languages, Sanskrit stands unaltered enduring the course of time. Numerous scriptures of spiritual and scientific nature have also been penned in Sanskrit that find it's way into the modern world. Right form the ages of its creation; it has been a transferor of knowledge from generations to generations, passed on verbally. A notable feature is its affinity towards creating rhythmic sentences. This is because Sanskrit was 'designed' to be vocal language with speech as primary mode of communication. This very nature aids primarily in selection of Sanskrit as a preferred language to communicate information.
Grammatical Structure of Sanskrit
Having highlighted the criteria to select Sanskrit as language of communication, a profuse detail of its working as such needs to be given. But first, a brief outline of its grammatical structure will be laid out. There are 16 vowels called 'swar' that are complete in themselves and can be pronounced independently and 35 consonants called 'vyanjan' that requires to be fused with a vowel if it is to be pronounced. Words are primarily categorized into nouns(shabda), verbs(dhatu), and the indeclinable(avayaya) where nouns and verbs are used in sentences by the process of inflection with gender-number-case for nouns and tense-number-person for verbs; while the avayaya remain free of any inflection.
Table 2.1 Noun table
Table 2.2 Verb table
Nouns and verbs exist in their root form called shabda(noun root) and dhatu(verb root) and are transformed into shabdaroop and dhaturoop respectively after inflection. The conversion from root to complete word is what gives Sanskrit its unique advantage, that being the use of tables on the basis of 3 genders(ling), 10 tenses(kAla), 8 cases(vibhakti), 3 persons(purush), 3 numbers(vachan) and tailing consonant/vowel of respective noun. These tables are of the form 8(case)X3(number) for each unique combination of tailing consonant/vowel and gender in case of nouns and 3(person)X3(number) for each of the 10 tenses. Though this is a mere superficial structure, it is sufficient enough to process most commonly used sentences.
Words in Sanskrit
A noun in Sanskrit is created using the noun root, tailing consonant/vowel of noun root, gender, number, and case. Out of these 5 first three are known already, as the root is the name and rest two are easily derived from it, while number and case is subject to meaning of the sentence. For example, the word, 'rAma:' can be broken down to rAm-a-male-one-nominative of the aforementioned sequence by identifying 'rAma:' as 'rAm+a:'. The tuple is derived due to identification of 'a:' in the table for 'a-male' at location (1,1). Conversely, when rAm-male-one-nominative is available, regeneration of rAm: is possible.
Verbs are formed in a fashion no different than nouns, using verb root, number, person, and tense. Having known only the verb root and obtaining values for number, person and tense a meaningful verb is created. For example, the verb 'gacCati' after a quick lookup of verb table reveals gacC-one-third-present tuple by segregating 'gacC+ati'. In this case 'ati' is defined into the verb table for present tense at location (1,1). Similarly, from the same tuple creating the verb back is only a table lookup away.
Task to identify membership of a suffix in a particular table, from a large set of tables is accomplished by creating a single suffix tree from all the tables present and injecting the word backwards. As possibility of multiple occurrences of single suffix in distinct tables exist, the suffix tree is to be created such that upon each successful hit the tuple of each match would be returned and maintained in a stack localized to each word. This ensures redundancy at lower level where meaning of sentence is yet to be ascertained.
Construction of words is a straightforward task completing in 3 steps. Firstly, the root word is obtained on which suffix would be attached. Then the tuple is acquired from the larger tuple defining meaning of the sentence. Finally, the suffix is fetched from corresponding table of the tuple and concatenated to the root word.
Sentences in Sanskrit
In Sanskrit, a sentence is a mere collection of words with similar properties. As seen above, this redundancy ensures correct construction as well as aids in determining correct meaning of sentences. A sentence can be identified to be interrogative or declarative by presence or absence of a set of 'ka' root words. These roots are finite and constant thus preliminary scan of entire sentence to search for them enables the system to decide all the subsequent operations to perform.
To extract information from a sentence is analogous to create a super-tuple out of tuple of each word. The super-tuple is created by merging selected tuples of all words together that requires one tuple per word. One tuple is selected from the stack of each word by performing successive common factorization on elements of tuples across all words of the sentence.
Knowledge Base of the system is a data structure designed specifically to be comprehensive enough to encompass all and any form of knowledge. This is accomplished by creating a data structure called 'Frames and Images' that maps relationship between words treating them purely as lexical item.
Frames and Images
This data structure is designed on the premise of establishing 'isA', 'isNotA' and 'actionPerformedBySuperOnMe' relationships on a word by other words while treating each word as set of word-relationship pair. Every new word in the knowledge base is called the Frame. A Frame is defined as a set containing Images and numerical constants whereas Image is a set of Images and numerical constants carrying a pointer to its parent Frame and a relationship token. Token defines relationship of a Frame with the Images it contains while the numerical constants do not carry any relationship tokens with it meaning that the constants are merely numbers that do not carry any significance of themselves.
This structure is capable of capturing any information solely by defining relationships among words by the three tokens with the lone requirement being to create a new Frame for each distinct word introduced into the knowledge base. A word is made meaningful by the relationship it shares with other words. For example, the sentence 'Airplane flies in sky' has the word 'Airplane' associated with word 'sky' over the 'actionPerformedBySuperOnMe' token of word 'fly', thus effectively demonstrating acceptance in the knowledge base where 'Airplane' is the Frame and 'sky' and 'fly' are the Images.
Fig 1: Pictorial representation of Frames and Images for given example
Tokens are the key features of this data structure, defining relationships among words. The 'isA' token is logically opposite to the 'isNotA' token whereas ''actionPerformedBySuperOnMe' token is independent of logical implication and denotes only the relationship between action, super and me. In case of the pervious example, this token stands for 'fly performed by Airplane on sky' meaning Airplane is performing the action of fly on sky. The sentence 'Earth is round' is represented as 'Earth' 'isA' 'round' thus applying 'isA' to all Images in Frame 'round' to 'Earth' by inheritance.
Logical operations such as AND, OR and NOT are too are performed on the knowledge base. NOT being the unary operator simply converts a 'isA' to 'isNotA' and vice-a-versa or adds 'isNotA' creating a 'isNotAActionPerformedBySuperOnMe' token logically opposite to 'actionPerformedBySuperOnMe'. AND operator works in a distributive fashion applying the same scheme of tokens on words to every word in AND clause. For example, 'Ball, Earth and marble is round' applies 'isA round' to 'Ball', 'Earth' and 'Marble'. On the other hand, OR creates an uncertainty of facts or provides choice where each of the choice may be a fact. Thus, an OR operation creates an Image with name as all words in OR clause concatenated together.
KNOWLEDGE BASE MANAGEMENT SYSTEM
Analogous to DBMS in case of relational databases, KBMS performs insertion, extraction and updating of knowledge on the knowledge base. Primary function of this module is to perform operations on the knowledge base and interact with Sanskrit interpreter to send and receive super-tuples. This involves initiating a search for elements of tuple in case of interrogative statement or entry of new information into memory post data security checks.
This module maintains a key-value pair indexing of Frames to perform faster searches. Also, each entry into the knowledge base is time stamped to facilitate temporal operations and service queries requiring knowledge of time. Search is performed when an interrogative sentence's super-tuple is received from the Interpreter module. Super-tuple contains normalized elements of each word including the 'ka' root words that carry properties of expected answer choice. To search memory for solution, Frames with their names in the super-tuple are fetched from knowledge base and converted into First-Order Logic; likewise, the super-tuple is converted into FOL too. Recursive descent search is performed for non-repeating Frames if query is not satisfied in one pass i.e. all the Images within a Frame are replaced with the contents of their corresponding Frames. This is done to ensure that information without any explicit relation to the original word can be identified and extracted.