I always wanted to create a chatbot, but I never got around to actually doing it since I wouldn’t want to write anything where I have to explicitly give it knowledge or manually feed it with language rules. Every time I’ve started to sketch – I kind of got stuck being unable to understand how to overcome this.
Lately I had another urge (after watching Prometheus) so once again I started by sketching on paper trying to understand how to attack the problem. Then, after a few evenings I decided just to start coding. Remembering the proverb “programming is understanding”.
First I started by looking on Wikipedia, using classifiers to give contexts etc, but it became clear that it would be beneficial to start by looking on the syntactic structures of language.
So I instead of jumping into the semantics (which is what I really like to play with) I decided to take a step back and do some syntactic stuff first. I reformed my plan to what I’m working after today:
1) Parse a dictionary with tagged words and samples
2) Extract grammar rules from the samples (memory based model)
3) Create a random jabber bot using the grammar rules (syntactic correct nonsense)
4) Use the grammar rules to resolve sentence ambiguity
5) Make synonyms, antonyms and derived meanings part of the system (allows for better context understanding – you can also play the association game with the bot after this) (<=== this is what i'm working on at the moment)
6) Make it possible to add context stimuli to the jabber bot. For example talking to it about a tree, will make it jabber about forests etc.
This will hopefully start to resemble about the hybrid in Battlestar Galactica. Even though it’s talking a lot of nonsense the “nonsense” is within a semantic context.
“The five lights of the apocalypse rising struggling towards the light, the sins revealed only to those who enter the temple only to the chosen one.”
7) Add simple associative memories. (From here and on I am not sure how I will implement yet)
8) Train it on conversational transcripts (question-answer patterns etc)
9) Feed it with knowledge (from wikipedia)
I will now shortly summarize what I’ve done so far:
1) Building a dictionary from Wiktionary
So I downloaded Wiktionary (wiki – dictionary) instead and used the SAX parser to extract information about words and their classes. The cool thing about Wiktionary is that it’s constantly expanding, it has tons of words, their definitions, examples, synonyms, antonyms, attributes, phrases etc. The bad thing is that wiki pages are not built for machines but for humans. So parsing pages require some extra time. Anyhow I spent a day building the parser and tagging the common word classes (verbs, nouns, pronouns etc) extracting examples and such, so it wasn’t too bad.
2) Extracting grammar rules from Wiktionary
As I mentioned earlier, most words comes with one example or more – my idea was to generate language rules from this. With a language rule i mean a syntactic correct series of word classes. For example, “the table is red”, the rules would be “article->noun->verb->adjective” then I would store this as a valid rule. Of course, since words have many meanings, classes and different origins (etymologies) it’s not that simple.
Another possible rule extraction from “the table is red” would be “adverb->verb->verb->noun” since ‘the’ can be either an article or adverb, ‘table’ a noun or verb (to table something) and ‘red’ the describing color or the actual red color.
This was a bit worrying at first, but after running the rule extraction on large amounts of data it turned out that I could extract a lot of unambiguous rule chains by using unambiguous words. In addition I disambiguated a couple of common words. For example “he” can be the personal pronoun or “The name of the fifth letter of many Semitic alphabets (Phoenician, Aramaic, Hebrew, Syriac, Arabic and others).” – removing the noun meaning exposes “he” phrases for rule extraction. By doing so on the articles, personal pronouns and the three most common verbs (be, have and do) I think it is good enough (some rules will be errornous but for this experiment I consider it some “noise”). I don’t strive for perfect.
Here is an example of an extracted rule:
Artificial [_start]=>Article [Definite]=>Noun [Countable]=>Verb [SingularFirstPerson, Past, Auxiliary, SingularThirdPerson]=>Article [Indefinite]=>Noun [Countable]=>Punctuation [Period]=>
3) Create a jabber bot
This was easy – just randomize words following random rules. Rules can also be merged using those that maximize the overlap of each other. Here is the first words it said:
“this has then gather next me” After fixing some bugs and and improving the rule extraction this is some of the things it randomly says now (including periods):
“i have an hybrid orbital instead of the test bench…. they are cold turkey rascally ladings are each who semi-annually skidded…”
“she meant me an efficacious oligomenorrhea. a solidarity does chemical castration hersirs. they wear me. she nitrogenized ironically.”
“he was monotonous during his lyophilization.”
“an twinning pigeon hole chihuahua”
“inconsistency secondary hypothermia was an sunni gino-sho near the orecchiette , who was an alpha privative.”
“the mustaches have an swallowtail, interjoin megawatts, wildly.”
Just like anyone who is learning English the bot mixes up a/an and also some of the rule merges breaks the English. However, I’m not aiming to make a grammatically correct bot. I want to create something that can make sense in a semantic manner – therefore I’m not spending time on the nitty gritty details.
4) Sentence disambiguation
When talking to the bot it needs to know what you mean, to know this sentences needs to be resolved to their meaning. As a first step in this I use the existing rule library to score the possible ambiguities. The scoring function matches each combination between the possible rules, where the amount of overlap is measured.
For example “huge ants are wondering about life” is resolved with the highest score to:
huge Adjective 
ants Noun [Plural, Countable]
are Verb [PluralSecondPerson, PluralFirstPerson, PluralThirdPerson, Auxiliary, SingularSecondPerson]
wondering Verb [PresentParticiples]
about Preposition 
life Noun [Countable]
All the attributes above (within the ) are automatically extracted from Wiktionary.
Anyways, this method is not 100% and I think regular POS taggers (parse trees) might do better, but again, this is good enough for now.
Hopefully I will take this project further. I would like to “upload” the bot (java) to a server so it can be tested by everyone while developing, but maintaining it will take time too.. Will see where to go from here.