|
BinaryTree
HW 06
Spell Checking Using Binary Trees
|
The myspell program checks the spelling of input words against a user-specified dictionary. Suggested corrections are outputted to the screen. The simplest way to do so is to check if a word is in the dictionary; if not then see what words are most similar to it. However, for this program to be useful, a large dictionary of correct words is necessary. An efficient data structure must be used or the program is useless.
./myspell [-d] dict
The optional flag -d will produce statistics about the data structure used to represent the dictionary, printed to standard error (cerr). The TreeStringSet statistics print the following:
height, size, size of a perfect tree of this height, and percent full
The positional argument dict is the name of the dictionary file. This file must be a sequence of unique lowercase words separated by whitespace, where a 'word' begins and ends with a letter and contains no spaces. Words may occur in any order in the dictionary file, but should not be duplicated.
Input is read from standard input (cin). If input comes from the keyboard, the program continues to read words until the EOF character is encountered (Control-D on Mac OS X and Unix computers, Control-Z on Windows).
Alternatively, we can tell the shell to run the command while directing the contents of a file into standard input
./myspell [-d] dict < inputfile.txt
in which case spell checking of the words in inputfile.txt continues until the end of that file is reached.
Any word in the input that is not found in the dictionary will be sent to standard output (cout) in the following format:
misspelledWord: correction1 correction2 correction3 . . .
If there are no suggested corrections then only the misspelled word and a colon (no whitespace) is outputted.
Corrections appear as soon as they are detected (rather than waiting until the entire input has been read) and so misspelled words are reported in the same order they were encountered (e.g., not sorted alphabetically). In particular, if input comes from the keyboard, corrections for each line should appear as soon as the line is entered.
However words that are misspelled the same way more than once (in the same run of the program) are reported only the first time they are encountered.
This is a sample run of a correctly working myspell program. Note that this example does not cover all edge cases, but is instead meant to demonstrate formatting. Please ask the graders for more direction if the formatting is unclear for any situtation not shown here.
None of the following lines contain a space before the newline (including the blank line). The input was terminated using Ctrl-D, which would appear as additional blank (input) line after the "yay: bay" line.
The debugging line (15 expansions...) is sent to cerr. The other lines go to cout.
% ./myspell -d ispell.words The internal tree structure is The height of this tree is 34 The size of this tree is 34832 The size of a perfect tree of this height is 34359738367 Therefore, this tree is 0.000101374% full This is incorXect incorxect: incorrect
We chose to implement a randomized binary search tree as the data structure for our dictionary. We chose this because it seemed to be an efficient encoding method. While it would not nesecarily have the smallest runtime, the simplicity of encoding would make the entire structure more efficient within our constraints.
We chose to use a queue to track the next item for the iterator. We did this because we could fill the queue with the adresses of each item in the tree, in order. By using this queue of addresses, we could instruct the iterator to increment to the next item using the adress of that item in the queue. This system is a simple way to track the items in the tree in order, because holding the addresses allows us to access all of the information about the object by just dereferenceing the pointer.
We created a new class called Dictionary to hold the words in the desired dictionary as a randomized binary tree. This class contains all of the functions necessary to check the spelling of input words against the dictionary. Our spelling suggestions handles only a single wrong letter.
Most of the lifting for the spellchecker is carried out in the Dictionary class. This is intentional, and allows for the spellchecker to hide most of the encoding of the system.
1.8.7