What is Fowlt?
Fowlt is an online, free-to-use context-sensitive English spelling checker. It follows the setup of the Dutch spelling checker Valkuil.net. Both Valkuil and Fowlt are unlike the typical spelling checkers: whereas the latter mostly try to find errors by comparing all words to a built-in dictionary and flag the word as an error if they can't find a match, Fowlt is context sensitive. This means that it also takes into account the context, which is the words around every word. For example, the context for the word "account" in the previous sentence is "also takes into _ the context ,". If, for a particular word, Fowlt expects another word based on the context, and it is quite certain about this, the word is flagged as an error, and Fowlt's alternative is presented as the suggested correction. This means, for example, that Fowlt is able to replace the incorrect "there" in "there really nice people" with "they're" simply because "there" usually is not followed by "really nice people", while "they're" is.
To be able to make these kinds of correction suggestions, Fowlt makes use of language models. These models are created by giving lots of texts to machine learning software (TiMBL and WOPR). On the basis of the 'training' texts the model knows "there" mostly is not followed by "really nice people". However, this also means that if the context of a particular error is very different from what the model has seen in the training data, it won't be able to correct the error. This means users should be aware of the fact that, although Fowlt can recognize more kinds of errors than regular spelling checkers, it might still miss errors - in particular when you set the tolerance slider to a low value.
How does fowlt.net work?
In its stand-alone form, Fowlt is an application that takes plaintext as input, and returns FoLiA XML with information about the detected errors and possible corrections. What happens in between is visualized below:
The text that is to be corrected is given to various modules, which all look at the text from another point of view:
- The WOPR module, the largest of the context-sensitive modules, uses the statistical word prediction software WOPR. It gives it the context of each word, but not the words themselves. WOPR then has to predict which word it would expect, based on the training texts it has seen. If WOPR would predict another word that is similar to the held-out word, WOPR's prediction is suggested as a correction.
- For many well-known mistakes ("than" versus "then", "you're" versus "your", and "two" versus "to" versus "too", to name a few), Fowlt employs specialized modules. If we take the "you're" versus "your" case as an example, we gave the module tens of thousands of examples of contexts for both the "your" and the "you're" options. If we then encounter a new case, the module decides to which group of contexts the context of this case is more similar. If its guess does not match with the the actual word, but the module is very certain about its guess, this word is flagged as an error. For example, Fowlt thinks the context "I really think ... very nice !" looks a lot like contexts for the option "you're". If the actual text contained "your", this is flagged as an error. Both the model creation and the word prediction in these modules are done by the memory-based learning software Timbl.
- The errorlist module simply uses a large list of common typos and their corrections. The module checks for every word in the input whether it is in the errorlist, and adds the correction when it is. This module is not context-sensitive.
- The lexicon module, which is also not context-sensitive, is based on a huge list of how frequent English words were on the internet in 2008. It checks for every word in the input whether there is a word in the frequency list that is very similar to it, but much more frequent. This word is then presented as the suggested correction.
- The run-on module also uses this list, but uses it to check if any spaces were accidentally forgotten in the text. This is done by looking whether splitting up long words produces two words which both are much more frequent that the original word. If so, these two words are suggested as a correction.
- The split module is the opposite of the run-on module: instead of looking whether any spaces are forgotten, it checks whether any spaces have to be removed. This is done by testing whether each combination of two words produces a word which is much more frequent than the two original words. Again, this joined word is suggested as a correction if this is the case.
To make Fowlt available as a webservice, we use CLAM.
We also offer Fowlt as a RESTful webservice, which means developers can integrate Fowlt in their own application easily. This service is free and without restriction (as long as our service can handle the quantities), but also without any warranties about uptime!
Whoever wants more control over Fowlt's webservice is free to set up a webservice himself - Fowlt's source code, including installation instructions, can be found at GitHub. Developers are also free to fork this repository and extend Fowlt. If you would like our help and expertise while setting up Fowlt or integrating into you application, contact Antal van den Bosch - we are probably interested!
Besides this free service, we integrated Fowlt's technology in a Twitter bot. The Twitter bot is meant as a tongue-in-cheek experiment: it is a bot that corrects random tweets a few times a day. Fortunately most 'victims' see the fun of it. We retweet funny replies.
Frequently asked questions
- What happens to my documents? - Fowlt.net doesn't save any documents permanently. The input text is available until maximally 24 hours after the last correction, and is deleted subsequently.
- What does the button 'Found language errors can be used for scientific research' do? - If you check this box, Fowlt saves all errors and its own corrections. An error is saved in context, which is the three words to its left and the three words to its right. The complete document is not saved. We use your corrections to evaluate and improve Fowlt.net.
- Can other people see my documents? - No, unless you send its unique URL to others. This URL only exists for maximally 24 hours.
Do you have another question, remark or suggestion? Send a message to Antal van den Bosch.
Fowlt.net was developed by
Special thanks go to Peter Berck, Martin Reynaert and Sebastiaan Tesink.
- Stehouwer, H. and Van den Bosch, A. (2009). Putting the t where it belongs: Solving a confusion problem in Dutch. In S. Verberne, H. van Halteren, and P.-A. Coppen (Eds.), Computational Linguistics in the Netherlands 2007: Selected Papers from the 18th CLIN Meeting, January 22, 2009, Groningen, pp. 21-36. [pdf]
- Stehouwer, H., and Van Zaanen, M. (2009). Language models for contextual error detection and correction. In Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, Athens, Greece, pp. 41-48. [pdf]
- Van den Bosch, A. (2006). All-words prediction as the ultimate confusible disambiguation. In Proceedings of the HLT-NAACL Workshop on Computationally hard problems and joint inference in speech and language processing, June 2006, New York City, NY. [pdf]
- Van den Bosch, A. (2005). Scalable classification-based word prediction and confusible correction. Traitement Automatiques des Langues, 46:2, 39-63. [pdf]
- Van den Bosch, A., and Berck, P. (2009). Memory-based machine translation and language modeling. The Prague Bulletin of Mathematical Linguistics No. 91, pp. 17-26. [pdf]
At the 2010 European Summer School on Logic, Language, and Information in Copenhagen, Antal van den Bosch gave a course on memory-based models of language, the technology on which Fowlt is based.