Existing techniques for tokenisation and sentence boundary identification are extremely accurate whenthe data is perfectly clean (Mikheev, 2002), and have been applied successfully to corpora of news feeds and other post-edited corpora. Informal written texts are readily available, and with the growth ofother informal text modalities (IRC, ICQ, SMS etc.) are becoming an interesting alternative, perhaps better suited as a source for lexical resources and language models for studies of dialogue andspontaneous speech. However, the high degree of spelling errors, irregularities and idiosyncrasies in the use of punctuation, white space and capitalisation require specialised tools. In this paper we study thedesign and implementation of a tool for pre-processing and normalisation of noisy corpora. We argue that rather than having separate tools for tokenisation, segmentation and spelling correction organisedin a pipeline, a unified tool is appropriate because of certain specific sorts of errors. We describe how a noisy channel model can be used at the character level to perform this. We describe how the sequenceof tokens needs to be divided into various types depending on their characteristics, and also how the modelling of white-space needs to be conditioned on the type of the preceding and following tokens.We use trainable stochastic transducers to model typographical errors, and other orthographic changes and a variety of sequence models for white space and the different sorts of tokens. We discuss thetraining of the models and various efficiency issues related to the decoding algorithm, and illustrate this with examples from a 100 million word corpus of Usenet news.
|Published - 2003