What Can We Do With Small Corpora? Document Categorization Via Cross-Entropy

Patrick Juola

Abstract A possible problem with many of the large-corpus techniques used for document categorization or similarity judgements is the very fact that they require large corpora for reliability. A powerful test against the distilled wisdom of hundreds of millions or billions of words may be of limited use when only a few thousand characters are available. This paper describes an information-theoretic model based on a new method for estimating entropy that is able to produce remarkably accurate judgements of language or even of authorship based on relatively tiny corpora. Based on a sample of a single document not much longer than this abstract, this technique is capable of error-free inference of the authorship of some of the Federalist Papers by estimating the similarity between these samples and the documents in question. The efficiency and generality of this technique suggests that it might be applied with good effect to a horde of other problems.