Patrick Juola
Abstract
A possible problem with many of the large-corpus techniques used
for document categorization or similarity judgements is the
very fact that they require large corpora for reliability.
A powerful test against the distilled wisdom of hundreds of
millions or billions of words may be of limited use when only
a few thousand characters are available. This paper describes
an information-theoretic model based on a new method for
estimating entropy that is able to produce remarkably accurate
judgements of language or even of authorship based on
relatively tiny corpora. Based on a sample of a single
document not much longer than this abstract, this technique
is capable of error-free inference of the authorship of some of
the Federalist Papers by estimating the similarity between
these samples and the documents in question. The efficiency
and generality of this technique suggests that it might be
applied with good effect
to a horde of other problems.