Technical Questions of Authorship Attribution Related to Our Project

How are PCA graphs generated?

The first step in these analyses is to quantify texts, i.e. convert words to numbers. This is done by putting together a Most-Frequent Word list and by calculating the frequency with which each word appears in the texts. These values are then put into a matrix. For example: let the word “the”, being the most common word in English language, be row 1. In addition, let’s say it appears in three short stories with a frequency of 8.0, 5.0 and 4.0. Thus, we have a matrix made of three columns (with values 8.0, 5.0, and 4.0) corresponding to row 1.

In order to generate these graphs, this calculation is done with all the words of all the texts. As a result, what used to be words are now transformed into matrices comprised of numbers. After generating the matrices, what follows is the generation of a matrix called “covariance matrix” (or correlative) comprised of the variance present in the previous matrices. Variance can be explained as the distance between the values in the matrix and their average.

After the covariance matrix is found, the eigenvectors are calculated. Eigenvectors are vectors found through mathematical equations in order to show the direction of the variance. Each eigenvector has an eigenvalue, which accounts for the magnitude of the variance. The two eigenvectors of highest eigenvalue are then used as axes in which the data is projected in a two-dimensional space (two-dimensional since we only have two axes). In other words, the lower axis and the left axis you see are the eigenvectors which have been calculated. The percentage in parenthesis accounts for how much of the total variance is being projected. That way, the numeric values into which our textual samples have been transformed are projected in a graph.

More General Questions of Authorship Attribution:

Authorship attribution can be broadly defined as “the science of inferring characteristics of the author from the characteristics of documents written by that author”.1 Its aim is usually to attempt to ascribe a particular piece of writing to a certain author of a set of candidates. Authorship attribution is therefore mainly on written texts or on those features of spoken texts which are shared with written texts, such as lexical choice.

What is stylometry?

Stylometry is a near-synonymous term to authorship attribution, dominated by attempts to identify unique authorial features by quantifying texts, usually in order to attribute authorship to anonymous or disputed texts.2

“Non-traditional” authorship attribution, as opposed to traditional human expert-run methods, is also called statistically or computationally-supported authorship attribution.3 It began with Mendenhall’s pioneering study on Shakespeare’s play and was made famous by Mosteller and Wallace’s study on the disputed authorship of “The Federalist Papers”.4 The significant development in modern computers and resultant digital corpora available in the late 1990s by their great influence on information retrieval techniques, machine learning, and natural language processing.

1. Deciding whether a given texts was written by a candidate author or not.
2. Determining the authorship of a given text which is known to be written by one of a set of candidate authors.
3. Determining the authorship of a given text which is believed to be written by one of a set of authors, if there is one.
4. Determining properties of the author(s) of a given text, including identifying whether a document is singly authored or multiply authored, and whether a text is written by a man or woman. This type of task, for some researchers, is typically named “stylometry” or “profiling”, while “authorship attribution” is reserved for the first three.

What is authorial fingerprint?

Authorial fingerprint is characteristic language pattern used by an author, traits which can be extracted and measured quantifiably in order to identify the text(s) written by that author. In practice, it is also reasonable to believe that such fingerprint is complicated to trace definitively, and simple univariate statistics like average word length or word count are not sufficient to conclusively identify authorial fingerprint.

What features can be used to identify author(s) of a given text?

There are many stylometric properties that can be used for different purposes, but the most frequently examined are:

1. Vocabulary and idiosyncratic spellings: Specific words can seem to label authors by group identity in terms of time and space but they are largely topic-related rather than style-related. In addition, vocabulary-based analysis is easy for forgers to manipulate.
2. Vocabulary properties:
• Superficial features of vocabulary properties such as word length, number of syllables, part of speech, and vocabulary richness could be used to identify author.
• Words in the sample text which are more common than any individual words and has a potential to vary in different texts by different authors, such as synonym pairs, and function words that some researchers use “the most frequent N words in the corpus” as a stand-in.
3. Syntactic properties: an author’s preferred syntactic constructions can be valuable. They can be captured by tagging sample texts for part of speech (POS) or other syntactic constructions. The method’s shortcoming arises from the processing: contraction apostrophes and closing single quotes are indistinguishable in some cases.
4. N-grams: a sequence of n items from a given text or speech. An n-gram of size 1 is referred to as unigram, size 2 as “bigram”, size 3 as “trigram”, size 4 as “four-gram”, etc.
• Vocabulary N-gram use combined lexical and syntactical information to assign authors to particular texts, such as the bigram “to love” and “the love” distinguishing authors who tend to use as a verb or noun.
• Character N-grams: taking advantage of morphological analysis, instead of studying individual word, it analyses the sequence of character, such as trigram “lov” shared the example above.

What methods can be used to analyse features above?

1. Unsupervised techniques require no prior information for sample documents, and are often used as data exploration by researchers looking for superficial patterns.5
• Vector Spaces: This techniques quantifies chosen features will result in the creation of a high-dimensional document space with each feature set as a vector or a point in that space. If two texts appear close in this high-dimensional space, then there is a high probability that these two samples share the same authors. However, the difficulty of visualisation in this high-dimensional space and the problem of independence of each dimension make the method less practical. To address these problems, researchers usually apply principal component analysis (PCA), instead.
2. Supervised Analysis require a priori knowledge which often gained from categorisation of sample texts without disputed authorship. There are many types of reliable analysis in this category, including Distance-based Methods, General Machine Learning Techniques, and Support Vector Machines, but only the most relevant methods – delta method and linear discriminant analysis were used in our research.
• Delta Method: the most notable of supervised analysis technique, which will be explained in the next question.
• Linear Discriminant Analysis (LDA): LDAs are much like PCAs inasmuch as they are both linear transformation methods; however, LDAs project the directions that maximize the difference in the data so that it can be discriminated in different classes. In other words, an LDA would project the information in different classes so we can discriminate between the pieces of data, e.g. this piece of information is either by Author A or Author B.

What is “Delta Method”?

Delta method is a measure of stylistic difference, first suggested by John F. Burrows and first used in assigning authorship to a collection of Restoration poetry by analysing the frequency of the 150 most frequent words (MFWs).6 Burrows first established a frequency-hierarchy list of a group of corpora, and then measured each tested text against this list and calculated their z-score, which represents sample’s divergence from the means of the main set.7  Delta measures “the mean of the absolute differences between the z-scores for a set of word-variables in a given text-group and the z-scores for the same set of word-variables in a target text”.8 The smallest delta score represents the greatest similarity to the training samples, and thus the category it belongs to is the most likely to share authorship.

David L. Hoover conducted extensive research on Delta variations, including changing word counts from 20-800, eliminating contractions and/or personal pronouns, and culling the list of word-variables for which a single text supplies more than 70% of the occurrences.9 Among all his findings, he suggests eliminating contractions generally reduce the accuracy of analysis, which may indicate the use of contractions is a major indicator of author identity, while personal pronouns are perhaps more related to the subject of the texts than the author.10

What is the advantage of choosing function words as a feature?

A: Function words such as conjunctions, prepositions and articles indicate an author’s preferred method of expressing their ideas while avoiding the influence of topics on their writing styles. This is because function words normally carry little meaning in themselves but define a semantic or syntactic relationship between different content words in a corpus. It is topic-independent and relatively easily to spot and visualise because of its highly frequency, but it is largely limited in the analysis of English-language texts.11

Three aspects of accuracy issues should be considered in response to this question:

1. Technical accuracy:  The inherent accuracy of the techniques combining issues of genres, representativeness and corpus size makes the technical accuracy more critical.
2. Sample texts accuracy: In theory only features belonging to the author should be used for analysis, while in practice it is difficult to determine features for which the author is responsible, because published books are the final product of a collaboration  between the author, editor and others. In addition, the presence of non-authorial materials like quotations and the selection and preparation of control materials complicate the issues.
3. Analysis accuracy: The knowledge of the science of authorship attribution and specific scholarly topics may have a great impact on the basis of decisions that may consequently result a bias in their analytical results. For more on this, see Expectations and Limitations.

Are there other potential applications of authorship attribution analysis apart from the traditional application to literary research?

Yes. Authorship attribution study can be applied to diverse areas, such as intelligence (e.g. attributing messages to known terrorists),12 civil law (e.g. disputed copyright issues),13 and computer forensics (e.g. identifying authors of source code of a software).14 Since the late 1990s, authorship analysis has shifted from addressing disputed authorship problems in traditional literary scholarship to addressing with real world texts like blogs and emails.

1. Juola, Patrick. “Authorship Attribution.” Foundations and Trends in information Retrieval 1.3 (2006): 233.
2. Holmes, David I. “The Evolution of Stylometry in Humanities Scholarship.” Literary and Linguistic Computing 13.3 (1998): 111-112.
3. Stamatatos, Efstathios. “A Survey of Modern Authorship Attribution Methods.” Journal of the American Society for Information Science and Technology 60.3 (2009): 538.
4. Mendenhall, T. C. “The Characteristic Curves of Composition.” Science ns-9.214S (1887): 237-246. Print; Mosteller, Frederick, and David L. Wallace. Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley, 1964. Print.