In order to perform our analyses we used Stylo, an open-source tool that can be installed in R, a free software environment for statistical computing. Stylo combines the ability to perform complex statistical analyses with a user-friendly interface. In a single function, it performs most of the steps in a typical stylometry analysis, such as generating a most-frequent word list and producing the calculations necessary to visualise the data in a chosen graph.
We encourage anyone who wishes to pursue stylometry to download Stylo and experiment on their own with the different multivariate methods it offers. Here are a few technical terms necessary to understand in order to perform these analyses:
Sample Size: Sampling is the amount of words in a text that you want to be used in the test, e.g. 4,500 words per sample. In Stylo there are three options:
- No Sampling: in which the full extension of the texts you input will be taken into account
- Normal Sampling: in which you must only specify how many words the sampling will comprise
- Random Sampling: the most effective of the three since it takes a random sample of the amount of words you indicate
Culling: The value referred to in culling is the degree in which words will be included in the analysis. For example: if I input 33 as my culling parameter, then only the words that appear in at least 33% of the texts shall be used in the analysis.
Principal Component Analysis (PCA): PCAs find the direction of the variance in a high-dimensional data space and project it into a smaller dimensional subspace so the information becomes readable and patterns can be observed. Therefore, a typical PCA graph would be comprised of two axes which account for a reliable portion of the variance in the data; the works of an author would appear projected as points in the graph and the distance amongst them represents their distance in relation to their variance.
Multidimensional Scaling (MDS): MDS is another form of visual representation that specifically graphs the distances amongst a set of objects. For example, in a MDS graph, works of an author that are very similar would be mapped near to each other whereas those works which are less similar would be mapped farther away from each other.
Cluster Analysis (CA): CAs or, in the type available in Stylo, dendrograms, are analyses in which data is clustered together hierarchically (from the bottom up in this case) based on the dissimilarity amongst the units of information. Its advantage usually lies in the way in which it clearly visualises the way the data clusters. For example, the most similar works of an author would cluster at the bottom of the dendrogram, and the less similar would progressively cluster onto them the higher it progresses in the hierarchic order.
Other Open-Source Tools:
AntConc: AntConc is a relatively simple computational linguistics tool that runs concordances (alphabetical lists of words in a text aligned with their frequencies) on any text a user uploads.
AntConc also allows users to search for clusters of words that appear frequently together in texts, highlighting relationships between words and phrases that might not be obvious to the naked eye. It also offers users the ability to output text analyses into cluster analyses.
Juxta Commons: Juxta (Latin for “alongside”) Commons has many of the same feature as Stylo and AntConc, allowing users to upload and analyze text files, but has the added benefit, for which it is named, of allowing users to compare versions of texts with peers working on the same text. This crowd-sourced, open-sourced sharing of digitized text is tremendously beneficial when cleaning files for a corpora, as it can usually offer a majority opinion on any uncertainties arising around translation issues in a text.
Zotero: Zotero is an all-encompassing bibliographic reference manager, which can also be used to run searches for sources you’ve found to be read and referenced later. Zotero is unique as a bibliographic reference manner in that it can be used to clip and export text from far more websites and databases than any other, and has the capability to store and search PDFs, images, screenshots, audio, and video files. In keeping with our open source ethos, Zotero is free, is not tied to a specific university affiliation, and does not even require an internet connection to use.