A robust, accurate, and uniform corpora is essential to performing meaningful stylometric analysis. The corpora is the foundation of any stylometric project, and the project’s results will only be as good as its corpora. In the interest of offering future scholars a process to work from in compiling their own corpora, here are the steps we went through to prepare our own for this project:
- Determine Parameters for Inclusion
An important first step to take in compiling a corpus is deciding what to include and what to leave out. Our corpora was constructed specifically to explore the authorship of our main text: The Dynamiter. Since The Dynamiter is a novel, and cross-genre analysis can be problematic, we decided to limit ourselves to prose.
A glance at both of the Stevensons’ biographies will illustrate that whereas Robert Louis wrote a large number of stories, and, most importantly, novels, Fanny wrote only a handful of short stories. This meant that we had to adjust Robert Louis’ true corpus so it contained a final word count that wouldn’t exceed Fanny’s too dramatically.
Once the parameters for the corpora are defined, the next step is finding the text to fill it.
- Locating Texts
If you are lucky enough to be working with texts with elapsed copyrights, as was our case, finding texts will be easier (which is not to say that it will be easy). Fanny is a relatively obscure author, and any critical and popular attention she has attracted is dwarfed by that paid to her husband. However, we still were able to locate the majority of both Fanny and Robert Louis’ works in open access archives, including UFDC, Hathitrust Digital Library, and The Internet Archive.
Most open access sites will allow you to download a PDF version of their documents. But given that the stylometric analyses is run on plain text documents, acquiring PDFs is only one part of this process. From PDFs, content will need to be transcribed manually or by Optical Character Recognition software like ABBYY, which generally are not offered free of charge.
Another option is to try to acquire the plain text versions of texts directly. This is not a feature available in all archives, but The Internet Archive, for example, has among its many downloading preferences a “Full Text” option. The problem with using the “Full Text” options is that its texts will be almost always be full of mistakes and formatting issues from the digitizing process.
In short, none of these methods are ideal nor problem-free. It is unlikely all required texts will be located in a single archive, meaning plain text versions will not be downloaded in a single way. To analyze The Dynamiter, we used a combination of all of these methods to curate our final corpora, and believe most scholars compiling digital corpora will need to do the same.
Scholars should prepare to devote a significant portion of time to converting PDFs and plain text files into those suitable for analysis — a tedious yet crucial task. Teamwork and clear communication are essential to cleaning texts as quickly, accurately, and consistently as possible.
In order to clean texts for analysis, a physical version of any text (or a PDF image) should be compared side-by-side with the plain text document. Discrepancies, spelling mistakes, missing text, and strange characters are quite common and should be caught at this stage. Especially for teams running different operating systems, we recommend Notepad++ for Windows and TextWrangler for Mac.