Catch, Tag, and Release: Coordinating our Efforts to Build the Early Modern Corpus

My revised abstract for the SAA 2016 Plenary Roundtable, “The Great Work Begins: EEBO-TCP in the Wild.” See all abstracts at Wine Dark Sea.

The work of correcting EEBO-TCP texts is formidable. MoEML‘s work with EEBO-TCP’s XML files shows that transcribers need to supply gaps, capture forme work, correct mis-transcriptions, and restore early modern typographical habits and idiosyncracies.  Only with many partners working in coordination will we be able to establish an accurate corpus suitable for text mining, copy-text editing, and critical editions. We might think of such work in terms of a “catch-tag-release” model, whereby various entities “catch” EEBO-TCP texts from the data stream, “tag” them in TEI Simple (developed by Mueller), correct both tagging and transcriptions through teams of emerging scholars, and then “release” the texts back into the scholarly wilds. Mueller has already described how a corrective tagging process might work, and the Folger’s Digital Anthology project prototypes a repository environment that will allows us to release texts back into the wild. We also need to capture corrective work that has already been done, such as the ISE‘s transcriptions of the quarto and folio transcriptions of Shakespeare’s plays. These transcriptions are highly accurate, having been double-keyed by research assistants, carefully checked by the play editors, and peer reviewed. Their markup predates the development of XML or TEI, but can be dynamically converted (with some effort) into TEI Simple for general “release” alongside other EEBO-TCP transcriptions. From this stage, we can use various XSLT scenarios to convert the TEI Simple both into the plaintext suitable for corpus-wide analyses and into a variety of XML forms suitable for web publication and further editorial work.  The limitations of EEBO-TCP transcriptions and the effort required to correct them should make us mindful of the effect of “unevenness” across the corpus. The ISE proposes to replace reasonably good EEBO-TCP transcriptions of Shakespeare’s play with excellent transcriptions. But what of the texts in which SAA members are less invested? Some of them have error rates of two or more errors per line. Which will we correct first? Will we bestow as much care and time on them as we have on Shakespeare? How will our answers to those questions affect the results of distant reading and data mining exercises?