Editing Modernism in Canada


June 7, 2010

A New Build: EMiC Tools in the Digital Workshop

DEMiC +1

On the occasion of our 2010 DEMiC summer institute I’d like to present an interim report on EMiC’s major digital initiatives, our new institutional partnerships, and our four streams of collaborative digital-humanities research: (1) digitization, (2) image-based editing and markup, (3) text analysis, (4) and visualization.

Last June I trekked out to Victoria to attend the Digital Humanities Summer Institute with a group of graduate students, postdocs, and faculty affiliated with the EMiC project. There were a dozen of us; some came with skills and digital editing projects in the works, others were standing at the bottom of the learning curve staring straight up. Most enrolled in one of the two introductory courses in text encoding or digitization fundamentals. Meagan Timney, who is our first EMiC postdoctoral fellow, and I enrolled in Susan Brown and Stan Ruecker’s seminar on Digital Tools for Literary History. They introduced us to a whole range of text-analysis and visualization tools. I started to pick and choose tools that I thought might be useful for the EMiC kit. These tools have been principally intended for the analysis of text datasets, either plain vanilla transcriptions of the kind that one finds on Project Gutenberg or enriched transcriptions marked up in XML. The common denominator is obvious enough: these tools are designed to work with transcribed texts. But what if I wanted tools to work with texts rendered as digital images? What if I didn’t want to read transcribed texts but instead use tools that could read encoded digital images of remediated textual objects? What kind of tools are being developed for linking marked-up transcriptions to images? How can these tools be employed by scholarly editors?

When we started three years ago to assemble the people involved in the EMiC project, few of us could have imagined asking—never mind answering—these questions. At best I had thought that some of the more intrepid among us might produce hypertext editions, or hypermedia editions that provided digital facsimiles of manuscript and print materials along with transcribed and encoded texts that we edited. I thought that we’d mostly be working in print media, especially since so few of us possessed the coding skills necessary to produce much more than rudimentary electronic editions that didn’t really do anything more sophisticated than what we already could do in print. I’m an idealist trying to be realistic: our project funding lasts for seven years, but absurd as it may seem I already started to worry that it might not be long enough for any of us to master the skills we needed. More to the point: the researchers affiliated with the project are, for the most part, humanities scholars who come from institutions without year-round access to the kinds of digital humanities training and expertise available at UVic or McMaster or the University of Alberta and elsewhere. If the idea was to train EMiC participants so that we can work on our own or in collaboration with others on digital editions, it became staggeringly obvious to me that one week a year at DHSI wasn’t going to get many of us very far up the learning curve. Susan and Stan already showed me that we didn’t all need to invent the wheel: that’s what tools are for—collaboration. And that’s what we needed: a collaborative image-based editing tool.

I came to this epiphany after a sleepless night in my UVic dorm room. Like many of my epiphanies, this one wasn’t all that original. The history of digital text-editing tools dates from the late 90s. According to my most recent count, there are as many as 28 different tools in various states, some long orphaned and others still at alpha and beta stages. Even narrowing it down to tools or collections of tools that allow users to edit or display images within the context of textual environments or editions, there are as many as 10 such open-source tools. These range from those that simply display an image alongside a text to software suites that support the development of complete image-based environments with substantial functionality beyond text-to-image mapping. I’ll just mention a few of these tools.

The simplest of these enable the viewing of images alongside text transcription, either for editing or display. Juxta, developed through the NINES project, provides a window for viewing image files alongside transcriptions, which could be useful for an editor checking readings or adding annotations, but does not provide any method for connecting the image with the text beyond the page level.

Figure 1: Juxta

Collex, another tool developed by NINES, can search across several different projects and view and compare edited texts and images; it does not, however, provide a way to create links between image and text, since it is a viewing tool—a collection or exhibit builder—and not an editor.

Figure 2: Collex

Similar in its simplicity is the Versioning Machine, developed by Susan Schrebman at the University of Maryland, a display tool for comparing encoded texts that also enables page images to be linked to the text at the page level.

Figure 3: The Versioning Machine

Among the more complex collaborative editing tools I’ve spent some time working with are TextGrid, developed by a consortium of universities in Germany, and the Edition Production and Presentation Technology (EPPT), developed by Kevin Kiernan at the University of Kentucky. TextGrid consists of two primary components, a repository for research data in the humanities (archives, databases, commons) and a laboratory of tools and services for the creation of digital editions. Though its specs are impressive, it’s stubbornly buggy and slow; it freezes pretty much every time I’ve tried it.

Figure 4: TextGrid

EPPT is a robust set of tools for overlapping multiple images and encoding variants; collating manuscripts; gathering and displaying statistical data; generating paleological and codicological description; developing glossaries; linking images and texts; and publishing editions. For scholars interested simply in linking image and text, this suite of tools is doubtless overkill. EPPT is probably the most sophisticated tool I’ve worked with so far, but it’s been grandfathered at this point, partly because its image-text linking protocols aren’t conformant with standards of XML markup set by the Text Encoding Initiative (TEI).

Figure 5: EPPT

The only reason I know any of this is because I went to DHSI last June. There I met and started to collaborate with digital humanists who have been instrumental to the transformation of EMiC from a predominantly print-based editing project to one that is increasingly invested in the implementation of computing tools toward the construction of online reading environments, databases, archives, editions, and commons.

EMiC has formed two partnerships to develop desktop-based image-editing tools and a publication engine to do TEI-compliant XML markup and XSLT rendering of not just transcribed texts but of the digitized images of texts. So far we have been building on the desktop-based Image Mark-up Tool (IMT) developed by Martin Holmes at UVic’s Humanities Computing and Media Centre. In its first build IMT was a Windows-based application limited to annotation of individual images; its next build, which has been undertaken in collaboration with our postdoc Meagan Timney, will be a cross-platform tool with the capacity for annotation of multiple images.

Figure 6: IMT 1.8

Our aim is to produce a tool which adheres to best practices in TEI markup but which has a simple enough interface that it can be used by EMiC participants with limited experience in editing XML code. The publication engine will link page images, transcriptions and annotations in an interactive interface, and will include features for navigating multiple versions of a source text. You can follow the development of the tool on the IMT blog.

Figure 7: XSLT rendered IMT page

The second partnership we’ve formed is with the Text-Image Linking Environment (TILE), currently in development by Doug Reside at the Maryland Institute for Technology in the Humanities (MITH), John Walsh of Indiana University Bloomington, and Dot Porter of the Digital Humanities Observatory at the Royal Irish Academy. The TILE project will produce a web-based, modular, collaborative image markup tool for both manual and semi-automated linking between encoded text, image of text, and image annotation. TILE is built on the existing code of the Ajax XML Encoder (AXE) image tagger, a web-based tool for ‘tagging’ text, video, audio, and image files with XML metadata.

Figure 8: AJAX XML Encoder

(The Shakespeare Quartos Archive is one example of a MITH project marked up using the AXE tool. There are demo videos that run through how to use the SQA.) By 2011, the TILE team will have produced software interoperable with other popular tools (including IMT) and capable of producing TEI-compliant XML for linking image to text. Because TILE is being designed to be interoperable with IMT, the key difference being that the former is web-based and the later desktop-based, the EMiC project will be able to take advantage of both web and desktop environments. The obvious advantage of TILE is that it offers an environment for real-time collaborative editing, which is certainly something that EMiC could use to its advantage as a research and pedagogical tool to facilitate long-distance collaborations across our distributed network of researchers and institutions. That TILE is designed to be interoperable with IMT is crucial, since our publication engine will be designed to work interchangeably with both of these web-based and desktop-based tools.

What is EMiC going to feed into these markup tools? EMiC is partnered with archives and libraries committed to large-scale digitization projects. In partnership with Library and Archives Canada, EMiC is undertaking the digitization of the LAC’s fonds of modernist authors, beginning with the literary papers of P.K. Page (followed by Elizabeth Smart, F.R. Scott, Marion Dale Scott, Miriam Waddington, Patrick Anderson, John Glassco, among other modernists). This new EMiC partner project will allow us to link to manuscript and typescript versions of published texts as well as previously unpublished texts, correspondence, and other archival materials. With Paul Hjartarson and his cohort of graduate-student collaborators (Kristin Fast, Stefanie Markle, and Kristine Smitka) at the University of Alberta, we are also working toward a project that will see the digitization of the Sheila Watson and Wilfred Watson fonds located at archives in Toronto and Edmonton.

None of the materials that will be processed by the EMiC or its partners have been made available through current mass-digitization projects, including those undertaken by Google Books, Canadiana.org, and the Open Content Alliance (OCA)’s Internet Archive. Nor are these texts available in transcription on Project Gutenberg. One of the decisive methodological differences between the digitization and transcription of literary texts under the mandate of these mass-digitization projects and that of EMiC is our attention to the bibliographic and codicological particularities of texts. While the library-driven preservationist program of Canadiana.org—which now includes the digitization of the Canadian Institute for Historical Microreproductions microfiche catalogue, and which includes multiple editions of texts—comes closest to the digitization program of EMiC, neither the image quality nor the file formats of any current national or international mass digitization projects are suitable to editorial projects that require high-resolution, full spectrum scans of original materials. (A sidebar: I have recently been in touch with the digitization team at the University of Alberta, which is one of the members of the OCA, to find out how we can access the Internet Archive’s raw individual .tiff files instead of the files packaged for digital readers.) Even though mass digitization resources are complementary to the work of producing digital editions and archives, these projects cannot really be expected to meet the particular and specialized requirements of scholarly editors. To take full advantage of the collaborative editing tools underdevelopment by EMiC and its partners, and in conjunction with the digitization of archival materials, we will assume responsibility for the production and storage of a digital commons which will be accessible to members of the project for their editorial work, to researchers who cannot otherwise consult physical copies, and to the general reading public. This is the developmental work that our postdocs Meagan Timney and Matt Huculak are in the process of prototyping.

This digital commons is not intended to replace the need for critically edited versions of modernist texts, nor is it designed to reproduce texts currently in print; it is, rather, a digital supplement to the corpus of critical editions already in production by EMiC editors and a repository of raw materials for the production of additional print and digital editions by graduate students, postdocs, and faculty.

EMiC is partnered with the Text Analysis Portal for Research (TAPoR), a multi-institutional digital humanities project under the direction of Geoffrey Rockwell at the University of Alberta that has set up research infrastructure for the study of texts using text-analysis tools. EMiC benefits from the partnership in two key ways:

(a) TAPoR has developed a portal that adapts commercial and open-source text tools to the needs of the research community. The portal is working; they are now extending it and have technical support to support a digital commons such as EMiC. As part of the TAPoR portal they have been developing programs or utilities that make other tools work in a portal context. TAPoR is willing to share the program code for the portal under an open source license with EMiC so that we can adapt it to our purposes. Where appropriate they are also willing to assist in the configuration, testing, and adaptation of the portal.

(b) TAPoR at the University of Alberta and McMaster University has developed and is extending a suite of tools (TAPoRware, Voyeur, and JiTR) that process electronic texts. These tools can be embedded into remote web projects to add functionality to e-texts from projects like EMiC. TAPoR will consult with EMiC to make sure these work within the our context and adapt them where feasible to our needs. TAPoR is also willing to share the program code for these tools and utilities under an open-source license.

Figure 9: Voyeur

These text-analysis tools will be employed by researchers affiliated with EMiC to process the corpora digitized and marked up with metadata. You can consult the TAPoR training videos for help with using these tools.

Rather than following past practices of transcribing texts and marking up transcriptions in the creation of electronic texts, EMiC and its partners will pioneer image-based editing, semantic markup, analysis, and visualization of texts in a field of emergent practices in digital-humanities scholarship. Instead of producing reading environments based on linear-discursive transcriptions of texts, EMiC will produce in collaboration with its partners techniques and technologies for encoding and interpreting the complex relations among large collections of visual and audial objects in non-linear reading environments. For example, consider the series of textual visualizations by Stefanie Posavec done in collaboration with Greg McInerny of Microsoft Research. They include textual variants and revisions among six versions of Charles Darwin’s Origin of Species.

Figure 10: Entangled Word Bank

Figure 11: Entangled Word Bank (detail)

Consider also another visualization of the multiple editions of Darwin’s Origin, this time by one of the creators of the Processing visualization environment, Ben Fry. The potential of these Origin projects to the kinds of editorial, text-analysis, and visualization work that EMiC and its partners are undertaking has only just started to be explored.

Figure 12: The Preservation of Favoured Traces

What these particular projects lack, however, is any significant element of interactivity. It will be invaluable to create interactive visualization tools available to render in ecological structures the relationships among different versions of texts as well as the masses of linear lists of textual variants and revisions that editions typically relegate to the apparatus of the text. That is what these two Origin projects already do: they produce aesthetic objects as such, instead of what many consider an unreadable apparatus of textual lists. This represents a major transformation in the way in which editors could imagine constructing the apparatuses of digital editions.

High-performance visualization environments for mass-digitized corpora (e.g., digital books, archives, and libraries) are just now emerging as the next evolutionary stage in the technologies and practices of image-based digital editing in the humanities. Recent developments in information-visualization theory and technology are modeled on ecosystems; see, for instance, the Mandala rich prospect browser developed by Stéfan Sinclair of McMaster University and Stan Ruecker of the University of Alberta.

Figure 13: Mandala

As emergent practices, these projects have yet to develop fully elaborated technologies and practices of producing image-based editing ecosystems, generating editions and archives as ecological structures, and reading the ecologies of archives and editions. If humanities scholars follow the work of the natural sciences in using algorithmic techniques to visualize ways of reading the environment, we can in turn develop technologies in which the environment “reads” text-based visual objects.

Not only can we design and adapt tools to visualize our digital editions, but we can extend the compass of these tools to read across multiple editions and across the whole of our digital commons. Examples of these kinds of analytic tools have been developed as part of the The Visible Archive project, designed by Mitchell Whitelaw for the National Archives of Australia. You can view demo videos of the A1 Explorer and the Series Browser. You can also download the programs and launch them as Java executable files.

Figure 14: A1 Explorer

Figure 15: Series Browser

EMiC’s development of visualization tools will be undertaken in partnership and collaboration with programmers affiliated with the Canadian Writing Research Collaboratory at the University of Alberta, as well as Digital Infrastructure group of the Network in Canadian History and the Environment (NiCHE) at the University of Western Ontario.

Through its partnership with the CFI-funded (2010-15) Canadian Writing Research Collaboratory (CWRC), directed by Susan Brown at the Universities of Guelph and Alberta, EMiC will play a key role in the development of an innovative web-based service-oriented platform that features two key elements:

(a) a database (Online Research Canada, ORCA) to house born-digital scholarly materials, digitized texts, and metadata (indices, annotations, cross-references). Content and tools will be open access wherever possible and designed for interoperability with each other and with other systems. The database will be seeded with a range of existing digital materials, including material produced by the digitization initiatives undertaken by EMiC.

(b) a toolkit for empowering new collaborative modes of scholarly writing online; editing, annotating, and analyzing materials in and beyond ORCA; discovering and collaborating with researchers with intersecting interests; mining knowledge about relations, events and trends, through automated methods and interactive visualizations; and analyzing the system’s usage patterns to discover areas for further investigation. Forms of collaboration will range from the sharing and building of fundamental resources such as bibliographies to the collaborative production of born-digital historical and literary studies.

EMiC’s role in the development of the ORCA database will see our digitization initiatives, including our online editions and digital commons, integrated into a consolidated digital repository.

As a modular tool, TILE will be ported into the CWRC toolkit. As we move forward with CWRC’s development, EMiC will work as a liaison to enable the integration of this image-based editing tool into the toolkit. EMiC’s involvement in the creation of the CWRC toolkit hinges on our project’s partnerships with the IMT and TILE markup tools and their interoperability in web-based and desktop-based environments. By bringing an image-based markup tools into the CWRC toolkit and by making it interoperable with the desktop-based IMT and the EMiC publication engine, it will exponentially increase the functionality of the platform for the scholarly editing community and, consequently, the productivity of our project.

If you want to get involved in testing and designing EMiC digital tools, let me (dean.irvine@dal.ca) know and I’ll put you in contact with our tool developers. If you’ve tried these tools and want to let others know how you’ve used them, post a story on our blog. I’m hoping that our contingent of EMiCites will all have epiphanies at DHSI this year; it’s a place to imagine our digital futures.

2 Responses to “A New Build: EMiC Tools in the Digital Workshop”

  1. […] This post was mentioned on Twitter by Dean Irvine, Editing Modernism. Editing Modernism said: A New Build: EMiC Tools in the Digital Workshop http://ow.ly/1Vgif #emic #dhsi2010 […]

  2. ballantye says:

    Really glad this is posted. We should definitely direct people “new” to EMiC toward this site. Thanks!

Leave a Reply

You must be logged in to post a comment.