IMT – Editing Modernism in Canada

IMTweet

Dean Irvine — Fri, 11 Jun 2010 15:52:20 +0000

#emicClouds

A little DHSI playtime for you. First, two word clouds: one of the DHSI Twitter feed, the other of the EMiC Twitter feed. Both feeds were collected using the JiTR webscraper, a beta tool in development by Geoffrey Rockwell at the University of Alberta.

#emic Twitter Feed in JiTR

How did I do this? First I scraped the text from the Twapper Keeper #dhsi2010 and #emic archives into JiTR. I did this because I wanted to clean it up a bit, take out some of the irrelevant header and footer text. Because JiTR allows you to clean up the text (which is not an option in the Twapper Keeper export) you don’t have to work with messy bits that you don’t want to analyze. After that I saved my clean texts and generated what are called “reports.” The report feature creates a permanent URL that you can then paste into various TAPoRware tools. I ran the reports of the #dhsi2010 and #emic feeds through two TAPoRware text-analysis tools, Voyeur and Word Cloud.

#emic Twitter Feed in TAPoR Word Cloud

#dhsi2010 Twitter Feed in TAPoR Word Cloud

If you want to generate these word clouds and interact with them, paste the report URLs I generated using JiTR into the TAPoR Word Cloud tool.

[#emic]

http://ra.tapor.ualberta.ca/~jitr/contents/show_report/4848?key=921556750373594176491

[#dhsi2010]

http://ra.tapor.ualberta.ca/~jitr/contents/show_report/4847?key=78555118459803099008

If you want to try JiTR and do some webscraping and aggregating on your own, let me know and I’ll put you in touch with Geoffrey Rockwell. You’ll need a username and password to test it out.

Gender Peeps

One of the many things that caught my attention on the #emic and #dhsi2010 Twitter feeds was the sudden emergence mid-week of a stream of discourse surrounding gender and the digital humanities. I thought that it might be revealing to compare ways in which the #emic and #dhsi2010 feeds differ. It turns out that just one tweet in the EMiC stream mentioned Susan Brown, and none of us picked up on the gender and digital humanities discussion that cropped up in the #dhsi stream. Obviously what this tells me is that Voyeur is only really useful if the documents you’re comparing contain the same keywords. I don’t think it really tells us much about the disposition of EMiC tweeters toward questions of gender. You can check out the results for yourself. Just paste the report URLs above into Voyeur. You’ll need to generate a favourites lists of keywords (gender, women, female, feminism, etc).

Blogospherics

So, after that lackluster result, I thought that I’d turn my webscraper to the EMiC blog. I uploaded the individual URLs for each post to Voyeur. You can play around with some keywords on your own. Here’s the URL to our blog corpus http://voyeur.hermeneuti.ca/?corpus=1276258984042.7345. For the screenshot below I picked IMT, digital, and editing, since they represent keywords widely used across a significant number of blog posts. What the stats and visualization confirm is what we all probably already know based on your anecdotal reports from DHSI reflected in Twitter and blog activity: we’re all keenly interested in the development of IMT. I’ve been particularly impressed by the initiative of DEMiC participants to work directly with Meagan on IMT. What we need to do from here is to sustain that dialogue through the blog over the summer and take the opportunity again this fall at the EMiC conference to reprise our conversations about IMT and assess what we’ve done so far in helping to implement the features, standards, and protocols that we’ve been discussing so far. Looking ahead to DHSI 2010, we’ll be in a position to ramp up our work with IMT in a specialized image-markup and edition-production seminar. I’d be very interested to see even more blog posts about curricular desiderata for that course.

EMiC Blog in Voyeur

Back to blog analytics. What I did next was scrape each individual blog post into JiTR using a feature called a text aggregator, which allowed me to list all of the URLs for each blog posting and scrape everything at once.

JiTR Text Aggregator

Then I uploaded the blog report URL (http://ra.tapor.ualberta.ca/~jitr/contents/show_report/5038?key=474528093615860579662) into Voyeur. As a comparator document, I also uploaded an updated scraping of the #emic Twitter feed. Let’s see what it tells us about differences between our blogging and tweeting about IMT.

IMT on EMiC Blog and Twitter Feed

There’s not all that much data to work with here, but what we might draw from this analysis is that IMT spiked early on in the Twitter feed (the green line) and was picked up on the blog (the blue line). We could extrapolate from this that our practice is a fairly common representation of the interactive relationship between tweeting and blogging: the conversation begins with probing tweets followed by more expansive commentary on blogs. That, in any case, was one of the reasons why Meg and I wanted EMiCites to tweet during DHSI: it’s not the Twitter feed that generates substantive content, but it does initiate a social exchange of ideas that finds its way to more formalized forums such as blogs, and as Emily has already suggested, journals. These are conversations that will also insinuate themselves into our roundtables and panels at the Conference on Editorial Problems at the University of Toronto in October, and they will be reprised during our the EMiC roundtable session on Editorial Networks and Modernist Remediations at the Modernist Studies Association conference in Victoria in November. These conversations will in turn end up in the edited essay collections and special journal issues that we have planned. And, ultimately, our blogs and tweets from DHSI will inform the design and functionality of our digital editions, archives, and commons and their interoperability with toolkits such as those in development by Susan Brown’s CWRC project. It’s a discursive exchange that links print and digital media, a dialogic interaction that extends from tweets to editions, blogs to essays, digital toolkits to roundtables.

Standardisation & its (dis)contents

Anouk — Fri, 11 Jun 2010 02:25:52 +0000

At lunch today a few of us met to talk with Meagan about strategies for standardising our projects, including personographies and placeographies, so as to make our various editions as interoperable as possible and to avoid duplicating each others’ labour. By happy chance we were joined by Susan Brown, who mentioned that CWRC is also working towards a standardised personography template which it might make sense for us to use too, given that EMiC will be one of the projects swimming around in the CWRC ‘fishtank’ (or whatever the term was that Susan used in her keynote).

One outcome of doing this is that our EMiC editions and authors could then be more easily connected by researchers to literatures outside Canada – eg. through the NINES project – which would be brilliant in terms of bringing them to the attention of wider modernist studies.

Meagan and Martin are, unsurprisingly, way ahead of TEI newbies such as me to whom this standardisation issue has only just occurred, and they are already working on it, in the form of a wiki. But, as Meagan said, they would like to hear from us, the user community, about what we would like to see included. Some things will be obvious, like birth and death dates, but might we also want to spend time, for example, encoding all the places where someone lived at all the different points in their life? That particular example seems to me simultaneously extremely useful, and also incredibly time-consuming. It also seems important to encode people’s roles – poet, editor, collaborator, literary critic, anthologist etc – but we need to have discussions about what that list looks like, and how we define each of the terms. Then there are the terms used to describe the relationships between people. What does it mean that two people were ‘collaborators’, for instance? (New Provinces has six people’s names on the cover but the archive makes it very clear that two of them had much more editorial sway than the others.) And how granular do we want to get with our descriptions?

As for placeographies: as I’ve already said on the #emic twitterfeed, one very easy way to standardise these is to ensure we all use the same gazetteer for determining the latitude and longitude of a place when we put in our codes. I suggest this one at The Atlas of Canada. Once you have the latitude and longitude, there are plenty of sites that will convert them to decimals for you (one example is here).

As Paul pointed out, it’s worth making the most of times when we meet face-to-face, because as we go along, our projects will change and our analytical interests will be clarified, and the things we need to encode will only make themselves clear gradually. So let’s take advantage of the summer institues and conferences to talk about the changing needs of our projects, and our evolving research questions, because it’s often quicker to have these conversations in person.

Perhaps others who were around the table could chime in with things I’ve forgotten or misrepresented. And for everyone: what are your wish lists of things that you’d like to see included in our -ographies?

Thoughts so far on TEI

Christopher Doody — Tue, 08 Jun 2010 23:50:44 +0000

After two days of TEI fundamentals, I have come to a few conclusions.

First, the most interesting thing about TEI is not the things that you can do, but the things that you cannot do. TEI is, as far as I understand it, only concerned with the content of the text, ignoring everything else (paratextual elements, marginalia, interesting layout, etc.) – stuff some of us find extremely valuable. On top of this, the coding for variants is messy, complicated, and would be next to impossible for complicated variants. For an example of TEI markup for variants, check out: http://www.wwp.brown.edu/encoding/workshops/uvic2010/presentations/html/advanced_markup_09.xhtml.

As a result, I am glad that EMiC is developing an excellent IMT – this should solve many of the limitations of TEI, at least from my perspective.

Finally, while participants can all learn basic TEI and encoding, the next step of course would be to establish a CSS stylesheet. It seems to me that EMiC, like all publishing houses, should establish a single organization style, and design a stylesheet that all EMiC participants are free to use. This would ensure a consistent design and look of all EMiC orientated projects in their digital form. Maybe this could be something discussed at a roundtable at next-years speculated EMiC orientated course at DHSI.

Something to ponder.

IMT and the P.K. Page Digital Edition

Emily Ballantyne — Tue, 08 Jun 2010 17:25:35 +0000

Well team, I tried to give us a good plug at this morning’s talk. Given the large size of our contingent this year, I thought it was important to let people know a bit about the project as a whole. And, it also gives us a chance to define ourselves for ourselves, and remind us of who we represent while we are here.

Though I don’t know how to do it, I am going to attempt to post some of the sections from my talk today on the blog. They provide a taste of the IMT, which we will get a bigger helping of on Friday when Zailig & Meg give us a quick peek into the project in its current development. This also helps follow in Dean’s footsteps in the reconfiguration of his talk as a blog post.

Enjoy!

Instead of large bulky volumes, the Collected Works of P.K. Page project intends to produce inexpensive print volumes published by the small press Porcupine’s Quill that are based on a hypermedia archive housed through the EMiC website. These selected texts are highlights that can be used in classrooms and can be enjoyed by a general reading public without a scholarly interest in Page’s work. The hypermedia archive, which will allow free access to the complete body of Page’s work, is the baby of this project.

My role in this project is essentially to provide a trial (super-trial!) run. Since we are still in the preliminary stages, I am going to walk you through some of the tools we plan to utilize, and will begin by explaining the editorial approach I bring to the project.

The project I produced as a component of my Master’s thesis is a small edition of poetry and poetry fragments that were written in 1957 by P.K. Page. Working with original poetry manuscripts from Library and Archives Canada, I am interested in tracing the writing processes of P.K. Page. I want to write the story of the composition of a page of Page, attempting to mark out the sequential stages that a text goes through as it is being written.

Out of a desire to reject the concept that a work has one, finalized artistic form, the genetic editorial tradition that I draw from seeks to represent in a readable way, every version of the text available. For this type of work, developing a clear method of transcription is most important. In many ways, it is the inaccessibility of these editions that is the biggest fear of the editor. Because genetic editors become so immersed in the field and have specific technical knowledge from extended research, remaining comprehensible is important. A clear method of transcription needs to be both consistently used by you, the author, as well as understandable by a readership that does not have your technical knowledge.

At this stage in the project, each of Page’s poems has been transcribed using this system, which will now be converted into TEI to be represented digitally. Extending far beyond a basic digital apparatus, the Collected Works digital archive will be a digital edition of every version of every page of Page that can be made available.

The digital toolbox we will draw from to create the online hypermedia archive will include TEI mark-up of each page, including specific TEI coding for genetic editors, the digitization of P.K. Page’s Complete Fond by Library and Archives Canada, as well as UVic’s own Image Mark-up Tool. I will touch on each of these briefly.

The TEI encoding will provide the basis of our project. Hopefully, we will collaborate with the already existing genetic editing TEI workgroup to develop and use tags that are specific to social text and genetic editing. Some particular challenges we have encountered so far with the TEI include issues with overlapping hierarchies and attempting to figure out how to categorize a particular page. With Page’s travel writing, for example, an image, poetry and prose might all occur on the same page. This makes it difficult to determine what category the item falls into.

Another very key issue is the intention of our edition. We aren’t interested in a diplomatic transcription of the page, but instead are interested in the literary structures of that page. For example, it is irrelevant to the genetic editor whether Page crossed out a line and wrote a new one over top, or if she drew an arrow that points to the new revision. The markings she uses to represent additions, deletions and changes in layout are not as important to us as describing the physical process the text went through.

To make our editorial process more apparent, as well as to give people access to the markings on the page, the digital archive will combine both image and text. Thanks to the Library and Archives Canada, the P.K. Page fonds will be the first to be made entirely available digitally. With scanned images, this will allow all scholars to work with the same primary materials we had access to, and allow comparison between the genetic edition we have created and the original text. In some ways, working with these scans is superior to working with the originals. We can scan in on minute details, as well as compare different versions of the poem that may be housed in different locations.

And finally, and most excitingly, the Collected Works will be using the Image Markup Tool to combine image and text on the same page. This will allow the genetic transcriptions we provide to be instantly accessible as a layer on top of the original image itself.

This collaboration, which has been facilitated by the EMiC project, will allow us to produce a tool which adheres to best practices in TEI markup but which has a simple enough interface that it can be used by EMiC participants with limited experience in editing XML code. The publication engine will link page images, transcriptions and annotations in an interactive interface, and will include features for navigating multiple versions of a source text.

I have included some screen shots to show how we use the IMT, and how the work produced by IMT will be presented in a browser window.

The Original Poem in a Form Similar to the One Provided by the LAC

To start, here is a high quality digital image of a page of Page’s poetry produced using a high quality scanner. Viewing this page online, instead of using a photocopy a variety of textual features become apparent. I can immediately identify that Page used two types of ink, and therefore made two distinct sets of revisions. I can clearly make out the doodle and determine that it is not part of the text. And, finally, I can locate this text within its particular time and place of composition.

The Text Marked Up Genetically

The next photo is of the genetic markup of the slide. In the digital archive, this would be available as its own entity as a central part of the edition. Here the colour coding and numbering allows us to identify the number of changes made to a particular line, as well as to easily identify text that was not revised. It also allows us to focus on the text itself, instead of being distracted by the details on which the original work was composed.

The Work in Progress: The Image in the IMT

This photo shows the process of working with the Image Mark-Up Tool. The current version available allows us to create rectangular boxes around sections of text and then allows us to annotate them. In this slide here, we are able to identify a revision and code it using both TEI and our own genetic mark-up language.

What it Might Look Like on the Web

And, finally, here is how the page will look in a browser. As you can see, by scrolling over sections of text, you can have the transcriptions “pop up” as layers on top of the image. This functionality is going to be further developed to allow for other types of non-rectangular shapes, as well as to allow even more layers of text to be compiled on top of the image. In this way, textual and explanatory notes, links to other versions of the text, as well as definitions and other social text features can all be accessible from one page in the edition.

Though the project is still in its preliminary stages, we have invested a lot of time into exploring our options to create a new flagship project for the future of editing Canadian authors. It is a very exciting time for Page scholarship, and for those of us interested in Editing Canadian Modernism. Woot woot!

(PS, a big thanks to Zailig for the screen captures!)

A New Build: EMiC Tools in the Digital Workshop

Dean Irvine — Mon, 07 Jun 2010 19:29:37 +0000

DEMiC +1

On the occasion of our 2010 DEMiC summer institute I’d like to present an interim report on EMiC’s major digital initiatives, our new institutional partnerships, and our four streams of collaborative digital-humanities research: (1) digitization, (2) image-based editing and markup, (3) text analysis, (4) and visualization.

Last June I trekked out to Victoria to attend the Digital Humanities Summer Institute with a group of graduate students, postdocs, and faculty affiliated with the EMiC project. There were a dozen of us; some came with skills and digital editing projects in the works, others were standing at the bottom of the learning curve staring straight up. Most enrolled in one of the two introductory courses in text encoding or digitization fundamentals. Meagan Timney, who is our first EMiC postdoctoral fellow, and I enrolled in Susan Brown and Stan Ruecker’s seminar on Digital Tools for Literary History. They introduced us to a whole range of text-analysis and visualization tools. I started to pick and choose tools that I thought might be useful for the EMiC kit. These tools have been principally intended for the analysis of text datasets, either plain vanilla transcriptions of the kind that one finds on Project Gutenberg or enriched transcriptions marked up in XML. The common denominator is obvious enough: these tools are designed to work with transcribed texts. But what if I wanted tools to work with texts rendered as digital images? What if I didn’t want to read transcribed texts but instead use tools that could read encoded digital images of remediated textual objects? What kind of tools are being developed for linking marked-up transcriptions to images? How can these tools be employed by scholarly editors?

When we started three years ago to assemble the people involved in the EMiC project, few of us could have imagined asking—never mind answering—these questions. At best I had thought that some of the more intrepid among us might produce hypertext editions, or hypermedia editions that provided digital facsimiles of manuscript and print materials along with transcribed and encoded texts that we edited. I thought that we’d mostly be working in print media, especially since so few of us possessed the coding skills necessary to produce much more than rudimentary electronic editions that didn’t really do anything more sophisticated than what we already could do in print. I’m an idealist trying to be realistic: our project funding lasts for seven years, but absurd as it may seem I already started to worry that it might not be long enough for any of us to master the skills we needed. More to the point: the researchers affiliated with the project are, for the most part, humanities scholars who come from institutions without year-round access to the kinds of digital humanities training and expertise available at UVic or McMaster or the University of Alberta and elsewhere. If the idea was to train EMiC participants so that we can work on our own or in collaboration with others on digital editions, it became staggeringly obvious to me that one week a year at DHSI wasn’t going to get many of us very far up the learning curve. Susan and Stan already showed me that we didn’t all need to invent the wheel: that’s what tools are for—collaboration. And that’s what we needed: a collaborative image-based editing tool.

IMAGE-BASED EDITING
I came to this epiphany after a sleepless night in my UVic dorm room. Like many of my epiphanies, this one wasn’t all that original. The history of digital text-editing tools dates from the late 90s. According to my most recent count, there are as many as 28 different tools in various states, some long orphaned and others still at alpha and beta stages. Even narrowing it down to tools or collections of tools that allow users to edit or display images within the context of textual environments or editions, there are as many as 10 such open-source tools. These range from those that simply display an image alongside a text to software suites that support the development of complete image-based environments with substantial functionality beyond text-to-image mapping. I’ll just mention a few of these tools.

The simplest of these enable the viewing of images alongside text transcription, either for editing or display. Juxta, developed through the NINES project, provides a window for viewing image files alongside transcriptions, which could be useful for an editor checking readings or adding annotations, but does not provide any method for connecting the image with the text beyond the page level.

Figure 1: Juxta

Collex, another tool developed by NINES, can search across several different projects and view and compare edited texts and images; it does not, however, provide a way to create links between image and text, since it is a viewing tool—a collection or exhibit builder—and not an editor.

Figure 2: Collex

Similar in its simplicity is the Versioning Machine, developed by Susan Schrebman at the University of Maryland, a display tool for comparing encoded texts that also enables page images to be linked to the text at the page level.

Figure 3: The Versioning Machine

Among the more complex collaborative editing tools I’ve spent some time working with are TextGrid, developed by a consortium of universities in Germany, and the Edition Production and Presentation Technology (EPPT), developed by Kevin Kiernan at the University of Kentucky. TextGrid consists of two primary components, a repository for research data in the humanities (archives, databases, commons) and a laboratory of tools and services for the creation of digital editions. Though its specs are impressive, it’s stubbornly buggy and slow; it freezes pretty much every time I’ve tried it.

Figure 4: TextGrid

EPPT is a robust set of tools for overlapping multiple images and encoding variants; collating manuscripts; gathering and displaying statistical data; generating paleological and codicological description; developing glossaries; linking images and texts; and publishing editions. For scholars interested simply in linking image and text, this suite of tools is doubtless overkill. EPPT is probably the most sophisticated tool I’ve worked with so far, but it’s been grandfathered at this point, partly because its image-text linking protocols aren’t conformant with standards of XML markup set by the Text Encoding Initiative (TEI).

Figure 5: EPPT

The only reason I know any of this is because I went to DHSI last June. There I met and started to collaborate with digital humanists who have been instrumental to the transformation of EMiC from a predominantly print-based editing project to one that is increasingly invested in the implementation of computing tools toward the construction of online reading environments, databases, archives, editions, and commons.

EMiC has formed two partnerships to develop desktop-based image-editing tools and a publication engine to do TEI-compliant XML markup and XSLT rendering of not just transcribed texts but of the digitized images of texts. So far we have been building on the desktop-based Image Mark-up Tool (IMT) developed by Martin Holmes at UVic’s Humanities Compu ting and Media Centre. In its first build IMT was a Windows-based application limited to annotation of individual images; its next build, which has been undertaken in collaboration with our postdoc Meagan Timney, will be a cross-platform tool with the capacity for annotation of multiple images.

Figure 6: IMT 1.8

Our aim is to produce a tool which adheres to best practices in TEI markup but which has a simple enough interface that it can be used by EMiC participants with limited experience in editing XML code. The publication engine will link page images, transcriptions and annotations in an interactive interface, and will include features for navigating multiple versions of a source text. You can follow the development of the tool on the IMT blog.

Figure 7: XSLT rendered IMT page

The second partnership we’ve formed is with the Text-Image Linking Environment (TILE), currently in development by Doug Reside at the Maryland Institute for Technology in the Humanities (MITH), John Walsh of Indiana University Bloomington, and Dot Porter of the Digital Humanities Observatory at the Royal Irish Academy. The TILE project will produce a web-based, modular, collaborative image markup tool for both manual and semi-automated linking between encoded text, image of text, and image annotation. TILE is built on the existing code of the Ajax XML Encoder (AXE) image tagger, a web-based tool for ‘tagging’ text, video, audio, and image files with XML metadata.

Figure 8: AJAX XML Encoder

(The Shakespeare Quartos Archive is one example of a MITH project marked up using the AXE tool. There are demo videos that run through how to use the SQA.) By 2011, the TILE team will have produced software interoperable with other popular tools (including IMT) and capable of producing TEI-compliant XML for linking image to text. Because TILE is being designed to be interoperable with IMT, the key difference being that the former is web-based and the later desktop-based, the EMiC project will be able to take advantage of both web and desktop environments. The obvious advantage of TILE is that it offers an environment for real-time collaborative editing, which is certainly something that EMiC could use to its advantage as a research and pedagogical tool to facilitate long-distance collaborations across our distributed network of researchers and institutions. That TILE is designed to be interoperable with IMT is crucial, since our publication engine will be designed to work interchangeably with both of these web-based and desktop-based tools.

DIGITIZATION
What is EMiC going to feed into these markup tools? EMiC is partnered with archives and libraries committed to large-scale digitization projects. In partnership with Library and Archives Canada, EMiC is undertaking the digitization of the LAC’s fonds of modernist authors, beginning with the literary papers of P.K. Page (followed by Elizabeth Smart, F.R. Scott, Marion Dale Scott, Miriam Waddington, Patrick Anderson, John Glassco, among other modernists). This new EMiC partner project will allow us to link to manuscript and typescript versions of published texts as well as previously unpublished texts, correspondence, and other archival materials. With Paul Hjartarson and his cohort of graduate-student collaborators (Kristin Fast, Stefanie Markle, and Kristine Smitka) at the University of Alberta, we are also working toward a project that will see the digitization of the Sheila Watson and Wilfred Watson fonds located at archives in Toronto and Edmonton.

None of the materials that will be processed by the EMiC or its partners have been made available through current mass-digitization projects, including those undertaken by Google Books, Canadiana.org, and the Open Content Alliance (OCA)’s Internet Archive. Nor are these texts available in transcription on Project Gutenberg. One of the decisive methodological differences between the digitization and transcription of literary texts under the mandate of these mass-digitization projects and that of EMiC is our attention to the bibliographic and codicological particularities of texts. While the library-driven preservationist program of Canadiana.org—which now includes the digitization of the Canadian Institute for Historical Microreproductions microfiche catalogue, and which includes multiple editions of texts—comes closest to the digitization program of EMiC, neither the image quality nor the file formats of any current national or international mass digitization projects are suitable to editorial projects that require high-resolution, full spectrum scans of original materials. (A sidebar: I have recently been in touch with the digitization team at the University of Alberta, which is one of the members of the OCA, to find out how we can access the Internet Archive’s raw individual .tiff files instead of the files packaged for digital readers.) Even though mass digitization resources are complementary to the work of producing digital editions and archives, these projects cannot really be expected to meet the particular and specialized requirements of scholarly editors. To take full advantage of the collaborative editing tools underdevelopment by EMiC and its partners, and in conjunction with the digitization of archival materials, we will assume responsibility for the production and storage of a digital commons which will be accessible to members of the project for their editorial work, to researchers who cannot otherwise consult physical copies, and to the general reading public. This is the developmental work that our postdocs Meagan Timney and Matt Huculak are in the process of prototyping.

This digital commons is not intended to replace the need for critically edited versions of modernist texts, nor is it designed to reproduce texts currently in print; it is, rather, a digital supplement to the corpus of critical editions already in production by EMiC editors and a repository of raw materials for the production of additional print and digital editions by graduate students, postdocs, and faculty.

TEXT ANALYSIS
EMiC is partnered with the Text Analysis Portal for Research (TAPoR), a multi-institutional digital humanities project under the direction of Geoffrey Rockwell at the University of Alberta that has set up research infrastructure for the study of texts using text-analysis tools. EMiC benefits from the partnership in two key ways:

(a) TAPoR has developed a portal that adapts commercial and open-source text tools to the needs of the research community. The portal is working; they are now extending it and have technical support to support a digital commons such as EMiC. As part of the TAPoR portal they have been developing programs or utilities that make other tools work in a portal context. TAPoR is willing to share the program code for the portal under an open source license with EMiC so that we can adapt it to our purposes. Where appropriate they are also willing to assist in the configuration, testing, and adaptation of the portal.

(b) TAPoR at the University of Alberta and McMaster University has developed and is extending a suite of tools (TAPoRware, Voyeur, and JiTR) that process electronic texts. These tools can be embedded into remote web projects to add functionality to e-texts from projects like EMiC. TAPoR will consult with EMiC to make sure these work within the our context and adapt them where feasible to our needs. TAPoR is also willing to share the program code for these tools and utilities under an open-source license.

Figure 9: Voyeur

These text-analysis tools will be employed by researchers affiliated with EMiC to process the corpora digitized and marked up with metadata. You can consult the TAPoR training videos for help with using these tools.

VISUALIZATION
Rather than following past practices of transcribing texts and marking up transcriptions in the creation of electronic texts, EMiC and its partners will pioneer image-based editing, semantic markup, analysis, and visualization of texts in a field of emergent practices in digital-humanities scholarship. Instead of producing reading environments based on linear-discursive transcriptions of texts, EMiC will produce in collaboration with its partners techniques and technologies for encoding and interpreting the complex relations among large collections of visual and audial objects in non-linear reading environments. For example, consider the series of textual visualizations by Stefanie Posavec done in collaboration with Greg McInerny of Microsoft Research. They include textual variants and revisions among six versions of Charles Darwin’s Origin of Species.

Figure 10: Entangled Word Bank

Figure 11: Entangled Word Bank (detail)

Consider also another visualization of the multiple editions of Darwin’s Origin, this time by one of the creators of the Processing visualization environment, Ben Fry. The potential of these Origin projects to the kinds of editorial, text-analysis, and visualization work that EMiC and its partners are undertaking has only just started to be explored.

Figure 12: The Preservation of Favoured Traces

What these particular projects lack, however, is any significant element of interactivity. It will be invaluable to create interactive visualization tools available to render in ecological structures the relationships among different versions of texts as well as the masses of linear lists of textual variants and revisions that editions typically relegate to the apparatus of the text. That is what these two Origin projects already do: they produce aesthetic objects as such, instead of what many consider an unreadable apparatus of textual lists. This represents a major transformation in the way in which editors could imagine constructing the apparatuses of digital editions.

High-performance visualization environments for mass-digitized corpora (e.g., digital books, archives, and libraries) are just now emerging as the next evolutionary stage in the technologies and practices of image-based digital editing in the humanities. Recent developments in information-visualization theory and technology are modeled on ecosystems; see, for instance, the Mandala rich prospect browser developed by Stéfan Sinclair of McMaster University and Stan Ruecker of the University of Alberta.

Figure 13: Mandala

As emergent practices, these projects have yet to develop fully elaborated technologies and practices of producing image-based editing ecosystems, generating editions and archives as ecological structures, and reading the ecologies of archives and editions. If humanities scholars follow the work of the natural sciences in using algorithmic techniques to visualize ways of reading the environment, we can in turn develop technologies in which the environment “reads” text-based visual objects.

Not only can we design and adapt tools to visualize our digital editions, but we can extend the compass of these tools to read across multiple editions and across the whole of our digital commons. Examples of these kinds of analytic tools have been developed as part of the The Visible Archive project, designed by Mitchell Whitelaw for the National Archives of Australia. You can view demo videos of the A1 Explorer and the Series Browser. You can also download the programs and launch them as Java executable files.

Figure 14: A1 Explorer

Figure 15: Series Browser

EMiC’s development of visualization tools will be undertaken in partnership and collaboration with programmers affiliated with the Canadian Writing Research Collaboratory at the University of Alberta, as well as Digital Infrastructure group of the Network in Canadian History and the Environment (NiCHE) at the University of Western Ontario.

CWRC AND ORCA
Through its partnership with the CFI-funded (2010-15) Canadian Writing Research Collaboratory (CWRC), directed by Susan Brown at the Universities of Guelph and Alberta, EMiC will play a key role in the development of an innovative web-based service-oriented platform that features two key elements:

(a) a database (Online Research Canada, ORCA) to house born-digital scholarly materials, digitized texts, and metadata (indices, annotations, cross-references). Content and tools will be open access wherever possible and designed for interoperability with each other and with other systems. The database will be seeded with a range of existing digital materials, including material produced by the digitization initiatives undertaken by EMiC.

(b) a toolkit for empowering new collaborative modes of scholarly writing online; editing, annotating, and analyzing materials in and beyond ORCA; discovering and collaborating with researchers with intersecting interests; mining knowledge about relations, events and trends, through automated methods and interactive visualizations; and analyzing the system’s usage patterns to discover areas for further investigation. Forms of collaboration will range from the sharing and building of fundamental resources such as bibliographies to the collaborative production of born-digital historical and literary studies.

EMiC’s role in the development of the ORCA database will see our digitization initiatives, including our online editions and digital commons, integrated into a consolidated digital repository.

As a modular tool, TILE will be ported into the CWRC toolkit. As we move forward with CWRC’s development, EMiC will work as a liaison to enable the integration of this image-based editing tool into the toolkit. EMiC’s involvement in the creation of the CWRC toolkit hinges on our project’s partnerships with the IMT and TILE markup tools and their interoperability in web-based and desktop-based environments. By bringing an image-based markup tools into the CWRC toolkit and by making it interoperable with the desktop-based IMT and the EMiC publication engine, it will exponentially increase the functionality of the platform for the scholarly editing community and, consequently, the productivity of our project.

If you want to get involved in testing and designing EMiC digital tools, let me (dean.irvine@dal.ca) know and I’ll put you in contact with our tool developers. If you’ve tried these tools and want to let others know how you’ve used them, post a story on our blog. I’m hoping that our contingent of EMiCites will all have epiphanies at DHSI this year; it’s a place to imagine our digital futures.