A little DHSI playtime for you. First, two word clouds: one of the DHSI Twitter feed, the other of the EMiC Twitter feed. Both feeds were collected using the JiTR webscraper, a beta tool in development by Geoffrey Rockwell at the University of Alberta.
How did I do this? First I scraped the text from the Twapper Keeper #dhsi2010 and #emic archives into JiTR. I did this because I wanted to clean it up a bit, take out some of the irrelevant header and footer text. Because JiTR allows you to clean up the text (which is not an option in the Twapper Keeper export) you don’t have to work with messy bits that you don’t want to analyze. After that I saved my clean texts and generated what are called “reports.” The report feature creates a permanent URL that you can then paste into various TAPoRware tools. I ran the reports of the #dhsi2010 and #emic feeds through two TAPoRware text-analysis tools, Voyeur and Word Cloud.
If you want to generate these word clouds and interact with them, paste the report URLs I generated using JiTR into the TAPoR Word Cloud tool.
At lunch today a few of us met to talk with Meagan about strategies for standardising our projects, including personographies and placeographies, so as to make our various editions as interoperable as possible and to avoid duplicating each others’ labour. By happy chance we were joined by Susan Brown, who mentioned that CWRC is also working towards a standardised personography template which it might make sense for us to use too, given that EMiC will be one of the projects swimming around in the CWRC ‘fishtank’ (or whatever the term was that Susan used in her keynote).
One outcome of doing this is that our EMiC editions and authors could then be more easily connected by researchers to literatures outside Canada – eg. through the NINES project – which would be brilliant in terms of bringing them to the attention of wider modernist studies.
Meagan and Martin are, unsurprisingly, way ahead of TEI newbies such as me to whom this standardisation issue has only just occurred, and they are already working on it, in the form of a wiki. But, as Meagan said, they would like to hear from us, the user community, about what we would like to see included. Some things will be obvious, like birth and death dates, but might we also want to spend time, for example, encoding all the places where someone lived at all the different points in their life? That particular example seems to me simultaneously extremely useful, and also incredibly time-consuming. It also seems important to encode people’s roles – poet, editor, collaborator, literary critic, anthologist etc – but we need to have discussions about what that list looks like, and how we define each of the terms. Then there are the terms used to describe the relationships between people. What does it mean that two people were ‘collaborators’, for instance? (New Provinces has six people’s names on the cover but the archive makes it very clear that two of them had much more editorial sway than the others.) And how granular do we want to get with our descriptions?
As for placeographies: as I’ve already said on the #emic twitterfeed, one very easy way to standardise these is to ensure we all use the same gazetteer for determining the latitude and longitude of a place when we put in our <geo> codes. I suggest this one at The Atlas of Canada. Once you have the latitude and longitude, there are plenty of sites that will convert them to decimals for you (one example is here).
As Paul pointed out, it’s worth making the most of times when we meet face-to-face, because as we go along, our projects will change and our analytical interests will be clarified, and the things we need to encode will only make themselves clear gradually. So let’s take advantage of the summer institues and conferences to talk about the changing needs of our projects, and our evolving research questions, because it’s often quicker to have these conversations in person.
Perhaps others who were around the table could chime in with things I’ve forgotten or misrepresented. And for everyone: what are your wish lists of things that you’d like to see included in our -ographies?
After two days of TEI fundamentals, I have come to a few conclusions.
First, the most interesting thing about TEI is not the things that you can do, but the things that you cannot do. TEI is, as far as I understand it, only concerned with the content of the text, ignoring everything else (paratextual elements, marginalia, interesting layout, etc.) – stuff some of us find extremely valuable. On top of this, the coding for variants is messy, complicated, and would be next to impossible for complicated variants. For an example of TEI markup for variants, check out: http://www.wwp.brown.edu/encoding/workshops/uvic2010/presentations/html/advanced_markup_09.xhtml.
As a result, I am glad that EMiC is developing an excellent IMT – this should solve many of the limitations of TEI, at least from my perspective.
Finally, while participants can all learn basic TEI and encoding, the next step of course would be to establish a CSS stylesheet. It seems to me that EMiC, like all publishing houses, should establish a single organization style, and design a stylesheet that all EMiC participants are free to use. This would ensure a consistent design and look of all EMiC orientated projects in their digital form. Maybe this could be something discussed at a roundtable at next-years speculated EMiC orientated course at DHSI.
Something to ponder.
Well team, I tried to give us a good plug at this morning’s talk. Given the large size of our contingent this year, I thought it was important to let people know a bit about the project as a whole. And, it also gives us a chance to define ourselves for ourselves, and remind us of who we represent while we are here. :)
Though I don’t know how to do it, I am going to attempt to post some of the sections from my talk today on the blog. They provide a taste of the IMT, which we will get a bigger helping of on Friday when Zailig & Meg give us a quick peek into the project in its current development. This also helps follow in Dean’s footsteps in the reconfiguration of his talk as a blog post.
On the occasion of our 2010 DEMiC summer institute I’d like to present an interim report on EMiC’s major digital initiatives, our new institutional partnerships, and our four streams of collaborative digital-humanities research: (1) digitization, (2) image-based editing and markup, (3) text analysis, (4) and visualization.
Last June I trekked out to Victoria to attend the Digital Humanities Summer Institute with a group of graduate students, postdocs, and faculty affiliated with the EMiC project. There were a dozen of us; some came with skills and digital editing projects in the works, others were standing at the bottom of the learning curve staring straight up. Most enrolled in one of the two introductory courses in text encoding or digitization fundamentals. Meagan Timney, who is our first EMiC postdoctoral fellow, and I enrolled in Susan Brown and Stan Ruecker’s seminar on Digital Tools for Literary History. They introduced us to a whole range of text-analysis and visualization tools. I started to pick and choose tools that I thought might be useful for the EMiC kit. These tools have been principally intended for the analysis of text datasets, either plain vanilla transcriptions of the kind that one finds on Project Gutenberg or enriched transcriptions marked up in XML. The common denominator is obvious enough: these tools are designed to work with transcribed texts. But what if I wanted tools to work with texts rendered as digital images? What if I didn’t want to read transcribed texts but instead use tools that could read encoded digital images of remediated textual objects? What kind of tools are being developed for linking marked-up transcriptions to images? How can these tools be employed by scholarly editors?