Resources
A range of resources including information about text analysis, the TAPoR workshop,
electronic text collections, and journals in the field of digital arts.
Quick Links

The University of Alberta TAPoR Node is proud to have worked in conjunction with exceptional researchers to produce and publish the following completed (though often still evolving) projects:
© 2004-2008 TAPoR @ UAlberta
Copyright and privacy statement
The written word is one of the most important ways we communicate and preserve information. Whether it is legal records, novels, historical records, medical case studies, or now WWW pages, written text is in an important form of data to preserve. It is one of the primary means by which we communicate in industry, academia or for pleasure and, as an increasing amount of the texts that we care about are created in electronic form and accessed in electronic form, Canada needs a well thought out strategy for preserving those electronic texts of use to future generations. The future understanding of our past and understanding of this age of technological change will be incomplete if we do not take steps to preserve one of the most widely used forms of electronic information - the electronic text.
What is an Electronic Text?
Electronic texts digitally represent oral or written
language in a form suitable for analysis with
a computer. Typically an electronic text is either
a electronic version of a written work, an electronic
version of a transcript of an oral event, or a
document composed on the computer. In any case
the information in an electronic text is meant
to be in a natural language that can be read by
humans when displayed properly. Some examples
of electronic texts would be:
An e-mail message
A medical case study that is stored on a computer
WWW pages that contain a significant amount of text so that to be understood they have to be read
A hypertext manual
An electronic edition of a play with markup and links to images of the original manuscript
A corpus of texts collected for linguistic or lexicographical study like a collection of exemplary texts used in the creation of a dictionary
A business document like a report, proposal, or contract
An interactive CD-ROM dictionary or encyclopedia
An interactive text adventure game where you read passages and make decisions
A collection of legal documents accessible through a retrieval system
A transcript of a series of interviews with embedded interpretative information
A transcript of a court case or administrative proceeding
Electronic texts come in four major forms:
What can we do with electronic texts?
We can use computers to present, manage, and learn
from electronic texts in ways difficult to do
by hand. We can archive large quantities of text
and make reliable copies of these archives. We
can quickly retrieve passages from a large text
database of millions of pages. We can ask where
two or more words occur within the same paragraph.
We can link automatically to other information
from a hypertext. We can quantify writing style
or try to identify the author of a disputed work
by his or her style. We can compare written works
or study the evolution of language usage over
a collection of texts. In general the process
of computer assisted text-analysis uses computers
to search, retrieve, manipulate, measure and classify
natural-language documents for patterns and by
author, subject, and genre or type. Here is partial
list of some of the activities researchers do
with electronic texts.
A brief history of electronic texts and text-analysis
tools.
A good way to understand text analysis is to look
at the tradition of concordancing from which it
evolved. A concordance is a standard study tool
where one can look up a word and find references
to all the passages in the target work where that
word occurs. They are alphabetically-sorted lists
of the vocabulary of a text (its different words
or phrases). Occurrences of each word (the keyword)
appear under a headword, each one surrounded by
enough context to make out the meaning, and each
one identified by a citation to the text that
gives its location in the original.
The first text-analysis tools were designed to create paper concordances. Father Roberto Busa in the late 1940s was one of the first to use of computers in the production of concordances with his Index Thomisticus, a project that began by using index cards, moved onto analogue information technology in the 50s and migrated to the computer. The results were finally published in the 1970s and a CD was released in 1992.
The concordance, however, goes back to the 13th century. Hugh of St. Cher is credited with directing the production of a concordance to the Vulgate bible by brother Dominican monks in Paris. This concordance, supposedly finished by 1247, suffered in that it only had references and not quotations to give a sense of context. Quotations were apparently added later by English Dominicans to a concordance that has not survived. Finally, a concordance attributed to Conrad of Halberstadt improved on the model, leaving us by the end of the 13th century with a concordance that provided some context along with references.
I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream.| But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not
Example of a Key Word In Context display from an interactive concordance of Shakespeare's A Midsummer Night's Dream
To return to text-analysis tools, the first generation of widely available tools were batch tools that were not interactive, but were designed to produce paper concordances. This can be seen in the names and operations of many of these tools. COCOA stands for COunt and COncordance generation on the Atlas. The Oxford University Computing Service took over COCOA in 1978 and produced OCP (the Oxford Concordance Program.)
With the availability and increasing power of micro-computers in the 1980s text-analysis tools migrated from mainframes to personal computers. OCP developed into Micro-OCP and new programs came out for the personal computer like the Brigham Young Concordance program (BYC) later renamed and commercialized under the name WordCruncher and the TACT (Text-Analysis and Concordance Tools) environment developed at the University of Toronto. This shift to the microcomputer changed the nature of our use of the tools in two ways. The scholar could now use tools whenever they wanted on a personal computer instead of having to wait for mainframe time or time on a terminal. This meant that textual scholars were no longer dependent on the paper concordance, but could use the electronic tools in his or her place of study. This change in the location of computer-assisted text-analysis along with developments in interface technology led developers away from a batch concording model towards interactive models that assumed that the scholar would have access to the tools and a collection of e-texts for personal study. It is the access to research electronic texts that we need to ensure through the preservation of research text data.
One of the best known examples of this shift from batch to interactive concording is TACT which was developed at the University of Toronto and is still in wide use today. TACT is not meant for producing a printed concordance but for exploring the electronic text interactively through queries and windowed displays. TACT did not just automate the job of the concordancer, but changed the perspective of the user of the concordance. It offers advanced features suitable to careful text analysis not found in text retrieval systems. The WWW accessible version of TACT called TACTweb has further made it possible for a researcher to share his research textual data over the Internet.
As interactive concording tools became accessible researchers began to ask more complex questions of text databases. Rather than simply asking for a list of locations of a word in the text, researchers began to ask for patterns of words, parts of words, linguistic features and punctuation. Researchers began to add statistical tools that could count and compare features and now we have visualization tools that display graphs that show usage over large quantities of texts. Many of these techniques are being used for business document management systems and basic Internet tools like the search engines we depend on to find WWW pages.
Why is it important to preserve electronic
texts for research?
The primary means by which scholarship in the
humanities and certain social sciences is transmitted,
studied and stored for future use is in the form
of written works like books, journals and manuscripts.
Philosophers, historians, literary critics, art
historians, political scientists and others in
the Social Sciences and Humanities use primary
sources that are texts and produce new research
in the form of written works. An increasing number
of these texts are generated on a computer and
are therefore originally in electronic form. Further,
a significant number of Canadian scholars have
created text research resources that can only
be studied in electronic form with the appropriate
tools. There is now a critical mass of research
resources available as electronic texts, and in
some cases, only as electronic texts. It is safe
to say that a significant number of researchers
now need access to well maintained electronic
text services in order to conduct research and
the majority will in the near future as computing
methods and text services spread through the disciplines.
Here are some specific reasons for preserving electronic text:
Appropriate investment in the preservation of electronic texts created by or for researchers will not only keep valuable research resources accessible. It will also provide a model for other domains that have significant investments in electronic texts from the health sciences to the insurance sector. We need to begin to solve the problem of how to archive our exploding electronic record before it overwhelms us. The pursuit of efficient models for the research community will benefit Canada.