Resources and Links

Resources
A range of resources including information about text analysis, the TAPoR workshop, electronic text collections, and journals in the field of digital arts.

Quick Links


Rotating Featured Project      

FolkwaysAlive!

FolkwaysAlive!

The American National Corpus Concordance

ANC

The Modern Arthur: A Bibliography of Arthuriana, 1500-2000

Arthurian Bibliography

The Atlas of Alberta Railways

Atlas of Alberta Railway

The Coleridge Project

Coleridge

Canadian Review of Comparative Literature

Canadian Review of Comparative Literature

The Dinka Project

Dinka

Experimental Reading Workshop

Experimental Reading Workshop

Ukrainian Folklore Sound Recordings

Three Women

Felynx Cougati Multi Media Math Project

Felynx Cougati

ICE-CANADA

ICE

Local Culture and Diversity on the Prairies

Local Culture

A Hypertext Edition of John Lydgate's "Edmund and Fremund"

Lydgate

Wenzhou Spoken Corpus

WenZhou

Northern Voices in Environmental Impact Assessment

Seismic lines

The Kinji Imanishi Archive Database Project

Imanishi

Le Patron

Le Patron

Very Large Internet Language Corpus

Cyrus & Chris

Dynamic Text

Electronic New Variorum Shakespeare


Published Projects

The University of Alberta TAPoR Node is proud to have worked in conjunction with exceptional researchers to produce and publish the following completed (though often still evolving) projects:


TAPoR Canada

Project information:

Other nodes:


© 2004-2008 TAPoR @ UAlberta
Copyright and privacy statement


What is Text Analysis?

Electronic Texts and Text Analysis
by Geoffrey Rockwell and Ian Lancashire

The written word is one of the most important ways we communicate and preserve information. Whether it is legal records, novels, historical records, medical case studies, or now WWW pages, written text is in an important form of data to preserve. It is one of the primary means by which we communicate in industry, academia or for pleasure and, as an increasing amount of the texts that we care about are created in electronic form and accessed in electronic form, Canada needs a well thought out strategy for preserving those electronic texts of use to future generations. The future understanding of our past and understanding of this age of technological change will be incomplete if we do not take steps to preserve one of the most widely used forms of electronic information - the electronic text.

What is an Electronic Text?
Electronic texts digitally represent oral or written language in a form suitable for analysis with a computer. Typically an electronic text is either a electronic version of a written work, an electronic version of a transcript of an oral event, or a document composed on the computer. In any case the information in an electronic text is meant to be in a natural language that can be read by humans when displayed properly. Some examples of electronic texts would be:

An e-mail message
A medical case study that is stored on a computer
WWW pages that contain a significant amount of text so that to be understood they have to be read
A hypertext manual
An electronic edition of a play with markup and links to images of the original manuscript
A corpus of texts collected for linguistic or lexicographical study like a collection of exemplary texts used in the creation of a dictionary
A business document like a report, proposal, or contract
An interactive CD-ROM dictionary or encyclopedia
An interactive text adventure game where you read passages and make decisions
A collection of legal documents accessible through a retrieval system
A transcript of a series of interviews with embedded interpretative information
A transcript of a court case or administrative proceeding

Electronic texts come in four major forms:

    1. A copy of a work that was originally on paper - a digital representation of a literary, dramatic, or other type of written work that was originally in analogue form.
    2. A work composed on the computer that is stored in that form, but was intended to be printed like a word-processing file or PDF (Portable Document Format) file.
    3. A work composed on a computer that is meant to be accessed on a computer like a WWW page, electronic text database, or hypertext
    4. A transcript of a conversation or other oral event

What can we do with electronic texts?
We can use computers to present, manage, and learn from electronic texts in ways difficult to do by hand. We can archive large quantities of text and make reliable copies of these archives. We can quickly retrieve passages from a large text database of millions of pages. We can ask where two or more words occur within the same paragraph. We can link automatically to other information from a hypertext. We can quantify writing style or try to identify the author of a disputed work by his or her style. We can compare written works or study the evolution of language usage over a collection of texts. In general the process of computer assisted text-analysis uses computers to search, retrieve, manipulate, measure and classify natural-language documents for patterns and by author, subject, and genre or type. Here is partial list of some of the activities researchers do with electronic texts.

A brief history of electronic texts and text-analysis tools.
A good way to understand text analysis is to look at the tradition of concordancing from which it evolved. A concordance is a standard study tool where one can look up a word and find references to all the passages in the target work where that word occurs. They are alphabetically-sorted lists of the vocabulary of a text (its different words or phrases). Occurrences of each word (the keyword) appear under a headword, each one surrounded by enough context to make out the meaning, and each one identified by a citation to the text that gives its location in the original.

The first text-analysis tools were designed to create paper concordances. Father Roberto Busa in the late 1940s was one of the first to use of computers in the production of concordances with his Index Thomisticus, a project that began by using index cards, moved onto analogue information technology in the 50s and migrated to the computer. The results were finally published in the 1970s and a CD was released in 1992.

The concordance, however, goes back to the 13th century. Hugh of St. Cher is credited with directing the production of a concordance to the Vulgate bible by brother Dominican monks in Paris. This concordance, supposedly finished by 1247, suffered in that it only had references and not quotations to give a sense of context. Quotations were apparently added later by English Dominicans to a concordance that has not survived. Finally, a concordance attributed to Conrad of Halberstadt improved on the model, leaving us by the end of the 13th century with a concordance that provided some context along with references.

	I.1/577.1       | Four nights will quickly dream away the time; | And
	I.1/578.2  Swift as a shadow, short as any dream; | Brief as the
	II.2/585.1       | Ay me, for pity! what a dream was here! | Lysander,
	III.2/591.1   this derision | Shall seem a dream and fruitless vision, |
	IV.1/593.1     as the fierce vexation of a dream.| But first I will
	IV.1/594.2   to me | That yet we sleep, we dream. Do not you think | The
	IV.1/594.2     rare | vision. I have had a dream, past the wit of man to
	IV.1/594.2    the wit of man to | say what dream it was: man is but an
	IV.1/594.2   he go | about to expound this dream. Methought I was--there
	IV.1/594.2    his heart to report, what my dream  | was. I will get Peter
	IV.1/594.2     to write a ballad of | this dream: it shall be called
	IV.1/594.2     it shall be called Bottom's dream, | because it hath no
	V.1/599.1      | Following darkness like a dream, | Now are frolic: not a
	V.1/599.2  theme, | No more yielding but a dream, | Gentles, do not

Example of a Key Word In Context display from an interactive concordance of Shakespeare's A Midsummer Night's Dream

To return to text-analysis tools, the first generation of widely available tools were batch tools that were not interactive, but were designed to produce paper concordances. This can be seen in the names and operations of many of these tools. COCOA stands for COunt and COncordance generation on the Atlas. The Oxford University Computing Service took over COCOA in 1978 and produced OCP (the Oxford Concordance Program.)

With the availability and increasing power of micro-computers in the 1980s text-analysis tools migrated from mainframes to personal computers. OCP developed into Micro-OCP and new programs came out for the personal computer like the Brigham Young Concordance program (BYC) later renamed and commercialized under the name WordCruncher and the TACT (Text-Analysis and Concordance Tools) environment developed at the University of Toronto. This shift to the microcomputer changed the nature of our use of the tools in two ways. The scholar could now use tools whenever they wanted on a personal computer instead of having to wait for mainframe time or time on a terminal. This meant that textual scholars were no longer dependent on the paper concordance, but could use the electronic tools in his or her place of study. This change in the location of computer-assisted text-analysis along with developments in interface technology led developers away from a batch concording model towards interactive models that assumed that the scholar would have access to the tools and a collection of e-texts for personal study. It is the access to research electronic texts that we need to ensure through the preservation of research text data.

One of the best known examples of this shift from batch to interactive concording is TACT which was developed at the University of Toronto and is still in wide use today. TACT is not meant for producing a printed concordance but for exploring the electronic text interactively through queries and windowed displays. TACT did not just automate the job of the concordancer, but changed the perspective of the user of the concordance. It offers advanced features suitable to careful text analysis not found in text retrieval systems. The WWW accessible version of TACT called TACTweb has further made it possible for a researcher to share his research textual data over the Internet.

As interactive concording tools became accessible researchers began to ask more complex questions of text databases. Rather than simply asking for a list of locations of a word in the text, researchers began to ask for patterns of words, parts of words, linguistic features and punctuation. Researchers began to add statistical tools that could count and compare features and now we have visualization tools that display graphs that show usage over large quantities of texts. Many of these techniques are being used for business document management systems and basic Internet tools like the search engines we depend on to find WWW pages.

Why is it important to preserve electronic texts for research?
The primary means by which scholarship in the humanities and certain social sciences is transmitted, studied and stored for future use is in the form of written works like books, journals and manuscripts. Philosophers, historians, literary critics, art historians, political scientists and others in the Social Sciences and Humanities use primary sources that are texts and produce new research in the form of written works. An increasing number of these texts are generated on a computer and are therefore originally in electronic form. Further, a significant number of Canadian scholars have created text research resources that can only be studied in electronic form with the appropriate tools. There is now a critical mass of research resources available as electronic texts, and in some cases, only as electronic texts. It is safe to say that a significant number of researchers now need access to well maintained electronic text services in order to conduct research and the majority will in the near future as computing methods and text services spread through the disciplines.

Here are some specific reasons for preserving electronic text:

  1. It is already the case that certain bibliographic databases (those that include a substantial amount of text information) are being used primarily in electronic form. Few Scholars use the MLA Index or the Philosophers Index on paper. Bibliographic research tools such as Iter, which specializes in references having to do with the Middle Ages and Renaissance, are being developed in Canada. Such rich bibliographic resources need to be preserved to provide future scholars access to research literature in their fields.
  2. Major dictionary projects based in Canada like the Dictionary of Old English are producing electronic text databases of language usage for researchers. The DOE has a full-text database of all the significant works written in Old English. There is no organization at present in Canada that can provide access, let alone preserve, this important resource. The project had to look abroad for electronic text delivery support. Further, the dictionary itself, while it will be published on paper, will also exist in electronic form with additional features that will not be available unless preserved in electronic form. The Comparative Lexicography Project of French and English in Canada is likewise creating a database of Canadian texts in English and French and a bilingual Canadian dictionary. The research text resources of projects like these are crucial to preserve for the ongoing study of our languages. We will not know why we speak and write as we do without such systematic resources.
  3. As mentioned above, one of the better known text-analysis tools TACT was developed at the University of Toronto. Text and document tools are not only be created at Canadian universities; an important commercial document management environment, Livelink by OpenText, benefited from research into text retrieval at the University of Waterloo. Another company,SoftQuad produces X-metal, one of the best SGML/XML editors. Canadian industry and researchers have been at the forefront of the development of tools for the study and retrieval of electronic texts whether for business or academic use. Both development communities benefit from the preservation and dissemination of a variety of electronic texts. For researchers and developers alike, these tools are only as useful as the e-texts available to study with them. An appropriate text service would provide the raw materials for software projects that have had a demonstrated benefit to the Canadian economy.
  4. Canadian researchers are involved in the creation of research quality electronic editions of important works of literature. The Internet Shakespeare Editions, for example, is a project out of the University of Victoria that is creating electronic versions of Shakespeare's works following best practices in the field. The Trésor de la langue française au Québec (TLFQ) at the Université Laval has assembled a corpus of electronic editions of literary texts of importance to the study of French in Québec. These research electronic texts are important not just to Canadian researchers, but also to researchers around the world. TLFQ is typical of the high level of electronic scholarship of Canadian scholars that we need to preserve and make available outside of Canada in order to increase international understanding of our cultures.
  5. Canadian researchers are also developing new forms of electronic research texts that can only be accessed on a computer. The Lyrical Ballads Bicentenary Project is making this work by Wordsworth and Coleridge available in an electronic form that allows for the comparison of versions. This resource can only exist in electronic form. The Performance in Victorian Hamilton project is creating a text database of all records of musical and theatrical performances in Hamilton in order to study entertainment before Cinema. Resources such as these have no analogue on paper and represent new forms of research that can only be preserved in electronic form.
  6. There is a growing number of courses and programmes in the Social Studies and the Humanities that make significant use of electronic texts to train undergraduate and graduate students in the application of computing to research. McMaster University has started an undergraduate Multimedia programme that includes courses on electronic texts and their study. The University of Alberta has announced a M.A. in Humanities Computing that includes core courses in electronic texts. We need access to a well maintained Canadian electronic text service to train future students in electronic literacy. The ability to manage, study, and retrieve textual information in electronic form is an important literacy skill in a world where an increasing amount of information is only available electronically.
  7. Finally, Canadian researchers need a service where they can deposit electronic texts comparable to the national efforts of countries like the United Kingdom which has set up a Arts and Humanities Data Service. What is at stake is the preservation of our textual heritage. The electronic texts of today will provide rich resources for future study including a works of interests to those who want to study our age and this transition from analogue to digital scholarship.

Appropriate investment in the preservation of electronic texts created by or for researchers will not only keep valuable research resources accessible. It will also provide a model for other domains that have significant investments in electronic texts from the health sciences to the insurance sector. We need to begin to solve the problem of how to archive our exploding electronic record before it overwhelms us. The pursuit of efficient models for the research community will benefit Canada.