RELI/ENGL 39, Fall 2015, University of the Pacific

Category: Coptic

My digital future

This fall, as I have been trying to finish up my book project, Monks and Their Children, I have been asked more than once:  What’s your next project?   When I start describing, I frequently get the reply:  no, I mean your real project, your next book.  My internal response was always twofold:  the snarky, “What, bringing the study of an entire language into the 21st century is not enough?” and the desperate, “I am not sure I have another monograph in me.”  And as the fall wore on, and 2014 became 2015, I became more and more convinced of the authenticity of those sentiments:  that digital scholarship in early Christian studies and late antiquity is still not regarded as legitimate as print monographs and articles, and that indeed I had no interest in writing another monograph.  It’s not that I thought I couldn’t write another book, but that I just had no desire to spend another decade on a long-form argument.  I was more interested in digital writing and digital scholarship that could be read or used by a community more quickly.  And in tighter, more focused arguments in essay form.

I also began chafing more and more at the conservatism of the field.  The definitions of “real” scholarship, the structural sexism that colleagues like Ellen Muehlberger and Kelly Baker were documenting in academia, and the perception of Egypt and Coptic as marginal areas of study.  That conservatism stoked my rebellious fires further; I was not going to force myself to come up with a book project just because that was what one “did” as an active scholar.

And then I saw the CFP for the Debates in the Digital Humanities Series.  It’s a call for essays, not monographs, but like Augustine hearing the child chant, “Tolle lege,” I had an epiphany:  I damn well had a third book in me. I just hadn’t put the pieces together.

In fact, I have two projects in mind:  both are examinations of the field of early Christianity as it intersects (or does not) with Digital Humanities.  Both are political and historiographical.

The book (as yet untitled) is about early Christian studies (especially Coptic and other “Eastern” traditions and manuscript collections), cultural heritage, and digitization.  Planned chapters are:

  1. Digitizing the Dead and Dismembered.  About the material legacy of the colonial dismemberment of archives, the limitations of existing DH standards and technologies (e.g., the TEI, Unicode characterset, etc.) to account for these archives, and how these standards, technologies, practices must transform.  The Coptic language and the White Monastery/Monastery of Shenoute manuscript repository will be the primary source examples, but there should be other examples from Syriac, Arabic.
  2. Can the Colonial Archive Speak? Orientalist Nostalgia, Technological Utopianism, and the Limits of the Digital.  This chapter will look at the practice of constructing digital editions and digital libraries and (building on the issues discussed in the previous chapter) explore the premise that digitization can “recover” an original dismembered archive such as the White Monastery’s repository.  To what extent can digitization recover and reconstruct lost libraries?  What are the political and ethical obligations of Western libraries to digitize manuscripts from Egypt and the wider Middle East?  Does digitization transcend or reify colonial archaeological and archival practices?  This chapter focuses on the concepts of the archive and library and voice.  [HT to Andrew Jacobs for inspiring the chapter title.]
  3. Ownership, Open Access, and Orientalism.  About the benefits, consequences, and dangers of the open access paradigm for digitizing eastern Christian manuscript collections.  Will look at the history of theft of physical text object from monasteries by Western scholars and will ask whether open access digitization is cultural repatriation or digital colonization.  Will look at a number of complexities:  a) the layers and levels of digitization (metadata, text, images); b) the spectrum of openness and privacy possible; and c) the different constituencies involved in asking the question:  whose heritage is this?  who owns/owned the text?  Church, local monastery, “the world” (as world heritage), American/European scholars who have privileged access to some of these texts already in their libraries or on their computers. Will explicitly draw on insights from indigenous cultural heritage studies related to digitization and digital repatriation.
  4. Transparency and Overexposure:  Digital Media and Online Scholarship in Debates about Artifact Provenance.  This chapter will examine the extent to which blogs and social media have changed the conversation about the provenance of text-bearing objects we study, and the ethical responsibilities of researchers.  Will also look at the risks of online debates, and suggest ways to have constructive conversations moving forward.  With special attention to the intersections of status (who’s online and who’s not?) and gender.
  5. The Digital Humanities as Cultural Capital: Implications for Biblical and Religious Studies.  Why our field needs to stop treating digital scholarship as derivative or less rigorous, the implications for us being so conservative about digital scholarship as a field, and how Biblical and Religious Studies can contribute to DH as a discipline (not just in content but in concept, in theory, in its very understanding of itself as a discipline or field, in other words, why DH needs Biblical and Religious studies).
  6. Desirable but maybe a stretch:  War and the Western Savior Complex:  Looks at the rhetoric of crisis and loss (especially in the context of the early 21st c. wars and revolutions in the Middle East) around saving texts, artifacts, and traditions.  What does it mean for scholars from Europe and America who are not the policy makers in their countries but are nonetheless citizens of them to be making pleas for the preservation of antiquities and or cultural traditions (and there is —see Johnson’s JAAR article “‘He Made the Dry Bones Live'”— a conflation of ancient traditions and modern Eastern Christian peoples in scholarship and the media)  that are endangered in part because of the actions of our governments?

The other project will be digital historiography:  using digital and computational methods to crunch Journal of Early Christian Studies (and hopefully its precursor the Second Century?) to look at trends in the field, especially with respect to gender.  Who is publishing, what are we publishing on?  Who is citing whom?  Who is reviewing whom?  How has that changed (or not) over the decades?  This may be one or two essays, not a book.  And it is inspired in part by Ellen Muehlberger’s work micro-blogging statistics on gender in biblical studies book reviews.  I’m taking the Topic Modeling course at DHSI this summer and will think more how that or other methods (concordance text analysis, network analysis, etc.) will support this project.

I hope to publish all of this in digital form, including the monograph on cultural heritage and cultural capital.

So that’s my digital future.  Of course, first I need to get a couple of other things out the door.  And of course Coptic Scriptorium continues.  But when you ask me what my next book is about, there you go.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Digitizing the Dead and Dismembered: Presentation at DH2014 in Lausanne

We’ll be presenting at DH2014 in Lausanne, Switzerland, on July 10, 2014.

“Digitizing the Dead and Dismembered: DH Technologies for the Study of Coptic Texts”

The abstract is below (and at the DH2014 site).

Here are the slides:

You can also download them as a pdf.

Printed abstract:

Digitizing the Dead and Dismembered: DH Technologies for the Study of Coptic Texts

  • Schroeder,Caroline T.
    University of the Pacific
  • Zeldes,Amir
    Humboldt University

This paper will explain the unique challenges to processing and annotating the Coptic language for digital research and present solutions and methodologies for digital linguistic and historical scholarship.

The Coptic language evolved from the language of the hieroglyphs of the pharaonic era and represents the last phase of the Egyptian language.  It is pivotal for a wide range of humanistic disciplines, such as linguistics, biblical studies, the history of Christianity, Egyptology, and ancient history.  Whereas languages like Classical Greek and Latin have enjoyed advances made in digital humanities with fully-fledged online research environments accessible to students and scholars (such as the Perseus Digital Library), until recently, no computational tools for Coptic have existed. Nor has an open digital research corpus been available.  The research team developing Coptic SCRIPTORIUM (Sahidic Corpus Research: Internet Platform for Interdisciplinary multilayerMethods) is developing and providing open-source technologies and methodologies for interdisciplinary research across multiple disciplines in the Coptic language.  This paper will address the automated tools we are developing for annotating and conducting research on a Coptic digital corpus.

Conducting digitally-assisted and computational research in Coptic using available DH resources is complex for several reasons.  Most texts are preserved from damaged, incomplete, and dismembered manuscripts or papyri.  The DH project has begun to create an online open-access resource for the study of Greek papyri and is beginning to digitize Coptic papyri and ostraca (ancient pot-shards with writing).  These texts, however, are primarily documentary, consisting of wills, contracts, personal letters, etc.  Coptic literary and monastic texts, the core of Coptic SCRIPTORIUM, are essential for the study of the Bible, intellectual history, literary history, and religious history.  The manuscripts containing these texts were removed from Egypt in the seventeenth through nineteenth centuries piece by piece (sometimes page by page).  Some have been published, many have not, and very few have been digitized in a format suitable for digital and computational work.  Texts must be must be reconstructed from pieces of manuscripts published in fragments and/or stored in various libraries and museums worldwide. The status of Coptic literary and monastic complicates metadata management and corpus architecture:  what constitutes a “work” – the codex in which a copy of the text appeared (and which may be dispersed across multiple physical repositories)? the manuscript fragment housed in a particular library or museum repository or the work, which only might survive in fragments of multiple codices (all copies of a “book” from the monastery’s library), and thus in fragments not only from more than one codex but also more than one modern repository?

Coptic scholarship still lacks many standards for digital publication and language research that are taken for granted in Greek and Latin. As with other ancient languages, Coptic manuscripts are written without spaces.  However, in contrast to its ancient counterparts, scholarly conventions on word division differ substantially from scholar to scholar.  Additionally, since Coptic is an agglutinative language, the relevant unit for linguistic analysis is the morpheme, below the ‘word’ level.  This means that segmentation guidelines must be developed for both levels of resolution. In order to search multiple texts, guidelines and tools for normalization, part-of-speech tagging and lemmatization of Coptic must be developed.  These tools need to take into account Coptic’s agglutinative nature, e.g. normalizing and annotating on the morpheme and word levels.

Finally, the development of the Coptic language during Egypt’s Greco-Roman era raises questions about the origins of the language, its usage in a multilingual context, and the language practices of its ancient speakers and writers.  Coptic consists of Egyptian grammar, vocabulary, and syntax written primarily in the Greek alphabet; some Egyptian letters were retained, and some Greek and Latin vocabulary was incorporated into the language.  The richness of the vocabulary’s languages of origin varies from author to author, genre to genre.  And despite recent publications on the topic, much research remains to be conducted on the extent and nature of multilingualism in late antique Egypt, especially during the fourth and fifth centuries.   Additionally, due to the agglutinative nature of the language, one word can be comprised of morphemes with different languages of origin.

This paper will focus on the automated tools our project is developing to process the language, especially tokenizing and part-of-speech annotations.  Coptic SCRIPTORIUM has developed the first tokenizer and part-of-speech tagger for the language, and in fact for any language in the Egyptian language family. The presentation will address the unique challenges to processing and annotating the Coptic language.  We will present our current technical solutions, their accuracy rates, and the potential for future research.  We will also address the ways in which this language’s and corpus’s unique featured differentiate them from other more widely studied ancient languages, such as Greek and Latin.  Examples will be drawn from the open-access corpora we are developing and annotating with these tools, available at (backup site  The Coptic corpora processed and annotated with these tools can be searched and visualized in ANNIS, a tool for multi-layer annotated corpora.   We anticipate this presentation to be of interest to scholars in digital humanities working with ancient languages and manuscript corpora as well as DH linguists and corpus linguists.


  1. Bentley Layton, A Coptic Grammar, 3rd Edition, Rev, Porta Linguarum Orientalium Neue Serie 20 (Wiesbaden: Harrassowitz, 2011), 19–20.
  2. Layton, Coptic Grammar, 5.
  3. J. N. Adams, Mark Janse, and Simon Swain, Bilingualism in Ancient Society (Oxford: Oxford University Press, 2002); Arietta Papaconstantinou, ed., The Multilingual Experience in Egypt from the Ptolemies to the Abassids (Burlington: Ashgate, 2010).


Tagging Shenoute

By Caroline T. Schroeder & Amir Zeldes
Schroeder presented this paper at the annual meeting of the North American Patristics Society in Chicago, Illinois, on May 24, 2014.  This post is a very minimally edited version of the paper prepared for and delivered at the conference.
Creative Commons License
Tagging Shenoute by Caroline T. Schroeder & Amir Zeldes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Tagging Shenoute

Many thanks to the people and institutions who made this possible: my collaborator Dr. Amir Zeldes; Tito Orlandi, Stephen Emmel, Janet Timbie, and Rebecca Krawiec who have freely given their labor and their advice; and funding agencies of the National Endowment for the Humanities and the German Federal Ministry of Education and Research. I’m also pleased that this year Rebecca Krawiec along with Christine Luckritz Marquis and Elizabeth Platte will be helping to expand our corpus of digitized texts.

Two years ago, I gave a NAPS paper entitled, “Shenoute of Atripe on the Digital Frontier,” in which I explored – and despaired – of challenges to digital scholarship in early Christian studies, especially Coptic. I posed important questions, such as, “Why don’t my web pages make the supralinear strokes in Coptic appear properly?” and “Why do I only have 50 followers on Twitter?”

I am pleased to report that I have in fact solved both of those problems. And today I want to take you on a tour into the weeds of digital Coptic: how to create a data model for computational research in Coptic; the requirements for visualizing and searching the data; and what you can do with this all once you’ve got it.

Today I’m going to get technical, because over these past two years I’ve come to learn two things:

1. Digital scholarship is about community—the community that creates the data, contributes to the development of standards for the creation of that data, and conducts research using the data. In other words, my work won’t succeed if I don’t drag all of you along with me.

2. The truth is not in there (i.e., you might be thinking, “What is she doing talking about “data” – she did her Ph.D. at Duke in the 1990s?!). In case you came to this paper wondering if I’ve abandoned my Duke Ph.D. post-modern, Foucauldian, patented Liz Clark student cred for some kind of positivist, quantitative stealth takeover of the humanities, well HAVE NO FEAR. The true truth is not in some essentialized compilation of “the data.” As with traditional scholarship, our research questions determine how we create our dataset, and how we curate it, annotate it is already an act of interpretation.  (I owe Anke Lüdeling for helping me think through this issue.)

So please, take my hand and take the red pill, not the blue pill, and jump into the data.


Our project is called Coptic Scriptorium, and in a nutshell, it is designed as an interdisciplinary, digital research environment for the study of Coptic language and literature. We are creating technologies to process the language, a richly annotated database of texts formatted in part with these technologies, texts to read online or download, documentation, and ultimately a collaborative platform where scholars and students will be able to study, contribute, and annotate texts. It is open source and open access (mostly CC-BY, meaning that you can download, reuse, remix, edit, research, and publish the material freely as long as you credit the project.

We also invite any of you to collaborate with us. Consider this presentation an open invitation. Our test case was a letter of Shenoute entitled Abraham Our Father, and we’ve since expanded to include another unnamed text by Shenoute (known as Acephalous Work 22 or hereafter A22), some Sahidic Sayings of the Desert Fathers, two letters of Besa, and a few chapters of the Gospel of Mark.

I’ve entitled this paper, “Tagging Shenoute” for two reasons. First, “tagging” refers to the process of annotating a text. To conduct any kind of search or computational work on a corpus of documents, you need to mark them up with annotations, sometimes called “tags.” They might be as simple as tagging an entire document as being authored by Shenoute, or as complex as tagging every word for its part of speech (noun, verb, article, etc.) or its lemma (the dictionary headword for words that have multiple word forms) Second, because the pun with the child’s game of tag was too rich to pass up. The Abba himself disdained children’s play and admonished the caretakers of children in his monastery not to goof around:

As for some people who have children who were entrusted to their care, if it is of no concern to them that they live self-indulgently, joking with them, and sporting with them, they will be removed from this task. For they are not fit to be entrusted with children. It is in this way also with women who have girls given to them.”

Shenoute, Canons vol. 9, DF 186-87 in Leipoldt 4:105-6

And finally, I was inspired by a conversation between two senior Coptic linguists at the Rome 2012 Congress for the International Association of Coptic Studies. When I told them about our nascent project, one replied something along the lines of, “I would not dare to think that Shenoute would allow himself to be tagged!” And the riposte from the other: “And I would not presume to speak for Shenoute!” All of this is to subversively suggest, that despite Shenoute’s own words, he can be fun. While annotation is serious work, there is also an element of play: playing with the data, and pleasure in the text.

The premise of our project is to facilitate interdisciplinary research, to develop a digital environment that will be of use to philologists, historians, linguists, biblical scholars, even paleographers. To that end, we have dared to tag Shenoute in quite a variety of ways:

  • Metadata: information about text, author, dating, history of the manuscript, etc.
  • Manuscript or document structure: page breaks, column breaks, line breaks, damage to the manuscript, different ink colors used, text written as superscript or subscript, text written in a different hand….
  • Linguistic: part of speech (noun, verb, article, stative verb, relative converter, negative prefix, etc.), language of origin (Greek, Hebrew, Latin…), lemmas (dictionary headwords for words with multiple forms
  • Translations

With hopefully more to come: biblical citations, citations and quotations to other authors, named entities with data linked to other open source projects on antiquity, source language for texts in Coptic translation (e.g., Apophthegmata Patrum and Bible).

We also must be cautious and discerning, on the lookout for the demon of all things shiny and new. As Hugh Cayless writes on the blog for the prosopographical project SNAPDRGN,

“In any digital project there is always a temptation to plan for and build things that you think you may need later, or that might be nice to have, or that might help address questions that you don’t want to answer now, but might in the future. This temptation is almost always to be fought against. This is hard.”

Hugh Cayless, “You Aren’t Gonna Need It” 22 May 2014

Digital scholarship in Coptic must develop annotation standards in conversation with existing conventions in traditional, print scholarship, as well as digital standards used by similar projects on the ancient world and ancient texts. For Shenoute, this means using as titles of texts the incipits delineated by Stephen Emmel in his book Shenoute’s Literary Corpus, manuscript sigla developed by Tito Orlandi and the Corpus dei Manoscritti Copti Letterari as well as id numbers for manuscripts established by the online portal Trismegistos, and part-of-speech tags based on Bentley Layton’s Coptic Grammar.

In the digital world, in addition to Trismegistos, the emerging standard for encoding manuscript information for ancient papyri, inscriptions, and manuscripts is the subset of the Text Encoding Initiative’s XML tagset known as EpiDoc. The Text Encoding Initiative is a global consortium of scholars who have established annotation standards (including a comprehensive set of tags) for marking up text for machine readability. XML stands for Extensible Markup Language, and is used more widely in computer science, including in commercial software. EpiDoc is a subset of TEI annotations used especially by people working in epigraphy or on ancient manuscripts in a variety of languages.  Patristics scholars might be familiar with it because the papyrological portal uses EpiDoc markup to annotate its digital corpus.

So, this is a lot of information – what does the data actually look like? Coptic poses some unique challenges.


To get from base text, to annotated corpora, there are a lot of steps: basic digitization of the text, encoding the manuscript information, ensuring Coptic word forms make sense properly, separating those bound groups into morphemes, normalizing the spelling so you can do genuine searching, and tagging for various annotations. I’m going to briefly go through most of these issues.

Before you can even begin to think about tagging, the data must be in a digital format that can be used and searched: typed in Unicode (UTF8) characters and in recognizable word forms. Many of us in this room probably have various files on our computers with text we keyed into Microsoft Word (or dare I say, WordPerfect?) in legacy fonts. We have developed converters ourselves for a couple of different legacy fonts. But keying in the text is only one piece of the puzzle; users of the data must be able to see the characters on their computers, ideally even if they don’t have a Coptic font or keyboard installed, or on their mobile devices.


So we created an embedded webfont that is installed on our website and inside our search and visualization tool. We’ve even embedded a little Coptic keyboard into the search tool, so that you can key in Coptic characters yourself if your device isn’t capable.

Those of you who have studied Coptic know that it is different from Greek or Latin, in that it is an agglutinative language.


Multiple different morphemes, each with different parts of speech, are plugged together like Legos to create “words,” or as Layton describes them, “bound groups.” When you search Coptic, you might not want to search bound groups but rather the individual morphemes within them. That means, when you digitize the text, you need to be attentive to the morphemes and word segmentation. This process of breaking a text into its constituent parts is called “tokenization”; the token is the smallest possible piece of data you annotate. In English texts, it’s often a word.

There are two problems with Coptic.

–      First, the concept of words is complex in Coptic

–      Second, annotations overlap parts of words. For example, in a manuscript a line might break in the middle of a word.

Here are some examples.  When we say Coptic is agglutinative, we mean that what we might think of as “words” are really bound groups of morphemes, as seen here in these two examples.

Slide14We’ve color-coded each separate morpheme, so that you can see that each one of these examples is a combination of seven or eight components.   To complicate matters, scholars use different conventions to bind these morphemes into words in print editions.  We follow Layton’s guidelines for visualizing or rendering Coptic bound groups.

But we also need not only to see or visualize words as bound groups but also to automate the taking apart of these bound groups. Our tools cannot yet handle text as you might see it in a manuscript, with scriptio continuaSlide15But, we have automated segmenting bound groups into morphemes, thanks in part to a lexicon that Tito Orlandi graciously gave us, which sped up our work by about a year.

But we need to dig deeper into our data than morphemes, because we might need to annotate on a level that’s even smaller than the morpheme. If you want to mark up the structure of the manuscript – the line breaks, oversized letters, letters written in different ink colors, etc., you need to annotate on the level of parts of morphemes or individual letters.

Slide16As in this example, where things that appear in the middle of a morpheme (such as the oversized janja in the middle of the word pejaf) might need to be tagged – size, line break, etc. So you need to annotate on a more granular level than “words” or “morphemes.”

So, now we’ve already got a ton of different ways to tag our data, and we’re not done yet. Lots of other tagging or annotations that you might want to make and use for research.  What you do NOT what to have to do is to write this all up manually using actual xml tags in what is called inline markup.


Instead if you markup your data in multiple layers, or what is known as mulit-layer standoff markup, you can make more sense of it and tag your data much more easily.

Slide18Here you can see the smallest level of data, the token layer, at the top. The second layer shows the morpheme segments, aligned with the tokens but those two at the end are merged into one, because it is one term – Abraham. Line three gives you the bound groups, line four shows you line breaks. Here you see the line ends in the middle of Abraham. Line 5 shows column breaks, and six page breaks.

Moreover, you want to automate as much of your annotation as possible. We have at least semi-automated normalizing spelling, which eliminates diacritics and supralinear strokes, normalizes spelling variants, deals with abbreviations, and so forth. Normalization is essential both for search and for further automated annotations.  We’ve also semi-automated annotations for language of origin of words in a text, and we are developing a lemmatizer, which will match each word with its dictionary head word.

Finally we’ve developed a part of speech tagger, which is is a natural language processing algorithm. It learns as it processes more data, based on patterns and probabilities. We have two sets of tags – coarse, which will just tag all nouns as nouns, for example – and fine – which will tag proper nouns, personal subject pronouns, personal object pronouns, etc.

Slide20 And so now your data looks like this:

Slide21You’ve preserved all your information. By making everything annotations – even spelling normalization – you don’t “lose” information. You just annotate another layer.

So, what can you do with this?

1. Basic search for historical and philological research. Below is a screen shot of the search and visualization tool we are using, ANNIS.  ANNIS was developed originally for computational linguistics work, and we are adapting it for our multidisciplinary endeavor. Here I’ve searched for both the terms God and Lord in Shenoute’s Abraham Our Father.

Slide23The query is on the upper left, corpus I’ve selected is in the lower left, and the results on the right.

You can export your results or select more than one text to search:

Slide24And if you click on a little plus sign next to “annotations” under any search result, you can see all the annotations for that result.

Slide25So, noute is a noun, it’s part of this bound group hitmpnoute, it’s in page 518 of manuscript YA, etc.

You can also read the text normalized or in the diplomatic edition of the manuscript inside ANNIS:
Or if you already know the texts you want to read, you can access them easily as stand-alone webpages on our main site (  see the HTML normalized and diplomatic pages of texts).

2. Linguistics and Style.  Here, I’ve told ANNIS to give me all the combinations of three parts of speech and the frequencies those sequences occur:


This is known as a “tri-gram” – you’re looking for sequences of three things. I didn’t tell it any particular three parts of speech, I said, give me ALL sequences of three. And then I generated the frequencies. Note: everything I am presenting here is raw data, designed primarily to GENERATE and EXPLORE research questions, not to answer them in a statistically rigorous way. This is raw data.

What do we learn?

The most common combination of three grammatical categories is the preposition + article + noun (“in the house”) across ALL the corpora – this is #1. Not a surprise if you think about it.

Slide29Also, you’ll notice some distinct differences in genre: the second most common tri-gram in the Apophthegmata Patrum is the Past tense marker+the subject personal pronoun+verb –this fits with the Sayings as a kind of narrative piece (3.66% of all combinations). Similarly, for Mark 1-6– the second most common tri-gram is Past tense marker +personal pronoun subject + verb (4.03% of trigrams). Compare that to Besa, where this combination is the 4th most common tri-gram (2.1% of trigrams), or Shenoute, with .91% (A22, 14th most common trigram) & 1.52% (also 4th most common) in Abraham Our Father. (My hunch is this tri-gram probably skews HIGH in Abraham compared to its frequency overall in Shenoute, since there are so many narrative references to biblical events in Abraham Our Father.)

Whereas a marker for Shenoute’s style is the relative clause.  Article + noun + relative converter occurs .91% of the time in Acephelous Work #22 and .76% in Abraham. But in Mark, it’s the 33rd most common combination, and occurs .55% of the time. In the Apophthegmata Patrum, it occurs .44% of the time (the 40th most common combination).

Slide30 Some of you are probably thinking, “Wait a minute, what is this quantitative analysis telling me that I don’t already know. Of course narrative texts use the past tense! And Shenoute’s relative clauses have been giving me conniption fits for years!”  But actually, having data confirm things we already know at this stage of the project is a good thing – it suggests that we might be on the right track. And then with larger dataset and better statistics, we can next ask other questions about, say, authorship, and bilingualism or translation vs native speakers.  For example:  A) How much of the variation between Mark and the AP on the one hand, and Shenoute and Besa on the other can be explained by the fact that Mark and the AP are translations from the Greek? Can understanding this phenomenon – the syntax of a translated text – help us study other texts for which we only have a Coptic witness and resolve any of those “probably translated from the Greek” questions arise about texts that survive only in Coptic? B) Shenoute is reported to have lived for over 100 years with a vast literary legacy that spans some eight decades. Did he really write everything attributed to him in those White Monastery codices? Can we use vocabulary frequency and style to attribute authorship to Coptic texts?

3. Language, loan words, and translation practices.  We can also study loan words and translation practices. Quickly let’s take a look at the frequency of Greek loan words in the five sets of texts:

Slide31In Abraham Our Father, 4.71% of words are Greek; Mark 1-6: 6.33%; A22: 5.44%; Besa: 5.82%; AP: 4.25%.  The texts are grouped on the graph roughly by the size of the corpus – Mark 1-6 is closer in size to Abraham Our Father, and the others are very small corpora. What’s interesting to me is the Apophthegmata Patrum number.  Since it’s a translation text, I’d expect this figure to be higher, more like Mark 1-6.

4.  Scriptural references and other text reuse.  Is it also possible to use vocabulary frequencies to find scriptural citations? The Tesserae project in Buffalo is working on algorithms to compare two texts in Latin or two texts in Greek to try to identify places where one text cites the other. Hopefully, we will be able to adapt this for Coptic one day.

Slide32In the Digital Humanities, “distant reading” has become a hot topic. Distant reading typically means mining “big data” (large data sets with lots and lots of texts) for patterns. Some humanists have bemoaned this practice as part of the technological takeover of literary studies, the abandonment of close reading in favor of quantitative analyses that don’t require you ever to actually read a text. Can distant reading also serve some very traditional research questions about biblical quotations, authorship identification, prosopography, or the evolution of a dialect?

This project still has a lot to do. We need to improve some of our taggers, create our lemmatizer, link our lemmas to a lexicon, provide universal references so that our texts, translations, and annotations can be cited, and possibly connect with other linked data projects about the ancient world (such as Pelagios and SNAPDRGN).

For today, I hope to have shown you the potential for such work, the need for at least some of us to dive into the matrix of technical data as willingly and as deeply as we dive into depths of theology and history. And also, I invite you to join us. If you have Coptic material you’d like to digitize, if you have suggestions, if you would like to translate or annotate a text we already have digitized, consider this an invitation. Thank you.


2 NEH Grants to support Coptic SCRIPTORIUM

The Coptic SCRIPTORIUM project is pleased to announce that we have been awarded two grants from the National Endowment for the Humanities.  A grant from the Office of Digital Humanities will support tools and technology for the study of Coptic language and literature in a digital and computational environment.  A grant from the Division of Preservation and Access will support digitization of Coptic texts.


Press Release from the University of the Pacific:


Dr. Caroline T. Schroeder Receives two National Endowment for the Humanities grants

Ann MazzaferroApr 8, 2014

Google has transformed the way we seek knowledge, and most questions can be answered with, “There’s an app for that,” but there are still corners that no search engine or web application have yet reached, among them rare writings in a dead Egyptian language.

With $100,000 in new grants from the National Endowment for the Humanities, Caroline T. Schroeder, associate professor of Religious and Classical Studies at University of the Pacific, plans to change that. Working in collaboration with her project co-director, Amir Zeldes of Humboldt University in Berlin, Schroeder’s goal is to make Coptic accounts of monks battling demons in the desert, early theological controversies, and accounts of life in Egypt’s first Christian monasteries as easy to access online as the morning’s latest news.

“Dr. Schroeder is a distinguished scholar and spectacular teacher, and there is no one more deserving of this prestigious recognition,” said Dr. Rena Fraden, dean of College of the Pacific, the liberal arts and sciences college at University of the Pacific. “Nations – ancient and modern – will always be judged for their contributions to knowledge and the arts. Pacific and Carrie Schroeder belong to this glorious tradition.”

Schroeder received a $40,000 Humanities Collections and Reference Resources grant, which will enable scholars not only to digitize core Coptic texts housed at institutions around the world, but to develop standards for future digitization projects. She also received $60,000 Digital Humanities Start-Up Grant; it will allow scholars to develop the tools and technologies necessary for computer-aided study and interaction with the materials.

The study of Coptic texts has gained attention in recent years, with high-profile controversies including the announcement in 2012 of an apparent Coptic papyrus text that may refer to “Jesus’ wife,” and increased international focus on the political climate of Egypt.

The digitization of these texts, and the database that Schroeder and her colleagues are working to create, will allow students, researchers, and non-academics alike to translate, analyze and understand the content of these Coptic texts, and to cross-reference the material with other texts and resources, including dictionaries and lexicons.

“This is the most cutting edge grant you can get for this type of work,” said Schroeder, who has taught at Pacific since 2007 and is the director of the Pacific Humanities Center. “This is about creating the technology for the study of the humanities. There aren’t that many technologies that work for Coptic or Egyptian texts. It’s an entire language family, and an important one for history, language, art. This is a world cultural heritage, a study of how our world and culture came to be.”

It will also create a centralized, open-source archive where these texts can be accessed in their entirety, anywhere in the world. This is particularly important, as many of these texts have been separated over centuries; reading one letter penned by a Coptic author may mean traveling to several different libraries and museums across the globe to track down the full account. While some Coptic manuscripts have been published in print, others have not.

“There are a lot of materials from this time and place that need more study. You have to know, if you want to read a letter, that some of the pages are going to be in London, some in Naples, some in Paris,” Schroeder said. “These documents and texts are primarily housed in Western museums and libraries, and our project is committed to being open-access, and to being available to everyone, including people in the country where these texts originated.”

This multi-disciplinary project has involved work with scholars from around the world, as well as collaboration among faculty and students at Pacific. Lauren McDermott, an English major with a Classics minor in the College, learned the Coptic alphabet in order to help proofread, digitize, and encode texts; Alexander Dickerson, a Computer Sciences major in the School of Engineering and Computer Sciences, worked on the coding as well.

RS43926_Carrie Schroeder 1

About the National Endowment for the Humanities
Created in 1965 as an independent federal agency, the NEH supports research and learning in history, literature, philosophy, and other areas of the humanities by funding selected, peer-reviewed proposals from around the nation. For more information, visit

About University of the Pacific
Established in 1851 as the first university in California, University of the Pacific prepares students for professional and personal success through rigorous academics, small classes, and a supportive and engaging culture. Widely recognized as one of the most beautiful private university campuses in the West, the Stockton campus offers more than 80 undergraduate majors in arts and sciences, music, business, education, engineering and computer science, and pharmacy and health sciences. The university’s distinctive Northern California footprint also includes the acclaimed Arthur A. Dugoni School of Dentistry in San Francisco and the McGeorge School of Law in Sacramento. For more information, visit

March 2014 Coptic SCRIPTORIUM Release notes

Coptic SCRIPTORIUM is pleased to announce a new release of data and an update on our project.  Please visit our site at (backup at

We’ve released several new corpora:
-two fragments of Shenoute’s Acephelous Work #22 (aka A22, from Canons Vol. 3)
-two letters of Besa (to Aphthonia and to Thieving Nuns)
-chapters 1-6 of the Sahidic Gospel of Mark (based on Warren Wells’ Sahidica New Testament)

These corpora include:
⁃    visualizations and annotations of diplomatic manuscript transcriptions (except for Mark)
⁃    visualizations and annotations of the normalized text
⁃    annotations of the English translation (except for some A22 material)
⁃    part-of-speech annotations (which can be searched)
⁃    search and visualization capabilities for normalized text, Coptic morphemes, and bound groups in most of the corpora
⁃    Language of origin annotations (Greek, Hebrew, Latin) in most corpora (which can be searched)
⁃    TEI XML files of the texts in the corpora, which validate to the EpiDoc subset

We’ve also:
⁃    Updated the documentation about our part-of-speech tag set and tagging script.  (If you’re interested at all in Coptic linguistics please do read about our tag set)
⁃    Provided some example queries for our search and visualization tool (ANNIS); just click on a query and ANNIS will open and run it
⁃    updated our Frequently Asked Questions document
⁃    released an update to the Apophthegmata Patrum corpus to incorporate some of the new technologies described above
⁃    improved automation of normalizing text, annotating it for part-of-speech, annotating language of origin, annotating word segmentation (bound groups vs morphemes, etc.)

We would love to hear from you if you use our site; we think it will be useful for people teaching Coptic as well as conducting research.  Please email either of us feedback directly.

The improvements in automation also mean we would love to work with you if you have digitized Coptic texts that you would like to be able to search or annotate, if there are texts you would like to digitize, or if you would like to annotate existing texts in our corpus in new ways.  We are ready to scale up!

Thanks for all of your support.  This project is designed for the use of the entire Coptological community, as well as folks in Linguistics, Classics, and related fields.

January 2014 Coptic SCRIPTORIUM release notes

We’ve released some additional TEI XML files for our SCRIPTORIUM corpora at (backup site

  • All the TEI files have been lightly annotated with linguistic annotations.
  • The metadata has been updated to provide more information about the repositories and manuscript fragments.
  • There are now TEI downloads for every file in our public ANNIS database.
  • All TEI files conform to the EpiDoc TEI XML subset and validate to the EpiDoc schema.
  • The files are licensed under a CC-BY 3.0 license which allows unrestricted reuse and remixing as long as the source is credited (Coptic SCRIPTORIUM).  Linguistic annotations were made possible with the sharing of resources from Dr. Tito Orlandi and the CMCL (Corpus dei Manoscritti Copti Letterari); please credit them, as well.

We welcome your feedback on the TEI XML.  We hope to release more texts in the corpora later this winter or in early spring.


SBL presentation on Digital Technologies to find and study biblical references in Coptic literature

The slides from my 2013 Society of Biblical Literature presentation are now available on and are referenced on Coptic SCRIPTORIUM’s Zotero Group Library page.

Searching for Scripture: Digital Tools for Detecting and Studying the Re-use of Biblical Texts in Coptic Literature (Caroline T. Schroeder, Amir Zeldes)


Some of our most important biblical manuscripts and extra-canonical early Christian literature survive in the Coptic language. Coptic writers are also some of our most important sources for early scriptural quotation and exegesis. This presentation will introduce the prototype for a new online platform for digital and computational research in Coptic, and demonstrate its potential for the detection and analysis of “text-reuse” (quotations from, citations and re-workings of, and allusions to prior texts). The prototype platform will include tools for formatting digital Coptic text as well as a digital corpus of select texts (most specifically the writings of Shenoute of Atripe, who is known for both his biblical citations and his biblical style of writing). It will allow searching for patterns of shared vocabulary with biblical texts as well as for grammatical and syntactical information useful for stylistic analyses. Both the potential uses and imitations of implicit methodologies will be discussed.

New grant funded for Coptic digital studies

The German Federal Ministry of Education and Research (BMBF) has approved Dr. Amir Zeldes’ (Humboldt University) proposal for a young researcher group on Digital Humanities at HU Berlin, starting early next year. The project is called KOMeT (Korpuslinguistische Methoden für eHumanities mit TEI), and aims to apply corpus linguistics methods to ancient texts encoded in TEI XML, focusing initially on richly annotated corpora of Sahidic Coptic. Dissertations within the group will be mentored by Frank Kammerzell, Anke Lüdeling, Laurent Romary and myself.

The group will cooperate with the SCRIPTORIUM project that Dr. Zeldes and I presented at our workshop in Berlin in May.

(The text of this announcement is taken from Amir Zeldes’.)