by Nick

I posted previously about being inspired by Digital Scholarship in the Humanities to mess about with word clouds. The same post also gave me the idea to try some text comparison tools.

TAPoR’s Comparator tool allows you to type in the URLs for two different pieces of text. It then compares the two, producing a word list showing whether words appear in both.

I tried it out with two texts in the pamphlet battle between John Taylor and Walker of 1641 that I’ve been looking at recently. Late in the summer of 1641, a text called The Irish Footman’s Poetry appeared by a third author – one George Richardson. The text referenced various previous pamphlets in the dispute. Although it appeared when Taylor was on a journey down to the south-west of England, it is often attributed to him. (No real George Richardson appears to have existed).

I ran Richardson’s text through the tool alongside one of Taylor’s pamphlets from the dispute. I had a hazy idea in my head that this could just possibly be a magic tool that could tell me the real author of a pseudonymous text.

Unfortunately it didn’t tell me very much. What it gives you is a list of words that occur in both texts, and the ratio with which they occur in both. In some cases I can imagine this being very useful – for example to trace the transmission of texts in cases where later works references or draws upon previous works. In my case, though, the only words that emerged in common were everyday verbs like “do”.

Then I tried doing two separate sets of more detailed analysis using the HyperPo tool. Here are the results for Taylor:

  • Total words (tokens): 1813
  • Unique words (types): 785
  • Highest word frequency: 91
  • Average word frequency: 2.31
  • Standard Deviation of word frequencies: 5.07
  • Average word length: 4.29
  • Standard Deviation of word lengths: 2.11
  • Number of sentences: 44
  • Average words per sentence: 41.2
  • Number of paragraphs: 17
  • Average words per paragraph: 106.6

Here is the same analysis for Richardson:

  • Total words (tokens): 1841
  • Unique words (types): 726
  • Highest word frequency: 86
  • Average word frequency: 2.54
  • Standard Deviation of word frequencies: 5.29
  • Average word length: 4.35
  • Standard Deviation of word lengths: 2.22
  • Number of sentences: 95
  • Average words per sentence: 19.4
  • Number of paragraphs: 38
  • Average words per paragraph: 48.4

Again not much stands out – in any case trying to look for similarities this way could be distorted if, for instance, the same author was deploying different literary styles in each text.

So, TAPoR’s tools were fun to try out, but not much help in this particular case – a far better way to establish who the real George Richardson might have been is through a detailed contextual, bibliographic and stylistic analysis of the text. That said, I’d still recommend having a play about with TAPoR’s wide range of tools since you may well find something of use.