Can a computer program help us identify unknown writers? 3

I have been working hard at calibrating the computer ‘authorship identification’ program mentioned in previous posts. The results are mixed, but offering interesting possibilities.

First of all, I narrowed down the many options offered by the program, by running a series of tests to see how well it picked out a known Jay Over text as being by that author, when compared to a set of four texts with known authorship. To help me narrow down the wide set of options on offer, I also referred to a post by the program’s author where he talks about what tests he used to identify J K Rowling as the author of The Cuckoo’s Nest, which had recently been published under a pseudonym.

JGAAP analysis 1JGAAP analysis 2

There are 10 sets of tests: the two highlighted in light purple represent tests used by the software author in the identification of J K Rowling, and the others are ones that I found through trial and error as being good identifiers of Jay Over’s writing in my test examples. I have got very little idea as to what the tests actually mean, or the numbers in the results, so the theory underpinning all this is pretty opaque to me! This means I could easily be making some sort of obvious beginner’s mistake that would be clear to someone who knew more than me about this, but never mind. This is a sort of calibration run, in effect.

One thing that is interesting and reassuring to see is that there is a distinct gap between the numerical results above for Jay Over as compared to the other texts / authors (see the column ‘Difference’). For instance, for the test ‘Word Ngrams’, the gap between Jay Over and Malcolm Shaw is 382.1303, whereas the gap between Malcolm Shaw and Alan Davidson on the same test is only 64.3022, and the gap between Alan Davidson and Pat Mills is 136.9977. What these units are, I will remain in blissful ignorance of, for now; but it seems to show that Shaw / Davidson / Mills cluster together more closely than they are linked to Over.

What happens when I try it against a different text – one of Malcolm Shaw’s? The results are more mixed this time. Malcolm Shaw is identified as the top most likely author in 5 of the 10 tests, but in 4 of the remaining tests it is Jay Over who is suggested as most likely (and Pat Mills is suggested in the other test). (These tests where the wrong author is found are marked in red below.)

JGAAP analysis 3JGAAP analysis 4

I tried it again with a different Malcolm Shaw text and again Shaw is only picked out as the likely author in 5 of the 10 tests (a slightly different set of 5, to boot).

JGAAP analysis 5JGAAP analysis 6

For now this is not really strong enough evidence to say this program is definitely a good tool for this job, though it does suggest a couple of intriguing ideas. One possible explanation is that maybe in narrowing the tests down, I’ve just picked out tests that are good at looking for some characteristic in texts that Jay Over particularly happens to use in his writing. But another one, which is rather more intriguing, is that, well, we know people wrote under pseudonyms, and we know that Malcolm Shaw / Pat Mills / Alan Davidson are all real names of real people – but do we know for absolute definite whether Jay Over is a real name or a pseudonym – of Malcolm Shaw’s???

Of course this means more tests. Next, I want to try to see how the program gets on with a test text by Alan Davidson and a test text by Pat Mills. Will these show the same sort of mixed results, meaning that we might not be at a point of having a really good check to use with an unknown author? Or will these authors be easier for the program to identify more definitely?

For reference, the following texts were used in the above tests. I am differentiating between ‘test texts’, meaning the ones that the program is asked to decide on an author for; and ‘check texts’, which are the ones that the program knows are by certain authors.

  • Jay Over test text: episode from “Slave of the Clock”, typed by me
  • Jay Over check text: episode from “The Secret of Angel Smith”, typed by me
  • Alan Davidson check text: episode from “Fran of the Floods”, typed by Marckie
  • Malcolm Shaw check text: episode from “The Sentinels”, typed by Lorrbot
  • Malcolm Shaw test text 1: episode from “Bella”, typed by Lorrbot
  • Malcolm Shaw test text 2: episode from “Four Faces of Eve”, typed by Lorrbot
  • Pat Mills check text: episode from “Concrete Surfer”, typed by me

With many thanks to all who have sent in scans of stories (Mistyfan in particular) and/or typed episodes (Marckie/Lorrbot/Mistyfan).

Attached below: PDF version of the analysis, hopefully readable

JGAAP analysis PDF

Advertisements

4 thoughts on “Can a computer program help us identify unknown writers? 3

  1. Another problem could be stories that had input from several writers, although only one was credited. This was the case with “The Terror Behind the Bamboo Curtain”.

  2. This is becoming more and more interesting!
    I still owe you an episode of ‘Stefa’, but the past two weeks it has been quite busy at work, all of a sudden. But I didn’t forget.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s