Tag Archives: JGAAP

Can a computer program help us identify unknown writers? 4

Right now I am sorry to say that I haven’t had great success with the computer program that I was hoping would help us to identify unknown writers. I’m by no means declaring it to be impossible or unrealistic, but I think I will need to ask for help from the experts who wrote the program and/or who do more of this sort of analysis on a day to day basis.

My initial trials were to see if I could test a Jay Over script known to be by him against another one known to be by him, so as to see if the program could pick out a ‘known good’ example. It did do that pretty well, but it may be that I calibrated the program options too closely against Jay Over. I haven’t got to the stage of being able to say that this series of tests, done in this way, gives you a good chance of identifying this text by a known author. (Unless that known author is Jay Over, she says slightly bitterly.) And if I can’t do this reasonably reliably, there is no point (as yet, at least) in moving on to trying out unknown author texts.

In my last post about this computer program, I ran a series of 10 tests against a Jay Over text, and the program reliably picked out Jay Over as the most likely author of that text out of a supplied set of 4 test authors. It was much less reliable in picking out a test Malcolm Shaw text out of the same set of test authors: only 5 of the 10 tests suggested that Malcolm Shaw was the best fit. I have now tried the same 10 tests with an Alan Davidson text (“Jackie’s Two Lives”), and with a Pat Mills text (“Girl In A Bubble”). This means that all four of the test authors have been tested against a text that is known to be by them.

  • Unfortunately, in the test using an Alan Davidson text the program was even worse at picking him out as the ‘best fit’ result: it only did so in 2 of the 10 tests, and in 4 of the tests it placed him in last, or least likely to have written that test text.
  • In the test using a Pat Mills text, the program was rather better at picking him out as the ‘best fit’ result, though still not great: it did so in 4 out of 10 tests, and in 3 of the remaining tests he was listed second; and he was only listed as ‘least likely/worst fit’ in one of the tests.

The obvious next step was to try with a larger group of authors. I tried the test texts of Jay Over (“Slave of the Clock”) and of Malcolm Shaw (“Bella” and “Four Faces of Eve”) against a larger group of 6 authors (Primrose Cumming, Anne Digby, Polly Harris, Louise Jordan, Jay Over, Malcolm Shaw).

  • With the Jay Over text, only 7 of the 10 tests chose him as the ‘best fit’, so the attribution of him as the author is showing as less definite in this set of tests.
  • With the Malcolm Shaw texts, only 1 and 3 tests (for “Bella” and for “Eve” respectively identified him as the ‘best fit’ – not enough for us to have identified him as the author if we hadn’t already known him to be so. (He also came last, or second to last, in 4 of the first set of tests, and the same in the second set of tests.)

I should also try with more texts by each author. However I think that right now I will take a break from this, in favour of trying to contact the creators of the program. I hope they may be able to give me better leads of the right direction to take this in. Do we need to have much longer texts for each author, for instance? (We have generally been typing up just one episode for each author – I thought might be too much of an imposition to ask people to do any more than that, especially as it seemed sensible to try to get a reasonably-sized group of authors represented.) Are there some tests I have overlooked, or some analytical methods that are more likely to be applicable to this situation? Hopefully I will be able to come back with some extra info that means I can take this further – but probably not on any very immediate timescale.

In the meantime, I leave you with the following list of texts that people have kindly helped out with. You may find (as I have) that just looking at the texts themselves is quite interesting and revealing. I am more than happy to send on any of the texts if they would be of interest to others. There are also various scans of single episodes sent on by Mistyfan in particular, to whom many thanks are due.

  • Alison Christie, “Stefa’s Heart of Stone” (typed by Marckie)
  • Primrose Cumming, “Bella” (typed by Lorrbot)
  • Alan Davidson, three texts
    • “Fran of the Floods” (typed by Marckie)
    • “Jackie’s Two Lives” (typed by me)
    • “Kerry In the Clouds” (typed by me, in progress)
  • Anne Digby, “Tennis Star Tina” (typed by Lorrbot)
  • Gerry Finley-Day, “Slaves of War Orphan Farm” (typed by Mistyfan)
  • Polly Harris, two texts
    • “Monkey Tricks” (typed by Mistyfan)
    • “Midsummer Tresses” (typed by Mistyfan)
  • Louise Jordan, “The Hardest Ride” (typed by Mistyfan)
  • Jay Over, two texts
    • “Slave of the Clock” (typed by me)
    • “The Secret of Angel Smith” (typed by me)
  • Malcolm Shaw, five texts
    • “Lucky” (typed by Lorrbot)
    • “The Sentinels” episode 1 (typed by Lorrbot)
    • “The Sentinels” episode 2 (typed by Lorrbot)
    • “Bella” (typed by Lorrbot)
    • “Four Faces of Eve” (typed by Lorrbot)
  • Pat Mills, two texts
    • “Concrete Surfer” (typed by me)
    • “Girl In A Bubble” (typed by me)
  • John Wagner, “Eva’s Evil Eye” (typed by Mistyfan)

 

 

Pat Davidson writes

(comment sent by email)
I was interested to read about your computer programme designed to identify authors. If you need another story to test, Alan was the author of the brilliant “Paint It Black” – although this was for Misty, not Jinty [faint carbon copy of one of his invoices attached]. I have carbon copies of some of his actual scripts for various publications, when I can find them, although I know these will be equally faint.

Paint It Black invoice ADavidson

[editorial comment] Of course I need hardly say that any scripts or further information on Alan Davidson and what he wrote will be extremely welcome! The words ‘eager anticipation’ come to mind.

Can a computer program help us identify unknown writers? 3

I have been working hard at calibrating the computer ‘authorship identification’ program mentioned in previous posts. The results are mixed, but offering interesting possibilities.

First of all, I narrowed down the many options offered by the program, by running a series of tests to see how well it picked out a known Jay Over text as being by that author, when compared to a set of four texts with known authorship. To help me narrow down the wide set of options on offer, I also referred to a post by the program’s author where he talks about what tests he used to identify J K Rowling as the author of The Cuckoo’s Nest, which had recently been published under a pseudonym.

JGAAP analysis 1JGAAP analysis 2

There are 10 sets of tests: the two highlighted in light purple represent tests used by the software author in the identification of J K Rowling, and the others are ones that I found through trial and error as being good identifiers of Jay Over’s writing in my test examples. I have got very little idea as to what the tests actually mean, or the numbers in the results, so the theory underpinning all this is pretty opaque to me! This means I could easily be making some sort of obvious beginner’s mistake that would be clear to someone who knew more than me about this, but never mind. This is a sort of calibration run, in effect.

One thing that is interesting and reassuring to see is that there is a distinct gap between the numerical results above for Jay Over as compared to the other texts / authors (see the column ‘Difference’). For instance, for the test ‘Word Ngrams’, the gap between Jay Over and Malcolm Shaw is 382.1303, whereas the gap between Malcolm Shaw and Alan Davidson on the same test is only 64.3022, and the gap between Alan Davidson and Pat Mills is 136.9977. What these units are, I will remain in blissful ignorance of, for now; but it seems to show that Shaw / Davidson / Mills cluster together more closely than they are linked to Over.

What happens when I try it against a different text – one of Malcolm Shaw’s? The results are more mixed this time. Malcolm Shaw is identified as the top most likely author in 5 of the 10 tests, but in 4 of the remaining tests it is Jay Over who is suggested as most likely (and Pat Mills is suggested in the other test). (These tests where the wrong author is found are marked in red below.)

JGAAP analysis 3JGAAP analysis 4

I tried it again with a different Malcolm Shaw text and again Shaw is only picked out as the likely author in 5 of the 10 tests (a slightly different set of 5, to boot).

JGAAP analysis 5JGAAP analysis 6

For now this is not really strong enough evidence to say this program is definitely a good tool for this job, though it does suggest a couple of intriguing ideas. One possible explanation is that maybe in narrowing the tests down, I’ve just picked out tests that are good at looking for some characteristic in texts that Jay Over particularly happens to use in his writing. But another one, which is rather more intriguing, is that, well, we know people wrote under pseudonyms, and we know that Malcolm Shaw / Pat Mills / Alan Davidson are all real names of real people – but do we know for absolute definite whether Jay Over is a real name or a pseudonym – of Malcolm Shaw’s???

Of course this means more tests. Next, I want to try to see how the program gets on with a test text by Alan Davidson and a test text by Pat Mills. Will these show the same sort of mixed results, meaning that we might not be at a point of having a really good check to use with an unknown author? Or will these authors be easier for the program to identify more definitely?

For reference, the following texts were used in the above tests. I am differentiating between ‘test texts’, meaning the ones that the program is asked to decide on an author for; and ‘check texts’, which are the ones that the program knows are by certain authors.

  • Jay Over test text: episode from “Slave of the Clock”, typed by me
  • Jay Over check text: episode from “The Secret of Angel Smith”, typed by me
  • Alan Davidson check text: episode from “Fran of the Floods”, typed by Marckie
  • Malcolm Shaw check text: episode from “The Sentinels”, typed by Lorrbot
  • Malcolm Shaw test text 1: episode from “Bella”, typed by Lorrbot
  • Malcolm Shaw test text 2: episode from “Four Faces of Eve”, typed by Lorrbot
  • Pat Mills check text: episode from “Concrete Surfer”, typed by me

With many thanks to all who have sent in scans of stories (Mistyfan in particular) and/or typed episodes (Marckie/Lorrbot/Mistyfan).

Attached below: PDF version of the analysis, hopefully readable

JGAAP analysis PDF

Can a computer program help us identify unknown writers? 2

The jury’s still out at present, but I am very grateful for the kind offers of help of various sorts! I now have text versions of episodes of:

  • “Slave of the Clock” and “The Secret of Angel Smith” (Jay Over)
  • “The Sentinels” (Malcolm Shaw)
  • “Fran of the Floods” (Alan Davidson)
  • “Concrete Surfer (Pat Mills).

So I have enough to try to see if I can get the program to identify “Slave of the Clock” as being by Jay Over rather than any of the other writers. If anyone is able to send in any more texts, the following would be useful:

  • Some texts by female writers such as Anne Digby, Alison Christie, Benita Brown
  • Some more texts by the writers named above, so that I can offer the program a wider base of texts per each writer (rather than keeping on increasing the number of individual authors)

How far have I got so far? Not that far yet, I’m afraid to say. I have downloaded a copy of the program I chose (JGAAP) and I’ve got it to run (not bad in itself as this is not a commercial piece of software with the latest user-friendly features). I’ve loaded up the known authors and the test text (Slave of the Clock). However, the checks that the program gives you as options are very academic, and hard for me to understand as it’s not an area I’ve ever studied. (Binned naming times, analysed by Mahalanobis distance? What the what??) Frankly, I am stabbing at options like a monkey and seeing what I get.

I can however already see that some of the kinds of checks that the program offers are plausibly going to work, so I am optimistic that we may get something useful out of this experiment. These more successful tests involve breaking down the texts into various smaller elements like individual words, or small groups of words, or the initial words of each sentence, or by tagging the text to indicate what parts of speech are used. The idea is that this should give the program some patterns to use and match the ‘test’ text against, and this does seem to be bearing fruit so far.

So, an interim progress report – nothing very definite yet but some positive hints. I will continue working through the options that the program offers, to see if I can narrow down the various analytical checks to a subset that look like they are successfully identifying the author as Jay Over. I will then run another series of tests with a new Jay Over file – I’ll type up an episode of “The Lonely Ballerina” to do that, unless anyone else has kindly done it before me 🙂 – scans from an episode are shown below, just in case! That will be a good test to see if the chosen analytical checks do the job that I hope they will…

Jay Over, Lonely Ballerina pg 1

Jay Over, Lonely Ballerina pg 2
click thru
Jay Over, Lonely Ballerina pg 3
click thru

Can a computer program help us identify unknown writers?

I don’t know yet, but I’m going to give it a go.

And I’ll need a little help from others, please.

I have been thinking about the problem of unknown writers and how we can try to identify them. In writing story posts here, Mistyfan and I sometimes raise questions about whether such and such a writer might have also written such and so other story, based on things like similar plot lines and the like. But there is a whole area of research into using computers in the Humanities, and a specific technique designed to help you attribute authorship to unknown writers: it’s called Stylometry. I want to try to use one of the pieces of software that does this – JGAAP – to see if we can get any help in thinking about who might have written what, or at  least in some cases. (Edited to add: this is written by the chap who did the analysis that strongly suggested that J K Rowling was the author of “The Cuckoo’s Egg”.)

The way it works is that I need to feed the program a number of texts from Known Authors, because it then compares the unknown writing with those known samples. (All it can ever do is say ‘this piece looks most likely to have been written by Author A out of the list of A – Z that you have given me’ – it’s just matching a sample to a known finite list, so it has limitations.) That means I need some text files (as many as possible) which are typed-up versions of stories where we already know the authors, such as the below:

  • Jay Over, Slave of the Clock / The Secret of Angel Smith / The Lonely Ballerina from Tammy 1982 and 19833
    • I can do the first two but haven’t got any copies of The Lonely Ballerina
  • Alison Christie – see list on the interview post
  • Pat Mills, various stories including Moonchild in Misty and Concrete Surfer in Jinty
    • I am in the middle of typing up the episode of Concrete Surfer included in the post about this story
  • Alan Davidson, Fran of the Floods / The Valley of Shining Mist / Gwen’s Stolen Glory
  • Malcolm Shaw, The Robot Who Cried

Can any one help by typing up one or more episodes from the stories mentioned, and sending them to me? I’m working out a standard format to use, because it’s going to be important to be consistent about things like how to indicate thought balloons or the text boxes at the beginning of each episode. We can work that out further together of course. Very many thanks in advance!

Once I have enough example files to start running them through the program, this is what I am intending to try (any comments or suggestions will be received with interest).

  1. Can I get the program to work at all?
    • If I load a credited Jay Over text as a Known Author, and a Pat Mills story likewise as a Known Author, will an episode of “Slave of the Clock” be successfully identified as a Jay Over story?
  2. What if I then compare a credited “Pam of Pond Hill” story – will the program identify this as a Jay Over story, or will the comedy style mean it is not as recognisable to the program?
  3. What if I then compare an uncredited “Pam” story with a credited “Pam” story? We think all the Pam stories were written by Jay Over but could this program show us any other views?
  4. What if I then add in more Known Authors and re-run the tests above – will the results still come out the same?
  5. And then excitingly I could try some further tests, like:
    • If I compare an episode of “Prisoner of the Bell” to “Slave of the Clock”, does the former look like the known Jay Over texts?
    • If I compare an episode of “E. T. Estate” by Jake Adams to the uncredited story “The Human Zoo”, what does the program indicate about any plausible attribution?
    • We think Benita Brown probably wrote “Spirit of the Lake” – is there any textual / stylistic similarity we can find between this and “Tomorrow Town” that we know she wrote?

Of course no stylistic attribution program is going to replace a statement from a creator or a source from the time, but we know these are thin on the ground and getting thinner, and what’s more people’s memories and records are getting more fragmentary as time goes by, so this seems worth trying. I don’t expect anything to happen very quickly on this because it does mean quite a bit of typing to get a good body of texts. If anyone is able to help on the typing front then I will be very grateful and hopefully will then be able to show any results sooner rather than later.

Apologies, I had meant to say something about the format of the text. I have a sample document which hopefully can be viewed via this link. In case that doesn’t work, this is what I mean for it to look like:

text grab

But I can add in extra detail such as the description that the text appeared in a word balloon, if I have a scan of the pages in question.