An "early access" beta of a stylometry program is now available here:

http://peterkirby.com/basic-stylometry.html

The program is very raw and may still have bugs, so you will need some patience if you want to get results with it. Here is some documentation.

Please report any bugs that you encounter, along with the error message and (if feasible) all the data you entered.

Entering Data

(1) The 'Words'

This program is exclusively based on the frequency of single words. (I might use other metrics in future efforts.)

You can use multiple tokens per line, in order to match any of the tokens. You can use this, for example, to list all the forms of a word making up a single lemma. You can also use this to combine multiple words (e.g., multiple prepositions, multiple conjuctions) that do not show up frequently enough on their own but do show up in significant quantities together.

Initial testing has shown that this list of twenty items has significant discriminatory power for samples that are approximately 750+ words (preferrably 2000+ words) in length. You can add to, subtract from, or modify the list as you like.

o oi h ai to ta

tou twn ths

tw| tois th| tais

ton tous thn tas

kai

te

de d

men

alla all

gar

eis

en

dia di

ek ec

kata kat kaq

pros

autos autou autw| auton autoi autwn autois autous auth auths auth| authn autai autwn autais autas auto auta

outos toutou toutw| touton autoi toutwn toutois toutous auth tauths tauth| tauthn autai tautais tautas touto touto tauta

tis tinos tini tina tines tinwn tisi tisin tinas ti tina

eimi ei esti estin esmen este eisi eisin hn hsqa hn hmen hte hsan esomai esh| esei estai esomeqa esesqe esontai w h|s h| wmen hte wsi eihn eihs eih eihmen eimen eihte eite eihsan eien esoimhn esoio esoito esoimeqa esoisqe esointo isqi estw este estwn ontwn estwsan einai esesqai wn ousa on esomenos esomenh esomenon

Here are ten additional words that are sometimes useful.

epi ep

mh

oti

upo up

apo ap

meta met meq

ou ouk oux

polus pollou pollw| polun pollh pollhs pollh| pollhn polu pollou pollw| polu polloi pollwn pollois pollous pollai pollwn pollais pollas polla pollwn pollois polla

pas pantos panti panta pas pasa pashs pash| pasan pasa pan pantos panti pan pantes pantwn pasi pasin pantas pantes pasai paswn pasais pasas pasai panta pantwn

ode toude tw|de tonde oide twnde toisde tousde hde thsde th|de thnde aide twnde taisde tasde tode toude tode tade ekeinos ekeinou ekeinw| ekeinon ekeinoi ekeinwn ekeinois ekeinous ekeinh ekeinhs ekeinh| ekeinhn ekeinai ekeinais ekeinas ekeino ekeinou ekeinw| ekeino ekeina ekeinwn ekeinois ekeina

A good rule of thumb is that each 'word' should appear about 5 times or more, on average, in each 'Sample'-sized extract.

There is no single list of words that work best for distinguishing the style of all authors from all other authors. Some adjustment is generally required to arrive at a list that works well for distinguishing the works of one author from the rest.

(2) The 'Sample'

The sample is what you want to compare to the various possible 'authors' to see if any of them are a likely match.

A sample between 1,000 words and 5,000 words in length is preferred. Samples as small as approximately 500 words may work occasionally.

All input must be in "Beta Code." You can copy Beta Code out of the TLG with Diogenes, for example. The program will automatically discard non-word characters such as quote marks, numbers, and other symbols. If you don't have the TLG with Diogenes, you can get it here:

https://kat.cr/tlg-phi-cd-rom-e-with-an ... uarium.2.0

Diogenes has a setting to change its results to "Beta Code." Be sure to use this setting in order to copy over text. You can also increase the number of lines that are returned by Diogenes. Other sources of "Beta Code" Greek are also, of course, acceptable.

(3) The 'Authors'

These are the candidates being considered as possible authors.

The 'Sample' should not appear in any of the authors (remove the 'Sample' from the text of the author, if applicable).

If it's possible that the author isn't any one of these candidates, you'll need to use your own judgment in interpreting the results (see below).

The 'Author' sections should be as long as they can be, while not including any composite material (i.e., material from more than one author). At a minimum they should be about 4-5x as long as the 'Sample'. If necessary, you can break up the 'Sample' into parts (tested separately) in order to meet this requirement.

(4) The 'Controls'

These are extracts of ancient Greek that are known not to be possible authors of the 'Sample'.

Again, the longer the better, while still avoiding composite texts.

These help you decide whether you believe that one of the candidates were the author or whether none of them were. (If one of the controls is a significantly closer match for the sample than the best candidate, that may mean that none of the candidates are matched signficantly-enough with the sample and that none of them can be declared the author.)

Interpreting the Results

When you've entered your data, press submit, and the results should appear in a new tab. This allows you to preserve your form data (at least for the length of the browser session). You would want to save your form data elsewhere if you want to work with it over multiple browser sessions.

(1) testsize, testsample, sample frequencies, and word list

This just repeats some of the data that you entered, along with a measurement of the sample length and the frequencies of the words in the 'Sample'.

(2) Author Stats and Control Stats

For each of the 'author'/'control' extracts, for each 'word', the mean and standard deviation for the appearance of the 'word' is calculated, based on samples made from the 'author'/'control' that are the exact same size as the 'Sample'.

(3) Author Q-Values and Control Q-Values

For each of the 'author'/'control' extracts, for each 'word', the actual number of the appearances of the 'word' in the 'Sample' (the observed frequency in the 'Sample') is compared to the mean and standard deviation for the 'author'/'control' in order to arrive at a statistic called a Q-value (calculated from the Z-score), which estimates the likelihood that a number which is that far from the mean (or farther) would be selected from the normal distribution (based on the mean and standard deviation) calculated from the respective 'author'/'control'. Values closer to 1 are based on observed frequencies closer to the mean that are more 'likely', while values closer to 0 are based on observed frequencies several standard deviations away from the mean that are less 'likely.'

(4) Author Chi-Square-Based P-Values and Control Chi-Square-Based P-Values

For each 'author'/'control', the Q-values are combined using a method known as " Fisher's method" for combining several p-values, in order to arrive at a single p-value representing the likelihood that all the observed frequencies in the 'Sample' would be generated from that 'author'/'control' (according to the normal distribution, with the measured means and standard deviations).

This is not the only way to arrive at a combined p-value statistic. This method gives equal weighting to all of the individual components. This method is more sensitive to the effect of any one single extreme value on the result than the other one (below).

(5) Bayesian Author Test: Posterior Probabilities from Equal Priors, Chi-Square-Based Method

This is a straight comparison of the different candidate authors, using equal prior likelihood for each. The most likely candidate, using the chi-square-based method, has the highest number.

(6) Bayesian Comparison of Best Author to Best Control: from Equal Priors, Chi-Square-Based Method

This is a comparison of the best 'author' candidate to the best 'control' candidate. "$VAR1" indicates which 'author' actually is the best candidate, while "$VAR3" indicates which 'control' actually is the best candidate. The one that is a closer match will have a value greater than 0.5.

Initial testing, using a large number of controls, shows that "$VAR2" values greater than 0.51 are typical for the author, while "$VAR2" values less than 0.3 are typical if the candidate being considered is not likely to be a reliable result. (Values between 0.3 and 0.51 do not appear to be probative; further analysis is recommended to confirm or disconfirm the result.)

(7) Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Chi-Square-Based Method

Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Chi-Square-Based Method

Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Chi-Square-Based Method

This scores every 'Sample'-sized sample in the best candidate author and every 'Sample'-sized sample outside the best candidate author, then uses Bayesian analysis to determine how likely a sample is to be from the best candidate author (and not outside that author generally, in the second best author candidate instead, or in the best control candidate instead) if it is as closely similar to the best candidate author as the 'Sample' is.

Initial testing indicates that, if first value is less than 0.9, or if either of the other two is less than 0.7, the candidate being considered may not be a reliable result.

(8) Author Z-Score-Based P-Values and Control Z-Score-Based P-Values

These results are similar to (4) above, which is based on Fisher's chi-square method for combining p-values.

Here, the absolute values of the Z-scores are taken. A weighted average is computed. And then the resulting combined "Z-score" is used to calculate a combined "p-value."

The weight assigned to each Z-score is equal to the square root of the mean number of appearances of the given 'word'. (If this mean value is less than 4, then a weight of 0 is assigned, and it is thereby removed from consideration.)

This method is less sensitive to extreme individual Q-values (since it takes an average of Z-scores instead of working with the Q-values directly) and is less sensitive to the effect of 'words' with fewer observations (due to weighting).

It is also insensitive to the effect of 'words' with too few appearances on average for the observed frequency to be significant (thus, it will perform more robustly when the practitioner fails to remove such uncommon words on their own).

(9) Bayesian Author Test: Posterior Probabilities from Equal Priors, Z-Score-Based Method

Like (5) above, this is a straight comparison of the candidate authors.

(10) Bayesian Comparison of Best Author to Best Control: from Equal Priors, Z-Score-Based Method

Like (6) above, this is a comparison of the best candidate author against the best candidate control.

(11) Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Z-Score-Based Method

Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Z-Score-Based Method

Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Z-Score-Based Method

Like (7) above.

Some more notes:

Sometimes a result is deemed 'unreliable' with one method but appears to be 'reliable' with another. This is acceptable. It should be interpreted as 'reliable.'

Sometimes a result is deemed 'unreliable' with one method, and a different result appears to be 'reliable' with another. This is acceptable. The latter should be interpreted as 'reliable.'

Sometimes two different 'author' candidate results might be deemed 'reliable,' each according to a different method. Favor the Z-score-based method.

(I will show a worked example in another post....)