Basic Stylometry Beta (early access)
 Peter Kirby
 Site Admin
 Posts: 5087
 Joined: Fri Oct 04, 2013 2:13 pm
 Location: Santa Clara
 Contact:
Basic Stylometry Beta (early access)
An "early access" beta of a stylometry program is now available here:
http://peterkirby.com/basicstylometry.html
The program is very raw and may still have bugs, so you will need some patience if you want to get results with it. Here is some documentation.
Please report any bugs that you encounter, along with the error message and (if feasible) all the data you entered.
Entering Data
(1) The 'Words'
This program is exclusively based on the frequency of single words. (I might use other metrics in future efforts.)
You can use multiple tokens per line, in order to match any of the tokens. You can use this, for example, to list all the forms of a word making up a single lemma. You can also use this to combine multiple words (e.g., multiple prepositions, multiple conjuctions) that do not show up frequently enough on their own but do show up in significant quantities together.
Initial testing has shown that this list of twenty items has significant discriminatory power for samples that are approximately 750+ words (preferrably 2000+ words) in length. You can add to, subtract from, or modify the list as you like.
o oi h ai to ta
tou twn ths
tw tois th tais
ton tous thn tas
kai
te
de d
men
alla all
gar
eis
en
dia di
ek ec
kata kat kaq
pros
autos autou autw auton autoi autwn autois autous auth auths auth authn autai autwn autais autas auto auta
outos toutou toutw touton autoi toutwn toutois toutous auth tauths tauth tauthn autai tautais tautas touto touto tauta
tis tinos tini tina tines tinwn tisi tisin tinas ti tina
eimi ei esti estin esmen este eisi eisin hn hsqa hn hmen hte hsan esomai esh esei estai esomeqa esesqe esontai w hs h wmen hte wsi eihn eihs eih eihmen eimen eihte eite eihsan eien esoimhn esoio esoito esoimeqa esoisqe esointo isqi estw este estwn ontwn estwsan einai esesqai wn ousa on esomenos esomenh esomenon
Here are ten additional words that are sometimes useful.
epi ep
mh
oti
upo up
apo ap
meta met meq
ou ouk oux
polus pollou pollw polun pollh pollhs pollh pollhn polu pollou pollw polu polloi pollwn pollois pollous pollai pollwn pollais pollas polla pollwn pollois polla
pas pantos panti panta pas pasa pashs pash pasan pasa pan pantos panti pan pantes pantwn pasi pasin pantas pantes pasai paswn pasais pasas pasai panta pantwn
ode toude twde tonde oide twnde toisde tousde hde thsde thde thnde aide twnde taisde tasde tode toude tode tade ekeinos ekeinou ekeinw ekeinon ekeinoi ekeinwn ekeinois ekeinous ekeinh ekeinhs ekeinh ekeinhn ekeinai ekeinais ekeinas ekeino ekeinou ekeinw ekeino ekeina ekeinwn ekeinois ekeina
A good rule of thumb is that each 'word' should appear about 5 times or more, on average, in each 'Sample'sized extract.
There is no single list of words that work best for distinguishing the style of all authors from all other authors. Some adjustment is generally required to arrive at a list that works well for distinguishing the works of one author from the rest.
(2) The 'Sample'
The sample is what you want to compare to the various possible 'authors' to see if any of them are a likely match.
A sample between 1,000 words and 5,000 words in length is preferred. Samples as small as approximately 500 words may work occasionally.
All input must be in "Beta Code." You can copy Beta Code out of the TLG with Diogenes, for example. The program will automatically discard nonword characters such as quote marks, numbers, and other symbols. If you don't have the TLG with Diogenes, you can get it here:
https://kat.cr/tlgphicdromewithan ... uarium.2.0
Diogenes has a setting to change its results to "Beta Code." Be sure to use this setting in order to copy over text. You can also increase the number of lines that are returned by Diogenes. Other sources of "Beta Code" Greek are also, of course, acceptable.
(3) The 'Authors'
These are the candidates being considered as possible authors.
The 'Sample' should not appear in any of the authors (remove the 'Sample' from the text of the author, if applicable).
If it's possible that the author isn't any one of these candidates, you'll need to use your own judgment in interpreting the results (see below).
The 'Author' sections should be as long as they can be, while not including any composite material (i.e., material from more than one author). At a minimum they should be about 45x as long as the 'Sample'. If necessary, you can break up the 'Sample' into parts (tested separately) in order to meet this requirement.
(4) The 'Controls'
These are extracts of ancient Greek that are known not to be possible authors of the 'Sample'.
Again, the longer the better, while still avoiding composite texts.
These help you decide whether you believe that one of the candidates were the author or whether none of them were. (If one of the controls is a significantly closer match for the sample than the best candidate, that may mean that none of the candidates are matched signficantlyenough with the sample and that none of them can be declared the author.)
Interpreting the Results
When you've entered your data, press submit, and the results should appear in a new tab. This allows you to preserve your form data (at least for the length of the browser session). You would want to save your form data elsewhere if you want to work with it over multiple browser sessions.
(1) testsize, testsample, sample frequencies, and word list
This just repeats some of the data that you entered, along with a measurement of the sample length and the frequencies of the words in the 'Sample'.
(2) Author Stats and Control Stats
For each of the 'author'/'control' extracts, for each 'word', the mean and standard deviation for the appearance of the 'word' is calculated, based on samples made from the 'author'/'control' that are the exact same size as the 'Sample'.
(3) Author QValues and Control QValues
For each of the 'author'/'control' extracts, for each 'word', the actual number of the appearances of the 'word' in the 'Sample' (the observed frequency in the 'Sample') is compared to the mean and standard deviation for the 'author'/'control' in order to arrive at a statistic called a Qvalue (calculated from the Zscore), which estimates the likelihood that a number which is that far from the mean (or farther) would be selected from the normal distribution (based on the mean and standard deviation) calculated from the respective 'author'/'control'. Values closer to 1 are based on observed frequencies closer to the mean that are more 'likely', while values closer to 0 are based on observed frequencies several standard deviations away from the mean that are less 'likely.'
(4) Author ChiSquareBased PValues and Control ChiSquareBased PValues
For each 'author'/'control', the Qvalues are combined using a method known as "[wiki]Fisher's method[/wiki]" for combining several pvalues, in order to arrive at a single pvalue representing the likelihood that all the observed frequencies in the 'Sample' would be generated from that 'author'/'control' (according to the normal distribution, with the measured means and standard deviations).
This is not the only way to arrive at a combined pvalue statistic. This method gives equal weighting to all of the individual components. This method is more sensitive to the effect of any one single extreme value on the result than the other one (below).
(5) Bayesian Author Test: Posterior Probabilities from Equal Priors, ChiSquareBased Method
This is a straight comparison of the different candidate authors, using equal prior likelihood for each. The most likely candidate, using the chisquarebased method, has the highest number.
(6) Bayesian Comparison of Best Author to Best Control: from Equal Priors, ChiSquareBased Method
This is a comparison of the best 'author' candidate to the best 'control' candidate. "$VAR1" indicates which 'author' actually is the best candidate, while "$VAR3" indicates which 'control' actually is the best candidate. The one that is a closer match will have a value greater than 0.5.
Initial testing, using a large number of controls, shows that "$VAR2" values greater than 0.51 are typical for the author, while "$VAR2" values less than 0.3 are typical if the candidate being considered is not likely to be a reliable result. (Values between 0.3 and 0.51 do not appear to be probative; further analysis is recommended to confirm or disconfirm the result.)
(7) Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ChiSquareBased Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ChiSquareBased Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ChiSquareBased Method
This scores every 'Sample'sized sample in the best candidate author and every 'Sample'sized sample outside the best candidate author, then uses Bayesian analysis to determine how likely a sample is to be from the best candidate author (and not outside that author generally, in the second best author candidate instead, or in the best control candidate instead) if it is as closely similar to the best candidate author as the 'Sample' is.
Initial testing indicates that, if first value is less than 0.9, or if either of the other two is less than 0.7, the candidate being considered may not be a reliable result.
(8) Author ZScoreBased PValues and Control ZScoreBased PValues
These results are similar to (4) above, which is based on Fisher's chisquare method for combining pvalues.
Here, the absolute values of the Zscores are taken. A weighted average is computed. And then the resulting combined "Zscore" is used to calculate a combined "pvalue."
The weight assigned to each Zscore is equal to the square root of the mean number of appearances of the given 'word'. (If this mean value is less than 4, then a weight of 0 is assigned, and it is thereby removed from consideration.)
This method is less sensitive to extreme individual Qvalues (since it takes an average of Zscores instead of working with the Qvalues directly) and is less sensitive to the effect of 'words' with fewer observations (due to weighting).
It is also insensitive to the effect of 'words' with too few appearances on average for the observed frequency to be significant (thus, it will perform more robustly when the practitioner fails to remove such uncommon words on their own).
(9) Bayesian Author Test: Posterior Probabilities from Equal Priors, ZScoreBased Method
Like (5) above, this is a straight comparison of the candidate authors.
(10) Bayesian Comparison of Best Author to Best Control: from Equal Priors, ZScoreBased Method
Like (6) above, this is a comparison of the best candidate author against the best candidate control.
(11) Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ZScoreBased Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ZScoreBased Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ZScoreBased Method
Like (7) above.
Some more notes:
Sometimes a result is deemed 'unreliable' with one method but appears to be 'reliable' with another. This is acceptable. It should be interpreted as 'reliable.'
Sometimes a result is deemed 'unreliable' with one method, and a different result appears to be 'reliable' with another. This is acceptable. The latter should be interpreted as 'reliable.'
Sometimes two different 'author' candidate results might be deemed 'reliable,' each according to a different method. Favor the Zscorebased method.
(I will show a worked example in another post....)
http://peterkirby.com/basicstylometry.html
The program is very raw and may still have bugs, so you will need some patience if you want to get results with it. Here is some documentation.
Please report any bugs that you encounter, along with the error message and (if feasible) all the data you entered.
Entering Data
(1) The 'Words'
This program is exclusively based on the frequency of single words. (I might use other metrics in future efforts.)
You can use multiple tokens per line, in order to match any of the tokens. You can use this, for example, to list all the forms of a word making up a single lemma. You can also use this to combine multiple words (e.g., multiple prepositions, multiple conjuctions) that do not show up frequently enough on their own but do show up in significant quantities together.
Initial testing has shown that this list of twenty items has significant discriminatory power for samples that are approximately 750+ words (preferrably 2000+ words) in length. You can add to, subtract from, or modify the list as you like.
o oi h ai to ta
tou twn ths
tw tois th tais
ton tous thn tas
kai
te
de d
men
alla all
gar
eis
en
dia di
ek ec
kata kat kaq
pros
autos autou autw auton autoi autwn autois autous auth auths auth authn autai autwn autais autas auto auta
outos toutou toutw touton autoi toutwn toutois toutous auth tauths tauth tauthn autai tautais tautas touto touto tauta
tis tinos tini tina tines tinwn tisi tisin tinas ti tina
eimi ei esti estin esmen este eisi eisin hn hsqa hn hmen hte hsan esomai esh esei estai esomeqa esesqe esontai w hs h wmen hte wsi eihn eihs eih eihmen eimen eihte eite eihsan eien esoimhn esoio esoito esoimeqa esoisqe esointo isqi estw este estwn ontwn estwsan einai esesqai wn ousa on esomenos esomenh esomenon
Here are ten additional words that are sometimes useful.
epi ep
mh
oti
upo up
apo ap
meta met meq
ou ouk oux
polus pollou pollw polun pollh pollhs pollh pollhn polu pollou pollw polu polloi pollwn pollois pollous pollai pollwn pollais pollas polla pollwn pollois polla
pas pantos panti panta pas pasa pashs pash pasan pasa pan pantos panti pan pantes pantwn pasi pasin pantas pantes pasai paswn pasais pasas pasai panta pantwn
ode toude twde tonde oide twnde toisde tousde hde thsde thde thnde aide twnde taisde tasde tode toude tode tade ekeinos ekeinou ekeinw ekeinon ekeinoi ekeinwn ekeinois ekeinous ekeinh ekeinhs ekeinh ekeinhn ekeinai ekeinais ekeinas ekeino ekeinou ekeinw ekeino ekeina ekeinwn ekeinois ekeina
A good rule of thumb is that each 'word' should appear about 5 times or more, on average, in each 'Sample'sized extract.
There is no single list of words that work best for distinguishing the style of all authors from all other authors. Some adjustment is generally required to arrive at a list that works well for distinguishing the works of one author from the rest.
(2) The 'Sample'
The sample is what you want to compare to the various possible 'authors' to see if any of them are a likely match.
A sample between 1,000 words and 5,000 words in length is preferred. Samples as small as approximately 500 words may work occasionally.
All input must be in "Beta Code." You can copy Beta Code out of the TLG with Diogenes, for example. The program will automatically discard nonword characters such as quote marks, numbers, and other symbols. If you don't have the TLG with Diogenes, you can get it here:
https://kat.cr/tlgphicdromewithan ... uarium.2.0
Diogenes has a setting to change its results to "Beta Code." Be sure to use this setting in order to copy over text. You can also increase the number of lines that are returned by Diogenes. Other sources of "Beta Code" Greek are also, of course, acceptable.
(3) The 'Authors'
These are the candidates being considered as possible authors.
The 'Sample' should not appear in any of the authors (remove the 'Sample' from the text of the author, if applicable).
If it's possible that the author isn't any one of these candidates, you'll need to use your own judgment in interpreting the results (see below).
The 'Author' sections should be as long as they can be, while not including any composite material (i.e., material from more than one author). At a minimum they should be about 45x as long as the 'Sample'. If necessary, you can break up the 'Sample' into parts (tested separately) in order to meet this requirement.
(4) The 'Controls'
These are extracts of ancient Greek that are known not to be possible authors of the 'Sample'.
Again, the longer the better, while still avoiding composite texts.
These help you decide whether you believe that one of the candidates were the author or whether none of them were. (If one of the controls is a significantly closer match for the sample than the best candidate, that may mean that none of the candidates are matched signficantlyenough with the sample and that none of them can be declared the author.)
Interpreting the Results
When you've entered your data, press submit, and the results should appear in a new tab. This allows you to preserve your form data (at least for the length of the browser session). You would want to save your form data elsewhere if you want to work with it over multiple browser sessions.
(1) testsize, testsample, sample frequencies, and word list
This just repeats some of the data that you entered, along with a measurement of the sample length and the frequencies of the words in the 'Sample'.
(2) Author Stats and Control Stats
For each of the 'author'/'control' extracts, for each 'word', the mean and standard deviation for the appearance of the 'word' is calculated, based on samples made from the 'author'/'control' that are the exact same size as the 'Sample'.
(3) Author QValues and Control QValues
For each of the 'author'/'control' extracts, for each 'word', the actual number of the appearances of the 'word' in the 'Sample' (the observed frequency in the 'Sample') is compared to the mean and standard deviation for the 'author'/'control' in order to arrive at a statistic called a Qvalue (calculated from the Zscore), which estimates the likelihood that a number which is that far from the mean (or farther) would be selected from the normal distribution (based on the mean and standard deviation) calculated from the respective 'author'/'control'. Values closer to 1 are based on observed frequencies closer to the mean that are more 'likely', while values closer to 0 are based on observed frequencies several standard deviations away from the mean that are less 'likely.'
(4) Author ChiSquareBased PValues and Control ChiSquareBased PValues
For each 'author'/'control', the Qvalues are combined using a method known as "[wiki]Fisher's method[/wiki]" for combining several pvalues, in order to arrive at a single pvalue representing the likelihood that all the observed frequencies in the 'Sample' would be generated from that 'author'/'control' (according to the normal distribution, with the measured means and standard deviations).
This is not the only way to arrive at a combined pvalue statistic. This method gives equal weighting to all of the individual components. This method is more sensitive to the effect of any one single extreme value on the result than the other one (below).
(5) Bayesian Author Test: Posterior Probabilities from Equal Priors, ChiSquareBased Method
This is a straight comparison of the different candidate authors, using equal prior likelihood for each. The most likely candidate, using the chisquarebased method, has the highest number.
(6) Bayesian Comparison of Best Author to Best Control: from Equal Priors, ChiSquareBased Method
This is a comparison of the best 'author' candidate to the best 'control' candidate. "$VAR1" indicates which 'author' actually is the best candidate, while "$VAR3" indicates which 'control' actually is the best candidate. The one that is a closer match will have a value greater than 0.5.
Initial testing, using a large number of controls, shows that "$VAR2" values greater than 0.51 are typical for the author, while "$VAR2" values less than 0.3 are typical if the candidate being considered is not likely to be a reliable result. (Values between 0.3 and 0.51 do not appear to be probative; further analysis is recommended to confirm or disconfirm the result.)
(7) Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ChiSquareBased Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ChiSquareBased Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ChiSquareBased Method
This scores every 'Sample'sized sample in the best candidate author and every 'Sample'sized sample outside the best candidate author, then uses Bayesian analysis to determine how likely a sample is to be from the best candidate author (and not outside that author generally, in the second best author candidate instead, or in the best control candidate instead) if it is as closely similar to the best candidate author as the 'Sample' is.
Initial testing indicates that, if first value is less than 0.9, or if either of the other two is less than 0.7, the candidate being considered may not be a reliable result.
(8) Author ZScoreBased PValues and Control ZScoreBased PValues
These results are similar to (4) above, which is based on Fisher's chisquare method for combining pvalues.
Here, the absolute values of the Zscores are taken. A weighted average is computed. And then the resulting combined "Zscore" is used to calculate a combined "pvalue."
The weight assigned to each Zscore is equal to the square root of the mean number of appearances of the given 'word'. (If this mean value is less than 4, then a weight of 0 is assigned, and it is thereby removed from consideration.)
This method is less sensitive to extreme individual Qvalues (since it takes an average of Zscores instead of working with the Qvalues directly) and is less sensitive to the effect of 'words' with fewer observations (due to weighting).
It is also insensitive to the effect of 'words' with too few appearances on average for the observed frequency to be significant (thus, it will perform more robustly when the practitioner fails to remove such uncommon words on their own).
(9) Bayesian Author Test: Posterior Probabilities from Equal Priors, ZScoreBased Method
Like (5) above, this is a straight comparison of the candidate authors.
(10) Bayesian Comparison of Best Author to Best Control: from Equal Priors, ZScoreBased Method
Like (6) above, this is a comparison of the best candidate author against the best candidate control.
(11) Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ZScoreBased Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ZScoreBased Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ZScoreBased Method
Like (7) above.
Some more notes:
Sometimes a result is deemed 'unreliable' with one method but appears to be 'reliable' with another. This is acceptable. It should be interpreted as 'reliable.'
Sometimes a result is deemed 'unreliable' with one method, and a different result appears to be 'reliable' with another. This is acceptable. The latter should be interpreted as 'reliable.'
Sometimes two different 'author' candidate results might be deemed 'reliable,' each according to a different method. Favor the Zscorebased method.
(I will show a worked example in another post....)
"... almost every critical biblical position was earlier advanced by skeptics."  Raymond Brown
 Ben C. Smith
 Posts: 4031
 Joined: Wed Apr 08, 2015 2:18 pm
 Location: USA
 Contact:
Re: Basic Stylometry Beta (early access)
You just never seem to fail to impress, Peter Kirby.
(Hey, guys, remember that one time, several years ago, when Peter Kirby failed to impress? .... Yeah, thought so. Nor do I.)
Ben.
(Hey, guys, remember that one time, several years ago, when Peter Kirby failed to impress? .... Yeah, thought so. Nor do I.)
Ben.
ΤΙ ΕΣΤΙΝ ΑΛΕΘΕΙΑ
 Peter Kirby
 Site Admin
 Posts: 5087
 Joined: Fri Oct 04, 2013 2:13 pm
 Location: Santa Clara
 Contact:
Re: Basic Stylometry Beta (early access)
LOL. Thanks!
"... almost every critical biblical position was earlier advanced by skeptics."  Raymond Brown
 Peter Kirby
 Site Admin
 Posts: 5087
 Joined: Fri Oct 04, 2013 2:13 pm
 Location: Santa Clara
 Contact:
Re: Basic Stylometry Beta (early access)
Here's a little study of whether the First Apology (attributed to Justin) and the Second Apology (attributed to Justin) were written by the author of the Dialogue with Trypho (attributed to Justin). It is intended as an example of the sort of thing you can do with the program and also as a sort of sanity check for those who are a bit skeptical (and who believe that the result should indicate that the same person wrote all three).
Author Group
#1 Justin (Dialogue with Trypho)
#2 Tatian
#3 Athenagoras
#4 Irenaeus
#5 Clement of Alexandria
#6 Origen
#7 Eusebius
Control Group
#1 Josephus
#2 Acts
#3 Mark
#4 John
#5 1Cor
#6 Hebrews
#7 Revelation
#8 Life of Adam and Eve
#9 1 Maccabees
#10 2 Maccabees
#11 Polybius
#12 Diodorus Siculus
#13 Dionysius Halicarnassus
#14 Strabo
#15 Plutarch
#16 Arrian
#17 Herodian
#18 Herodotus
#19 Thucydides
#20 Xenophon
#21 Epictetus
#22 Galen
#23 Lucian
#24 Philostratus
#25 Basil
#26 John Chrysostom
To test the First Apology, we take a 4641 word sample consisting of chapters 50 through 68.
To test the Second Apology, we put the entire 3295word text in.
Author Group
#1 Justin (Dialogue with Trypho)
#2 Tatian
#3 Athenagoras
#4 Irenaeus
#5 Clement of Alexandria
#6 Origen
#7 Eusebius
Control Group
#1 Josephus
#2 Acts
#3 Mark
#4 John
#5 1Cor
#6 Hebrews
#7 Revelation
#8 Life of Adam and Eve
#9 1 Maccabees
#10 2 Maccabees
#11 Polybius
#12 Diodorus Siculus
#13 Dionysius Halicarnassus
#14 Strabo
#15 Plutarch
#16 Arrian
#17 Herodian
#18 Herodotus
#19 Thucydides
#20 Xenophon
#21 Epictetus
#22 Galen
#23 Lucian
#24 Philostratus
#25 Basil
#26 John Chrysostom
To test the First Apology, we take a 4641 word sample consisting of chapters 50 through 68.
The results indicate that the author is the same as the author of the Dialogue with Trypho.testsize: 4641
Author ChiSquareBased PValues
$VAR1 = '0.888460855804236'; $VAR2 = 0; $VAR3 = 0; $VAR4 = '0.00631746893963075'; $VAR5 = '1.01758673269655e08'; $VAR6 = '3.82515355529966e10'; $VAR7 = '0.000730455451654238';
Control ChiSquareBased PValues
$VAR1 = '6.50657222634683e49'; $VAR2 = 0; $VAR3 = 0; $VAR4 = 0; $VAR5 = 0; $VAR6 = 0; $VAR7 = 0; $VAR8 = '0.0217532829559411'; $VAR9 = 0; $VAR10 = 0; $VAR11 = 0; $VAR12 = 0; $VAR13 = 0; $VAR14 = '8.46752788675696e21'; $VAR15 = 0; $VAR16 = 0; $VAR17 = 0; $VAR18 = 0; $VAR19 = '2.2846420327196e32'; $VAR20 = 0; $VAR21 = 0; $VAR22 = '2.1288841096972e40'; $VAR23 = '6.11732606067973e33'; $VAR24 = 0; $VAR25 = '1.63086849061174e19'; $VAR26 = '2.44254151170694e18';
Bayesian Author Test: Posterior Probabilities from Equal Priors, ChiSquareBased Method
$VAR1 = '0.992129686472721'; $VAR2 = '0'; $VAR3 = '0'; $VAR4 = '0.00705461409743644'; $VAR5 = '1.13632243837593e08'; $VAR6 = '4.27148632687276e10'; $VAR7 = '0.0008156876394695';
Bayesian Comparison of Best Author to Best Control: from Equal Priors, ChiSquareBased Method
$VAR1 = 1; $VAR2 = '0.976100917323069'; $VAR3 = 8; $VAR4 = '0.0238990826769311';
Percentage of Samples in the Best Author Candidate that Meet the PValue>0.7 Test, ChiSquareBased Method
0.181818181818182
Percentage of Samples outside the Best Author Candidate that Meet the PValue>0.7 Test, ChiSquareBased Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ChiSquareBased Method
1
Percentage of Samples in the SecondBest Author Candidate that Meet the PValue>0.7 Test, ChiSquareBased Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ChiSquareBased Method
1
Percentage of Samples in the Best Control Candidate that Meet the PValue>0.7 Test, ChiSquareBased Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ChiSquareBased Method
1
Author ZScoreBased PValues
$VAR1 = '0.211556501767805'; $VAR2 = '9.87894456773742e42'; $VAR3 = '2.11748551596198e09'; $VAR4 = '0.00560915487879656'; $VAR5 = '0.0354182939677903'; $VAR6 = '0.0907131978115263'; $VAR7 = '0.0926953240598759';
Control ZScoreBased PValues
$VAR1 = '0.00158311378490499'; $VAR2 = '0.000395314434282135'; $VAR3 = '1.20917740013948e14'; $VAR4 = '2.64020185029806e12'; $VAR5 = '0'; $VAR6 = '3.11298773021434e224'; $VAR7 = '2.86618459242993e51'; $VAR8 = '0.119270040623747'; $VAR9 = '4.31835159940076e13'; $VAR10 = '3.78634400157456e13'; $VAR11 = '3.26965236089271e05'; $VAR12 = '8.28894037464006e05'; $VAR13 = '2.83748794406989e05'; $VAR14 = '0.00699144781369404'; $VAR15 = '8.7302454204118e16'; $VAR16 = '5.45242882713467e05'; $VAR17 = '1.69376834935096e09'; $VAR18 = '2.89547666909129e07'; $VAR19 = '0.000276715507818051'; $VAR20 = '0.00215812982785909'; $VAR21 = '8.09173261441971e14'; $VAR22 = '0.000342826130120986'; $VAR23 = '0.0126968173134049'; $VAR24 = '1.1126934485835e07'; $VAR25 = '0.0142496971279844'; $VAR26 = '0.00448348611287948';
Bayesian Author Test: Posterior Probabilities from Equal Priors, ZScoreBased Method
$VAR1 = '0.48522970943548'; $VAR2 = '2.26585208304949e41'; $VAR3 = '4.85670198296137e09'; $VAR4 = '0.0128652561810854'; $VAR5 = '0.0812360213327496'; $VAR6 = '0.208061384302719'; $VAR7 = '0.212607623891265';
Bayesian Comparison of Best Author to Best Control: from Equal Priors, ZScoreBased Method
$VAR1 = 1; $VAR2 = '0.639478622961926'; $VAR3 = 8; $VAR4 = '0.360521377038074';
Percentage of Samples in the Best Author Candidate that Meet the PValue>0.21 Test, ZScoreBased Method
0.727272727272727
Percentage of Samples outside the Best Author Candidate that Meet the PValue>0.21 Test, ZScoreBased Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ZScoreBased Method
1
Percentage of Samples in the SecondBest Author Candidate that Meet the PValue>0.21 Test, ZScoreBased Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ZScoreBased Method
1
Percentage of Samples in the Best Control Candidate that Meet the PValue>0.21 Test, ZScoreBased Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ZScoreBased Method
1
To test the Second Apology, we put the entire 3295word text in.
Last but not least, let's identify an unreliable result (based on an excerpt of Romans 34).
Author ChiSquareBased PValues
$VAR1 = '1.94493602477902e05'; $VAR2 = '6.87563449738722e49'; $VAR3 = 0; $VAR4 = 0; $VAR5 = '3.74277169629409e28'; $VAR6 = 0; $VAR7 = '5.86049482486032e12';
Control ChiSquareBased PValues
$VAR1 = '3.45057626008072e39'; $VAR2 = 0; $VAR3 = 0; $VAR4 = 0; $VAR5 = 0; $VAR6 = 0; $VAR7 = 0; $VAR8 = '7.54061444420771e11'; $VAR9 = 0; $VAR10 = 0; $VAR11 = 0; $VAR12 = '1.93744868969708e42'; $VAR13 = 0; $VAR14 = '7.58407562510822e15'; $VAR15 = 0; $VAR16 = 0; $VAR17 = 0; $VAR18 = 0; $VAR19 = 0; $VAR20 = 0; $VAR21 = '1.57529752644245e43'; $VAR22 = '3.84962930383549e17'; $VAR23 = '1.64623005556979e23'; $VAR24 = 0; $VAR25 = '3.95474661623243e26'; $VAR26 = '4.87924453750406e24';
Bayesian Author Test: Posterior Probabilities from Equal Priors, ChiSquareBased Method
$VAR1 = '0.999999698679392'; $VAR2 = '3.53514580326519e44'; $VAR3 = '0'; $VAR4 = '0'; $VAR5 = '1.92436693075552e23'; $VAR6 = '0'; $VAR7 = '3.0132060820038e07';
Bayesian Comparison of Best Author to Best Control: from Equal Priors, ChiSquareBased Method
$VAR1 = 1; $VAR2 = '0.999996122964913'; $VAR3 = 8; $VAR4 = '3.87703508645621e06';
Percentage of Samples in the Best Author Candidate that Meet the PValue>1.9e05 Test, ChiSquareBased Method
0.933333333333333
Percentage of Samples outside the Best Author Candidate that Meet the PValue>1.9e05 Test, ChiSquareBased Method
0.0792079207920792
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ChiSquareBased Method
0.921773142112125
Percentage of Samples in the SecondBest Author Candidate that Meet the PValue>1.9e05 Test, ChiSquareBased Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ChiSquareBased Method
1
Percentage of Samples in the Best Control Candidate that Meet the PValue>1.9e05 Test, ChiSquareBased Method
0.2
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ChiSquareBased Method
0.823529411764706
Author ZScoreBased PValues
$VAR1 = '0.0806523205258673'; $VAR2 = '0.00144687349965014'; $VAR3 = '0.00230037565992109'; $VAR4 = '0.00170887928048379'; $VAR5 = '0.0410372784494046'; $VAR6 = '0.0354496933328968'; $VAR7 = '0.0571990842645113';
Control ZScoreBased PValues
$VAR1 = '0.00462825929965623'; $VAR2 = '0.000103340653796794'; $VAR3 = '9.27852482839872e06'; $VAR4 = '1.1369044928401e08'; $VAR5 = '9.5151095815589e14'; $VAR6 = '2.07222888758312e234'; $VAR7 = '1.93349986706008e78'; $VAR8 = '0.0606829808842978'; $VAR9 = '8.26439242154331e13'; $VAR10 = '4.18605420315407e07'; $VAR11 = '2.35700063856191e05'; $VAR12 = '0.000182856448558814'; $VAR13 = '0.00171504960896788'; $VAR14 = '0.0279026368047166'; $VAR15 = '1.56316735712744e09'; $VAR16 = '0.000240126094636099'; $VAR17 = '1.79762719439227e05'; $VAR18 = '4.06081317403127e05'; $VAR19 = '0.00345519144246145'; $VAR20 = '0.00237669054742433'; $VAR21 = '0.00666006758748458'; $VAR22 = '0.0153212403407854'; $VAR23 = '0.0491781693169549'; $VAR24 = '0.000328289585005977'; $VAR25 = '0.0300989557781122'; $VAR26 = '0.0246370106233428';
Bayesian Author Test: Posterior Probabilities from Equal Priors, ZScoreBased Method
$VAR1 = '0.36694420782355'; $VAR2 = '0.00658284655281216'; $VAR3 = '0.0104660289836991'; $VAR4 = '0.00777489537504488'; $VAR5 = '0.186707481367775'; $VAR6 = '0.161285621452833'; $VAR7 = '0.260238918444286';
Bayesian Comparison of Best Author to Best Control: from Equal Priors, ZScoreBased Method
$VAR1 = 1; $VAR2 = '0.570645264991572'; $VAR3 = 8; $VAR4 = '0.429354735008428';
Percentage of Samples in the Best Author Candidate that Meet the PValue>0.08 Test, ZScoreBased Method
0.933333333333333
Percentage of Samples outside the Best Author Candidate that Meet the PValue>0.08 Test, ZScoreBased Method
0.0396039603960396
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ZScoreBased Method
0.959294436906377
Percentage of Samples in the SecondBest Author Candidate that Meet the PValue>0.08 Test, ZScoreBased Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ZScoreBased Method
1
Percentage of Samples in the Best Control Candidate that Meet the PValue>0.08 Test, ZScoreBased Method
0.2
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ZScoreBased Method
0.823529411764706
The bolded parts all are indications that the result is not reliable and can be discarded.testsize: 829
Author ChiSquareBased PValues
$VAR1 = '1.64139255508376e05'; $VAR2 = '8.94264926804366e24'; $VAR3 = '7.14474143179853e17'; $VAR4 = '1.00436073947967e19'; $VAR5 = '1.03907335376701e08'; $VAR6 = '2.50294178139516e08'; $VAR7 = '1.9828950617084e13';
Control ChiSquareBased PValues
$VAR1 = '3.85812800821849e36'; $VAR2 = '4.70936019769718e28'; $VAR3 = '2.35342420462997e39'; $VAR4 = '4.81452529568474e37'; $VAR5 = '3.98988031780169e09'; $VAR6 = 0; $VAR7 = 0; $VAR8 = '2.95840507860174e17'; $VAR9 = 0; $VAR10 = '1.25070413738294e51'; $VAR11 = '5.05211938633974e43'; $VAR12 = '4.59710989342204e34'; $VAR13 = '1.5275742423603e40'; $VAR14 = '3.90120214184003e15'; $VAR15 = 0; $VAR16 = 0; $VAR17 = 0; $VAR18 = 0; $VAR19 = '4.23818579524785e28'; $VAR20 = '3.37160534646114e33'; $VAR21 = '4.50902205262221e17'; $VAR22 = '9.7817794756768e12'; $VAR23 = '7.35790723030463e32'; $VAR24 = 0; $VAR25 = '1.88975491293184e05'; $VAR26 = '7.27129411865981e16';
Bayesian Author Test: Posterior Probabilities from Equal Priors, ChiSquareBased Method
$VAR1 = '0.997846701630155'; $VAR2 = '5.43647712322986e19'; $VAR3 = '4.34348057059187e12'; $VAR4 = '6.1057791936035e15'; $VAR5 = '0.000631680651649651'; $VAR6 = '0.00152160565929337'; $VAR7 = '1.20545526472397e08';
Bayesian Comparison of Best Author to Best Control: from Equal Priors, ChiSquareBased Method
$VAR1 = 1; $VAR2 = '0.464832627340306'; $VAR3 = 25; $VAR4 = '0.535167372659694';
Percentage of Samples in the Best Author Candidate that Meet the PValue>1.6e05 Test, ChiSquareBased Method
0.983870967741935
Percentage of Samples outside the Best Author Candidate that Meet the PValue>1.6e05 Test, ChiSquareBased Method
0.366906474820144
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ChiSquareBased Method
0.728373851043725
Percentage of Samples in the SecondBest Author Candidate that Meet the PValue>1.6e05 Test, ChiSquareBased Method
0.807692307692308
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ChiSquareBased Method
0.549168975069252
Percentage of Samples in the Best Control Candidate that Meet the PValue>1.6e05 Test, ChiSquareBased Method
0.804878048780488
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ChiSquareBased Method
0.550032988783813
Author ZScoreBased PValues
$VAR1 = '0.092944936302373'; $VAR2 = '0.0179815177344596'; $VAR3 = '0.0302195183096689'; $VAR4 = '0.0248526327223616'; $VAR5 = '0.0559421645002796'; $VAR6 = '0.0879517542012107'; $VAR7 = '0.0421041182426673';
Control ZScoreBased PValues
$VAR1 = '0.00519145846504496'; $VAR2 = '0.0215540129534138'; $VAR3 = '0.0187010097507574'; $VAR4 = '0.0146081967706652'; $VAR5 = '0.0917803912120555'; $VAR6 = '0.00293637254008674'; $VAR7 = '0.000195125886212675'; $VAR8 = '0.0470097764320487'; $VAR9 = '0.000884154117507113'; $VAR10 = '0.00293141385939037'; $VAR11 = '0.00332657651863712'; $VAR12 = '0.00425320970557994'; $VAR13 = '0.00646680016129299'; $VAR14 = '0.0260502374964299'; $VAR15 = '0.000481371517759226'; $VAR16 = '0.001802460355443'; $VAR17 = '6.4985196428693e06'; $VAR18 = '3.34298816939969e05'; $VAR19 = '0.000273259415753231'; $VAR20 = '0.0134573437681482'; $VAR21 = '0.0354389905667045'; $VAR22 = '0.0406696129722549'; $VAR23 = '0.0138978627036434'; $VAR24 = '2.37747537918995e05'; $VAR25 = '0.0784283645521996'; $VAR26 = '0.0242931502755357';
Bayesian Author Test: Posterior Probabilities from Equal Priors, ZScoreBased Method
$VAR1 = '0.264050633468642'; $VAR2 = '0.0510843445313165'; $VAR3 = '0.0858517232915847'; $VAR4 = '0.0706047437845793'; $VAR5 = '0.15892811982624'; $VAR6 = '0.24986532172076'; $VAR7 = '0.119615113376876';
Bayesian Comparison of Best Author to Best Control: from Equal Priors, ZScoreBased Method
$VAR1 = 1; $VAR2 = '0.503152099135476'; $VAR3 = 5; $VAR4 = '0.496847900864524';
Percentage of Samples in the Best Author Candidate that Meet the PValue>0.09 Test, ZScoreBased Method
0.983870967741935
Percentage of Samples outside the Best Author Candidate that Meet the PValue>0.09 Test, ZScoreBased Method
0.405275779376499
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, ZScoreBased Method
0.708255603508284
Percentage of Samples in the SecondBest Author Candidate that Meet the PValue>0.09 Test, ZScoreBased Method
0.846153846153846
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the SecondBest Author, ZScoreBased Method
0.537627118644068
Percentage of Samples in the Best Control Candidate that Meet the PValue>0.09 Test, ZScoreBased Method
0.5
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, ZScoreBased Method
0.66304347826087
"... almost every critical biblical position was earlier advanced by skeptics."  Raymond Brown
 Peter Kirby
 Site Admin
 Posts: 5087
 Joined: Fri Oct 04, 2013 2:13 pm
 Location: Santa Clara
 Contact:
Re: Basic Stylometry Beta (early access)
Updates:
(a) A worked example, based on the works of Justin Martyr, has been posted above.
(b) Some of the specific guidance (in the OP) for identifying a 'reliable' or 'unreliable' result has been adjusted, based on experience.
(c) A list of ten additional words that may often be useful has been added to the OP.
(d) The order of the results being printed by the program has been reorganized, so that lessimportant information is appended at the end.
(e) A bug has been fixed.
(a) A worked example, based on the works of Justin Martyr, has been posted above.
(b) Some of the specific guidance (in the OP) for identifying a 'reliable' or 'unreliable' result has been adjusted, based on experience.
(c) A list of ten additional words that may often be useful has been added to the OP.
(d) The order of the results being printed by the program has been reorganized, so that lessimportant information is appended at the end.
(e) A bug has been fixed.
"... almost every critical biblical position was earlier advanced by skeptics."  Raymond Brown
 Peter Kirby
 Site Admin
 Posts: 5087
 Joined: Fri Oct 04, 2013 2:13 pm
 Location: Santa Clara
 Contact:
Re: Basic Stylometry Beta (early access)
Methodologically you will encounter a few problems most frequently:
(1) False Positive, and the Actual Author Is Listed among the Candidates
Check that everything is set up sufficiently well: make sure that the 'Sample' is generous (>2000 words preferrably) and that the list of 'words' includes enough features to discriminate between authors effectively.
This can happen when there is a "dead ringer" or "lookalike" for the actual author among the 'authors'. What I mean by this is that there may be another extracted candidateauthor that has average/mean values for each of the 'words' being considered that are very close to the average/mean values for the actual author.
This can happen, and it is just a natural consequence of the type of measurements being made. Stylometry is less like genetic sequencing (comparing whether two samples are a DNA match) and more like biometrics (comparing whether a person matches the height, weight, sex, build, and description of the suspect). Even with a modestlyadequate list of biometric or stylometric measurements, there will typically be more than one person in the population that has these measurements.
The crude solution to this is to remove your "dead ringer" from the list of 'authors', if it is not really a genuine candidate. (It is not recommended to remove it from the list of candidate authors, if it is genuinely a possible candidate author.) This will mean that the results are not as robust as they might be otherwise, but it could keep this "lookalike" from interfering with the task of determining whether the sample is more like the (believed) actual author than like the other candidates.
A better solution is to adjust (add to, subtract from, or modify) the list of words and/or to take a larger 'Sample' (if possible). This may provide the data that is needed to tell the two "look alikes" apart (for example, perhaps one of them has the stylometric equivalent of a "mole on the left elbow" but the other doesn'tadding another preposition or two to consideration may be just what is needed).
(Words may need to be subtracted, not just added, if they are not, in fact, invariant regardless of subject, for this particular author.)
(2) False Positive, and the Actual Author Is Not Listed among the Candidates
One way to fight this problem is to include more controls. The more controls there are, the more likely that one of them will present a closer match to the sample than the supposed candidate author, and the more likely you will be able to identify the false positive as not likely being one of the candidates. (Obviously, this is in direct tension to the 'crude' solution mentioned under the other headings, which is why the crude solution is not recommended if there is a good chance that the actual author is not one of the candidates.)
Another way to fight this problem is to make sure that your list of words consists of those which appear sufficiently often (thus avoiding estimates that are based on faulty indicators, something the chisquarebased method is susceptible to). Along with this, make sure that the sample is not overly large (less than 5000 words max). After checking these things, make sure that the Zscorebased Pvalue for the candidate author is generously large (greater than 0.050.10 seems quite common for authors). This is an absolute reference point, instead of a relative reference point, for the similarity of the 'Sample' to the author. If the candidate author being proposed is simply the best of a bad bunch of matches, the Pvalues may be low in absolute terms.
(3) False Negative (and the Actual Author Is Listed among the Candidates)
Check that everything is set up sufficiently well: make sure that the 'Sample' is generous (>2000 words preferrably) and that the list of 'words' includes enough features to discriminate between authors effectively.
This can happen when there is a "dead ringer" or "lookalike" for the actual author among the 'controls'. What I mean by this is that there may be another extracted candidateauthor/control that has average/mean values for each of the 'words' being considered that are very close to the average/mean values for the actual author.
This can happen, and it is just a natural consequence of the type of measurements being made. Stylometry is less like genetic sequencing (comparing whether two samples are a DNA match) and more like biometrics (comparing whether a person matches the height, weight, sex, build, and description of the suspect). Even with a modestlyadequate list of biometric or stylometric measurements, there will typically be more than one person in the population that has these measurements.
The crude solution to this is to remove your "dead ringer" from the list of 'controls'. This will mean that the results are not as robust as they might be otherwise (especially if the author might not be any of the candidate authors, which is what the controls are there to gauge), but it could keep this "lookalike" from interfering with the task of determining whether the sample is more like the (believed) actual author than like the other candidates.
A better solution is to adjust (add to, subtract from, or modify) the list of words and/or to take a larger 'Sample' (if possible). This may provide the data that is needed to tell the two "look alikes" apart (for example, perhaps one of them has the stylometric equivalent of a "mole on the left elbow" but the other doesn'tadding another preposition or two to consideration may be just what is needed).
(Words may need to be subtracted, not just added, if they are not, in fact, invariant regardless of subject, for this particular author.)
(1) False Positive, and the Actual Author Is Listed among the Candidates
Check that everything is set up sufficiently well: make sure that the 'Sample' is generous (>2000 words preferrably) and that the list of 'words' includes enough features to discriminate between authors effectively.
This can happen when there is a "dead ringer" or "lookalike" for the actual author among the 'authors'. What I mean by this is that there may be another extracted candidateauthor that has average/mean values for each of the 'words' being considered that are very close to the average/mean values for the actual author.
This can happen, and it is just a natural consequence of the type of measurements being made. Stylometry is less like genetic sequencing (comparing whether two samples are a DNA match) and more like biometrics (comparing whether a person matches the height, weight, sex, build, and description of the suspect). Even with a modestlyadequate list of biometric or stylometric measurements, there will typically be more than one person in the population that has these measurements.
The crude solution to this is to remove your "dead ringer" from the list of 'authors', if it is not really a genuine candidate. (It is not recommended to remove it from the list of candidate authors, if it is genuinely a possible candidate author.) This will mean that the results are not as robust as they might be otherwise, but it could keep this "lookalike" from interfering with the task of determining whether the sample is more like the (believed) actual author than like the other candidates.
A better solution is to adjust (add to, subtract from, or modify) the list of words and/or to take a larger 'Sample' (if possible). This may provide the data that is needed to tell the two "look alikes" apart (for example, perhaps one of them has the stylometric equivalent of a "mole on the left elbow" but the other doesn'tadding another preposition or two to consideration may be just what is needed).
(Words may need to be subtracted, not just added, if they are not, in fact, invariant regardless of subject, for this particular author.)
(2) False Positive, and the Actual Author Is Not Listed among the Candidates
One way to fight this problem is to include more controls. The more controls there are, the more likely that one of them will present a closer match to the sample than the supposed candidate author, and the more likely you will be able to identify the false positive as not likely being one of the candidates. (Obviously, this is in direct tension to the 'crude' solution mentioned under the other headings, which is why the crude solution is not recommended if there is a good chance that the actual author is not one of the candidates.)
Another way to fight this problem is to make sure that your list of words consists of those which appear sufficiently often (thus avoiding estimates that are based on faulty indicators, something the chisquarebased method is susceptible to). Along with this, make sure that the sample is not overly large (less than 5000 words max). After checking these things, make sure that the Zscorebased Pvalue for the candidate author is generously large (greater than 0.050.10 seems quite common for authors). This is an absolute reference point, instead of a relative reference point, for the similarity of the 'Sample' to the author. If the candidate author being proposed is simply the best of a bad bunch of matches, the Pvalues may be low in absolute terms.
(3) False Negative (and the Actual Author Is Listed among the Candidates)
Check that everything is set up sufficiently well: make sure that the 'Sample' is generous (>2000 words preferrably) and that the list of 'words' includes enough features to discriminate between authors effectively.
This can happen when there is a "dead ringer" or "lookalike" for the actual author among the 'controls'. What I mean by this is that there may be another extracted candidateauthor/control that has average/mean values for each of the 'words' being considered that are very close to the average/mean values for the actual author.
This can happen, and it is just a natural consequence of the type of measurements being made. Stylometry is less like genetic sequencing (comparing whether two samples are a DNA match) and more like biometrics (comparing whether a person matches the height, weight, sex, build, and description of the suspect). Even with a modestlyadequate list of biometric or stylometric measurements, there will typically be more than one person in the population that has these measurements.
The crude solution to this is to remove your "dead ringer" from the list of 'controls'. This will mean that the results are not as robust as they might be otherwise (especially if the author might not be any of the candidate authors, which is what the controls are there to gauge), but it could keep this "lookalike" from interfering with the task of determining whether the sample is more like the (believed) actual author than like the other candidates.
A better solution is to adjust (add to, subtract from, or modify) the list of words and/or to take a larger 'Sample' (if possible). This may provide the data that is needed to tell the two "look alikes" apart (for example, perhaps one of them has the stylometric equivalent of a "mole on the left elbow" but the other doesn'tadding another preposition or two to consideration may be just what is needed).
(Words may need to be subtracted, not just added, if they are not, in fact, invariant regardless of subject, for this particular author.)
"... almost every critical biblical position was earlier advanced by skeptics."  Raymond Brown
 Peter Kirby
 Site Admin
 Posts: 5087
 Joined: Fri Oct 04, 2013 2:13 pm
 Location: Santa Clara
 Contact:
Re: Basic Stylometry Beta (early access)
Declension of Greek nouns and conjugation of Greek verbs can be conveniently accessed here:
http://en.wiktionary.org/
Data for Greek word frequency can be accessed here:
http://perseus.uchicago.edu/GreekFrequency.html
With this information you can find other words that you might want to consider as "function words" (a term in stylometry for words that appear frequently, are not a matter of the subject being discussed, and can be considered a matter of style in their use) to be added to your list.
http://en.wiktionary.org/
Data for Greek word frequency can be accessed here:
http://perseus.uchicago.edu/GreekFrequency.html
With this information you can find other words that you might want to consider as "function words" (a term in stylometry for words that appear frequently, are not a matter of the subject being discussed, and can be considered a matter of style in their use) to be added to your list.
"... almost every critical biblical position was earlier advanced by skeptics."  Raymond Brown
 Peter Kirby
 Site Admin
 Posts: 5087
 Joined: Fri Oct 04, 2013 2:13 pm
 Location: Santa Clara
 Contact:
Re: Basic Stylometry Beta (early access)
Here are some sample files that you can work with.
 Attachments

 control4.txt
 (1.06 MiB) Downloaded 27 times

 control3.txt
 (2.84 MiB) Downloaded 29 times

 control2.txt
 (611.22 KiB) Downloaded 28 times

 control.txt
 (401.91 KiB) Downloaded 26 times

 justin.txt
 (1018.52 KiB) Downloaded 30 times

 greek.txt
 (1.21 MiB) Downloaded 29 times
"... almost every critical biblical position was earlier advanced by skeptics."  Raymond Brown
 Peter Kirby
 Site Admin
 Posts: 5087
 Joined: Fri Oct 04, 2013 2:13 pm
 Location: Santa Clara
 Contact:
Re: Basic Stylometry Beta (early access)
The chisquarebased method (Fisher's method) for aggregating pvalues is, in this context, too unreliable and has been 'commented out' of the source code (and thus not calculated or displayed).
"... almost every critical biblical position was earlier advanced by skeptics."  Raymond Brown
Re: Basic Stylometry Beta (early access)
Wow! Really impressive work here (and it's obvious that lots of it went into this)! I applaude this for the effort to add some objective criteria into the analysis of ancient texts. At least with these kind of calculations, it's clear for all to see how the final results were arrived at. At least in an argument over these types of results it's clear_what_we're arguing about.
It is intimidating to someone who doesn't know ancient greek though. I can hardly imagine how much work was put into the calculator, and this thread explaining it, but there are some (pretty obvious, I'm sure you've thought of them) things that would make life easier for the user, and the tool much more powerful, that stand out:
Check boxes/drop down lists for the common word (features), and a database of greek documents that could be selected at will for the analysis by their english title and chapter contents. (I know, this is no minor request).
A way for the program to analyze any sample text and automatically come up with a list of it's common words and their statistics, without having to specify beforehand which word features you're interested in, maybe through a builtin database of greek words/features. This way you could get an idea of which word features are good choices to use with your sample. (The list of 20 important word features you provide is aimed at helping this issue, I think.)
More of the same (and more tall orders ): An ability to save your state between sessions and default/preset setups would save frequent users at lot time and energy copy/pasting. Maybe most importantly of all, this would make it more feasible to share one's findings with others, and allow them to study what exactly you did and how you did it.
I'm sorry if this comes off as critical at all; I'm very excited by this kind of tool and it's hard for me not to let my mind run wild and imagine what might be possible. Obviously if you could get a database of sources connected to it and a way to save states, adding in more/different algorithms of analysis might not seem that hard, and really be the start something amazing.
Jeff
It is intimidating to someone who doesn't know ancient greek though. I can hardly imagine how much work was put into the calculator, and this thread explaining it, but there are some (pretty obvious, I'm sure you've thought of them) things that would make life easier for the user, and the tool much more powerful, that stand out:
Check boxes/drop down lists for the common word (features), and a database of greek documents that could be selected at will for the analysis by their english title and chapter contents. (I know, this is no minor request).
A way for the program to analyze any sample text and automatically come up with a list of it's common words and their statistics, without having to specify beforehand which word features you're interested in, maybe through a builtin database of greek words/features. This way you could get an idea of which word features are good choices to use with your sample. (The list of 20 important word features you provide is aimed at helping this issue, I think.)
More of the same (and more tall orders ): An ability to save your state between sessions and default/preset setups would save frequent users at lot time and energy copy/pasting. Maybe most importantly of all, this would make it more feasible to share one's findings with others, and allow them to study what exactly you did and how you did it.
I'm sorry if this comes off as critical at all; I'm very excited by this kind of tool and it's hard for me not to let my mind run wild and imagine what might be possible. Obviously if you could get a database of sources connected to it and a way to save states, adding in more/different algorithms of analysis might not seem that hard, and really be the start something amazing.
Jeff