Can an author's unique "literary style" be used to identify him/her as the author of a text? [closed]

A

2

20

Let's imagine, I have two English language texts written by the same person. Is it possible to apply some Markov chain algorithm to analyse each: create some kind of fingerprint based on statistical data, and compare fingerprints gotten from different texts? Let's say, we have a library with 100 texts. Some person wrote text number 1 and some other as well, and we need to guess which one by analyzing his/her writing style. Is there any known algorithm doing it? Can be Markov chains applied here?

Adduct answered 22/1, 2011 at 23:22 Comment(2)

A famous example is, who wrote which Federalist Papers ? See notes 19 and 20 there. – Fathead 8/5, 2011 at 10:39

I feel this question shouldn't be closed. In 2017 there was a competition in Kaggle: Spooky Author Identification, which shows the relevance of this question. – Inandin 26/9, 2018 at 17:34

G

19

Absolutely it is possible, and indeed the record of success in identifying an author given a text or some portion of it, is impressive.

A couple of representative studies (warning: links are to pdf files):

To aid your web-search, this discipline is often called Stylometry (and occasionally, Stylogenetics).

So the two most important questions are i suppose: which classifiers are useful for this purpose and what data is fed to the classifier?

What i still find surprising is how little data is required to achieve very accurate classification. Often the data is just a word frequency list. (A directory of word frequency lists is available online here.)

For instance, one data set widely used in Machine Learning and available from a number of places on the Web, is comprised of data from four authors: Shakespeare, Jane Austen, Jack London, Milton. these works were divided into 872 pieces (corresponding roughly to chapters), in other words, about 220 different substantial pieces of text for each of the four authors; each of these pieces becomes a single data point in the data set. Next a word-frequency scan was performed on each text, and the 70 most common words were used for the study, the remainder of the results of the frequency scan were discarded. Here are the first 20 of that 70-word list.

['a', 'all', 'also', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'been',
  'but', 'by', 'can', 'do', 'down', 'even', 'every', 'for', 'from']

Each data point then is just a count of each word of the 70 words in each of the 872 chapters.

[78, 34, 21, 45, 76, 9, 23, 12, 43, 54, 110, 21, 45, 59, 87, 59, 34, 104, 93, 40]

Each of these data points is one instance of the author's literary fingerprint.

The final item in each data point is an integer (1-4) representing one of the four authors to whom that text belongs.

Recently, I ran this dataset through a simple unsupervised ML algorithm; the results were very good--almost complete separation of the four classes, which you can see in my Answer to a previous Q on StackOverflow related to text classification using ML generally, rather than author identification.

So what other algorithms are used? Apparently, most Machine Learning algorithms in the supervised category can successfully resolve this kind of data. Among these, multi-layer perceptrons (MLP, aka, neural networks) are often used (Author Attribution Using Neural Networks is one such frequently-cited study).

Gamba answered 28/1, 2011 at 11:45 Comment(4)

Is it possible then to trace an anonymous article to its author by analyzing public SNS texts? – Paleontology 18/7, 2016 at 9:46

@FRIdSUN not sure what you mean by "SNS" i'll assume it's a typo and you meant SMS. If so, my answer is no. The reason is that SMS messages have their own (informal) style, syntax, and usage rules, and those rules would. effectively conceal an author's literary prose style. So for instance, stop word frequency, often a strong signature of author style (ie, consistent across many of the author's texts) is probably useless for SMS--eg, SMS texts rarely have any stop words ("a", "a", "the") for brevity sake, often use symbols instead of stop words ("&" for "and"), etc. – Gamba 19/7, 2016 at 2:32

SNS = Social Network Service. I meant if it is possible to analyze Facebook posts, Twitter tweets, Medium articles and the like to do such identification. – Paleontology 5/9, 2016 at 5:47

Prime reason not to abbreviate something if is not well known. – Samp 19/6, 2018 at 0:51

D

1

You might start with a visit to the Apache Mahout web site. There is a giant literature on classification and clustering. Essentially, you want to run a clustering algorithm, and then hope that 'which writer' determines the clusters.

Deacon answered 22/1, 2011 at 23:30 Comment(1)

+1 for the Apach Mahout reference – Jeconiah 22/1, 2011 at 23:36

Recommended topics

Hot tags