Analysis of Shakespeare character speech topics

In Part I of this post I made a topic model of the speech of Shakespeare characters from eight plays. Here in Part II I'll analyze the results of the model. Download notebook.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

from collections import defaultdict
from gensim import corpora, models, similarities
from pprint import pprint

Here I load data from Part I. You can find the data here.

In [2]:
word_data_df = pd.read_csv('data/word_data_df.csv', index_col=0)
personae = [tuple(character) for character in word_data_df[['persona', 'play']].values]
plays = word_data_df['play'].unique()
corpus = corpora.mmcorpus.MmCorpus('data/')
tfidf = models.TfidfModel(corpus)
lsi = models.lsimodel.LsiModel.load('data/shkspr.lsi')

gensim can calculate a similarity value between each character, using a cosine similarity metric. The input is the model and the corpus of each character's speech.

In [3]:
matsim = similarities.MatrixSimilarity(lsi[tfidf[corpus]], num_best=6)
WARNING:gensim.similarities.docsim:scanning corpus to determine the number of features (consider setting `num_features` explicitly)

For each of the ten characters with the most lines, this prints the most similiar characters along with their similarity scores. Most characters are most similar to other characters in their play. Under this model Mark Antony in Antony and Cleopatra is more closely related to three other characters from Antony and Cleopatra than to himself in Julius Ceasar. This doesn't seem too far fetched, as characters in different plays are concerned with different people and problems.

In [4]:
for sims in list(matsim)[:10]:
    persona_index = sims[0][0]
    print '|'.join(personae[persona_index])
    for other_persona_index, score in sims[1:]:
        print '\t{:<30}{:.3f}'.format('|'.join(personae[other_persona_index]), score)
	KING CLAUDIUS|hamlet          0.384
	LORD POLONIUS|hamlet          0.314
	MACBETH|macbeth               0.308
	OTHELLO|othello               0.297
	IAGO|othello                  0.291
	BIANCA|othello                0.502
	OTHELLO|othello               0.465
	DESDEMONA|othello             0.458
	LODOVICO|othello              0.425
	EMILIA|othello                0.424
	IAGO|othello                  0.465
	CASSIO|othello                0.397
	EMILIA|othello                0.369
	DESDEMONA|othello             0.339
	HAMLET|hamlet                 0.297
	CLEOPATRA|a_and_c             0.367
	DOMITIUS ENOBARBUS|a_and_c    0.340
	OCTAVIUS CAESAR|a_and_c       0.316
	BRUTUS|j_caesar               0.273
	ANTONY|j_caesar               0.252
	LADY MACBETH|macbeth          0.331
	HAMLET|hamlet                 0.308
	LENNOX|macbeth                0.272
	DUNCAN|macbeth                0.271
	MACDUFF|macbeth               0.255
	CASSIUS|j_caesar              0.530
	ANTONY|j_caesar               0.498
	TITINIUS|j_caesar             0.402
	Servant|j_caesar              0.363
	CAESAR|j_caesar               0.350
	JULIET|r_and_j                0.420
	FRIAR LAURENCE|r_and_j        0.410
	LADY CAPULET|r_and_j          0.322
	BENVOLIO|r_and_j              0.292
	Nurse|r_and_j                 0.277
	DOMITIUS ENOBARBUS|a_and_c    0.439
	OCTAVIUS CAESAR|a_and_c       0.375
	MARK ANTONY|a_and_c           0.367
	BRUTUS|j_caesar               0.317
	HAMLET|hamlet                 0.290
	FRIAR LAURENCE|r_and_j        0.486
	BENVOLIO|r_and_j              0.441
	Nurse|r_and_j                 0.441
	ROMEO|r_and_j                 0.420
	LADY CAPULET|r_and_j          0.381
	BASSANIO|merchant             0.346
	GRATIANO|merchant             0.325
	ANTONIO|merchant              0.309
	SHYLOCK|merchant              0.262
	HAMLET|hamlet                 0.250

Latent Sementic Indexing (LSI) creates a lower dimensional subspace of the space spanned by all words (i.e. a space in which each word represents one orthogonal dimension). The speech of each character can be projected into this smaller dimensional space. Below is the projection of Hamlet's speech into the first 10 dimensions. Because of the way the space is constructed, the first dimensions contain the most information.

In [5]:
[(0, 0.59246577031622616),
 (1, -0.14281562247124016),
 (2, 0.096597123493435619),
 (3, -0.041090063998733474),
 (4, 0.0054346234670322726),
 (5, -0.2281871246338642),
 (6, 0.0022814406628075051),
 (7, -0.114377333096131),
 (8, 0.096684501194655312),
 (9, -0.022321022902042544)]

The functions below plot the projection of each character's speech onto two of the axes (topics) defined by the LSI model. This is useful for visualizing the result of the model. The most important 10 words in each topic are printed above the graph.

In [6]:
def format_topic_coeffs(topic):
    """Return a list of coefficent, word tuples with coefficent truncated to 
    3 decimal places.
    return [('{0:.3f}'.format(coeff), word) for coeff, word in topic]

def plot_axes(x=0, y=1, model=lsi, corpus=corpus, 
              tfidf=tfidf, personae=personae, plays=plays):
    """Plot each character in personae according to the projection of their
    speech into the given x and y topic axes of model.
    Points are colored according to play and labeled with the character.
    :param x: the index of the x axis to plot
    :param y: the index of the y axis to plot
    :param model: the gensim model to project into
    :param corpus: the gensim corpus of documents
    :param tfidf: a tfidf model for converting documents into tfidf space
    :param personae: a list of (character, play) tuples, the order must correspond to
      the order of documents in the corpus
    :param plays: a list of all the plays existing in the data
    x_data = defaultdict(list)
    y_data = defaultdict(list)
    chars = defaultdict(list)
    print 'x topic:'
    print ''
    print 'y topic:'
    for persona, doc in zip(personae, corpus):
        play = persona[1]
    plt.figure(figsize=(10, 10))
    ax = plt.gca()
    cmap = plt.get_cmap('Paired')
    play_index = {play: i for i, play in enumerate(plays)}
    for play in play_index:
        color_index = play_index[play] / float(len(play_index))
        plt.scatter(x_data[play], y_data[play], color=cmap(color_index), 
                    label=play, alpha=.5, s=40)
        for char, x, y in zip(chars[play], x_data[play], y_data[play]):
            ax.annotate(char, xy=(x, y), xycoords='data', xytext=(1, 1), 
                        textcoords='offset points', size=10)
    plt.legend(loc=1, ncol=2, scatterpoints=1)

Here the y-axis separates the plays about Romans from other plays. Looking at the list of words that make up this topic, we can see that the Romans talk a lot about "Caesar", "Antony", and "Rome", but not much about "Romeo" or "Tybalt". The characters from Romeo and Juliet are the opposite, and they extend the other way along the y-axis.

In [7]:
plot_axes(x=0, y=1)
x topic:
[('0.192', u'caesar'),
 ('0.125', u'lord'),
 ('0.121', u'antony'),
 ('0.112', u'brutus'),
 ('0.106', u'thou'),
 ('0.105', u'romeo'),
 ('0.093', u'cassio'),
 ('0.091', u'love'),
 ('0.084', u'thee'),
 ('0.078', u'madam')]

y topic:
[('0.513', u'caesar'),
 ('0.378', u'brutus'),
 ('0.286', u'antony'),
 ('0.192', u'cassius'),
 ('-0.151', u'romeo'),
 ('0.139', u'rome'),
 ('-0.108', u'cassio'),
 ('0.090', u'octavius'),
 ('0.081', u'lepidus'),
 ('-0.073', u'tybalt')]

The next two axes separate out several of the other plays. Characters from Romeo and Juliet, Othello, and A Midsummer Night's Dream extending along the axes in different directions, while to a lesser extent the characters from The Merchant of Venice have some projection on the y-axis.

In [8]:
plot_axes(x=2, y=3)
x topic:
[('0.460', u'cassio'),
 ('-0.351', u'romeo'),
 ('-0.170', u'tybalt'),
 ('0.164', u'iago'),
 ('0.163', u'moor'),
 ('-0.125', u'juliet'),
 ('-0.115', u'nurse'),
 ('0.110', u'desdemona'),
 ('0.105', u'lord'),
 ('0.104', u'lieutenant')]

y topic:
[('0.367', u'romeo'),
 ('-0.253', u'hermia'),
 ('-0.244', u'demetrius'),
 ('-0.241', u'lysander'),
 ('0.215', u'cassio'),
 ('0.173', u'tybalt'),
 ('-0.157', u'pyramus'),
 ('-0.152', u'thisby'),
 ('-0.132', u'helena'),
 ('0.124', u'juliet')]

In the next set of axes the characters from The Merchant of Venice are well separated from those in the other plays along the x-axis, while characters from Hamlet and Macbeth extend along the y-axis. That these characters would be clustered is no surprise, above we can see that Hamlet and Macbeth (the people) both have places among each other's top 3 most similar characters.

In [9]:
plot_axes(x=4, y=5)
x topic:
[('0.279', u'jew'),
 ('-0.245', u'cassio'),
 ('0.231', u'antonio'),
 ('0.215', u'launcelot'),
 ('0.198', u'bassanio'),
 ('-0.192', u'hermia'),
 ('-0.184', u'demetrius'),
 ('-0.181', u'lysander'),
 ('0.156', u'ducats'),
 ('0.150', u'lorenzo')]

y topic:
[('0.315', u'cassio'),
 ('-0.280', u'hamlet'),
 ('0.245', u'brutus'),
 ('-0.147', u'king'),
 ('-0.145', u'lord'),
 ('0.124', u'cassius'),
 ('0.114', u'hermia'),
 ('0.113', u'demetrius'),
 ('0.111', u'jew'),
 ('0.110', u'lysander')]

Back to Part I.

Similar Posts