Topic modeling of Shakespeare characters

In this post I extract all the words spoken by each character in eight of Shakespeare's plays. Then I construct a topic model to see which characters are generally speaking about similar things. In Part II I look into the information revealed by the topic model. Download notebook.

In [1]:
import nltk
import pandas as pd
from collections import defaultdict
from gensim import corpora, models, similarities

The nltk library includes eight of Shakespeare's plays in xml format, which makes it easy to split up line by speaker. Here's an example of the xml format.

In [2]:
nltk.corpus.shakespeare.fileids()
Out[2]:
['a_and_c.xml',
 'dream.xml',
 'hamlet.xml',
 'j_caesar.xml',
 'macbeth.xml',
 'merchant.xml',
 'othello.xml',
 'r_and_j.xml']

parse_plays returns two dictionaries, mapping each speaker in each play to the words they say and the number of lines they have.

In [3]:
def parse_plays(file_ids, 
                tokenizer=nltk.tokenize.RegexpTokenizer(r'\w+'),
                stopwords=set(nltk.corpus.stopwords.words('english'))):
    """Return two dictionaries, mapping each speaker in each play to the 
    words they say and the number of lines they have.
    
    :param file_ids: the nltk file_ids of play xml files
    :param tokenizer: tokenizer to split words within the lines
      default: nltk.tokenize.RegexpTokenizer(r'\w+')
    :param stopwords: set of words to exclude
      default: set(nltk.corpus.stopwords.words('english'))
    """
    lines = defaultdict(list)
    linecounts = defaultdict(int)
    for file_id in file_ids:
        raw_data = nltk.corpus.shakespeare.xml(file_id)
        for child in raw_data.findall('ACT/SCENE/SPEECH'):
            speaker = (child.find('SPEAKER').text, file_id.replace('.xml', ''))
            for line in child.findall('LINE'):
                if line.text is not None:
                    for word in tokenizer.tokenize(line.text):
                        word_lower = word.lower()
                        if word_lower not in stopwords and len(word) > 2:
                            lines[speaker].append(word_lower)
                            linecounts[speaker] += 1
    return lines, linecounts

To make the clean up and manipulation of data easier, I put the relevant data into a pandas DataFrame.

In [4]:
min_lines = 100
lines, linecounts = parse_plays(nltk.corpus.shakespeare.fileids())
word_data = [(speaker[0], speaker[1], count, lines[speaker]) 
             for speaker, count in linecounts.iteritems()
             if count >= min_lines]
word_data_df = pd.DataFrame(word_data, columns=['persona', 'play', 'linecount', 'words'])
word_data_df = word_data_df.sort('linecount', ascending=False).reset_index(drop=True)
word_data_df.ix[:, :3].to_csv('data/word_data_df.csv')
word_data_df.head()
Out[4]:
persona play linecount words
0 HAMLET hamlet 5461 [lord, much, sun, madam, common, seems, madam,...
1 IAGO othello 3857 [sblood, hear, ever, dream, matter, abhor, des...
2 OTHELLO othello 3059 [tis, better, let, spite, services, done, sign...
3 MARK ANTONY a_and_c 2984 [beggary, love, reckon, must, thou, needs, fin...
4 MACBETH macbeth 2653 [foul, fair, day, seen, speak, stay, imperfect...

Here I make a gensim dictionary, which creates a mapping of words to integer ids. The integer ids are used by gensim in the later steps to extract a topic model.

In [5]:
line_list = word_data_df['words'].values
dictionary = corpora.Dictionary(line_list)
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]
dictionary.filter_tokens(once_ids)
dictionary.compactify()

The below step creates a sparse vector of integer words ids to word counts for each character and a TF-IDF model. The TF-IDF model converts raw word counts to a value more indicative of the importance of each word.

In [6]:
corpus = [dictionary.doc2bow(words) for words in line_list]
corpora.mmcorpus.MmCorpus.serialize('data/shkspr.mm', corpus)
tfidf = models.TfidfModel(corpus)

Finally, the model is constructed.

In [7]:
lsi = models.lsimodel.LsiModel(corpus=tfidf[corpus], id2word=dictionary)
lsi.save('data/shkspr.lsi')
for i, topic in enumerate(lsi.print_topics(5)[:3]):
    print 'Topic {}:'.format(i)
    print topic.replace(' + ', '\n')
    print ''
Topic 0:
0.192*"caesar"
0.125*"lord"
0.121*"antony"
0.112*"brutus"
0.106*"thou"
0.105*"romeo"
0.093*"cassio"
0.091*"love"
0.084*"thee"
0.078*"madam"

Topic 1:
0.513*"caesar"
0.378*"brutus"
0.286*"antony"
0.192*"cassius"
-0.151*"romeo"
0.139*"rome"
-0.108*"cassio"
0.090*"octavius"
0.081*"lepidus"
-0.073*"tybalt"

Topic 2:
0.460*"cassio"
-0.351*"romeo"
-0.170*"tybalt"
0.164*"iago"
0.163*"moor"
-0.125*"juliet"
-0.115*"nurse"
0.110*"desdemona"
0.105*"lord"
0.104*"lieutenant"

The topic model is now constructed. In Part II I'll analyze the results.

Similar Posts



Comments

Links

Social