Mining Fanfics on AO3 — Part 2: Exploratory Data Analysis

6 min readMar 5, 2021

With all the data collected at one place, now we can conduct all kinds of Exploratory Data Analysis (EDA). Since I do not have a fixed plan in mind for this project, I just went ahead and tried to answer various questions I’m interested in.

There are two distinctive parts of this analysis: the first is an initial exploration of the fiction’s profile information and its interaction with “performance” or “reaction” statistics — the more numerical aspect; the second part involves much bulkier textual data that requires a different set of skills to unpack. We will continue with the latter in Part 3.

I will present some possibilities I have tried in the process, but they are in no way exhaustive. For the first part, this nice EDA process layout written by Destin Gong offered me a solid structure and guidance on the use of seaborn for visualization. However, instead of presenting the findings in a methodological way (for which Destin’s article shall suffice), I’d like to keep the flow mostly question/curiosity-driven. Both matplotlib/seaborn and Tableau are used for visualization.

Also, my example is limited to data collected for a specific fandom, while the questions for exploration may speak to any fanfic-data enthusiasts.

Here is the dependency:

import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt 
import numpy as np 
from pandas.api.types import is_string_dtype, is_numeric_dtype
import scipy.stats
# import matplotlib.font_manager
# from matplotlib.font_manager import FontProperties

The last two were commented out since they haven’t helped me solve the problem with printing non-English characters in seaborn charts. I might work on this issue later. Suggestions are welcome.

First, it was helpful to have a summary of different types of variables I have:

num_cols = []
str_cols = []
others = []
for col in df.columns:
    if is_numeric_dtype(df[col]):
        num_cols.append(col)
    elif is_string_dtype(df[col]):
        str_cols.append(col)
    else:
        others.append(col)
print(num_cols)
print(str_cols)
print(others)
Output:
>>>['Word_count', 'Num_chapters', 'Num_comments', 'Num_kudos', 'Num_bookmarks', 'Num_hits', 'Success'] # numerical columns
>>>['Title', 'Author', 'ID', 'Rating', 'Pairing', 'Warning', 'Complete', 'Language'] # categorical columns # 'ID' were converted to string
>>>['Date_updated', 'Date_published'] # date columns

Categorical X Categorical/Numeric

1. Rating composition

df.Rating.value_counts().plot.pie()

2. Language composition

Switched to Tableau due to the character printing issue and also for prettier formatting

Clearly most fanfics in this fandom are written in English while Chinese writers contributed another quarter. The following analysis will focus on these two languages.

3. Interaction between language and rating

Note that “Words per chapter” is a calculated field I created in Tableau. I used the aggregate function SUM([Word count])/SUM([Num chapters]) rather than first calculating [Word count]/[Num chapters] for each work and then averaging them. This essentially elicits a weighted average not so skewed by extreme values.

For those interested, we can make percent stacked bar charts like this in Tableau through quick table calculation. And instead of percentage of total within the entire table, we limit calculation to the dimension of ‘Complete.’

We may also ask things like how ratings impact reader reactions? Comparing percentiles may give us some indication.

The “Reactions” variable is created using the sum of comments, kudos, and bookmarks.

Since rating is essentially an ordinal variable, it would be better if it’s ordered according to its intrinsic level rather than alphabetically. I used Default Properties>Sort and then Manual to achieve this.

Numeric X Numeric

1. Correlation with heat map

correlation = df.iloc[:,:-1].corr() # eliminate the last 'Success' column which was added for later classification purposes, which is essentially a categorical variablesns.heatmap(correlation, cmap='GnBu', annot=True)

It seems the number of bookmarks and kudos are very highly correlated while all the others are only somewhat positively correlated.

2. Correlation & distribution with scatterplot

sns.pairplot(df.drop(['ID','Success'], axis=1))

We may also add in a categorical variable to observe differences among groups, such as —

sns.pairplot(df.drop(['ID','Success'], axis=1), hue = 'Complete')

It becomes clear that many numerical valuables are highly skewed. I used log transformation on such variables:

dic = {}
for col in df.drop(['ID','Success'], axis=1).columns:
    if is_numeric_dtype(df[col]):
        key = '{}_log'.format(col)
        dic[key] = np.log(df[col].values+1)
    else:
        dic[col] = df[col].valuesdf2 = pd.DataFrame.from_dict(dic)

The relationship becomes much clearer after transformation:

sns.pairplot(df2, hue = 'Rating', plot_kws={'alpha': 0.3}) # alpha is for transparency since groups overlap severely

Indeed, the correlation becomes stronger as well, especially for the reaction metrics:

As expected, hits, kudos, bookmarks are very strongly correlated amongst themselves. More kudos or bookmarks do lead to more comments, while the number of hits is a poor predictor of comments.

Also, the answer to the question whether writing longer fictions earns one more reactions is sadly an affirmative “no.”

3. Spearman’s rank order correlation

In the case of highly skewed data like this, Spearman’s rank order correlation may be a reasonable choice since ranks wouldn’t be affected by outliers.

The result is similar to the Pearson correlation matrix after log transformation:

corr_matrix, p_matrix = scipy.stats.spearmanr(df)
spearman_corr = pd.DataFrame(corr_matrix, columns = list(df.columns), index = list(df.columns))plt.figure(figsize=(13,10))
sns.heatmap(spearman_corr.iloc[9:15,9:15], cmap='GnBu', annot=True)

Time Series

With date variables available, we could also explore patterns with time on the x-axis.

Or we could gain some sense of the writing process by plotting out the period between date published and date of the last update. This could tell you, for example, who is the most consistent author, or whether an author has abandoned the work midway.

This Gantt bar chart involves creating a calculated field for the difference between these two dates, using it as Size (label can be changed to fiction title or others), and designating IDs or another unique identifier as Detail. (The idea comes from a kind contributor online but I can’t find that source anymore.)

Again, there are limitless possibilities depending on the project goal. I will end my EDA here and pick it up with some text analysis in an upcoming post!