Mining Fanfics on AO3 — Part 1: Data Collection

Sophia Z.
Nerd For Tech
Published in
7 min readMar 2, 2021

--

When starting this project, I had the dual purpose of getting started with web scraping/text mining and actually fetching some insights from fanfics I read and love.

I was following the book Web Scraping with Python, and I found Archive of Our Own (AO3) being a good playground since it doesn’t have confusing UI or overwhelming Javascript to complicate things; plus, all the HTML classes and IDs are very sensible and organized, which I felt grateful for later on.

I’m writing this primarily to document the process and questions for anyone who happens to drop in or my future self to contemplate upon (or laugh at;). The code is very specific to AO3’s web structure. But I’d imagine it being useful to someone looking for a quick review of web scraping from scratch or someone interested in conducting similar fandom research. In any case, who can resist the temptation of such a vibrant, diverse, ever-growing repertoire of user-generated textual information.

Understanding Link Formation

Web scraping at its very core is as simple as downloading web content with a provided URL. Thus, the URL as the key is the most important thing to explore.

Fortunately, AO3’s tag system made the link formation pretty straightforward.

The basic format is just https://archiveofourown.org/tags/###INSERT TAGS HERE###/works

With some exploration, I also understood the following:

(1) Every page has a fixed, non-customizable maximum of 20 works, and navigating to another page can usually be achieved via adding “?page=#PAGE NUMBER” following “/works”

(2) To search a specific keyword, we can use this URL: https://archiveofourown.org/works/search?utf8=%E2%9C%93&work_search%5Bquery%5D=###INSERT SEARCH QUERY HERE###

(3) To access the actual fictions, we need the work ID as well as some additional parameters to efficiently get to the entire content — “view_adult=true&view_full_work=true.” If we want comments to be scraped, too, “show_comments=true” can be added.

Understanding HTML Structure

Every web page is different, so I won’t go too deep into the discovery of every single marker. It is a tedious but necessary step to correctly identify the information we want to retrieve as well as account for any irregularities.

Firefox Developer Edition is a great tool. You can pick an element and locate its tag in the source code immediately.

For example, we would quickly find out that all the works on a page is listed under <ol> with the class of “work index group” and every individual work is a <li> element that has the “role” of “article.”

Having an initial understanding of the DOM and the naming conventions for class, id, role and the like helps eliminating errors in later steps.

Scraping the Basic Statistics

Enough with the exploration, now we can get down to the actual work.

Again, downloading content is very straightforward. One way is using Python’s urllib library.

import urllib.requestreq = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
content = resp.read()

But there are two major issues:

The first is that with popular fandoms that havehundred of pages, we may quickly encounter the “HTTP 429 Too Many Requests error.”

I have tried HTTP library’s Retry. It’s indeed very error-tolerant but it gave me a couple of blank pages without notifying me. Maybe that’s because I had set the backoff_factor too low and the server still didn’t let my requests pass until a later time. I will need to get more familiar with the functionality of “requests” to efficiently solve this.

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import requests
session = requests.Session()retry = Retry(connect=7, backoff_factor=5)adapter = HTTPAdapter(max_retries=retry)session.mount('http://', adapter)
session.mount('https://', adapter)

Second, vibrant fandoms can get new works every now and then, while a data mining project can take much longer time to finish. If we conduct analysis on the fly (e.g., fetching a piece of information today, and making another request for a different retrieval the next day), we may unexpectedly fetch different page content.

Therefore, a better approach may be making all the requests and store the page content in one place at once.

Here is what I eventually used to download and store page content:

I adopted a relatively manual way to handle 429 Errors. It would report the breakpoint like so:

And after some rest, I could adjust the start_page variable and restart from where we left off.

One additional note is that AO3 requires a 5-second interval between requests, hence the time.sleep(). So don’t forget to first —

import time
import csv

Once we have saved all the content into html files, we could read them back into the memory and start applying any functions of interest:

totalPages = ###for i in range(1, totalPages+1):
pageName = "./SomeContent/"+str(i)+".html"
with open(pageName, mode='r', encoding='utf8') as f:
print('Now we are opening page', i, '...')
page = f.read()
process_basic(page)

The first thing I was interested in is to scrape the statistics like kudos, hits, bookmarks and organize them into one place, together with the basic information like ID, title, and author.

We will need our old friends for pattern recognition:

from bs4 import BeautifulSoup
import re

First, I initiated a new file with the designated header row:

header = ['Title', 'Author', 'ID', 'Date_updated', 'Rating', 'Pairing', 'Warning', 'Complete', 'Language', 'Word_count', 'Num_chapters', 'Num_comments', 'Num_kudos', 'Num_bookmarks', 'Num_hits']with open('SomeName.csv','w', encoding='utf8') as f:
writer = csv.writer(f)
writer.writerow(header)

Then make the aforementioned process_basic function:

I took a page-wise approach for this function (hence all the lists), but you may run it row-wise as well, processing each work and writing each row into the CSV file.

Another reason that bulks up this block of code is all the error handling. Any work that doesn’t have kudos or comments would not have these attributes and thus throw errors, stalling the entire process. Apparently, there could also be works of no word count or no author. Being conservative and meticulous is necessary when dealing with heterogenous data sources.

The last few lines are just for checking purposes, which helped me spot the empty HTML pages “requests” failed to download.

Scraping the Texts

There are text data that can be scraped from the search result page, such as tags and summary. Others, including the actual work content and comments, require opening every fiction’s own link, which would once again introduce potential disruptions that we have to be prepared for.

Similar as before, let’s first set up the basic schema for the data we want to scrape and store:

header_row = ['ID', 'Tags', 'Summary', 'Date_published', 'Content', 'Comments']with open('SomeText.csv','w', encoding='utf8') as f:
writer = csv.writer(f)
writer.writerow(header_row)

On the first level, we can build simple functions for tags and summaries:

It’s your decision how to best store tags, using other separation markers or all in one string, same as the comments we will deal with later on.

Also to note, when using BeautifulSoup’s .children in this case, some blank rows would be returned as Navigable String, which would give you error if you try to apply .text to them. In this case, we may just ignore these blanks.

Another first-level function is for comments. There are two scenarios — one is when there’s 0 or 1 page of comments, and the other is when there are more and we need to make new page requests.

But both would require accessing a particular fiction’s page content:

Both functions take a BeautifulSoup object supplied from the open_fic function as the input. The second function depends on the first and passes onto it a new URL for each page.

Finding how many pages to turn is a little tricky. Currently, I leveraged the navigation bar and found the last number before “next”.

Now, the last step is to compile all these information into one place and write into a CSV file. While it can be achieved in one step, I separated them into 2 levels in order to more effectively handle errors and restart the loop at the right location:

A few notes:

  1. AO3 somehow hides publish date in work detail page, which is why this piece of information is not fetched together with the other statistics.
  2. Error handing for comments: If there’s one or none page of comments, we don’t have to go through the trouble of turn_page_comments(). (It will throw an error since there’s no navigation bar on such pages.)
  3. Process_articles() would iterate through fictions on each page. Start_index would be the index of articles on a page, so that if we break off at any point, we can reset the start_index to the right starting position. Start_index2 indicate the page of comments which would be fed into the article_to_row() function and then to the turn_page_comments() inner loop. Interestingly, I didn’t encounter any “Too many requests” error during the comment iteration process. Maybe the rate limit is only applied to the search page?

Putting it all together, we can jumpstart the process by reading in our stored pages one by one:

The “ix” is created to instruct the process_article function to run according to the manually altered start_index only once for the current page and then switch back to starting from the first article on each page.

Due to the five-second interval requirement, this process can take a long time for each work to be opened and processed, especially when we have to loop through hundreds of pages of comments for just one fiction. Be prepared for a couple of hours or even overnight waiting depending on how many pages there are in total. The print statements can at least keep us informed of the process:

Hooray! This is the end of it, and you can merge the datasets and change data types as you wish, for example:

basic_stats.Date_updated = pd.to_datetime(basic_stats.Date_updated, format='%d %b %Y')basic_stats.Word_count = basic_stats.Word_count.apply(lambda x: int(x.replace(',','')))text_df.Date_published = pd.to_datetime(text_df.Date_published, format='%Y-%m-%d')df = basic_stats.merge(text_df, how='inner', on='ID')

Then, it’s time to start cooking!

--

--