To continue my quest for yet more data, I started to look into the transient world of sound. I was once amazed by the automated captions on TV. They were so accurate and appeared with hardly any delay, which made me doubt it’s automated, and yet we can’t have someone typing all those words out for every single program, right? Apparently, we do have this technology ready since a long time ago. If we could jot down everything people are saying, that would mean a lot more data to play with!

TV programs seem a bit too unattainable, so for…

CODEX

While I was taking this course on Udemy, my dear friend Sruthi gave me a surprisingly relevant challenge to work on: scrape some immigration-related articles from two distinctive news publications and compare the keywords they are using. It sounds immensely fun to do, so here’s the progress report dedicated to her!

While it sounds like a straightforward ask, there are a couple tricky parts and here’s the breakdown:

  1. How to find articles related to a certain topic?
  2. If using search, how to deal with search results that are all written in JavaScript?
  3. How to define and extract “keywords”?

If I…

First, let’s review the data we have collected from AO3 so far:

Title                     object
Author object
ID int64
Date_updated datetime64[ns]
Rating object
Pairing object
Warning object
Complete object
Language object
Word_count int64
Num_chapters int64
Num_comments int64
Num_kudos int64
Num_bookmarks int64
Num_hits int64
Tags object
Summary object
Date_published datetime64[ns]
Content object
Comments object

(See Part 1 for the data collection process)

In Part 2, we have focused on the more structured data and left out Tags, Summary, Content (which is actual fiction text), and Comments. Those would be the focus of this post.

Beyond simple word frequency counts, I first explored…

With all the data collected at one place, now we can conduct all kinds of Exploratory Data Analysis (EDA). Since I do not have a fixed plan in mind for this project, I just went ahead and tried to answer various questions I’m interested in.

There are two distinctive parts of this analysis: the first is an initial exploration of the fiction’s profile information and its interaction with “performance” or “reaction” statistics — the more numerical aspect; the second part involves much bulkier textual data that requires a different set of skills to unpack. …

When starting this project, I had the dual purpose of getting started with web scraping/text mining and actually fetching some insights from fanfics I read and love.

I was following the book Web Scraping with Python, and I found Archive of Our Own (AO3) being a good playground since it doesn’t have confusing UI or overwhelming Javascript to complicate things; plus, all the HTML classes and IDs are very sensible and organized, which I felt grateful for later on.

I’m writing this primarily to document the process and questions for anyone who happens to drop in or my future self…

Sophia Z.

A Disciple of the Secret Lives of Data

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store