How to build a Database of all the songs you have ever heard using Python?

Rishav Sharma
10 min readJan 16, 2022

A guide and a step towards extreme OCD for organization.

Wordcloud animation of all the words associated with the music I heard.

How cool would it be if you could own a database of all the songs you have ever heard?
You could listen to songs that you no longer remember and reminisce the memories associated with them. It was a simple idea, but the question is where does one start?

Episodes:

  1. The data challenge and solution.
  2. Isolating music — Automated + Manual
  3. Manual checks and building the final database.
  4. Word cloud animation.
  5. How can you make my own database?

1. The data challenge and solution

The majority of us listen to songs on multiple platforms, but my go-to platform for the past couple of years has been YouTube music. So, that became the starting point. Before attempting to build a music database, I first needed the data of everything I have ever watched on YouTube.
The idea?
Get the history data from YouTube, figure out which videos are music, and voila!
But things didn’t turn out as straightforward.

First thought: Can the YouTube API provide me the data of my watched history? I explored for a bit but could not find anything concrete. Closed!

Second thought: Do I need to scrape the data myself from the history page? Ugh! I don’t want to do this. Let’s keep thinking.

The third thought came as a savior. Google allows users to download ALL of their data with their project — Google Takeout. A user can visit Google Takeout and download the required YouTube history data. Steps:

Optionally, you can export just the history by choosing history from All YouTube data included.

i) Open the site and click on Deselect all. Check the YouTube and YouTube Music option as shown in the screenshot.

ii) Click on Next step.

iii) Keep everything checked to default and click Export.
Note: You can choose the file format for your history. Options include HTML and JSON. For this task, JSON is more suitable.

Now, you have your entire YouTube history. Sorry, Incognito mode users :P

You can navigate to the ‘watch-history.json’(or .html) file in your downloaded Takeout folder. That is all we need, and now we can start building the database.

2. Isolating music — Automated + Manual

Now begins the challenge. Not only do you have to fetch all details of a video from YouTube but you also need to use those details to somehow classify a video as music or not music. Machine Learning? Do we need you?

Not really…

What features does YouTube provide for every video?

→ Title, Tags, and Description, right? So, I thought. But there is something more. YouTube has a Category ID attribute that classifies a video into a category. The Category ID for music is 10.

→ So, does this mean that all one needs to do is filter the videos by Category ID, and their database is ready? Sadly no :(
Like any Machine Learning task, this classification by YouTube is not 100% accurate.

So, how accurate is it exactly? How can we test the accuracy of music videos being classified correctly by the Category ID? I did it as follows:-

→ I found a public playlist of the most viewed songs on YouTube with the idea that I can check the category ID of all the videos present in this playlist and get an estimate of how accurate the inbuilt classification is.

→ I wrote a script that inputs any playlist ID and figures out the category ID of all the videos present in that playlist. When I executed the script with the above music playlist, I found that around 90.3% of the videos had a category 10. So, in a rough sense, 90.3% of the music videos were identified by category ID 10. The result is not bad, but can we improve it further?

One strategy that we could use is as follows:

  • Figure out the category ID of all the videos from the most played music playlist and store the Top-2 category ID’s (accounts for 97.1% of music videos).
  • Create a list of Tags associated with all the videos in the playlist, and store the 750 most commonly used Tags with music videos in that playlist.
  • Create a string of Video descriptions and store the 250 most commonly used words that frequently appear in music video descriptions.

Using a combination of the above, we are more likely to filter the videos accurately when compared with using Category ID=10 alone. Here is the Python script that applies our strategy and stores the results of the Top Category, Top tags, Top Description words in a pickle file.

r'''Once we have obtained a list of videos along with all the necessary details from 
our history, how do we identify if a particular video is a MUSIC video?
Idea: we could use a combination of CategoryID, Video Tags and Description words.
Problem: How do we know which values to pick?
Solution:
This script experiments with a playlist of ONLY MUSIC videos to figure out
the most common CategoryID, Video Tags and Description assigned for music videos.
Playlist used: https://www.youtube.com/watch?v=JGwWNGJdvx8&list=PLhsz9CILh357zA1yMT-K5T9ZTNEU6Fl6nThis playlist contains around 5000 most played music videos on YouTube. We
could use information from these videos to create a system that can best
identify a music video.
'''import os
import string
from collections import Counter
import pandas as pd
import pickle
from nltk.corpus import stopwords
from googleapiclient.discovery import build
# If you have never downloaded list of stopwords before, uncomment the following lines:
# import nltk
# nltk.download('stopwords')
# Opening the API key.
base_path = os.path.dirname(__file__)
with open(os.path.join(base_path, '../_private_data/api_key.txt'), 'r', encoding='utf-8') as f:
api_key = f.read()

# Creating a YouTube resource
yt = build('youtube', 'v3', developerKey=api_key)
##################################################### Variable to keep track of pages from the playlist.
nextPageToken = None
# List to store the video details from the playlist
all_video_details = []
playlist_id = 'PLhsz9CILh357zA1yMT-K5T9ZTNEU6Fl6n'
# What's the idea, here?
# We loop through each page of the playlist, fetch the video ID of all present
# videos, fetch the details of those videos and append them to a list. Repeat
# the process, until the playlist pages are exhausted.
while True:
# YouTube resource to fetch playlist video IDs.
pl_req = yt.playlistItems().list(
part='contentDetails',
playlistId=playlist_id,
maxResults=50,
pageToken=nextPageToken
)
pl_response = pl_req.execute()
# Fetching the video ID's of 50 (maximum allowed) videos.
video_ids = []
for item in pl_response['items']:
video_ids.append(item['contentDetails'].get('videoId'))
# YouTube resource to fetch details of the obtained 50 videos.
# snippet contains Title, Tags, Description and CategoryID.
vid_req = yt.videos().list(
part='snippet',
id=','.join(video_ids)
)
vid_response = vid_req.execute()
# Adding the necessary information to a dictionary and appending
# to a list for all the playlist videos.
for details, yt_id in zip(vid_response["items"], video_ids):
snippet = details["snippet"]
# Fetching the necessary details from the snippet key
every_video_detail = dict(
VideoID=yt_id,
Title=snippet.get("title"),
CategoryID=snippet.get("categoryId"),
PublishDate=snippet.get("publishedAt"),
ChannelTitle=snippet.get("channelTitle"),
Description=snippet.get("description"),
Tags=snippet.get("tags"),
)
all_video_details.append(every_video_detail)
# Returns None if we have exhausted all pages. Break if it returns None.
# not None returns True.
nextPageToken = pl_response.get('nextPageToken')
if not nextPageToken:
break
df = pd.DataFrame(all_video_details)####################################### 1. Filtering most used CategoryID
# Filtering the top 2 CategoryIDs which YT identifies as music.
top_categories = df['CategoryID'].value_counts()[:2]
print(f"Fetched top-2 categories that account for {round(sum(top_categories)/len(df)*100, 2)} % music videos...")
# 2. Filtering most used Tags
# The tags column contains a list of tags associated with each video.
# Creating a single list of all the tags associated with the videos.
all_tags = [
tag.lower() for tags in df['Tags'].tolist() if tags is not None for tag in tags
]
# Fetching the top 1000 most common tags in music videos.
top_tags = Counter(all_tags).most_common(750)
print(f"Fetched {len(top_tags)} most common tags...")
# 3. Filtering most used Description words
desc_words = ''.join(df['Description'].str.lower()).split()
# From these words, we need to remove the stop words.
stopwords_en = stopwords.words('english')
desc_words_filtered = [word for word in desc_words if word not in stopwords_en]
# Fetching the 250 most common words in description
desc_most_common = Counter(desc_words_filtered).most_common(250)
desc_most_common = [word for word in desc_most_common if word[0] not in string.punctuation]
print(f"Fetched {len(desc_most_common)} most common words...")
print("Dumping categories, tags and most common words into pickle...")
with open(os.path.join(base_path, 'music_identify_data/top_categories.pkl'), 'wb') as cats:
pickle.dump(top_categories, cats)
with open(os.path.join(base_path, 'music_identify_data/top_tags.pkl'), 'wb') as tags:
pickle.dump(top_tags, tags)
with open(os.path.join(base_path, 'music_identify_data/top_desc_words.pkl'), 'wb') as desc:
pickle.dump(desc_most_common, desc)
print("Complete...")

We now have a strategy for our identification. Therefore, we can start building the database. We need to perform three separate tasks for extracting useful data and proper identification, which is present in these scripts:

  • module_extract_urls → Script to extract all the watched video URLs from the HTML/JSON history file (watch-history.json).
  • module_fetch_video_details → Script to fetch the details such as Title, Description, Tags, Duration of these watched videos using the Python YouTube API.
  • module_identify_music_video → Script that uses our strategy to filter the videos. To implement, we create three flag columns called TagsCheck, DescriptionCheck, and CategoryIDCheck. They have values 1 if the video Tags, Category, and Description match the top categories, top tags, and top description words respectively, 0 otherwise.
    We exclude videos that fail all 3 checks. We also exclude videos with duration 60s or lesser (YouTube shorts).

3. Manual checks, and building the final database.

This script uses the above three modules in sequence to build an initial Database of all the videos that could be music. Let’s call this IDBM. At this point, I came to terms with the fact that automation can only take this so far. At one point, you need to manually check if certain videos are music or not. But, we can at least make that task as simple as possible. Therefore, we have the final script in place that does the following tasks:

— It breaks IDBM(Initial Database music) into three parts: definitely music(Y), definitely not music(N) and maybe music(Maybe).

If the video category is 10 or if all three checks are 1, we classify the video as definitely music. The idea being that not all music videos have category 10, but those videos that have a category 10 are definitely music.

— If all three checks result in 0, we classify the video as definitely not music.

— For all other cases, we classify the video into the “maybe” category which will require manual checking. For these videos, we create a column called Is_Music_Manual with all values set to No initially. Now, all we need to do is create a copy of this, open, and for a music video, toggle the column’s value from a No to a Yes.

— We finally merge our definitely music data with the manually checked music data.

— Our database is ready :)

4. Word cloud animation.

We have all heard that:

A picture is worth thousand words.

Here’s a rendition that I always use:

An animation is worth a thousand pictures.

The best way to see the most popular artist in your database is by building a word cloud. The following animation was created by merging 11 word clouds. The word clouds were created by using the words from the Title, Tags and Description. The words were used in increasing order of their frequency i.e. the first word cloud was created using the words which have the bottom 10% frequency, second using the bottom 20% and so on…
The final word cloud where the frame rests after 2 loops is the one that uses all the words.

5. How can you make your own database?

  1. Learn how to connect with the YouTube API. This is a fantastic video on the topic. Set up the API key.
  2. Clone this repository.
  3. Download your YouTube history data as mentioned above.
  4. Create a folder named “_private_data” within the repository and add your history file (watch-history.json) to this.
  5. Add the API key into a text file named api_key.txt. Store it in the folder _private_data.
  6. In the folder named “2. song_database”, create a folder called songs_heard.
  7. Run the create_song_db_1 script from the repository. This should detect the history file, and create a csv file of the filtered videos into the folder songs_heard. The filtered videos are the ones that have passed at least one column check and are at least 61s long.
    Note: The script by default expects a .json file. You can change the parameter is_json to False in the script if you want to use an HTML file.
  8. Next, run the script create_song_db_2 from the repository. This will create two files. One will contain all the definitely music videos.
  9. The other will be an excel sheet named: “ManuallyCheckMusic.xlsx”
  10. Make a copy of this sheet in the same directory. Rename it to ManuallyCHECKEDMusic.xlsx. Toggle the N to Y for the Is_Music_Manual column, by identifying music videos yourself. It is not as challenging as it sounds. I had to identify around 1200 videos (6 months data) and it hardly took 15 minutes.
    Optionally, if you’re okay with not maximizing accuracy, you can just save the definitely music file that the script creates.
  11. If you have checked manually and created the required excel file, run create_song_db_2.py once again. This should merge the definitely music and manually checked files to create a FinalMusicDatabase.csv file in the same directory as the script.
  12. Completed :D

Code repository.

Note:

  • There is no right or wrong way to do this fun little thing, neither was this an exhaustive or a fool proof approach. This is the best I could figure out. You could have better ideas or a more efficient filtering system. If so, please share it.
  • YouTube by default stores only 3 years worth of data in the history. You can change it to lifetime in your Google activity page.
  • A playlist is a tiny substrata of everything that you may have every heard. If you already have playlists of ALL the music you have ever heard, things can be even simpler because there is a direct way of storing the results of videos from a playlist using the YouTube API. I didn’t have, so took this long route.

--

--

Rishav Sharma

A guy who hates randomness | Quantifying intuition using Data