shainis - Songs Recommender System

Overview

In this notebook, we will be using cosine similarity to produce songs recommendation. We will be using data from Spotify to cluster the songs into different song types and see what kinds of songs have the same attributes.

We will use the k-means algorithm to sort the songs into different types. We will then use the clustered database to create a recommendation system and make a few recommendations.

By the end of this notebook you will be able to find similar songs to your favorite song, and hopefully, find new favorite songs :)

We’ll use a dataset from Kaggle: https://www.kaggle.com/datasets/rodolfofigueroa/spotify-12m-songs

The dataset contains audio features for over 1.2 million songs, obtained with the Spotify API.

Reference for these audio features can be found here: https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features

Let’s go!

First, we’ll import some important packages:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import distance
import difflib
import time
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

%matplotlib inline

Now we can load the data:

df = pd.read_csv("tracks_features.csv")

df.head()

	id	name	album	album_id	artists	artist_ids	track_number	disc_number	explicit	danceability	...	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms	time_signature	year	release_date
0	7lmeHLHBe4nmXzuXc0HDjk	Testify	The Battle Of Los Angeles	2eia0myWFgoHuttJytCxgX	['Rage Against The Machine']	['2d0hyoQ5ynDBnkvAbJKORj']	1	1	False	0.470	...	0.0727	0.02610	0.000011	0.3560	0.503	117.906	210133	4.0	1999	1999-11-02
1	1wsRitfRRtWyEapl0q22o8	Guerrilla Radio	The Battle Of Los Angeles	2eia0myWFgoHuttJytCxgX	['Rage Against The Machine']	['2d0hyoQ5ynDBnkvAbJKORj']	2	1	True	0.599	...	0.1880	0.01290	0.000071	0.1550	0.489	103.680	206200	4.0	1999	1999-11-02
2	1hR0fIFK2qRG3f3RF70pb7	Calm Like a Bomb	The Battle Of Los Angeles	2eia0myWFgoHuttJytCxgX	['Rage Against The Machine']	['2d0hyoQ5ynDBnkvAbJKORj']	3	1	False	0.315	...	0.4830	0.02340	0.000002	0.1220	0.370	149.749	298893	4.0	1999	1999-11-02
3	2lbASgTSoDO7MTuLAXlTW0	Mic Check	The Battle Of Los Angeles	2eia0myWFgoHuttJytCxgX	['Rage Against The Machine']	['2d0hyoQ5ynDBnkvAbJKORj']	4	1	True	0.440	...	0.2370	0.16300	0.000004	0.1210	0.574	96.752	213640	4.0	1999	1999-11-02
4	1MQTmpYOZ6fcMQc56Hdo7T	Sleep Now In the Fire	The Battle Of Los Angeles	2eia0myWFgoHuttJytCxgX	['Rage Against The Machine']	['2d0hyoQ5ynDBnkvAbJKORj']	5	1	False	0.426	...	0.0701	0.00162	0.105000	0.0789	0.539	127.059	205600	4.0	1999	1999-11-02

5 rows × 24 columns

We’ll convert the “explicit” and “release_date” columns to a numerical value:

df['explicit'] = df['explicit'].astype('int') 

df['release_date'] = pd.to_datetime(df['release_date'], errors = 'coerce').astype('int64')

and now create years cuts (we want our system to recommend songs from the same decade)

df['year'].mask(df['year'] <= 1910 ,1 , inplace=True)
df['year'].mask(df['year'].between(1911, 1920) ,2 , inplace=True)
df['year'].mask(df['year'].between(1921, 1930) ,3 , inplace=True)
df['year'].mask(df['year'].between(1931, 1940) ,4 , inplace=True)
df['year'].mask(df['year'].between(1941, 1950) ,5 , inplace=True)
df['year'].mask(df['year'].between(1951, 1960) ,6 , inplace=True)
df['year'].mask(df['year'].between(1961, 1970) ,7 , inplace=True)
df['year'].mask(df['year'].between(1971, 1980) ,8 , inplace=True)
df['year'].mask(df['year'].between(1981, 1990) ,9 , inplace=True)
df['year'].mask(df['year'].between(1991, 2000) ,10 , inplace=True)
df['year'].mask(df['year'].between(2001, 2010) ,11 , inplace=True)
df['year'].mask(df['year'].between(2011, 2020) ,12 , inplace=True)
df['year'].mask(df['year'].between(2021, 2030) ,13 , inplace=True)

Let’s normalize the numerical columns. Why do we have to? Because we want the distances between the different features to be even, so all the features will be equal.

cols_to_normalize = ['acousticness', 'danceability', 'duration_ms',
       'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness',
       'tempo', 'valence', 'time_signature', 'year', 'release_date']
        
scaler = StandardScaler()
df[cols_to_normalize] = scaler.fit_transform(df[cols_to_normalize])

We’ll drop unnecessary columns (mainly textual columns)

df.drop(columns=['id', 'album', 'album_id', 'artist_ids', 'track_number', 'disc_number', 'key'], inplace=True)
df.head(2)

	name	artists	explicit	danceability	energy	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms	time_signature	year	release_date
0	Testify	['Rage Against The Machine']	0	-0.121562	1.589717	0.918016	1	-0.100716	-1.092029	-0.751691	0.855599	0.277330	0.008781	-0.238621	0.298487	-1.082527	-0.734249
1	Guerrilla Radio	['Rage Against The Machine']	1	0.558569	1.518454	0.865739	1	0.893324	-1.126297	-0.751533	-0.258227	0.225571	-0.451057	-0.262868	0.298487	-1.082527	-0.734249

We can now gather the relevant columns for clustering

X = df[['explicit', 'danceability', 'energy', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms', 'time_signature', 'release_date', 'year']]

How many clusters should we use? We’ll check with the elbow method

#creating a list of inertia  scores might take a while

# inertia = []
# for n in range(1,20):
#     kmeans = KMeans(n_clusters = n, random_state=7)
#     kmeans.fit(X)
#     inertia.append(kmeans.inertia_)
# inertia

#creating a line graph of the inertia scores

# plt.figure(figsize = (12,8))
# plt.plot(range(1,20),inertia)
# plt.title('Inertia scores')
# plt.show()

This isn’t enough. We should also look at the silhouette scores.

#creating a list of silhouette scores

# from sklearn.metrics import silhouette_score

# silhouette = []
# for n in range(2,20):
#     kmeans = KMeans(n_clusters = n, random_state=7)
#     kmeans.fit(X)
#     score = silhouette_score(X,kmeans.labels_)
#     silhouette.append(score)
# silhouette

# ploting the silhouette scores

# plt.figure(figsize = (12,8))
# plt.plot(range(2,20),silhouette)
# plt.title('silhouette scores')

We can now cluster to 18 clusters.

# clustering to 18 (styles)

from sklearn.cluster import KMeans

X = df[['explicit', 'danceability', 'energy', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms', 'time_signature', 'release_date', 'year']]
km = KMeans(n_clusters=18)
df['cluster'] = km.fit_predict(X)

df.head()

c:\Users\nisan\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(

	name	artists	explicit	danceability	energy	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms	time_signature	year	release_date	cluster
0	Testify	['Rage Against The Machine']	0	-0.121562	1.589717	0.918016	1	-0.100716	-1.092029	-0.751691	0.855599	0.277330	0.008781	-0.238621	0.298487	-1.082527	-0.734249	1
1	Guerrilla Radio	['Rage Against The Machine']	1	0.558569	1.518454	0.865739	1	0.893324	-1.126297	-0.751533	-0.258227	0.225571	-0.451057	-0.262868	0.298487	-1.082527	-0.734249	1
2	Calm Like a Bomb	['Rage Against The Machine']	0	-0.938773	1.562569	0.914435	1	3.436618	-1.099039	-0.751715	-0.441094	-0.214381	1.038065	0.308569	0.298487	-1.082527	-0.734249	2
3	Mic Check	['Rage Against The Machine']	1	-0.279732	1.552388	0.856287	0	1.315769	-0.736631	-0.751711	-0.446636	0.539822	-0.674995	-0.217001	0.298487	-1.082527	-0.734249	1
4	Sleep Now In the Fire	['Rage Against The Machine']	0	-0.353544	1.423437	0.727529	1	-0.123132	-1.155581	-0.472676	-0.679930	0.410424	0.304640	-0.266567	0.298487	-1.082527	-0.734249	1

We can also one-hot encode the cluster column:

encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")
enc = pd.DataFrame(encoder.fit_transform(np.array(df["cluster"]).reshape(-1,1)))
enc.columns = df["cluster"].unique()

df[enc.columns] = enc

df.head()

c:\Users\nisan\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\preprocessing\_encoders.py:828: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(

	name	artists	explicit	danceability	energy	loudness	mode	speechiness	acousticness	instrumentalness	...
0	Testify	['Rage Against The Machine']	0	-0.121562	1.589717	0.918016	1	-0.100716	-1.092029	-0.751691	...
1	Guerrilla Radio	['Rage Against The Machine']	1	0.558569	1.518454	0.865739	1	0.893324	-1.126297	-0.751533	...
2	Calm Like a Bomb	['Rage Against The Machine']	0	-0.938773	1.562569	0.914435	1	3.436618	-1.099039	-0.751715	...
3	Mic Check	['Rage Against The Machine']	1	-0.279732	1.552388	0.856287	0	1.315769	-0.736631	-0.751711	...
4	Sleep Now In the Fire	['Rage Against The Machine']	0	-0.353544	1.423437	0.727529	1	-0.123132	-1.155581	-0.472676	...

5 rows × 36 columns

Let’s sort the dataframe by release date and drop duplicate song names (keep the first edition).


df.sort_values(by=['release_date'], inplace=True)
df.drop_duplicates(subset=['name', 'artists'], inplace=True)

We should also clean the artists’ column of some symbols

df["artists"] = df["artists"].str.replace("[","")
df["artists"] = df["artists"].str.replace("]","")
df["artists"] = df["artists"].str.replace("'","")
df["artists"] = df["artists"].str.replace("'","")

C:\Users\nisan\AppData\Local\Temp\ipykernel_1540\4026273699.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df["artists"] = df["artists"].str.replace("[","")
C:\Users\nisan\AppData\Local\Temp\ipykernel_1540\4026273699.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df["artists"] = df["artists"].str.replace("]","")

At this stage, you can choose to use PCA to speed up the results.

# from sklearn.decomposition import PCA
  
# pca = PCA(n_components = 3)

# x = data.drop(columns=['name'])

# x = pca.fit_transform(x)
  
# explained_variance = pca.explained_variance_ratio_

# print(explained_variance)

# dataset_pca = pd.DataFrame({'pca1': x[:, 0], 'pca2': x[:, 1], 'pca3': x[:, 2], 'name': data['name'], 'cluster': data['cluster']})

# dataset_pca.head()

Let’s create a function that finds similar songs using Cosine Similarity. You can replace it with any other relevant function from: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

def find_similar_songs(best_match, artist):

    found_song_idx = df.index[(df['name'] == best_match) & (df['artists'] == artist)].values

    cluster_data = df['cluster'].loc[found_song_idx].values[0]

    #filter to the relevant cluster only
    x = df

    x = x[x['cluster']==cluster_data]

    #store the names of the songs
    song_names=x['name'].values
    artists_names = x['artists'].values

    #drop the categorial column
    x = x.drop(columns=['name', 'cluster', 'artists'])


    #create a list that will store all the cosine similarities
    lst = []

    #add a counter
    count = 0

    #iterate over the dataframe and compute all the similarities
    for i in x.values:
        lst.append([distance.cosine(x.loc[found_song_idx].values[0], i), count])
        count += 1

    #get top songs names from the list
    lst.sort()
    recs = []
    for i in range(1,6):
        recs.append([song_names[lst[i][1]], artists_names[lst[i][1]]])

    recs_df = pd.DataFrame(recs, columns =['Similar Song', 'Artist'])
    
    print("\nHere are songs similar to", best_match)
    print("*************************************************")
    return print(recs_df)

Our similarity function needs a song and an artist. Let’s create another function that will get a song’s name and guide the user to find the relevant artist from our database.

def get_song_and_artist():
    song = input('Please enter a song that you like:')
    best_match = difflib.get_close_matches(song, df['name'].values, n=1)[0] 

    found_song_idx = df.index[df['name'] == best_match].values
    artist_lst = df['artists'].loc[df['name'] == best_match].tolist()

    print('\nFound the following artists: ')
    print(artist_lst)
    artist = input('\nWho is the artist? Please choose from the list:')
    return best_match, artist

We can try it out:

best_match, artist = get_song_and_artist()
find_similar_songs(best_match, artist)


Found the following artists: 
['Europe', '"Pickin On Series"', '"Scott Bradlees Postmodern Jukebox", Gunhild Carling']

Here are songs similar to The Final Countdown
*************************************************
                    Similar Song                                  Artist
0                      On and On                                 Triumph
1                          Inoiz                                   Itoiz
2                         Outlaw                           Brighton Rock
3               Kissin' Dynamite                                   AC/DC
4  The Walk - Remastered Version  Eurythmics, Annie Lennox, Dave Stewart

Here’s a short demo: