One problem about my algorithm that used the Jaccard index is that it was super slow. This isn't because I used the Jaccard index (although the runtime is n squared which is pretty bad) but because I wrote it in a way that makes a call to the sqlite3 database for every playlist to compare it to. The database call is like the rate determining step in chemistry kinetics; it's the slowest process in my algorithm that determines how long the process is going to take. Seriously, database calls are super slow. I could rewrite it but I decided to create an additional method using a different way to calculate the similarity score while designing the algorithm to not make any database calls. Because this second method uses the cosine similarity, I'll call it the cosine method from now on.
Below is the Jaccard method.
Below is the Jaccard method.
This is the cosine method. Don't worry I'll explain what I'm doing here.
To avoid making database calls for this new method, I created a file called vectors.rb which I reference instead. Its one big hash with the playlist ID as the key and the vector as the value. Looping through hashes is so much faster! Anyway I'll explain the cosine method because it's a bit more complicated than the Jaccard method.
The first step is to make a matrix for all existing playlists and to create a vector for each playlist. In this matrix, a score of 1 is provided if the artist exists in the playlist and a 0 if they don't.
The first step is to make a matrix for all existing playlists and to create a vector for each playlist. In this matrix, a score of 1 is provided if the artist exists in the playlist and a 0 if they don't.
Here I made a simple 4 by 6 matrix as an example. In reality, there are 8502 unique artists and about 1500 playlists that I look at in my database so it'll be a 8502 by 1500 matrix. Playlist 1's vector will be [0,1,1,1] and Playlist 2's vector is [0,0,0,0]. These are 4 dimensional vectors. You can't visualize this because we can only visualize up to 3 dimensions in reality but same concept. In my method, each vector has 8502 dimensions. Just think of it being an arrow in a 8502-D space with a certain length and there are 1500 of them. We then have to create a vector for the user input and calculate the angle between the vector of one playlist and the user's vector. This is where we calculate the cosine similarity score. If two vectors are close together in space, the angle between them will be small. The cosine of the angle is inversely related to the angle meaning the smaller the angle, the bigger the cosine of the angle. We can therefore use the cosine of the angle as the similarity. You can visualize this by considering vectors in 2 dimensional space:
The equation to calculate the similarity between two vectors are:
The numerator is the dot product of the two vectors and the denominator is the product of the unit vectors.
And thats it! I tested the run time for my two methods and these are the results.
And thats it! I tested the run time for my two methods and these are the results.
500 playlists
Jaccard = 24 seconds
Cosine = 6 seconds
1000 playlists
Jaccard = 50 seconds
Cosine = 8 seconds
Clearly my cosine method is faster to start with and way more scalable.