There are some common ways to calculate the similarity score. Using the Pearson Correlation and Euclidean Distance are pretty popular. I could try to fit my data into using those but it just isn't the best way because for my music data, I'm looking at if the user has common songs to the playlists or not. To put it into numerical values it'll be binary 0 or 1. These calculation metrics are fit for when comparing say movie ratings from 0-5. I need a different method.
After some Googling, I decided that using the Log-Likelihood metric will be my best bet. What this does is it quantifies how unlikely or likely it is that an overlap between two datasets is due to chance. We will be comparing two likelihoods and looking at their ratio.
We can create a table to observe four different situations.
1. The likelihood of both event A and B happening together (k11)
2. The likelihood of only event A happening and not B (k12)
3. The likelihood of only event B happening and not A (k21)
4. The likelihood of neither A nor B happening (k22)
These can be used to calculate the log-likelihood ratio.
LLR = 2sum(k)(H(k) - H(rowSums(k) - H(colSums(k))
where H is the Shannon's entropy => sum of (k_ij/sum(k)) log((k_ij / sum(k))
But after some researching, I thought I should start out with something a bit simpler. The Jaccard similarity coefficient looks pretty easy to implement. Not to mention, it is similar to the log-likelihood method in the way that it uses 4 situations as well to calculate the intersect and the union. This also gives a score of 0~1 (1 being most similar) which is perfect. All I have to do to get the Jaccard similarity coefficient is to divide the amount of intersections by the union.
Sources I read through:
1. https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/impl/similarity/package-summary.html
2. http://mail-archives.apache.org/mod_mbox/mahout-user/201105.mbox/%[email protected]%3E
3. http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html4. http://www.slideshare.net/NYCPredictiveAnalytics/building-a-recommendation-engine-an-example-of-a-product-recommendation-engine
5. http://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify?related=1&utm_campaign=related&utm_medium=1&utm_source=3
6. http://en.wikipedia.org/wiki/Jaccard_index
After some Googling, I decided that using the Log-Likelihood metric will be my best bet. What this does is it quantifies how unlikely or likely it is that an overlap between two datasets is due to chance. We will be comparing two likelihoods and looking at their ratio.
We can create a table to observe four different situations.
1. The likelihood of both event A and B happening together (k11)
2. The likelihood of only event A happening and not B (k12)
3. The likelihood of only event B happening and not A (k21)
4. The likelihood of neither A nor B happening (k22)
These can be used to calculate the log-likelihood ratio.
LLR = 2sum(k)(H(k) - H(rowSums(k) - H(colSums(k))
where H is the Shannon's entropy => sum of (k_ij/sum(k)) log((k_ij / sum(k))
But after some researching, I thought I should start out with something a bit simpler. The Jaccard similarity coefficient looks pretty easy to implement. Not to mention, it is similar to the log-likelihood method in the way that it uses 4 situations as well to calculate the intersect and the union. This also gives a score of 0~1 (1 being most similar) which is perfect. All I have to do to get the Jaccard similarity coefficient is to divide the amount of intersections by the union.
Sources I read through:
1. https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/impl/similarity/package-summary.html
2. http://mail-archives.apache.org/mod_mbox/mahout-user/201105.mbox/%[email protected]%3E
3. http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html4. http://www.slideshare.net/NYCPredictiveAnalytics/building-a-recommendation-engine-an-example-of-a-product-recommendation-engine
5. http://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify?related=1&utm_campaign=related&utm_medium=1&utm_source=3
6. http://en.wikipedia.org/wiki/Jaccard_index