So a quick update. I spent 2 hours or so yesterday working on the netflix data mining. I basically just tidied it up and started testing. I was almost finished and was about to post a blog with results, but decided I wasn’t happy with the results. It turned out there was a bug but that my code which worked out how related users are wasn’t as effective as I thought it would be. See I used the vector space index to calculate how similar users are. The catch being is that it found the users almost 100% similar in almost all cases. I was wondering about this and then realised it was working correctly, and my assumption on how it would work was wrong.
What ended up happening was this,
My user rates the movies like so,
The one I am comparing to rates them like this,
Now because the ratings are so different the users should have almost a 0% relation. However it ended up giving a relation of 99% which is obviously wrong. Now there are two ways to improve this. Feed in more movies to check which will increase the distance between the users, or modify the ratings so they are more separate. So 1 = -5, 2 = -2, 3 =0, 4=2, 5=5.
I decided that I would do the second since it means no matter the amount of movies plugged in it should give better results. And the results are….
Im not sure. I haven’t put the code in to do this. I will do it tonight (its a 30 job) and post the results for a few queries. Essentially the way the queries work is I put in some movies and how I rated them. The program then goes and finds movies which I might like based on that information.
EDIT I ended up implementing the above and then trying it out on a few queries.
Like Star Trek, Dislike Star Wars
Like Felicity season 1 and 2 but average thoughts about The Simpsons
Like Futurama
Like Notting Hill, Love Actually and Bridget Jones Diary
Like The Evil Dead and Evil Dead 2
Like Terminator 1,2,3