The Kaggle competition we hosted ended recently. Congratulations to all the competitors, and especially to the top three finishers!
- Steffen Rendle, who identified useful features for users and questions, then used factorization machines to automatically develop a factorization model from those features.
- Alexander d’Yakonov, who used traditional machine learning approaches to solve the challenge as a classification problem.
- Pankaj Mishra, who blended an ensemble of results from collaborative filtering and a variety of IRT models, based on models used in the Netflix prize and the 2010 KDD cup.
We learned a number of interesting things by hosting this competition:
- A sense of how good our existing approach was. We posted benchmark results from the IRT algorithm at the heart of our internal prediction model. While it didn’t win the competition (and we’d have been disappointed if it had), it was a difficult benchmark to beat — improving on it was a serious challenge for the competitors. Before the competition, though, we didn’t really know whether we could feel confident in the strength of our model, or is there was something much better we could be doing. Knowing that the problem has been looked at by hundreds of world-class data scientists gives us a lot of confidence that there isn’t a lot of room for improvement on the results.
- Collaborative filtering/factorization methods for organizing and clustering our questions. While we’ve relied on manual categorization, this is a promising way of identifying the actual knowledge areas used by each questions, so that we can get a sense for what questions actually use a common set of skills. This should help us give students a more accurate picture of the areas they really understand and which questions they’ll actually get right.
- Stefan’s libFM tool. This proved to be a very powerful way of developing those factorization models.
We’re glad to have some new tools in our search for better ways to understand what people are learning and to help them learn better. All the data (training and test) from the competition is now publicly available in the data section of the kaggle page. We hope that it will continue to be useful for understanding student knowledge and predicting performance.












