Predicting Kickstarter Campaign Success Using Gradient-Boosted Decision Models
For creators, backers, and platforms alike, predicting the success of a crowdfunding campaign before it starts can be beneficial for planning and resource management. This paper studies the ability to predict the outcome of a crowdfunding campaign in pre-launch phase using only pre-launch variables in order to mitigate leakage from variables such as the amounts pledged or the number of backers. We used the public Kaggle dataset on Kickstarter to train and test three models - Logistic Regression, Gradient Boosted Decision Trees (GBDT) using XGBoost, and CatBoost, with a mixture of numerical predictors (i.e. log of the campaign funding goal, campaign length) and categorical predictors (i.e. campaign category, country, campaign currency). Based on the experiments conducted, CatBoost and XGBoost demonstrated better results than Logistic Regression with AUC attaining 0.755 and near 0.66 for PR-AUC which suggests strong predictive ability on an imbalanced classification problem. XGBoost had slightly better recall, but CatBoost had better generalization and more consistent results due their ability to better handle categorical features with little to no additional preprocessing. The three most important features included campaign category, length of the campaign, and the funding goal. These results help to reinforce the conclusion that higher order, gradient boosted ensembles can predict with good reliability the success of a campaign on Kickstarter with only minimal information, and point to more complex models that use text from the campaign description, campaign creators, historical data, and multiple other features as predictors with even higher reliability.
