Class Project - Prediction of corruption in WB contracts By Carlos Grandet, Santiago Matallana and Hector Salvador
Machine Learning for Public Policy (UChicago) - final project
This project provides a machine learning approach to labelling the World Bank contracts that are more prone to corruption and fraud. As a global institution, the World Bank awarded close to 200,000 contracts in the past 12 years, but less than 2,000 were investigated on malpractices. Lack of resources and personnel is likely a barrier preventing further audit- ing and sanctioning, thus allowing for the opportunity of many corrupt acts to go unnoticed. Thus, we aim to pro- vide a model that allows the World Bank to be more effcient in their contract investigation by maximizing their ability to andd the high-valued contracts incurring in corruption or fraud. We are proposing a supervised learning approach based on contract features related to supplier, project, coun- try of origin, procurement information and major sector. We then train the data to and the best model that maximizes precision at a top N-percentile and test it with a subsam- ple of labelled contracts. Our results with the testing data shows that our model has a precision of 50% on the top 20% of our sample with a threshold of 95%. Finally, we run the model with unlabelled data to provide a list to the World Bank of which contracts to investigate.
-
Use python feature_generation.py to add additional features to the data. Change "data/tothepipe.csv" and "data/tothepipe_II.csv" as input and output files.
-
Run code from the WB_delivery.ipynb notebook.
-
Output is generated by applying the proposed model to the contracts without any labels (e.g. no unsubstantiated, substantiated, or unfounded). Repeat the step of feature generation on such database, changing the "data/tothecheck.csv" names of input and output.