Detecting corruption, collusion & fraud

MS Data Science Thesis

View the Project on GitHub carpetri/detecting_corruption

Detecting Collusion, Corruption, and Fraud

by Carlos Petricioli

This project was made at the Eric and Wendy Schmidt, Data Science for Social Good (DSSG) summer fellowship at the University of Chicago during the summer of 2014. As a data fellow I was assigned to the the project of Detecting Corruption, Collusion and Fraud with the World Bank as a partner. My mentor was Eric Rozier from the University of Cincinnati, my teammates were Dylan Fitzpatick, Jeff Alstott and Misha Teplitskiy, our partner from the World Bank was Elizabeth Wiramidjaja and we all worked under the supervision of Rayid Ghani, the director of the DSSG. This paper is a report of what was made during the elaboration of that project. Some of the visualizations and data was private so this paper shows the public version of this project. For some visualizations and interpretations, fake data was used to protect the confidentiality.

The World Bank Group lends billions of dollars each year to fund development projects in its efforts to reduce global poverty. This project helps investigators at the Bank search for patterns of collusion, corruption, and fraud in its contracts data, using models of contract-specific risk. Developing an automated approach to detecting these offenses can help the World Bank efficiently target future investigations.

Contractors providing goods and services on World Bank projects are typically hired through a competitive bidding process. Occasionally, prospective contractors influence the competitive system by colluding with other contractors, bribing government officials, or otherwise manipulating the bidding process. These offenses have far-reaching effects on the price and quality of contract delivery. The World Bank is committed to detecting instances of collusion, corruption, and fraud in order to maximize its global impact.

For this project, we met with the World Bank team in charge of attacking this problem. The first objective was to understand what does corruption look like in the data? Their suggestion was to look for specific patterns in the procurement data. For example, turn-taking behavior among suppliers of goods and services is a possible indicator of collusion,

also, patterns of non competitive biding process as one supplier winning all the contracts as an indicator of corruption, among other possible indicators.

This project incorporated data from multiple sources including historical data on over 300,000 major contracts funded by World Bank loans from the past 20 years, which had features as company name, country, sector, and total award amount. We needed to add some additional features to the data in order to classify in a more accurate way each contract, so we incorporated annual economic development indicators, collected by the World Bank, for countries and industries within them. Finally, the World Bank gave us investigations data, covering companies and projects investigated for collusion, corruption or fraud in the past years that includes specific allegations and case outcomes.

(AFR,AFR), (EAP,EAP), (ECA,ECA), (LCR,LCR), (MNA,MNA), (OTH,OTH), (SAR,SAR)

The first big problem that the project faced after cleaning all the data was that company names are represented by different text strings among different data sources, so a single company may be represented in several very different ways (e.g. ACME Inc. vs. A.C.M.E. Co.). This was a big problem because, in order for the project to develop a model that predicts the level of risk for a contract presenting corruption or fraudulent activities, we need to have common strings among the data corresponding to the entities. What we did, was a company name disambiguation. Company names were reconciled by querying each name on Google and comparing their top 10 URL results. Names that had at least 7 links in common were considered to be a single company. This was a complicated task in terms of computational issues because of the size of the data. Google does not like us using their resources, so we had to create a big number of virtual machines, query Google for the URLs from each machine and then gather all the results in a database. The result of this process was good enough for the World Bank team because with this disambiguation now they have a better way of investigating companies.

After long nights of waiting for this algorithm to end, we finally were ready to build tools for a proactive investigation within the World Bank. To evaluate contract risk, we generated features and models tracking companies' historical involvement on World Bank projects within specific countries and sectors. As well we created co-award network features for each company. For example, down here you can see the network for General Electric Company. In blue there's every project that General Electric Company has been part of and in green there's every company that worked in that specific project. This is a public version of the network, in the one we delivered to the World Bank, they have different colors whether a company was investigated and found to be guilty. Network features turned out to be very good predictors of risk in the final model.

In terms of the model, we trained a binary classifier separating past contracts that were investigated by the World Bank from those that were not investigated. We evaluated and compared models using precision, recall, and area under ROC curve. A random forest provided the best results across all metrics on a held-out test set.

Finally we developed an interactive map and a dashboard for World Bank investigators to track a company's activity across countries, sectors, and time.You can see a public version of the Interactive map here. Using this tool, investigators can track contract awards companies have received, including under different names (e.g. ACME, Inc. vs. ACME Co.), view a risk score for each World Bank contract, as calculated by the contract risk model and visualize the immediate neighborhood of the company in its co-award network.