ProPublica is a nonprofit newsroom that investigates abuses of power. Sign up to receive our biggest stories as soon as they’re published.
Our story “How Google’s Ad Business Funds Disinformation Around the World” found that, despite Google’s public commitments to fight disinformation, it continues to allow websites to use Google’s ad systems to profit from false and misleading content. Our reporting identified websites that were allowed to continue to collect revenue from Google ads, even on stories that appeared to be in violation of the company’s policies against unreliable and harmful claims related to COVID-19, health, elections and climate change. We also found that websites containing misinformation in languages other than English and smaller markets were more likely to be allowed to continue to profit from Google ads than similar English-language websites.
We analyzed datasets of articles and websites containing false claims to determine what proportion of them made money using Google’s ad platforms. We obtained these datasets from organizations that track online disinformation around the world and wrote software to determine whether a web address was currently earning money from Google ads. Between Aug. 23 and Sept. 13, 2022, we ran the datasets through this software system to calculate the proportion of web addresses monetizing with Google ads for each dataset. We include our detailed findings in Appendix A.
Data SourcesWe analyzed 17 article and website datasets, totaling more than 13,000 active articles and over 8,000 domains, obtained from nine fact-checking and news quality monitoring organizations. Some of the datasets cover articles and websites from a particular country or region, while others cover subject matter, such as COVID-19 misinformation or climate change misinformation. In Appendix B we include a description of each dataset and the organizations that provided them.
Data CleaningThe datasets varied in size, types of content and level of curation. We filtered all URL datasets to include only articles published after 2019 to keep the datasets recent and roughly within the same time frame. If the dataset provided information on the type of fact-checked content, we limited it to the most serious forms of disinformation or disinformation purveyors. For example, Brazil’s Netlab provided a column distinguishing between suspected and confirmed purveyors of disinformation, allowing us to select confirmed purveyors.
Some datasets included links to social media platforms, such as Facebook or Twitter. We excluded these links from our analysis. Some datasets also had links to images or pdfs, which we similarly excluded. See Appendix C for a full list of exclusions.
The datasets from the International Fact-Checking Network and Raskrinkavanje included articles that had been archived using a webpage archiving service such as archive.today. In these cases, we wrote programs to extract the original web addresses of the false or misleading articles. For the IFCN dataset, we extracted by hand any addresses that we could not extract by code. For Raskrinkavanje, we excluded from our final analysis any remaining links that could not be extracted. Links that could not be extracted accounted for less than 1% of the total webpages from the datasets. We do not have reason to believe these excluded links biased our results. See Appendix C for more detailed information.
Analyzing a Web AddressOur system to determine whether a web address was currently earning money with Google’s ad systems consists of two components: a web scraper and a data analysis script.
Web ScraperA web scraper is software that can systematically extract and save data from a visited web page. ProPublica’s scraper uses a library called Playwright, which can mimic human behavior when visiting a site and is often used for automated website testing.
When our web scraper visits any web address, URL or base domain, it collects and saves the following information:
- All network requests initiated by the webpage. Network requests are used to retrieve web content such as images, text and ads or to provide information such as user actions or profile information back to the web servers.
- The response for each network request, if those requests went out to Google servers (a handful of servers we identified as serving or related to Google’s ad content). When successful, these responses contain ad content that the website loads onto the page.
- The webpage content. Once the webpage loads, the scraper captures its HTML, the code that defines what a visitor to that page would see.
- The ads.txt file: The ads.txt file lists all of a website’s advertising partners. Not all websites make this file available to visitors, but it is highly recommended by Google and the IAB Tech Lab as a web advertising transparency best practice.
- A random subpage: When visiting a website, the scraper will select an arbitrary subpage link found on the base domain (e.g. for test.com, test.com/morecontent) and also scrape the same information for that page. This is done to capture cases where the homepage for a website does not run ads, but sections of the website do.
Our analysis tool processes the above data from each URL to determine whether the address is valid, and if so whether it is monetizing with Google’s ad systems.
We manually identified 10 separate network request and response pairs that indicate a webpage is making a request to a Google server for one or multiple ads. If the response did not contain advertising content, then we did not count the website as monetizing with Google. (This may occur, for example, if the webpage makes an ad request, but Google has demonetized the specific page or website.) We then wrote software that would look for these request-response pairs in the data collected by our web scraper.
We also identified scenarios where a scraper visit did not result in valid webpage content. These invalid visits can mean the scraper was redirected to a different page from the original page, the content at the web address is no longer available, or the server is no longer reachable.
Thus, for a single web address, there are three possible outcomes of the analysis:
- The web address is valid, and it is monetizing with Google’s ad systems.
- The web address is valid, but it is not monetizing with Google’s ad systems.
- The web address is not valid or the content has been removed.
We scraped and analyzed each web address in our 17 datasets to determine which of the three categories it fell under. We then compiled the results in a spreadsheet. Appendix A provides the detailed results of this analysis.
Verifying the ResultsWe hand-checked the results of all of the smaller domain datasets by visiting each page and determining the validity of its web address and whether the webpage was monetizing via Google’s ad systems. For the larger datasets containing individual webpages, we extracted and checked a random sample of web addresses by hand, using a 90% confidence level and 10% margin of error.
The scraper and analysis tools were designed to make false positives (where we falsely flag a web address as monetizing with Google) very rare. In fact, we never identified a false positive during our audit. There were some instances where ads were displayed at the time of the scrape but not when we manually visited the page later on (or vice versa). In these cases, we manually examined the scraped data to confirm ad content was served at the time of the scrape. There were a few rare instances where content returned from the ad server was never loaded on the page, possibly because of coding errors on the webpage. We still counted these cases as positives, since they are indications of an active monetization relationship with Google.
False negatives (where the scraper did not find ads on the page but ads were present) were more common due to several scenarios: For example, the scraper was sometimes blocked from accessing a page or failed to bypass page pop-ups such as consent forms. In our audits we saw false negative rates of between 0% and 13%.
Because we found false negatives more often than false positives, the true proportion of these web addresses monetizing with Google’s ad systems is likely slightly higher than what we reported.
Dataset name
Data source
Languages covered
Regions covered
Domains or Web Pages
Number of valid web addresses analyzed
Number of valid web addresses monetizing Google ads
% of valid web addresses monetizing Google ads
Africa Check
Misinformation
Web Pages
Africa Check
English
Nigeria, South Africa, and Kenya
Web pages
66
38
57.6
Africa Check
Misinformation
Web Pages
Senegal
Africa Check
French
Senegal, Guinea, Mali, Côte d'Ivoire, and Cameroon
Web pages
44
29
65.9
Balkans
MisinformationWeb Pages
Raskrinkavanje
Bosnian-Croatian-Serbian
Serbia, Croatia, Bosnia and Herzegovina
Web pages
9,973
6,216
62.3
Balkans Publishers
Raskrinkavanje
Bosnian-Croatian-Serbian
Serbia, Croatia, Bosnia and Herzegovina
Domains
30
26
86.7
Brazil Publishers
Netlab
Portuguese
Brazil
Domains
30
24
80
Latin American Publishers
Chequeado
Spanish, Portuguese
Argentina, Bolivia, Brazil, Colombia, Costa Rica, Cuba, Ecuador, Venezuela, Peru and Mexico
Domains
49
19
38.8
Covid Disinformation Pages
International Fact-Checking Network
Various
Global
Web pages
814
338
41.5
NewsGuard Publisher list
NewsGuard
Various
Global
Domains
7,739
4,186
54.1
Turkey Disinformation Pages
Teyit
Turkish
Turkey
Web pages
1,035
756
73
Turkey Publishers
Teyit
Turkish
Turkey
Domains
50
45
90
Spanish Language Publishers
EU DisinfoLab
Spanish
Spain
Domains
32
14
43.8
German Language Publishers
EU DisinfoLab
German
Germany, Austria and Switzerland
Domains
30
10
33.3
EU
Disinformation Pages
EU DisinfoLab
Various
EU
Web pages
235
57
24.3
Climate
Disinformation Pages
Science Feedback
Various
Global
Web pages
427
86
20.1
Appendix B: Organization and Dataset detailsAll datasets were filtered to remove duplicates, archived URLs that could not be successfully unarchived, data before 2019 and URLs from social media sites such as Facebook, Twitter, Weibo, Pinterest, Telegram and WhatsApp (see full list in Appendix C).
Africa CheckWebsite: https://africacheck.org/
Description: Africa Check is an African nonprofit fact-checking organization founded in South Africa in 2012.
Datasets analyzed:
- Articles in French from Senegal, Guinea, Mali, Côte d’Ivoire and Cameroon between 2019 and 2022 fact-checked and determined to be misinformation.
- Articles in English from Nigeria, South Africa and Kenya between 2019 and 2022 fact-checked and determined to be misinformation.Raskrinkavanje
Website: https://raskrinkavanje.ba/
Description: Raskrinkavanje is a fact-checking program for media organizations in the Balkans. It was founded in 2017 by Zašto ne, a civil society organization based in Bosnia and Herzegovina.
Datasets analyzed:
- Articles from the region between 2019 and July 2022 that were fact-checked by Raskrinkavanje and determined to be misinformation.
- Thirty websites that were most frequently identified as publishing misinformation by Raskrinkavanje in the region from 2019 to July 2022.Netlab
Website: http://www.netlab.eco.ufrj.br/
Description: Netlab is a research laboratory of the School of Communication of the Federal University of Rio de Janeiro (UFRJ) that uses network analysis to study online misinformation.
Datasets analyzed:
- A list of websites shared within Brazilian right wing and left wing WhatsApp and Telegram groups and channels in August 2022 and flagged by researchers as a source of disinformation in Portuguese.Chequeado
Website: https://chequeado.com/
Description: Chequeado is a nonpartisan, nonprofit news monitoring and fact-checking organization founded in Argentina in 2010.
Datasets analyzed:
- Websites determined by LatamChequea, Chequado’s fact-checking partners in Latin America, to be spreading false information.International Fact-Checking Network
Website: https://www.poynter.org/ifcn/
Description: The International Fact-Checking Network is a network of 100 fact-checking organizations around the world. It was launched in 2015 by the Poynter Institute, a nonprofit journalism institute based in St. Petersburg, Florida.
Datasets analyzed:
- COVID: links to social media and news content spreading misinformation about the COVID-19 pandemic.NewsGuard
Website: https://www.newsguardtech.com/
Description: NewsGuard is a company that provides trust ratings for the most visited websites in the U.S., U.K., Canada, Germany, France and Italy.
Datasets analyzed:
- Domains for news websites around the world rated by NewGuard. Reliability ratings range from 0 to 100 (0 being completely untrustworthy).Teyit
Website: https://teyit.org/
Description: Teyit is a Turkish nonprofit fact-checking and media literacy social enterprise founded in 2016.
Datasets analyzed:
- Articles that were published in 2019 or later that contained claims categorized as “incorrect association,” “manipulation,” or “distortion” and which the fact-checkers had not seen subsequently corrected. (Fact-checkers provided access to a database containing a wide range of thousands of fact-checks which ProPublica filtered based on the previous criteria.)EU DisinfoLab
Website: https://www.disinfo.eu/
Description: EU DisinfoLab is a Brussels-based nonprofit organization that studies misinformation in the EU.
Datasets analyzed:
- Articles from the region between 2019 and present that were fact-checked by EU DisinfoLab and determined to be misinformation.
- Websites from Spain and German-speaking countries that were identified as sources of false and misleading claims in the regions.Science Feedback
Website: https://sciencefeedback.co/
Description: Science Feedback is a nonprofit based in France that produces scientist-expert fact-checks for health and climate news articles.
Datasets analyzed:
- Articles related to climate and climate change published in 2019 or later that Science Feedback rated their lowest rating, “False.”
All datasets were cleaned with the intention of removing invalid links, social media traffic, archived content and images/PDFs.
Any links originating from the below social media or content hosting sites were removed from the final analysis.
- Google Drive
- Telegram
- TikTok
- Vimeo
- YouTube
Any links ending in any of the below were automatically excluded from the final analysis:
- .png
- .jpg
- .jpeg
- ?type=image
Any of the archiving sites below were visited and an attempt was made to extract the archived URL. If the extraction failed or the extracted link was of a type that should be excluded from the final analysis anyway, the URL was discarded.
- Web.archive.org
- Webcache.googleusercontent.com
- Archive.today
- google.com/url?
- perma.cc
This content originally appeared on Articles and Investigations - ProPublica and was authored by by Ruth Talbot, Jeff Kao, Craig Silverman and Anna Klühspies.