Ballotpedia increases data extraction efficiency using AWS

100s

of research hours saved each month

3k+

source URLs ingested

100+

PDFs extracted and analysed each day

Overview

Ballotpedia, a non-profit organization dedicated to providing neutral information about U.S. political processes and public policy, needed to automate its data extraction process to improve the accuracy and timeliness of its online political encyclopedia. The organization partnered with Lambert Labs to leverage AWS services, including AWS Fargate and Amazon EventBridge, to create a scalable and efficient solution for extracting election information from a variety of sources and formats.

ballotpedia logo

We were satisfied using AWS, but now we’re getting even more out
of it with Lambert Labs.
Matt Latourelle, Director of Technology Operations, Ballotpedia

Opportunity / Customer Challenge

Ballotpedia, a comprehensive source of information on American politics, faced a significant challenge in maintaining the timeliness and accuracy of its election data. This data, crucial for their platform, is often fragmented and dispersed across thousands of local government websites, ranging from county clerk offices to city election boards. Prior to implementing a new solution on AWS, Ballotpedia relied on a labor-intensive, manual process. Staff members curated lists of potentially relevant URLs and then regularly checked these sites for updates, manually extracting election information, including key details like election dates, and recording it in spreadsheets.

This approach presented several key challenges. First, it was incredibly time-consuming and resource-intensive, requiring significant manpower to monitor the vast and ever-changing landscape of local government websites. Second, the manual process was prone to human error, potentially leading to inaccuracies in the data. Third, the frequency of updates was limited by the available resources, meaning that Ballotpedia’s information might not always be completely up-to-date, impacting the reliability of their platform. Finally, the system struggled to handle the variability in data formats. Election information wasn’t always readily available in easily parsed HTML; often, it was embedded in PDFs, requiring further manual intervention. Ballotpedia needed a solution on AWS that could effectively automate the data extraction process.

Solution

At the heart of the solution is Fargate, a serverless compute engine that allows Ballotpedia to run the data extraction tasks without managing servers or clusters. This ensures that the solution can scale seamlessly to handle the fluctuating demands of web data extraction, while also optimizing costs by only paying for the compute resources actually consumed.  

To orchestrate these tasks and ensure regular data updates, EventBridge is used to trigger the data extraction process on a schedule defined by Ballotpedia. This automation eliminates the need for manual intervention and ensures that the election information remains up-to-date.  

Extracted data is stored in Amazon RDS for PostgreSQL, a fully managed relational database service, ensuring that the extracted election information is readily available for use in Ballotpedia’s internal systems and applications.  For data embedded in PDF files, the solution employs Amazon Simple Storage Service (Amazon S3) for storage and OCR libraries to extract the relevant text. This allows Ballotpedia to access and utilize election information regardless of the format in which it is published.

As we built the tool and then revamped our infrastructure, Lambert Labs was sharing knowledge with our engineering team to help us better understand how everything fits together. We’re looking at better ways to deploy our stuff. We were satisfied using AWS, but now we’re getting even more out of it with Lambert Labs. (Matt Latourelle, Director of Technology Operations, Ballotpedia)

Outcome

Ballotpedia’s data extraction became significantly more efficient. By automating the previously manual process of data collection and extraction, productivity has increased. It has freed up valuable staff time, allowing their team to focus on higher-value tasks such as data analysis and content creation.

The automated system ensures that election information is updated more frequently and accurately, reducing the risk of human error. This ensures that Ballotpedia’s platform remains a reliable source of information for the public. The use of serverless technologies like Fargate allows the solution to scale seamlessly with the volume of data, optimizing costs by only paying for the resources consumed.

The collaboration with Lambert Labs also introduced Ballotpedia to infrastructure engineering and DevOps best practices, raising the bar for their internal engineering team and promoting a culture of continuous improvement.

As a result, Ballotpedia has achieved significant operational improvements, enhanced the quality and timeliness of its data, and strengthened its position as a leading source of information on American politics.

There are many ways to manipulate data, and perhaps this is no more true than when extracting unstructured data from the Web. One part though is constant – if you want to do it at scale efficiently and reliably, using AWS infrastructure is a no-brainer. (George Lambert, Founder & CEO, Lambert Labs)

About Ballotpedia

Ballotpedia is a non-profit organization that operates a comprehensive online political encyclopedia covering federal, state, and local politics, elections, and public policy in the United States. As the premier resource for unbiased information on U.S. elections, politics, and policy, Ballotpedia provides a digital encyclopedia of American politics. Their mission is to inform the public by providing neutral and accurate information about political processes and public policy in the United States.