Containerised web data extraction for Ballotpedia using Python and AWS

Who are Ballotpedia?

Ballotpedia are a nonprofit online political encyclopedia covering federal, state, and local politics, elections, and public policy in the United States. They provide a digital encyclopedia of American politics, and are the premier resource for unbiased information on United States elections, politics and policy.

Why Lambert Labs?

Ballotpedia chose to collaborate with Lambert Labs because of our expertise and experience in running web based Extract Transform Load (ETL) workloads in AWS. We were an especially good fit given our proficiency with Python, Scrapy, Docker containerisation, and PostgreSQL databases, technologies that lend themselves especially well to running modern web data extraction applications in the cloud.

Customer challenge

Ballotpedia faced a significant challenge in efficiently and accurately extracting relevant data from a vast array of web sources, including HTML pages and searchable and non-searchable PDF files. With the ever-evolving landscape of American politics, staying up-to-date with the latest information was crucial for the organization. However, manual data extraction was time-consuming, resource-intensive, and prone to errors. Ballotpedia needed an automated solution to streamline their data extraction process while ensuring the highest level of accuracy. Without such a solution, Ballotpedia would have to extract the data manually at great cost in terms of time and staff resources, or only offer a very limited subset of the data required.

Additionally, Ballotpedia required a scalable and cost-effective infrastructure to handle the varying workloads involved in web-based data extraction. It was essential for them to have a solution that could accommodate the growing volume of data, as well as seamlessly integrate with their existing AWS infrastructure. The challenge was to design and implement a robust and efficient cloud-based architecture that would allow Ballotpedia to keep up with the ever-changing political landscape and maintain its reputation as a reliable source of information.

Partner solution

Our primary focus was to create an efficient and automated data extraction system that could handle various web sources and integrate seamlessly with Ballotpedia’s existing AWS infrastructure. We utilized a range of AWS services and technologies to ensure that the solution was not only cost-effective but also met the required performance, security, and reliability standards.

In order to meet this, we :

  • Conducted an AWS Well-Architected Review to identify elements of Ballotpedia’s AWS infrastructure that would benefit from improvement or reconfiguration.
  • Containerised a Python Scrapy application and Python scripts using Docker and AWS ECR
  • Ran containers in AWS using ECS Fargate tasks and EventBridge rules
  • Extracted HTML data into Amazon RDS PostgreSQL databases and PDF files into Amazon S3 buckets
  • Used Optical Character Recognition (OCR) libraries to extract text from searchable and non searchable PDFs, and PyQuery to extract data from HTML
  • Integrated Scrapy with Zyte for browser automation
  • Setup an automated testing suite using Pytest
  • Setup a CI/CD pipeline (linting, formatting, unit tests, database migrations, containerisation) in CircleCI
  • Used Alembic with SQLAlchemy for database migrations
  • Setup monitoring and logging in Amazon CloudWatch
  • Used RDS CloudFormation templates to satisfy Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)
  • Automated the running and creating of ECS Fargate tasks via JSON templates and Boto3 scripts
  • Setup pre-commit hooks (linting, formatting, unit tests) in local development
  • Used Docker Compose to simplify and streamline local development

Results and benefits

The solution developed by Lambert Labs for Ballotpedia led to significant improvements in their data extraction process, enabling them to efficiently collect and process information from a diverse range of web sources. The containerized web data extraction system proved to be highly scalable and resilient, easily adapting to changes in the volume and complexity of the data. The use of AWS services and technologies allowed for seamless integration with their existing infrastructure and offered an optimized, cost-effective solution. Moreover, the automated testing suite, CI/CD pipeline, UAT environment, and monitoring and logging capabilities provided Ballotpedia with increased confidence in the overall system’s stability and reliability. As a result, Ballotpedia experienced reduced operational overhead, faster turnaround times for data extraction (saving hundreds of hours of staff members’ time each month), and an enhanced ability to maintain and update their digital encyclopedia with the most accurate and up-to-date information on American politics, elections, and public policy.

Ballotpedia’s collaboration with Lambert Labs has led to the formation of an internal DevOps team, focused on enhancing their operational efficiency and security. By undergoing the Well-Architected Review process with Lambert Labs, Ballotpedia has been able to implement numerous recommendations to improve their infrastructure and security. The organization has further engaged Lambert Labs on a monthly retainer to integrate security practices into their DevOps team, ensuring continuous improvements.

To strengthen their DevOps expertise and foster a culture of ongoing learning, the Ballotpedia team is working towards achieving multiple AWS certifications. The release process for all projects has been fully automated, streamlining deployment and reducing potential errors. Additionally, Ballotpedia has adopted the best practices established during the crawler project with Lambert Labs as a blueprint for their other projects, ensuring consistency in their CI/CD pipelines, AWS infrastructure setup, and security across the organization.

About Lambert Labs

Lambert Labs, an AWS Advanced Consulting Partner, is a Python and cloud development company based in London. We specialize in providing Python development services, AWS consulting services, and AWS Well-Architected Framework Review services.

We participate in various APN programs, including:

  • AWS Well-Architected Partner Program
  • AWS Reseller
  • AWS CloudFormation Service Validation holder
  • AWS RDS Service Validation holder
  • AWS Public Sector Program
  • AWS Immersion Day Program

Our expertise in AWS technologies and commitment to delivering high-quality solutions have made us a trusted partner for clients seeking reliable and efficient solutions to meet their business objectives.

Lambert Labs’ work with Ballotpedia was also written up as an AWS customer success story.

If you would like to find out a little bit more about Lambert Labs then please feel free to read about our Python development services, AWS consulting services or AWS Well-Architected Fr