Pinterest Data Pipeline

2024

Summary of the Pinterest Data Pipeline

The project builds an end-to-end data pipeline leveraging both batch and stream processing on AWS to efficiently process over 30,000 rows of Pinterest data. The pipeline utilises Kafka and AWS Kinesis for real-time data ingestion, as well as Apache Spark for batch and streaming data processing. Key components of the project include:

Pinterest Data Pipeline Architecture

1. Data Ingestion

Batch Ingestion: An API Gateway RESTful API sends data through a Kafka REST proxy to multiple Kafka topics hosted on Amazon MSK (Managed Streaming for Apache Kafka). Data from these topics is then distributed to an AWS S3 data lake using an MSK Connect plugin-connector pair.
Stream Ingestion: Real-time streaming data is handled by AWS Kinesis data streams. Data is streamed into Spark Structured Streaming running on Databricks for near real-time analysis.

2. Data Processing

Batch Processing: Data stored in the S3 data lake is processed in batches using Apache Spark on Databricks. Custom Spark transformations are applied to clean and aggregate the data. These transformations include normalisation, cleaning of missing or erroneous values, and preparing the data for further analysis.
Stream Processing: Spark Structured Streaming handles real-time data processing, allowing near real-time analysis of streaming data from Kinesis streams.

3. Data Orchestration

Apache Airflow (MWAA) is used to automate the daily batch processing jobs on Databricks. Custom DAG scripts manage the orchestration of the data pipeline, ensuring that the batch jobs run daily and process the data in a timely manner.

4. Data Storage

S3 Data Lake: Batch-processed data is stored in Amazon S3, where it can be accessed for further transformations or analysis.
Post-processing storage: For real-time data, Spark Structured Streaming saves the cleaned and processed data into a local database, such as PostgreSQL, for analysis and long-term storage.

Tools & Technologies

AWS (S3, API Gateway, MSK, Kinesis, MWAA)
Kafka (via MSK)
Apache Spark (batch and streaming processing on Databricks)
Databricks (for custom Spark transformations and workloads)
Python (for custom DAG scripts and API emulation)

AWS Data Engineering Pinterest Pipeline

Real-time and Batch Data Processing with AWS