Introduction
The Serverless Scrapper - TSS
Overview
This project provides a serverless API for web scraping jobs, file storage, and retrieval, built for scalability and automation. The API is deployed using CI/CD via GitHub Actions—developers only need to clone, make changes, and push; deployments are automatic.
Key Features
- Trigger scraping jobs via API
- Store results in S3
- Retrieve or preview results with pre-signed URLs
- Fully automated deployment pipeline
API Endpoints
GET / # API status and version
POST /scrape # Trigger a new scraping job
GET /retrieve # Get pre-signed URLs for the archive file
GET /preview # Get pre-signed URLs for files except the archiveSee API Reference for full endpoint details and example requests.
Contribution Area
api_handler.py: Main Lambda handler for API Gatewayscrapper.py: Main Lambda that actually does the scrapingDocs/: API and logging documentation
CI/CD Pipeline
- GitHub Actions: On push to main, the workflow builds, tests, and deploys the API automatically.
- No manual deployment steps required.
Contributing
- Clone the repository:
git clone <repo-url> cd the-serverless-scrapper - Make your changes and commit.
- Push to the repository:
git push origin <branch> - GitHub Actions will handle build and deployment.
System Flow Diagram
sequenceDiagram
participant Client
participant API Gateway
participant TSS-API-Lambda as tss-api lambda
participant TSS-Worker-Lambda as tss-worker lambda
participant S3
Client->>API Gateway: POST /scrape
API Gateway->>TSS-API-Lambda: Invoke handler
TSS-API-Lambda->>TSS-Worker-Lambda: Trigger scraping job
TSS-Worker-Lambda->>S3: Store results
TSS-API-Lambda-->>API Gateway: Return job info & S3 links
API Gateway-->>Client: Response
Client->>API Gateway: GET /retrieve or /preview
API Gateway->>TSS-API-Lambda: Invoke handler
TSS-API-Lambda->>S3: Generate pre-signed URLs
TSS-API-Lambda-->>API Gateway: Return URLs
API Gateway-->>Client: ResponseGetting Help
- See
API Referencefor endpoint details. - For issues, open a GitHub issue or contact the maintainer.
Happy Scraping!