TSS Docs

Introduction

The Serverless Scrapper - TSS

Overview

This project provides a serverless API for web scraping jobs, file storage, and retrieval, built for scalability and automation. The API is deployed using CI/CD via GitHub Actions—developers only need to clone, make changes, and push; deployments are automatic.

Key Features

  • Trigger scraping jobs via API
  • Store results in S3
  • Retrieve or preview results with pre-signed URLs
  • Fully automated deployment pipeline

API Endpoints

GET    /           # API status and version
POST   /scrape     # Trigger a new scraping job
GET    /retrieve   # Get pre-signed URLs for the archive file
GET    /preview    # Get pre-signed URLs for files except the archive

See API Reference for full endpoint details and example requests.

Contribution Area

  • api_handler.py: Main Lambda handler for API Gateway
  • scrapper.py: Main Lambda that actually does the scraping
  • Docs/: API and logging documentation

CI/CD Pipeline

  • GitHub Actions: On push to main, the workflow builds, tests, and deploys the API automatically.
  • No manual deployment steps required.

Contributing

  1. Clone the repository:
    git clone <repo-url>
    cd the-serverless-scrapper
  2. Make your changes and commit.
  3. Push to the repository:
    git push origin <branch>
  4. GitHub Actions will handle build and deployment.

System Flow Diagram

sequenceDiagram
    participant Client
    participant API Gateway
    participant TSS-API-Lambda as tss-api lambda
    participant TSS-Worker-Lambda as tss-worker lambda
    participant S3

    Client->>API Gateway: POST /scrape
    API Gateway->>TSS-API-Lambda: Invoke handler
    TSS-API-Lambda->>TSS-Worker-Lambda: Trigger scraping job
    TSS-Worker-Lambda->>S3: Store results
    TSS-API-Lambda-->>API Gateway: Return job info & S3 links
    API Gateway-->>Client: Response

    Client->>API Gateway: GET /retrieve or /preview
    API Gateway->>TSS-API-Lambda: Invoke handler
    TSS-API-Lambda->>S3: Generate pre-signed URLs
    TSS-API-Lambda-->>API Gateway: Return URLs
    API Gateway-->>Client: Response

Getting Help

  • See API Reference for endpoint details.
  • For issues, open a GitHub issue or contact the maintainer.

Happy Scraping!