AWS - Secure, High‑Throughput Ingestion Pipeline for Large Binary Objects

Published: (December 4, 2025 at 04:47 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Overview

The solution enables direct client uploads of large video files (≈150 MB average, up to 1 GB) to Amazon S3 while meeting the following requirements:

  • Low‑latency, client‑side upload – no server‑side proxy.
  • Server‑side validation of file type and size before the object becomes publicly accessible.
  • Automatic virus scanning after upload.
  • Retention policy – 90 days in S3 Standard, then transition to Glacier Deep Archive.
  • Access control – only the authenticated owner can read the file; no public read access.

Client Upload Flow

  1. Authenticate (e.g., Amazon Cognito, JWT).
  2. POST /upload-url with filename (and optionally size).
  3. Backend returns a pre‑signed PUT URL scoped to uploads/{userId}/{uuid}.ext and valid for 5 minutes.
  4. Client performs an HTTP PUT directly to the URL, uploading the file to S3.
  5. S3 emits an ObjectCreated event that triggers the validation & scanning Lambda.
  6. The client polls (or receives a webhook) for the upload status stored in DynamoDB.
  7. When ready, the client requests a pre‑signed GET URL for the final object (private/{userId}/{uuid}.ext).

Backend API (Pre‑Signed URL Generation)

  • IAM role with permission s3:PutObject limited to the bucket and the uploads/${userId}/* prefix.
  • The API generates the URL using the AWS SDK, specifying:
    • Bucket name.
    • Object key (uploads/{userId}/{uuid}.ext).
    • Expiration (5 min).
    • Optional condition s3:x-amz-content-sha256 for payload integrity.

S3 Bucket Configuration

ResourcePurposeKey Settings
S3 Bucket (my‑media‑bucket)Store raw uploads and final objects• Block public access (bucket policy).
• Enable Versioning.
• (Optional) Enable Object Lock for tamper‑evidence.
Lifecycle PolicyRetention & archival• Transition objects to Glacier Deep Archive after 90 days.
• Expire objects after the required retention period (e.g., 7 years).
Bucket PolicyEnforce access control• Deny any s3:GetObject unless the request is authenticated and the principal matches the object’s userId tag or prefix.
• Allow only the backend role to generate pre‑signed URLs.

Lambda Validation & Scanning

Trigger: ObjectCreated events on the uploads/ prefix.

Permissions

  • s3:GetObject, s3:PutObject, s3:DeleteObject on the bucket.
  • dynamodb:PutItem on the UserFiles table.

Workflow

  1. Retrieve object metadata (ContentLength, ContentType).
  2. Size check – reject if > 1 GB.
  3. MIME‑type whitelist (e.g., video/mp4, video/webm).
  4. Download the object to /tmp (max 512 MB per Lambda). For larger files, use multipart copy or S3 Object Lambda for streaming scan.
  5. Run ClamAV (provided via a Lambda layer) to detect viruses.
  6. If validation fails or a virus is found, delete the object and optionally notify the user.
  7. On success, move the object to private/{userId}/{uuid}.ext (copy + delete).
  8. Write a record to DynamoDB with ownership, status, and timestamps.

Virus Scanning Implementation

  • ClamAV database is packaged in a Lambda layer and refreshed daily by a scheduled Lambda.
  • The scanning code calls the clamscan binary on the downloaded file and interprets the exit code.
  • For files larger than the Lambda /tmp space, consider:
    • S3 Object Lambda to stream the object through ClamAV without full download.
    • Step Functions to orchestrate multipart processing.

DynamoDB Metadata Store

TablePrimary KeyAttributes
UserFilesfileId (UUID)userId, s3Key, status (READY, FAILED, …), uploaded (timestamp), size, contentType

The table enables quick lookup of a user’s files and drives the download‑URL generation logic.

Lifecycle Management

{
  "Rules": [
    {
      "ID": "TransitionToGlacierDeepArchive",
      "Prefix": "",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER_DEEP_ARCHIVE"
        }
      ]
    },
    {
      "ID": "ExpireObjects",
      "Prefix": "",
      "Status": "Enabled",
      "Expiration": {
        "Days": 2555   // ~7 years
      }
    }
  ]
}

Security & Access Control

  • Bucket policy denies any s3:GetObject unless the request is made by an authenticated principal that matches the object’s userId prefix.
  • IAM roles for the backend API and Lambda are scoped to the minimum required actions.
  • Cognito (or another OIDC provider) supplies JWTs that are validated by API Gateway or ALB before allowing the /upload-url call.
  • Object Lock (optional) can be enabled to make objects immutable for the retention period.

Monitoring & Alerts

  • CloudWatch MetricsLambdaErrors, S3ObjectCreated, custom ValidationFailures.
  • Alarms – trigger SNS notifications or Lambda retries when validation or scanning failures exceed a threshold.
  • AWS GuardDuty / Macie (optional) – continuous threat detection on the bucket.

Sample Lambda Code (Go)

package main

import (
    "context"
    "fmt"
    "time"

    "github.com/aws/aws-lambda-go/events"
    "github.com/aws/aws-lambda-go/lambda"
    "github.com/aws/aws-sdk-go-v2/aws"
    "github.com/aws/aws-sdk-go-v2/service/dynamodb"
    "github.com/aws/aws-sdk-go-v2/service/dynamodb/types"
    "github.com/aws/aws-sdk-go-v2/service/s3"
)

var (
    s3Client    *s3.Client
    dynamoClient *dynamodb.Client
)

func handler(ctx context.Context, s3Event events.S3Event) error {
    for _, record := range s3Event.Records {
        bucket := record.S3.Bucket.Name
        // ... implementation continues ...
        fmt.Printf("Processing object %s in bucket %s\n", record.S3.Object.Key, bucket)
        _ = time.Now() // placeholder
    }
    return nil
}

func main() {
    lambda.Start(handler)
}
Back to Blog

Related posts

Read more »

Saving Terraform State in S3

Configuring S3 as a Terraform Backend Terraform can store its state in an S3 bucket. Below is a minimal configuration that sets up the S3 backend: hcl terrafor...