AWS - Secure, High‑Throughput Ingestion Pipeline for Large Binary Objects

Published: (December 4, 2025 at 04:47 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Overview

The solution enables direct client uploads of large video files (≈150 MB average, up to 1 GB) to Amazon S3 while meeting the following requirements:

  • Low‑latency, client‑side upload – no server‑side proxy.
  • Server‑side validation of file type and size before the object becomes publicly accessible.
  • Automatic virus scanning after upload.
  • Retention policy – 90 days in S3 Standard, then transition to Glacier Deep Archive.
  • Access control – only the authenticated owner can read the file; no public read access.

Client Upload Flow

  1. Authenticate (e.g., Amazon Cognito, JWT).
  2. POST /upload-url with filename (and optionally size).
  3. Backend returns a pre‑signed PUT URL scoped to uploads/{userId}/{uuid}.ext and valid for 5 minutes.
  4. Client performs an HTTP PUT directly to the URL, uploading the file to S3.
  5. S3 emits an ObjectCreated event that triggers the validation & scanning Lambda.
  6. The client polls (or receives a webhook) for the upload status stored in DynamoDB.
  7. When ready, the client requests a pre‑signed GET URL for the final object (private/{userId}/{uuid}.ext).

Backend API (Pre‑Signed URL Generation)

  • IAM role with permission s3:PutObject limited to the bucket and the uploads/${userId}/* prefix.
  • The API generates the URL using the AWS SDK, specifying:
    • Bucket name.
    • Object key (uploads/{userId}/{uuid}.ext).
    • Expiration (5 min).
    • Optional condition s3:x-amz-content-sha256 for payload integrity.

S3 Bucket Configuration

ResourcePurposeKey Settings
S3 Bucket (my‑media‑bucket)Store raw uploads and final objects• Block public access (bucket policy).\n• Enable Versioning.\n• (Optional) Enable Object Lock for tamper‑evidence.
Lifecycle PolicyRetention & archival• Transition objects to Glacier Deep Archive after 90 days.\n• Expire objects after the required retention period (e.g., 7 years).
Bucket PolicyEnforce access control• Deny any s3:GetObject unless the request is authenticated and the principal matches the object’s userId tag or prefix.\n• Allow only the backend role to generate pre‑signed URLs.

Lambda Validation & Scanning

Trigger: ObjectCreated events on the uploads/ prefix.

Permissions:

  • s3:GetObject, s3:PutObject, s3:DeleteObject on the bucket.
  • dynamodb:PutItem on the UserFiles table.

Workflow:

  1. Retrieve object metadata (ContentLength, ContentType).
  2. Size check – reject if > 1 GB.
  3. MIME‑type whitelist (e.g., video/mp4, video/webm).
  4. Download the object to /tmp (max 512 MB per Lambda). For larger files, use multipart copy or S3 Object Lambda for streaming scan.
  5. Run ClamAV (provided via a Lambda layer) to detect viruses.
  6. If validation fails or a virus is found, delete the object and optionally notify the user.
  7. On success, move the object to private/{userId}/{uuid}.ext (copy + delete).
  8. Write a record to DynamoDB with ownership, status, and timestamps.

Virus Scanning Implementation

  • ClamAV database is packaged in a Lambda layer and refreshed daily by a scheduled Lambda.
  • The scanning code calls the clamscan binary on the downloaded file and interprets the exit code.
  • For files larger than the Lambda /tmp space, consider:
    • S3 Object Lambda to stream the object through ClamAV without full download.
    • Step Functions to orchestrate multipart processing.

DynamoDB Metadata Store

TablePrimary KeyAttributes
UserFilesfileId (UUID)userId, s3Key, status (READY, FAILED, …), uploaded (timestamp), size, contentType

The table enables quick lookup of a user’s files and drives the download‑URL generation logic.

Lifecycle Management

{
  "Rules": [
    {
      "ID": "TransitionToGlacierDeepArchive",
      "Prefix": "",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "GLACIER_DEEP_ARCHIVE"
        }
      ]
    },
    {
      "ID": "ExpireObjects",
      "Prefix": "",
      "Status": "Enabled",
      "Expiration": {
        "Days": 2555   // ~7 years
      }
    }
  ]
}

Security & Access Control

  • Bucket policy denies any s3:GetObject unless the request is made by an authenticated principal that matches the object’s userId prefix.
  • IAM roles for the backend API and Lambda are scoped to the minimum required actions.
  • Cognito (or another OIDC provider) supplies JWTs that are validated by API Gateway or ALB before allowing the /upload-url call.
  • Object Lock (optional) can be enabled to make objects immutable for the retention period.

Monitoring & Alerts

  • CloudWatch MetricsLambdaErrors, S3ObjectCreated, custom ValidationFailures.
  • Alarms – trigger SNS notifications or Lambda retries when validation or scanning failures exceed a threshold.
  • AWS GuardDuty / Macie (optional) – continuous threat detection on the bucket.

Sample Lambda Code (Go)

package main

import (
    "context"
    "fmt"
    "time"

    "github.com/aws/aws-lambda-go/events"
    "github.com/aws/aws-lambda-go/lambda"
    "github.com/aws/aws-sdk-go-v2/aws"
    "github.com/aws/aws-sdk-go-v2/service/dynamodb"
    "github.com/aws/aws-sdk-go-v2/service/dynamodb/types"
    "github.com/aws/aws-sdk-go-v2/service/s3"
)

var (
    s3Client    *s3.Client
    dynamoClient *dynamodb.Client
)

func handler(ctx context.Context, s3Event events.S3Event) error {
    for _, record := range s3Event.Records {
        bucket := record.S3.Bucket.Name
        // ... implementation continues ...
        fmt.Printf("Processing object %s in bucket %s\n", record.S3.Object.Key, bucket)
        _ = time.Now() // placeholder
    }
    return nil
}

func main() {
    lambda.Start(handler)
}
Back to Blog

Related posts

Read more »

Saving Terraform State in S3

Configuring S3 as a Terraform Backend Terraform can store its state in an S3 bucket. Below is a minimal configuration that sets up the S3 backend: hcl terrafor...