AWS - Secure, High‑Throughput Ingestion Pipeline for Large Binary Objects
Source: Dev.to
Overview
The solution enables direct client uploads of large video files (≈150 MB average, up to 1 GB) to Amazon S3 while meeting the following requirements:
- Low‑latency, client‑side upload – no server‑side proxy.
- Server‑side validation of file type and size before the object becomes publicly accessible.
- Automatic virus scanning after upload.
- Retention policy – 90 days in S3 Standard, then transition to Glacier Deep Archive.
- Access control – only the authenticated owner can read the file; no public read access.
Client Upload Flow
- Authenticate (e.g., Amazon Cognito, JWT).
- POST
/upload-urlwithfilename(and optionallysize). - Backend returns a pre‑signed PUT URL scoped to
uploads/{userId}/{uuid}.extand valid for 5 minutes. - Client performs an HTTP PUT directly to the URL, uploading the file to S3.
- S3 emits an ObjectCreated event that triggers the validation & scanning Lambda.
- The client polls (or receives a webhook) for the upload status stored in DynamoDB.
- When ready, the client requests a pre‑signed GET URL for the final object (
private/{userId}/{uuid}.ext).
Backend API (Pre‑Signed URL Generation)
- IAM role with permission
s3:PutObjectlimited to the bucket and theuploads/${userId}/*prefix. - The API generates the URL using the AWS SDK, specifying:
- Bucket name.
- Object key (
uploads/{userId}/{uuid}.ext). - Expiration (5 min).
- Optional condition
s3:x-amz-content-sha256for payload integrity.
S3 Bucket Configuration
| Resource | Purpose | Key Settings |
|---|---|---|
| S3 Bucket (my‑media‑bucket) | Store raw uploads and final objects | • Block public access (bucket policy).\n• Enable Versioning.\n• (Optional) Enable Object Lock for tamper‑evidence. |
| Lifecycle Policy | Retention & archival | • Transition objects to Glacier Deep Archive after 90 days.\n• Expire objects after the required retention period (e.g., 7 years). |
| Bucket Policy | Enforce access control | • Deny any s3:GetObject unless the request is authenticated and the principal matches the object’s userId tag or prefix.\n• Allow only the backend role to generate pre‑signed URLs. |
Lambda Validation & Scanning
Trigger: ObjectCreated events on the uploads/ prefix.
Permissions:
s3:GetObject,s3:PutObject,s3:DeleteObjecton the bucket.dynamodb:PutItemon theUserFilestable.
Workflow:
- Retrieve object metadata (
ContentLength,ContentType). - Size check – reject if > 1 GB.
- MIME‑type whitelist (e.g.,
video/mp4,video/webm). - Download the object to
/tmp(max 512 MB per Lambda). For larger files, use multipart copy or S3 Object Lambda for streaming scan. - Run ClamAV (provided via a Lambda layer) to detect viruses.
- If validation fails or a virus is found, delete the object and optionally notify the user.
- On success, move the object to
private/{userId}/{uuid}.ext(copy + delete). - Write a record to DynamoDB with ownership, status, and timestamps.
Virus Scanning Implementation
- ClamAV database is packaged in a Lambda layer and refreshed daily by a scheduled Lambda.
- The scanning code calls the
clamscanbinary on the downloaded file and interprets the exit code. - For files larger than the Lambda
/tmpspace, consider:- S3 Object Lambda to stream the object through ClamAV without full download.
- Step Functions to orchestrate multipart processing.
DynamoDB Metadata Store
| Table | Primary Key | Attributes |
|---|---|---|
| UserFiles | fileId (UUID) | userId, s3Key, status (READY, FAILED, …), uploaded (timestamp), size, contentType |
The table enables quick lookup of a user’s files and drives the download‑URL generation logic.
Lifecycle Management
{
"Rules": [
{
"ID": "TransitionToGlacierDeepArchive",
"Prefix": "",
"Status": "Enabled",
"Transitions": [
{
"Days": 90,
"StorageClass": "GLACIER_DEEP_ARCHIVE"
}
]
},
{
"ID": "ExpireObjects",
"Prefix": "",
"Status": "Enabled",
"Expiration": {
"Days": 2555 // ~7 years
}
}
]
}
Security & Access Control
- Bucket policy denies any
s3:GetObjectunless the request is made by an authenticated principal that matches the object’suserIdprefix. - IAM roles for the backend API and Lambda are scoped to the minimum required actions.
- Cognito (or another OIDC provider) supplies JWTs that are validated by API Gateway or ALB before allowing the
/upload-urlcall. - Object Lock (optional) can be enabled to make objects immutable for the retention period.
Monitoring & Alerts
- CloudWatch Metrics –
LambdaErrors,S3ObjectCreated, customValidationFailures. - Alarms – trigger SNS notifications or Lambda retries when validation or scanning failures exceed a threshold.
- AWS GuardDuty / Macie (optional) – continuous threat detection on the bucket.
Sample Lambda Code (Go)
package main
import (
"context"
"fmt"
"time"
"github.com/aws/aws-lambda-go/events"
"github.com/aws/aws-lambda-go/lambda"
"github.com/aws/aws-sdk-go-v2/aws"
"github.com/aws/aws-sdk-go-v2/service/dynamodb"
"github.com/aws/aws-sdk-go-v2/service/dynamodb/types"
"github.com/aws/aws-sdk-go-v2/service/s3"
)
var (
s3Client *s3.Client
dynamoClient *dynamodb.Client
)
func handler(ctx context.Context, s3Event events.S3Event) error {
for _, record := range s3Event.Records {
bucket := record.S3.Bucket.Name
// ... implementation continues ...
fmt.Printf("Processing object %s in bucket %s\n", record.S3.Object.Key, bucket)
_ = time.Now() // placeholder
}
return nil
}
func main() {
lambda.Start(handler)
}