Journey with Apache Flink on AWS EKS

Published: 1 month ago (December 16, 2025 at 05:00 AM EST)

3 min read

Source: Dev.to

Introduction

When building streaming pipelines, one of the biggest challenges isn’t just processing data—it’s keeping jobs running reliably while data continuously flows. Apache Flink addresses this challenge effectively.

Apache Flink Overview

Apache Flink is a distributed stream‑processing framework built for stateful, real‑time data processing. Unlike batch‑first frameworks, Flink treats streams as the core abstraction, with batch as just a special case.

Key strengths

Exactly‑once processing guarantees
Native state management for long‑running jobs
Event‑time processing with watermarks
Checkpointing that makes failures recoverable

These features make Flink feel less like a tool and more like a platform for always‑on systems.

Checkpoint‑Based Execution Model

Streaming pipelines evolve over time: logic changes, schemas expand, and performance tuning becomes necessary. Flink’s checkpoint‑based execution model allows us to:

Restart jobs without breaking downstream systems
Roll out logic or configuration changes safely
Treat streaming jobs as living systems, not one‑off deployments

Its combination of stateful processing, exactly‑once semantics, and safe evolution makes Flink a clear choice for production pipelines.

Running Flink on AWS EKS

Once we decided on Flink, we needed a reliable runtime. AWS EKS provides:

A managed Kubernetes control plane
Native integration with AWS services
Consistent environments across dev, test, and prod

To make Flink truly Kubernetes‑native, we adopted the Flink Kubernetes Operator.

Flink Kubernetes Operator

After installing the operator via Helm, it introduces a FlinkDeployment Custom Resource Definition (CRD). Deployments become fully declarative: we define the desired state in YAML, and the operator continuously reconciles it.

Operator Responsibilities

Launches the JobManager pod (hosting the Flink UI)
Scales TaskManagers as needed
Configures networking, volumes, and service accounts
Manages job restarts, upgrades, and recovery

This drastically reduces operational overhead and makes Flink cloud‑native and production‑ready.

Deployment Model

We separate cluster management from job management:

FlinkDeployment – defines the session cluster (image, resources, Flink config, EFS mounts)
FlinkSessionJob – defines the actual streaming job (entrypoint, arguments, parallelism, upgrade mode)

Most deployments happen via Terraform, rendering YAML templates and applying them with kubernetes_manifest.

Simplified Example of a `FlinkSessionJob`

apiVersion: flink.apache.org/v1beta1
kind: FlinkSessionJob
metadata:
  name: my-streaming-job
spec:
  job:
    jarURI: s3://my-bucket/jars/my-job.jar
    parallelism: 4
    args:
      - "--input"
      - "s3://my-bucket/input/"
      - "--output"
      - "s3://my-bucket/output/"
  upgradeMode: last-state   # 💡 Tip: To rerun a job from the last checkpoint, set upgradeMode to `last-state`.

Operating Jobs Safely

For initial runs we typically start jobs stateless. For restarts, backfills, or upgrades, we rely on checkpoint‑based recovery using upgradeMode: last-state. This ensures:

Jobs resume from the latest successful checkpoint
Downstream systems remain stable
Minimal gaps or duplicates, even for CDC sources

Safe Change Process

Update the job or cluster spec.
Apply the changes via Terraform or kubectl.
The operator restores state automatically.

Business Logic Changes (Python, SQL, JARs)

Push updated code to S3.
Sync to EFS via AWS DataSync.
Verify files in Flink containers.
Perform a rolling, stateful upgrade.

This process allows safe iteration on critical streaming pipelines.

Standalone Flink Deployment (Fallback)

While the operator is our default, it has limits on configuration. For special cases we maintain a standalone Flink deployment using Terraform:

Separate deployments for JobManager and TaskManagers.
A Service and Ingress exposing the Flink UI.

Trade‑offs

Approach	Pros	Cons
Operator	Safer, simpler, automated	Limited configurability
Standalone	Full control over pods and settings	Higher operational overhead

Most workloads fit perfectly with the operator, but having a fallback gives us flexibility for edge cases.

Conclusion

From choosing Flink for its stateful streaming model, to running it on AWS EKS, and finally operating jobs safely with the Flink Kubernetes Operator, this journey has shaped how we build and maintain streaming pipelines. Flink is more than a processing engine—it’s a platform for evolving, long‑running data pipelines. Running it Kubernetes‑native on AWS lets us balance operational safety, scalability, and flexibility.

We’re excited to continue sharing our learnings with the AWS Community Builders and the wider streaming community.

Journey with Apache Flink on AWS EKS

Introduction

Apache Flink Overview

Checkpoint‑Based Execution Model

Running Flink on AWS EKS

Flink Kubernetes Operator

Operator Responsibilities

Deployment Model

Simplified Example of a `FlinkSessionJob`

Operating Jobs Safely

Safe Change Process

Standalone Flink Deployment (Fallback)

Trade‑offs

Conclusion

Related posts

How I Built a Stroke Capture System for an AI Drawing Game

El error de seguridad más común es “Dale Admin y Ya”

Sending EIP-4844 Blob Transactions with ethers.js and kzg-wasm

Automate Your Life with n8n (Beginner-Friendly Guide)

Introduction

Apache Flink Overview

Checkpoint‑Based Execution Model

Running Flink on AWS EKS

Flink Kubernetes Operator

Operator Responsibilities

Deployment Model

Simplified Example of a FlinkSessionJob

Operating Jobs Safely

Safe Change Process

Standalone Flink Deployment (Fallback)

Trade‑offs

Conclusion

Related posts

How I Built a Stroke Capture System for an AI Drawing Game

El error de seguridad más común es “Dale Admin y Ya”

Sending EIP-4844 Blob Transactions with ethers.js and kzg-wasm

Automate Your Life with n8n (Beginner-Friendly Guide)

Simplified Example of a `FlinkSessionJob`