INITIALIZING SYSTEMS

0%
DATA ANALYTICS

Real-Time Streaming Analytics
Event-Driven Data Processing Guide

A comprehensive enterprise guide to real-time streaming analytics covering Apache Kafka, Apache Flink, Spark Structured Streaming, event-driven architecture patterns (Lambda, Kappa, CQRS), windowing strategies, exactly-once semantics, and production deployment for fraud detection, IoT monitoring, live dashboards, and algorithmic trading across APAC operations.

DATA ANALYTICS February 2026 30 min read Technical Depth: Advanced

1. What is Real-Time Streaming Analytics

Real-time streaming analytics is the practice of continuously ingesting, processing, and analyzing data the moment it is generated -- event by event, record by record -- rather than accumulating it in batches for later processing. Unlike traditional batch analytics that operates on bounded datasets at scheduled intervals (hourly ETL jobs, nightly data warehouse refreshes), streaming analytics treats data as an unbounded, continuously flowing sequence of events and delivers insights within milliseconds to seconds of the originating event.

This paradigm shift from "store then analyze" to "analyze as it flows" has been driven by modern business demands that simply cannot tolerate minutes or hours of analytical latency. When a fraudulent credit card transaction occurs, detecting it 24 hours later in a batch report means the money is already gone. When an IoT sensor on a turbine detects vibration anomalies, a 15-minute batch window may mean the difference between a preventive shutdown and a catastrophic failure costing millions in repairs and downtime.

The core principle of streaming analytics is that every business event -- a user click, a sensor reading, a financial transaction, a log entry -- is processed as a first-class, immutable record in a continuous data stream. Processing engines apply transformations, aggregations, joins, and pattern detection on these streams in real time, materializing results to dashboards, alerting systems, downstream services, or persistent storage.

<100ms
Target End-to-End Latency for Critical Events
10M+
Events/Second in Production Kafka Clusters
73%
Enterprises Adopting Streaming by 2027
$28.3B
Global Stream Processing Market by 2028

1.1 Batch vs. Stream Processing: A Fundamental Comparison

Understanding the difference between batch and stream processing is foundational to designing modern data architectures. The two paradigms are not mutually exclusive -- most production systems employ both -- but they serve fundamentally different analytical needs.

DimensionBatch ProcessingStream Processing
Data ModelBounded dataset (finite)Unbounded stream (infinite)
Processing TriggerScheduled (cron, orchestrator)Continuous (event-driven)
LatencyMinutes to hoursMilliseconds to seconds
ThroughputVery high (optimized for bulk)High (optimized for per-event)
State ManagementImplicit (full dataset available)Explicit (managed checkpoints)
Fault ToleranceRe-run entire jobCheckpoint-based recovery
WindowingNatural (job boundary = window)Explicit (tumbling, sliding, session)
ComplexityLower (simpler programming model)Higher (ordering, late data, state)
Typical ToolsSpark Batch, Hive, BigQuery, dbtKafka, Flink, Spark Streaming, Kinesis
Best ForReports, ML training, backfillsAlerts, fraud, dashboards, IoT

1.2 The Event-Driven Paradigm

At the heart of streaming analytics lies the event-driven paradigm. An event is an immutable record of something that happened: a user clicked a button, a temperature sensor reported 85.3 degrees Celsius, a payment of $4,299 was initiated from an unusual location. Events carry a timestamp, a payload, and typically a key that identifies the entity involved.

In an event-driven architecture, components communicate by producing and consuming events through a durable event log (such as Apache Kafka). Producers are decoupled from consumers -- the payment service does not need to know about the fraud detection system, the recommendation engine, or the analytics dashboard. Each consumer processes the event stream independently at its own pace, enabling independent scaling, deployment, and evolution of each component.

// Anatomy of a streaming event { "event_id": "evt_7f2a8b3c-1d4e-5f6a-7b8c-9d0e1f2a3b4c", "event_type": "transaction.initiated", "timestamp": "2026-02-01T14:23:17.842Z", "source": "payment-gateway-sg-01", "partition_key": "customer_8847291", "payload": { "transaction_id": "txn_9182736450", "customer_id": "customer_8847291", "amount": 4299.00, "currency": "SGD", "merchant_id": "merchant_electronics_001", "merchant_category": "electronics", "card_last_four": "7823", "geo_location": {"lat": 1.3521, "lon": 103.8198}, "device_fingerprint": "fp_a8b7c6d5e4f3", "ip_address": "203.116.xx.xx", "channel": "mobile_app" }, "metadata": { "schema_version": "2.1", "correlation_id": "corr_abc123def456", "trace_id": "trace_789xyz" } }
Key Insight: Events vs. Commands vs. Queries

Events describe facts that have already occurred ("PaymentProcessed", "SensorReadingCaptured"). They are immutable and past-tense. Commands are requests for something to happen ("ProcessPayment", "UpdateInventory"). They may be rejected. Queries request current state ("GetAccountBalance", "ListRecentOrders"). A well-designed streaming system uses events as the source of truth, commands to trigger actions, and materialized views to serve queries -- the essence of CQRS (Command Query Responsibility Segregation).

2. Architecture Patterns

2.1 Lambda Architecture

The Lambda architecture, introduced by Nathan Marz, addresses the challenge of serving both real-time and historical analytics from a single system. It operates three layers in parallel: a batch layer that processes the complete historical dataset for accuracy, a speed layer that processes the real-time stream for freshness, and a serving layer that merges results from both to answer queries.

# Lambda Architecture: Conceptual Data Flow # # +------------------+ +-------------------+ +------------------+ # | DATA SOURCES | | BATCH LAYER | | SERVING LAYER | # | +---->+ (Spark / Hive) +---->+ (Druid / HBase) | # | Transactions | | Complete recomp. | | Batch Views | # | Clickstream | | High accuracy | | + | # | IoT Sensors | | Hours latency | | Merge & Query | # | Log Events +--+ +-------------------+ | + | # | API Calls | | | Speed Views | # +------------------+ | +-------------------+ | | # +->+ SPEED LAYER +---->+ | # | (Flink / Storm) | +--------+---------+ # | Incremental | | # | Low latency | v # | Approximate | +------------------+ # +-------------------+ | API / DASHBOARD | # +------------------+ # # Pros: Fault-tolerant, batch corrects streaming errors, proven at scale # Cons: Dual codebase, operational complexity, eventual consistency gap

The batch layer recomputes views from the immutable master dataset (typically stored in a data lake on S3, ADLS, or GCS) using engines like Apache Spark or Hive. These batch views are authoritative but stale -- they reflect data only up to the last batch run. The speed layer processes events in real time using stream processing engines (Flink, Storm, Kafka Streams), producing incremental views that cover the gap between the last batch run and the present moment. The serving layer merges batch and speed views to serve queries with both accuracy and freshness.

Lambda's primary drawback is the dual-codebase problem: the same business logic must be implemented in both the batch and speed layers, often using different languages and frameworks. This leads to divergence bugs where batch and speed results disagree, and doubles the maintenance burden.

2.2 Kappa Architecture

The Kappa architecture, proposed by Jay Kreps (co-creator of Apache Kafka), eliminates the batch layer entirely. All data flows through a single streaming pipeline, and historical reprocessing is achieved by replaying the event log from a desired offset. Kafka's durable, replayable log serves as both the real-time transport and the historical source of truth.

# Kappa Architecture: Single Pipeline, Replayable Log # # +------------------+ +-------------------+ +------------------+ # | DATA SOURCES +---->+ EVENT LOG +---->+ STREAM PROCESSOR | # | | | (Apache Kafka) | | (Flink / KStreams) # | All events flow | | Durable | | Single codebase | # | into one log | | Replayable | | Handles both | # +------------------+ | Partitioned | | real-time and | # | Retention: 7-30d| | historical | # | or infinite | +--------+---------+ # +-------------------+ | # v # +-------------------+ +------------------+ # | REPROCESSING: | | MATERIALIZED | # | Deploy v2 job, | | VIEWS / SINKS | # | replay from t=0 | | (DB, Cache, S3) | # | Switch on catch-up| +------------------+ # +-------------------+ # # Pros: Single codebase, simpler operations, consistent semantics # Cons: Requires replayable log, reprocessing speed limited by log read

When business logic changes (a new fraud rule, updated aggregation), the operator deploys a new version of the streaming job that reads from the beginning of the Kafka log, reprocessing all historical events through the updated logic. Once the new job catches up to the present, traffic is switched from the old to the new version. This approach provides the accuracy of batch recomputation with the simplicity of a single codebase.

Kappa is the preferred architecture for organizations that can afford Kafka's storage costs for long retention periods and whose processing logic is naturally expressed as stream operations. It has become the dominant pattern for greenfield streaming platforms.

2.3 Event Sourcing

Event sourcing is an application architecture pattern where the system's state is determined entirely by a sequence of events rather than by mutable database records. Instead of storing the current account balance as a single row, event sourcing stores every deposit, withdrawal, and transfer as an immutable event. The current balance is derived by replaying all events for that account.

Event sourcing provides a complete audit trail, enables temporal queries ("what was the balance at 3:00 PM yesterday?"), and supports rebuilding state from scratch by replaying events -- a property that integrates naturally with Kappa-style streaming architectures. Apache Kafka or Amazon Kinesis serves as the event store, with stream processors materializing current-state views into read-optimized databases.

2.4 CQRS -- Command Query Responsibility Segregation

CQRS separates the write model (commands that modify state) from the read model (queries that return state), allowing each to be independently optimized. In a streaming analytics context, commands produce events into the event log, and dedicated stream processors consume these events to build purpose-specific read models (materialized views) optimized for different query patterns.

# CQRS + Event Sourcing with Kafka # # WRITE SIDE EVENT LOG READ SIDE # +-------------+ +---------------------+ +------------------+ # | Command API | | Apache Kafka | | View: Dashboard | # | - Validate +------>+ topic: orders +---->+ (ClickHouse) | # | - Produce | | | +------------------+ # | - Respond | | Immutable log of | +------------------+ # +-------------+ | all order events: +---->+ View: Search | # | OrderCreated | | (Elasticsearch) | # | OrderPaid | +------------------+ # | OrderShipped | +------------------+ # | OrderDelivered +---->+ View: Analytics | # | OrderCancelled | | (Apache Druid) | # +---------------------+ +------------------+ # +------------------+ # +--->+ View: ML Features| # | (Redis / Feast) | # +------------------+ # # Each read model is independently optimized, scaled, and updated

A single order event stream might feed a ClickHouse materialized view for real-time dashboarding, an Elasticsearch index for full-text order search, an Apache Druid datasource for OLAP analytics, and a Redis feature store for machine learning model serving. Each consumer processes the same event stream but materializes it into a shape optimized for its specific access pattern.

3. Core Technologies

3.1 Apache Kafka

Apache Kafka is the de facto standard distributed event streaming platform, serving as the backbone of real-time data infrastructure at over 80% of Fortune 500 companies. Kafka's architecture -- a distributed, partitioned, replicated commit log -- provides durable, high-throughput, low-latency event transport with strong ordering guarantees within partitions.

Kafka's key architectural components include: Brokers (servers that store and serve data), Topics (named feeds of events, analogous to database tables), Partitions (parallelism units within topics, enabling horizontal scaling), Producers (clients that publish events), and Consumers (clients that read events, organized into Consumer Groups for parallel processing). Kafka Connect provides a framework for streaming data between Kafka and external systems (databases, file systems, search indices) without custom code.

Kafka CapabilitySpecificationNotes
ThroughputMillions of events/sec per clusterLinkedIn runs 7+ trillion messages/day
Latency2-10ms end-to-end (p99)Producer to consumer, same datacenter
DurabilityReplicated across brokers (RF=3 typical)No data loss with acks=all
RetentionTime-based or size-based, or infiniteTiered storage offloads to S3/GCS
OrderingGuaranteed within partitionUse partition keys for entity ordering
Exactly-OnceIdempotent producers + transactionsAvailable since Kafka 0.11+
Schema RegistryConfluent Schema Registry (Avro, Protobuf, JSON)Schema evolution with compatibility checks
Kafka StreamsLightweight stream processing libraryNo separate cluster; runs in your app
# Production Kafka Cluster Configuration (docker-compose excerpt) # 3-broker cluster with KRaft (no ZooKeeper), Schema Registry, Connect --- version: '3.8' services: kafka-1: image: confluentinc/cp-kafka:7.6.0 environment: KAFKA_NODE_ID: 1 KAFKA_PROCESS_ROLES: broker,controller KAFKA_CONTROLLER_QUORUM_VOTERS: '1@kafka-1:9093,2@kafka-2:9093,3@kafka-3:9093' KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093 KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER KAFKA_LOG_RETENTION_HOURS: 168 # 7 days hot storage KAFKA_LOG_SEGMENT_BYTES: 1073741824 # 1GB segments KAFKA_NUM_PARTITIONS: 12 # Default partitions KAFKA_DEFAULT_REPLICATION_FACTOR: 3 KAFKA_MIN_INSYNC_REPLICAS: 2 KAFKA_COMPRESSION_TYPE: lz4 KAFKA_MESSAGE_MAX_BYTES: 10485760 # 10MB max message KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'false' # Explicit topic creation volumes: - kafka-1-data:/var/lib/kafka/data deploy: resources: limits: { memory: 8G } reservations: { memory: 6G } schema-registry: image: confluentinc/cp-schema-registry:7.6.0 environment: SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka-1:9092,kafka-2:9092,kafka-3:9092 SCHEMA_REGISTRY_AVRO_COMPATIBILITY_LEVEL: BACKWARD ports: ["8081:8081"] kafka-connect: image: confluentinc/cp-kafka-connect:7.6.0 environment: CONNECT_BOOTSTRAP_SERVERS: kafka-1:9092,kafka-2:9092,kafka-3:9092 CONNECT_GROUP_ID: connect-cluster CONNECT_KEY_CONVERTER: io.confluent.connect.avro.AvroConverter CONNECT_VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter CONNECT_EXACTLY_ONCE_SOURCE_SUPPORT: enabled ports: ["8083:8083"]

3.2 Apache Flink

Apache Flink is the most advanced open-source stream processing engine, purpose-built for stateful computations over unbounded data streams. Unlike frameworks that treat streaming as micro-batching (Spark Streaming), Flink processes events truly one-at-a-time with millisecond latency while maintaining complex distributed state with exactly-once consistency guarantees.

Flink's defining capabilities for enterprise streaming include: distributed snapshot-based checkpointing (Chandy-Lamport algorithm) for exactly-once state consistency, event-time processing with watermarks for handling out-of-order data, a rich windowing API (tumbling, sliding, session, and custom windows), CEP (Complex Event Processing) library for pattern detection, and native support for both SQL and DataStream APIs.

// Apache Flink: Real-Time Fraud Detection Pipeline (Java) public class FraudDetectionPipeline { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Enable exactly-once checkpointing every 30 seconds env.enableCheckpointing(30_000, CheckpointingMode.EXACTLY_ONCE); env.getCheckpointConfig().setMinPauseBetweenCheckpoints(10_000); env.getCheckpointConfig().setCheckpointStorage("s3://checkpoints/fraud-detection/"); // Configure event-time processing env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); // Source: Kafka transactions topic KafkaSource<Transaction> source = KafkaSource.<Transaction>builder() .setBootstrapServers("kafka-1:9092,kafka-2:9092,kafka-3:9092") .setTopics("transactions") .setGroupId("fraud-detection-v3") .setStartingOffsets(OffsetsInitializer.latest()) .setDeserializer(new TransactionDeserializer()) .build(); DataStream<Transaction> transactions = env.fromSource( source, WatermarkStrategy.<Transaction>forBoundedOutOfOrderness(Duration.ofSeconds(5)) .withTimestampAssigner((txn, ts) -> txn.getTimestamp()), "Kafka Transactions" ); // Rule 1: Velocity check -- more than 5 transactions in 2 minutes DataStream<FraudAlert> velocityAlerts = transactions .keyBy(Transaction::getCustomerId) .window(SlidingEventTimeWindows.of( Time.minutes(2), Time.seconds(30))) .aggregate(new TransactionCounter()) .filter(count -> count.getTotal() > 5) .map(count -> new FraudAlert( count.getCustomerId(), "VELOCITY", "High transaction velocity: " + count.getTotal() + " in 2min", FraudAlert.Severity.HIGH)); // Rule 2: Geographic impossibility -- transactions in different // countries within 1 hour (impossible travel) DataStream<FraudAlert> geoAlerts = transactions .keyBy(Transaction::getCustomerId) .process(new ImpossibleTravelDetector( Duration.ofHours(1), 500.0 /* km threshold */)); // Rule 3: Amount anomaly -- transaction > 3x rolling 30-day average DataStream<FraudAlert> amountAlerts = transactions .keyBy(Transaction::getCustomerId) .process(new AmountAnomalyDetector( Duration.ofDays(30), 3.0 /* multiplier */)); // Union all alerts, deduplicate, and sink velocityAlerts.union(geoAlerts, amountAlerts) .keyBy(FraudAlert::getCustomerId) .process(new AlertDeduplicator(Duration.ofMinutes(5))) .sinkTo(KafkaSink.<FraudAlert>builder() .setBootstrapServers("kafka-1:9092") .setRecordSerializer(new FraudAlertSerializer("fraud-alerts")) .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE) .build()); env.execute("Fraud Detection Pipeline v3"); } }

3.3 Spark Structured Streaming

Apache Spark Structured Streaming extends the Spark SQL engine to handle streaming data using a micro-batch or continuous processing model. Its key advantage is API unification -- the same DataFrame/SQL API used for batch analytics works identically for streaming, enabling a single codebase that processes both historical and real-time data. This makes Structured Streaming the natural choice for organizations already invested in the Spark ecosystem (Databricks, EMR, HDInsight).

Structured Streaming processes data in micro-batches (default 100ms intervals) or in continuous processing mode (experimental, targeting ~1ms latency). While micro-batching introduces slightly higher latency than true event-at-a-time engines like Flink, it provides excellent throughput and simplifies exactly-once semantics through idempotent writes and write-ahead logging.

3.4 Cloud-Native Streaming Services

ServiceProviderMax ThroughputLatencyManagedBest For
Amazon Kinesis Data StreamsAWS~1M records/sec per shard~200msFully managedAWS-native real-time pipelines
Amazon MSK (Managed Kafka)AWSKafka-scaleKafka-nativeManaged brokersKafka workloads on AWS
Azure Event HubsAzure~1M events/sec per namespace~10msFully managedAzure-native ingestion, Kafka-compatible API
Google Pub/SubGCPVirtually unlimited~100msFully managed, serverlessGCP-native event distribution
Confluent CloudMulti-cloudKafka-scaleKafka-nativeFully managed KafkaEnterprise Kafka without ops burden
Amazon Kinesis Data AnalyticsAWSScales with KPU~100msManaged FlinkServerless Flink on AWS
Google DataflowGCPAuto-scaled~secondsManaged BeamUnified batch+stream on GCP
Azure Stream AnalyticsAzure~200MB/sec per SU~100msFully managed, SQLSQL-based streaming on Azure
Technology Selection Guide for APAC Enterprises

Choose Apache Kafka + Flink when you need the lowest latency (<10ms), complex stateful processing (CEP, pattern detection, ML scoring), exactly-once guarantees, and are willing to invest in operational expertise. Ideal for financial services, telco, and large-scale IoT.

Choose Kafka + Spark Structured Streaming when your team already uses Spark for batch analytics, you want a unified batch/stream codebase, and sub-second (rather than sub-millisecond) latency is acceptable. Ideal for e-commerce, marketing analytics, and data lake streaming ingestion.

Choose Cloud-Native Services (Kinesis, Event Hubs, Pub/Sub) when you want zero operational overhead, are committed to a single cloud provider, and your streaming patterns are standard (ingestion, filtering, aggregation). Ideal for startups, SMBs, and teams without dedicated streaming engineers.

4. Use Cases

4.1 Fraud Detection in Milliseconds

Financial fraud detection is the canonical use case for real-time streaming analytics. A modern fraud detection system must evaluate every transaction against hundreds of rules and ML models within 50-100ms to authorize or block the transaction before it completes. This requires processing the transaction event through a pipeline that includes velocity checks (transaction frequency), geographic analysis (impossible travel detection), amount anomaly scoring (deviation from historical patterns), device fingerprint matching, and real-time ML model inference.

Production fraud detection systems at major APAC banks (DBS, OCBC, Vietcombank) process 10,000-50,000 transactions per second through Apache Flink pipelines backed by Kafka. The streaming engine maintains per-customer state including rolling 30-day transaction statistics, recent geolocation history, and device trust scores -- all updated in real time as events flow through. False positive rates of 0.1-0.3% are achievable with ensemble models combining rule-based and ML-based detection, scoring each transaction in under 20ms.

4.2 IoT Sensor Monitoring

Industrial IoT deployments generate continuous streams of sensor telemetry -- temperature, pressure, vibration, humidity, power consumption -- from thousands of connected devices. Real-time analytics on these streams enables predictive maintenance (detecting degradation patterns before failure), process optimization (adjusting parameters in real time to minimize waste), and safety alerting (immediate response to threshold breaches).

A typical manufacturing IoT streaming architecture in Vietnam's industrial zones processes 100,000+ sensor readings per second from connected PLCs, edge gateways, and SCADA systems. Apache Kafka ingests the raw telemetry, Apache Flink computes windowed aggregations (5-second averages, 1-minute moving standard deviations), and anomaly detection models score each window against equipment-specific baselines. Alerts reaching operations dashboards within 3 seconds of the triggering event enable operators to intervene before minor deviations cascade into production line shutdowns.

4.3 Real-Time Recommendations

E-commerce and content platforms use streaming analytics to personalize user experiences in real time. Every user interaction -- page view, search query, cart addition, scroll depth, hover duration -- flows into a streaming pipeline that updates the user's behavioral profile and triggers recommendation model re-scoring. Unlike batch recommendation systems that update once daily, streaming recommendations reflect the user's intent in the current session.

Streaming recommendation architectures typically combine a pre-computed candidate generation model (updated hourly or daily via batch) with a real-time ranking model that re-scores candidates based on the user's streaming session context. Apache Kafka captures user events, a Flink or Kafka Streams application enriches events with user profile features from a Redis feature store, and a model serving layer (TensorFlow Serving, Triton) provides sub-10ms scoring. This hybrid approach delivers 15-30% improvement in click-through rates compared to batch-only recommendations.

4.4 Live Operational Dashboards

Real-time dashboards transform streaming data into visual intelligence for operational decision-making. Unlike traditional BI dashboards that refresh on intervals (hourly, daily), streaming dashboards update continuously -- showing live order volumes, current system health, real-time revenue counters, and instantaneous anomaly indicators.

The streaming dashboard architecture typically flows from Kafka through a stream processor (Flink SQL or ksqlDB) that computes windowed aggregations, into a real-time OLAP engine (Apache Druid, ClickHouse, or Apache Pinot) optimized for sub-second analytical queries, with Grafana or Apache Superset providing the visualization layer. WebSocket connections from the dashboard frontend to the OLAP engine enable push-based updates without polling.

4.5 Algorithmic Trading

Financial markets demand the lowest possible latency for streaming analytics. Algorithmic trading systems process market data feeds (price ticks, order book updates, trade executions) through complex event processing pipelines that detect trading signals, calculate risk metrics, and generate orders -- all within microseconds to low milliseconds.

While ultra-low-latency trading (sub-microsecond) relies on custom FPGA/ASIC hardware and direct market access, the broader quantitative finance ecosystem uses Apache Kafka for market data distribution, Flink for real-time risk aggregation and signal computation, and in-memory grids (Hazelcast, Apache Ignite) for position and portfolio state management. APAC financial centers (Singapore, Hong Kong, Tokyo) increasingly deploy these streaming architectures for real-time portfolio risk monitoring, cross-market arbitrage detection, and regulatory compliance calculations (real-time margin requirements under Basel III.1).

50ms
Fraud Decision Latency Target
100K+
IoT Events/Second Per Factory
15-30%
CTR Lift from Real-Time Recommendations
<1ms
Dashboard Aggregation Latency

5. Implementation Guide

5.1 Infrastructure Setup

A production streaming analytics platform requires careful infrastructure planning across compute, storage, and networking layers. The foundation is a Kafka cluster sized for your peak throughput with headroom for growth, backed by stream processing compute (Flink cluster or managed service), and connected to appropriate sink systems for serving the processed results.

# Infrastructure Sizing Guide for Streaming Analytics # # Kafka Cluster Sizing (self-managed) # ==================================== # Target: 100,000 events/sec ingestion, 1KB avg event, 7-day retention # # Throughput: 100K * 1KB = 100 MB/s ingestion # With RF=3: 100 MB/s * 3 = 300 MB/s total write throughput # Brokers: 300 MB/s / 100 MB/s per broker = 3 brokers minimum # Add 50% headroom = 5 brokers recommended # # Storage: 100 MB/s * 86400 sec/day * 7 days * 3 replicas = ~181 TB # With compaction overhead: ~220 TB total # Per broker: 220 TB / 5 = 44 TB (use NVMe SSDs) # # Memory: Each broker: 32-64 GB RAM # OS page cache is critical for Kafka read performance # # Network: 25 GbE per broker (10 GbE minimum) # # Flink Cluster Sizing # ==================================== # TaskManagers: Start with 1 TM per Kafka partition being consumed # Memory: 4-8 GB per TM for stateless; 16-32 GB for stateful # Checkpointing: S3/GCS backend, 30-60 second intervals # State Backend: RocksDB for large state (>10 GB), HashMapState for small # # Recommended Instance Types (AWS) # ==================================== # Kafka brokers: i3en.2xlarge (8 vCPU, 64 GB, 2x2.5 TB NVMe) # Flink TaskMgr: m6i.2xlarge (8 vCPU, 32 GB) for compute # Flink JobMgr: m6i.xlarge (4 vCPU, 16 GB) # Schema Registry: t3.large (2 vCPU, 8 GB)

5.2 Stream Processing Pipeline Design

A well-designed stream processing pipeline follows a series of stages: source ingestion, deserialization and validation, enrichment (joining with reference data), transformation and business logic, windowed aggregation, and sink output. Each stage should be independently testable and monitorable.

# Spark Structured Streaming: E-Commerce Analytics Pipeline (Python/PySpark) from pyspark.sql import SparkSession from pyspark.sql.functions import * from pyspark.sql.types import * spark = SparkSession.builder \ .appName("ECommerceStreamingAnalytics") \ .config("spark.sql.streaming.checkpointLocation", "s3a://checkpoints/ecom/") \ .config("spark.sql.shuffle.partitions", "24") \ .getOrCreate() # Schema for incoming clickstream events event_schema = StructType([ StructField("event_id", StringType()), StructField("event_type", StringType()), StructField("user_id", StringType()), StructField("session_id", StringType()), StructField("timestamp", TimestampType()), StructField("product_id", StringType()), StructField("category", StringType()), StructField("price", DoubleType()), StructField("quantity", IntegerType()), StructField("page_url", StringType()), StructField("referrer", StringType()), StructField("device_type", StringType()), StructField("geo_country", StringType()) ]) # Source: Read from Kafka topic raw_events = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kafka-1:9092,kafka-2:9092") \ .option("subscribe", "clickstream-events") \ .option("startingOffsets", "latest") \ .option("maxOffsetsPerTrigger", 500000) \ .load() # Parse and validate events events = raw_events \ .select(from_json(col("value").cast("string"), event_schema).alias("data")) \ .select("data.*") \ .filter(col("event_id").isNotNull()) \ .withWatermark("timestamp", "5 minutes") # Real-time revenue aggregation: 1-minute tumbling windows revenue_per_minute = events \ .filter(col("event_type") == "purchase") \ .groupBy( window("timestamp", "1 minute"), "geo_country", "category" ).agg( sum(col("price") * col("quantity")).alias("revenue"), count("event_id").alias("order_count"), avg("price").alias("avg_order_value"), countDistinct("user_id").alias("unique_buyers") ) # Session analytics: session windows with 30-minute gap session_metrics = events \ .groupBy( session_window("timestamp", "30 minutes"), "user_id", "device_type" ).agg( count("event_id").alias("events_in_session"), collect_set("category").alias("categories_viewed"), max(when(col("event_type") == "purchase", 1).otherwise(0)).alias("converted"), min("timestamp").alias("session_start"), max("timestamp").alias("session_end") ) # Sink: Write revenue metrics to ClickHouse via JDBC revenue_per_minute.writeStream \ .foreachBatch(lambda df, epoch_id: df.write \ .format("jdbc") \ .option("url", "jdbc:clickhouse://clickhouse:8123/analytics") \ .option("dbtable", "realtime_revenue") \ .mode("append") \ .save() ) \ .outputMode("update") \ .trigger(processingTime="10 seconds") \ .start() # Sink: Write session data to data lake (Parquet) session_metrics.writeStream \ .format("parquet") \ .option("path", "s3a://data-lake/sessions/") \ .partitionBy("device_type") \ .outputMode("append") \ .trigger(processingTime="1 minute") \ .start() spark.streams.awaitAnyTermination()

5.3 Windowing Strategies

Windowing is the mechanism by which unbounded streams are divided into finite chunks for aggregation. The choice of window type directly impacts the semantics and latency of your analytics.

5.4 State Management

Stateful stream processing -- where the processor maintains information across events (running counters, aggregation buffers, ML model state) -- is both the most powerful and most operationally challenging aspect of streaming systems. State must survive failures, scale with data volume, and be queryable for debugging and monitoring.

Apache Flink provides two state backends: HashMapStateBackend for small state stored in JVM heap (fast but limited by memory), and EmbeddedRocksDBStateBackend for large state that spills to local disk (slower but scales to terabytes per node). State is periodically checkpointed to durable storage (S3, HDFS, GCS) to enable exactly-once recovery after failures. Incremental checkpointing reduces checkpoint size by storing only state changes since the last checkpoint.

State Management Best Practices

1. Right-size your state TTL: Configure state TTL (time-to-live) to automatically expire state entries that are no longer relevant. A 30-day customer profile state with no TTL will grow unboundedly.

2. Use key-scoped state: Partition state by entity key (customer_id, device_id) to enable parallel processing and avoid hot partitions.

3. Monitor state size: Track state size per TaskManager and total checkpoint size. State growth exceeding expectations often indicates missing TTL or key cardinality issues.

4. Plan for state migration: Schema changes to state require savepoint-based migration. Use Avro or Protobuf for state serialization to support evolution.

6. Performance & Scalability

6.1 Throughput Optimization

Achieving high throughput in streaming systems requires optimization at every layer -- from Kafka producer batching through network utilization to sink parallelism.

6.2 Partitioning Strategy

Kafka topic partitioning is the primary mechanism for parallelism and data distribution. The partition key determines which partition receives each event, and consequently which consumer processes it. Choosing the right partition key is critical for both performance and correctness.

Partition StrategyKey ExampleProsCons
Entity-basedcustomer_id, device_idOrdering guaranteed per entity; enables stateful per-entity processingHot partitions if key distribution is skewed
Round-robinNone (null key)Perfect load balancing; maximum throughputNo ordering guarantees; cannot do per-entity stateful processing
Compositeregion + customer_idLocality-aware; enables regional processingHigher key cardinality; partition assignment complexity
Time-baseddate_hour bucketNatural time partitioning; easy retentionSkew during peak hours; limited parallelism in off-peak

6.3 Exactly-Once Semantics

Exactly-once processing guarantees that each event affects the system's state precisely once, even in the presence of failures and retries. This is critical for financial calculations, inventory management, and any use case where duplication or loss causes material business impact.

End-to-end exactly-once requires coordination across three layers:

  1. Source (Kafka): Idempotent producers (enable.idempotence=true) prevent duplicate writes. Transactional producers wrap multi-partition writes in atomic transactions.
  2. Processing (Flink/Spark): Flink uses distributed snapshots (Chandy-Lamport algorithm) with aligned or unaligned barriers to create consistent checkpoints of all operator state. On failure, the system restores from the last checkpoint and replays from the corresponding Kafka offsets.
  3. Sink: Sinks must be either idempotent (writing the same record twice has no additional effect, using unique keys) or transactional (two-phase commit with the stream processor). Flink provides a TwoPhaseCommitSinkFunction base class for building exactly-once sinks.

6.4 Backpressure Handling

Backpressure occurs when a downstream component cannot keep up with the upstream event rate. Without proper handling, backpressure causes unbounded buffer growth, out-of-memory failures, or data loss. Production streaming systems must implement explicit backpressure strategies.

7. Monitoring & Observability

7.1 Key Metrics for Streaming Systems

Monitoring a streaming analytics platform requires tracking metrics across three dimensions: pipeline health (is data flowing correctly?), performance (how fast is it flowing?), and business accuracy (are the results correct?).

MetricDescriptionAlert ThresholdMeasurement
Consumer LagDifference between latest offset and consumer offset>10,000 events or >30 secondsKafka consumer group lag (Burrow, Prometheus JMX)
End-to-End LatencyTime from event production to sink writeP99 > 5x target SLATimestamps embedded in events, measured at sink
Processing ThroughputEvents processed per second per operator<50% of expected baselineFlink metrics reporter, Spark StreamingQueryListener
Checkpoint DurationTime to complete a state checkpoint>50% of checkpoint intervalFlink checkpoint metrics
State SizeTotal managed state across all operatorsGrowth rate exceeding capacity planFlink RocksDB metrics, checkpoint size
Backpressure RatioPercentage of time operators are backpressured>10% sustainedFlink backpressure monitor
Kafka ISR ShrinkageBrokers falling out of in-sync replica setAny ISR shrinkage eventKafka broker JMX metrics
Deserialization ErrorsEvents failing schema validation>0.1% of total eventsDead letter queue counter

7.2 Consumer Lag Monitoring

Consumer lag is the most critical health indicator for streaming pipelines. Lag represents the distance between the latest event produced to a Kafka partition and the latest event consumed by a consumer group. Increasing lag means the consumer is falling behind and results are becoming stale. Sustained high lag may indicate undersized consumers, a slow sink, an expensive computation, or a data skew causing hot partitions.

# Prometheus + Grafana: Kafka Consumer Lag Monitoring # Using kafka-exporter for JMX metrics exposure # prometheus.yml scrape config scrape_configs: - job_name: 'kafka-exporter' static_configs: - targets: ['kafka-exporter:9308'] scrape_interval: 10s - job_name: 'flink-metrics' static_configs: - targets: ['flink-jobmanager:9249'] scrape_interval: 15s # Grafana Alert Rules (as code) # alert_rules.yaml groups: - name: streaming-health interval: 30s rules: - alert: HighConsumerLag expr: kafka_consumergroup_lag_sum{group="fraud-detection-v3"} > 50000 for: 5m labels: severity: critical team: data-platform annotations: summary: "Consumer lag exceeding 50K for fraud detection pipeline" description: | Consumer group {{ $labels.group }} on topic {{ $labels.topic }} has accumulated {{ $value }} events of lag. This may cause delayed fraud detection. - alert: CheckpointFailure expr: flink_jobmanager_job_numberOfFailedCheckpoints > 0 for: 2m labels: severity: warning annotations: summary: "Flink checkpoint failures detected" - alert: E2ELatencyBreach expr: histogram_quantile(0.99, streaming_e2e_latency_seconds_bucket) > 5 for: 3m labels: severity: critical annotations: summary: "P99 end-to-end latency exceeding 5 seconds"

7.3 Dead Letter Queues

Dead letter queues (DLQs) capture events that cannot be processed successfully -- malformed data, schema violations, processing errors, or sink write failures. Rather than failing the entire pipeline or silently dropping events, DLQ routing preserves problematic events for offline investigation and reprocessing while allowing the main pipeline to continue processing healthy events.

Implement DLQs as dedicated Kafka topics (e.g., fraud-detection.dlq) with extended retention (30-90 days). Each DLQ message should include the original event, the error message, the stack trace, the timestamp of failure, and the pipeline version that failed. Build a DLQ reprocessing tool that allows operators to inspect, fix, and replay DLQ events back into the main pipeline after the root cause is resolved.

7.4 Alerting Strategy

Effective alerting for streaming systems follows a tiered approach that minimizes alert fatigue while ensuring rapid response to critical issues:

8. Cost Optimization

8.1 Self-Managed vs. Managed Services

The build-vs-buy decision for streaming infrastructure has significant cost and operational implications. Self-managed Kafka and Flink clusters provide maximum control and potentially lower per-event cost at scale, but require dedicated engineering bandwidth for operations, upgrades, security patching, and capacity management. Managed services (Confluent Cloud, Amazon MSK, Amazon Kinesis, Azure Event Hubs) trade higher per-unit cost for zero operational overhead.

FactorSelf-Managed (Kafka + Flink)Managed Service
Monthly Cost (100K events/sec)$3,000-$5,000 (EC2/VM cost)$6,000-$12,000 (service pricing)
Engineering Headcount1-2 dedicated streaming engineers0.25-0.5 FTE for configuration
Time to Production4-8 weeks1-2 weeks
ScalingManual (add brokers, rebalance)Automatic or semi-automatic
UpgradesRolling upgrades, downtime riskHandled by provider
Multi-RegionComplex (MirrorMaker 2)Built-in replication options
ComplianceFull control over data locationProvider's compliance certifications
Breakeven PointGenerally above 500K events/secGenerally below 500K events/sec

8.2 Right-Sizing Kafka Clusters

Over-provisioned Kafka clusters are a common source of wasted spend. Key strategies for right-sizing include:

8.3 Auto-Scaling Stream Processing

Stream processing workloads often exhibit daily or weekly traffic patterns -- an e-commerce platform sees 3-5x higher event rates during business hours and promotional events compared to overnight baseline. Fixed-size processing clusters waste compute during low-traffic periods.

Flink's Reactive Mode (introduced in Flink 1.13+) automatically adjusts job parallelism when TaskManagers are added or removed, enabling integration with Kubernetes Horizontal Pod Autoscaler (HPA) for metric-driven scaling. Configure HPA to scale based on consumer lag or CPU utilization, with a cooldown period to prevent thrashing during brief traffic spikes.

# Kubernetes HPA for Flink TaskManagers (auto-scale on consumer lag) apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: flink-taskmanager-autoscaler namespace: streaming spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: flink-taskmanager minReplicas: 3 maxReplicas: 24 metrics: - type: External external: metric: name: kafka_consumergroup_lag_sum selector: matchLabels: consumergroup: "analytics-pipeline-v2" target: type: AverageValue averageValue: "10000" # Scale up when lag > 10K per replica - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 behavior: scaleUp: stabilizationWindowSeconds: 120 # Wait 2 min before scaling up policies: - type: Pods value: 4 # Add max 4 pods at a time periodSeconds: 300 scaleDown: stabilizationWindowSeconds: 600 # Wait 10 min before scaling down policies: - type: Pods value: 2 periodSeconds: 600

8.4 Cost Optimization for APAC Deployments

APAC-specific cost optimization strategies include leveraging regional pricing differences (AWS ap-southeast-1 Singapore is typically 10-15% more expensive than ap-northeast-1 Tokyo for compute), using Reserved Instances or Savings Plans for steady-state Kafka broker workloads (40-60% discount), and deploying Spot/Preemptible instances for non-critical Flink batch reprocessing jobs. For Vietnam-based operations, self-managed Kafka on Vietnamese cloud providers (Viettel Cloud, FPT Cloud) can reduce costs by 30-50% compared to hyperscaler pricing while maintaining data sovereignty compliance under PDPD.

60-80%
Storage Savings with Tiered Storage
40-60%
Savings with Reserved Instances
3-5x
Traffic Variability (Peak vs Baseline)
30-50%
Savings with Local Cloud Providers

9. Frequently Asked Questions

Q: What is the difference between batch processing and stream processing?

Batch processing collects data over a period and processes it in bulk at scheduled intervals (hourly, daily), resulting in latencies of minutes to hours. Stream processing handles data continuously as it arrives, event by event, delivering results in milliseconds to seconds. Batch is suited for historical analytics and reporting, while streaming is essential for real-time fraud detection, live dashboards, IoT monitoring, and any use case requiring sub-second response times. Most modern data architectures use both -- batch for comprehensive historical analysis and training ML models, streaming for operational real-time insights.

Q: Should I use Apache Kafka or Apache Flink for real-time analytics?

Kafka and Flink serve complementary roles and are most commonly deployed together. Apache Kafka is a distributed event streaming platform that excels at durable message ingestion, event storage, and pub/sub distribution. Apache Flink is a stream processing engine that performs complex computations -- windowed aggregations, stateful transformations, pattern detection, and ML inference -- on data flowing through Kafka. Think of Kafka as the highway and Flink as the processing plant. For simpler processing needs (filtering, routing, basic aggregation), Kafka Streams or ksqlDB can run within the Kafka ecosystem itself without a separate Flink cluster.

Q: What is the Lambda architecture and when should I use it?

Lambda architecture runs parallel batch and speed (streaming) layers, merging results in a serving layer to provide both accurate historical analysis and low-latency real-time views. Use Lambda when you need guaranteed accuracy from batch recomputation alongside real-time approximations, which is common in financial reporting and compliance scenarios. However, Lambda introduces operational complexity from maintaining two separate codebases and reconciling potential divergences between batch and speed layer outputs. Many organizations now prefer Kappa architecture, which uses a single streaming pipeline for both real-time and historical reprocessing by replaying events from Kafka's durable log.

Q: How do I achieve exactly-once processing semantics in stream processing?

Exactly-once semantics ensure each event is processed precisely once despite failures and retries. The mechanism varies by system: Apache Kafka supports exactly-once via idempotent producers (enable.idempotence=true) and transactional APIs. Apache Flink achieves it through distributed snapshots using the Chandy-Lamport algorithm with aligned or unaligned checkpointing. Spark Structured Streaming uses write-ahead logs and idempotent sinks. For true end-to-end exactly-once, the entire pipeline -- source, processor, and sink -- must coordinate. Common sink-side patterns include two-phase commit protocols (Flink's TwoPhaseCommitSinkFunction), idempotent writes with deduplication keys (e.g., upsert to database using event_id as primary key), and transactional outbox patterns.

Q: What are the cost implications of real-time streaming versus batch processing?

Real-time streaming typically costs 2-5x more than equivalent batch processing due to always-on compute resources (streaming clusters run 24/7 vs batch jobs that run on schedule), higher memory requirements for maintaining processing state, and the operational complexity requiring specialized engineering talent. However, streaming delivers business value that batch cannot: sub-second fraud detection can save financial institutions millions annually in prevented losses, real-time recommendations increase e-commerce conversion rates by 15-30%, and instant IoT alerting prevents costly equipment failures. The ROI calculation should compare the cost premium against the quantifiable business value of timeliness. Managed services like Amazon Kinesis and Azure Event Hubs reduce operational costs by 40-60% versus self-managed Kafka/Flink clusters, making streaming accessible to teams without deep infrastructure expertise.

Q: How do I handle late-arriving data in streaming analytics?

Late data -- events arriving after their processing window has closed -- is an inherent challenge in distributed streaming systems caused by network delays, client-side buffering, mobile device offline periods, and clock skew. This is handled through watermarks and allowed lateness. Watermarks are heuristic timestamps that estimate the progress of event time through the system; events arriving before the watermark are processed normally within their window. Configure an allowed lateness period (e.g., 5 minutes) during which late events trigger window recomputation and result updates. Events arriving beyond the allowed lateness threshold are routed to a side output or dead letter queue for offline processing. Both Apache Flink and Spark Structured Streaming support watermark-based late data handling with configurable policies. The watermark delay and allowed lateness should be tuned based on empirical analysis of your data's late-arrival distribution.

10. Get Started with Real-Time Streaming Analytics

Implementing a production-grade streaming analytics platform requires expertise across distributed systems, data engineering, and cloud infrastructure. Whether you are building a greenfield streaming platform or migrating from batch-only architecture, the key is to start with a high-value use case (fraud detection, IoT monitoring, or real-time dashboarding), prove the architecture at scale, and then expand to additional use cases.

Seraphim Streaming Analytics Engagement Model

Phase 1 -- Architecture Assessment (2 weeks): We evaluate your current data architecture, identify high-value streaming use cases, and design a target-state streaming platform architecture including technology selection (Kafka vs. managed services), infrastructure sizing, and integration patterns.

Phase 2 -- Platform Build (4-8 weeks): Deploy the streaming infrastructure (Kafka cluster, stream processing engine, monitoring stack), build the first production pipeline end-to-end, and establish CI/CD and operational runbooks.

Phase 3 -- Production & Scale (ongoing): Launch the platform with real production traffic, add additional streaming use cases, optimize cost and performance, and train your team on operations and pipeline development.

Schedule a streaming analytics consultation to discuss your real-time data processing requirements.

Get Your Real-Time Streaming Analytics Assessment

Receive a customized architecture report including technology recommendations, infrastructure sizing, cost analysis, and implementation roadmap for your streaming analytics platform.

© 2026 Seraphim Co., Ltd.