LearnProduction ML SystemsBatch and Streaming Feature Pipelines

⚙️MediumMLOps & Deployment

Batch and Streaming Feature Pipelines

Build point-in-time training-run features from events and preserve the same meaning in online serving.

12 min read

Learning path

Step 43 of 158 in the full curriculum

Feature Engineering for Production ML Gradient Boosted Trees in Production

A valid training-job SLA row is useful only if the pipeline can build it honestly at scale. The system must produce millions of historical rows for training and keep fresh values ready for live scoring jobs.

Those jobs run at different speeds, but they share one question: what did the scoring system know at the prediction timestamp? If an offline replay sees facts that live serving couldn't have known yet, training gets an unfairly clean view of history.

Two-clock point-in-time quadrant for run R-204: E1 occurred and arrived before 10:00 and is visible, E2 occurred before 10:00 but arrived at 11:00 and is excluded as a late arrival, and E3 occurred after 10:00 and is future; adjacent as-of states advance from queued to warmup_spike to gpu_pressure. — At `10:00`, only E1 lies inside both time boundaries. E2 is the subtle failure: it occurred at `09:30` but arrived at `11:00`, so an event-time-only replay leaks information unavailable to live serving.

One event has two relevant times

Consider a training-job telemetry event:

Field	Example	Meaning
`event_time`	May 1, 09:30	when the scheduler recorded the heartbeat
`ingested_at`	May 1, 11:00	when this pipeline learned about the heartbeat
`prediction_time`	May 1, 10:00	when the scoring system asked for a training SLA feature row

The heartbeat happened before the scoring request, but it arrived one hour afterward. A live prediction at 10:00 couldn't use it. A faithful replay excludes facts that occurred too late or arrived too late:

text

event_time <= prediction_time
and ingested_at <= prediction_time

The first boundary prevents future-event leakage. The second reproduces what the live system knew. Keep both timestamps even if your first source rarely arrives late.

Feature stores commonly provide point-in-time historical retrieval. Feast documents historical joins that scan backward from each entity timestamp within a configured time-to-live (TTL) window.^{[1]Reference 1Feast: Production Feature Store for Machine Learninghttps://feast.dev/} Your pipeline still owns the exact replay contract, including whether ingestion time belongs in it.

Diagram showing Event log event_time + ingested_at, Feature definition v1 time boundary + missing policy, Batch replay historical training rows, and Stream updater newly arrived events. — Event log event_time + ingested_at, Feature definition v1 time boundary + missing policy, Batch replay historical training rows, and Stream updater newly arrived events.

Rebuild what serving knew

Build the replay in small steps. The first cell defines one normal event, one late arrival, and one future event for run R-204. Run the cells in order.

01-inspect-event-clocks.py

from datetime import datetime
from datetime import timedelta

def dt(value):
    return datetime.fromisoformat(value)

events = [
    {
        "id": "E1",
        "run": "R-204",
        "event_time": dt("2026-05-01T08:00:00"),
        "ingested_at": dt("2026-05-01T08:02:00"),
        "source_version": 1,
        "status": "queued",
    },
    {
        "id": "E2",
        "run": "R-204",
        "event_time": dt("2026-05-01T09:30:00"),
        "ingested_at": dt("2026-05-01T11:00:00"),
        "source_version": 1,
        "status": "warmup_spike",
    },
    {
        "id": "E3",
        "run": "R-204",
        "event_time": dt("2026-05-01T13:00:00"),
        "ingested_at": dt("2026-05-01T13:01:00"),
        "source_version": 1,
        "status": "gpu_pressure",
    },
]
prediction_time = dt("2026-05-01T10:00:00")

for event in events:
    lag_minutes = int((event["ingested_at"] - event["event_time"]).total_seconds() / 60)
    print(
        f'{event["id"]} status={event["status"]} '
        f'occurred={event["event_time"]:%H:%M} arrived={event["ingested_at"]:%H:%M} '
        f'lag={lag_minutes}m'
    )

Output

E1 status=queued occurred=08:00 arrived=08:02 lag=2m
E2 status=warmup_spike occurred=09:30 arrived=11:00 lag=90m
E3 status=gpu_pressure occurred=13:00 arrived=13:01 lag=1m

A filter on event_time alone looks reasonable, but it leaks E2 into a replay of the 10:00 decision.

02-expose-event-time-only-bug.py

def event_time_only(events, run_id, at):
    return [
        event
        for event in events
        if event["run"] == run_id and event["event_time"] <= at
    ]

naive_replay = event_time_only(events, "R-204", prediction_time)
print("event-time-only replay:", [event["status"] for event in naive_replay])
print("leaked late arrival:", "warmup_spike" in {event["status"] for event in naive_replay})

Output

event-time-only replay: ['queued', 'warmup_spike']
leaked late arrival: True

Add the availability boundary. Now replay sees the same facts live serving saw at 10:00.

03-filter-by-decision-time.py

def known_by(events, run_id, at):
    return [
        event
        for event in events
        if (
            event["run"] == run_id
            and event["event_time"] <= at
            and event["ingested_at"] <= at
        )
    ]

faithful_replay = known_by(events, "R-204", prediction_time)
print("decision-time replay:", [event["status"] for event in faithful_replay])
print("late heartbeat excluded:", "warmup_spike" not in {event["status"] for event in faithful_replay})

Output

decision-time replay: ['queued']
late heartbeat excluded: True

A historical training snapshot needs the latest visible fact for each prediction request. The warmup_spike becomes usable after it arrives at 11:00; gpu_pressure becomes usable after 13:01.

04-build-as-of-rows.py

def state_order(event):
    return (event["event_time"], event.get("source_version", 0))

def revision_value(event):
    return event["status"]

def as_of_status(events, run_id, at):
    visible = known_by(events, run_id, at)
    if not visible:
        return "missing"
    latest_order = max(state_order(event) for event in visible)
    latest = [event for event in visible if state_order(event) == latest_order]
    if len({revision_value(event) for event in latest}) > 1:
        return "quarantined"
    return revision_value(latest[0])

for requested_at in [
    dt("2026-05-01T10:00:00"),
    dt("2026-05-01T12:00:00"),
    dt("2026-05-01T16:00:00"),
]:
    print(f"{requested_at:%H:%M} -> {as_of_status(events, 'R-204', requested_at)}")

Output

00 -> queued
00 -> warmup_spike
00 -> gpu_pressure

Sometimes analysts also need a corrected history of what truly happened by 10:00, regardless of when the source delivered it. Keep that artifact separate. Corrected history is useful for auditing training operations, but it isn't a faithful replay of the model's information set.

05-separate-corrected-history.py

decision_view = [event["status"] for event in known_by(events, "R-204", prediction_time)]
corrected_view = [event["status"] for event in event_time_only(events, "R-204", prediction_time)]

print("decision-time view:", decision_view)
print("corrected event-time view:", corrected_view)

Output

decision-time view: ['queued']
corrected event-time view: ['queued', 'warmup_spike']

Keep online state current

Batch replay, stream updates, and online reads are related but distinct jobs:

Job	Reads	Produces	Typical latency
batch replay	bounded event history	versioned training snapshot	minutes or hours
stream updater	newly arrived events	current feature state	seconds
online read	current feature state	one scoring row	milliseconds

Different stores are acceptable. Different meanings aren't. If batch computes a seven-day backlog mean while online returns a one-hour queue count under the same name, offline metrics can't predict live behavior.

Feast documents an online store built for low-latency feature serving. For each entity key, it stores only the latest feature values rather than full history.^{[1]Reference 1Feast: Production Feature Store for Machine Learninghttps://feast.dev/} Google Cloud's MLOps guidance describes a feature store as an optional shared repository for definitions, storage, and access across high-throughput batch and low-latency serving workloads.^{[2]Reference 2MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.https://docs.cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning}

A late, older event mustn't overwrite newer online state. Same-time corrections need a deterministic rule too. Use the same source-owned (event_time, source_version) order in batch replay and online updates: version 2 replaces version 1, including when a stale version-1 retry arrives later. In a real source contract, require a stable sequence or revision field rather than trusting arrival order.

06-protect-latest-online-state.py

online_state = {}

def apply_latest_state(state, event):
    previous = state.get(event["run"])
    if previous is not None:
        if event["id"] == previous.get("id"):
            return "ignored duplicate event"
        if state_order(event) < state_order(previous):
            return "ignored stale event"
        if state_order(event) == state_order(previous):
            if previous.get("quarantined"):
                return "quarantine persists"
            if revision_value(event) == revision_value(previous):
                return "ignored duplicate revision"
            state[event["run"]] = {
                "run": event["run"],
                "event_time": event["event_time"],
                "source_version": event.get("source_version", 0),
                "status": "quarantined",
                "quarantined": True,
            }
            return "quarantined conflicting revision"
    state[event["run"]] = event
    return "stored"

def online_status(state, run_id):
    current = state.get(run_id)
    if current is None:
        return "missing"
    return "quarantined" if current.get("quarantined") else revision_value(current)

arrivals = [
    {
        "id": "E4",
        "run": "R-204",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:31:00"),
        "source_version": 1,
        "status": "worker_scaled",
    },
    {
        "id": "E4-correction",
        "run": "R-204",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:34:00"),
        "source_version": 2,
        "status": "worker_scaled_corrected",
    },
    {
        "id": "E4-v1-retry",
        "run": "R-204",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:36:00"),
        "source_version": 1,
        "status": "worker_scaled",
    },
    {
        "id": "E1-retry",
        "run": "R-204",
        "event_time": dt("2026-05-01T08:00:00"),
        "ingested_at": dt("2026-05-01T11:05:00"),
        "status": "queued",
    },
]

for event in arrivals:
    print(f'{event["id"]}: {apply_latest_state(online_state, event)}')
print("current status:", online_status(online_state, "R-204"))

Output

E4: stored
E4-correction: stored
E4-v1-retry: ignored stale event
E1-retry: ignored stale event
current status: worker_scaled_corrected

Latest-value state isn't enough for aggregates. A duplicated queue update can silently inflate queue backlog unless the updater is idempotent. Idempotent means retrying the same event doesn't change the result after its first successful application. The updater should also reject impossible aggregate states instead of publishing them.

07-deduplicate-aggregate-updates.py

seen_event_ids = set()
origin_backlog = 0

def apply_queue_update(event):
    global origin_backlog
    if event["id"] in seen_event_ids:
        return "duplicate skipped"
    next_backlog = origin_backlog + event["delta"]
    if next_backlog < 0:
        return "blocked negative backlog"
    seen_event_ids.add(event["id"])
    origin_backlog = next_backlog
    return "applied"

queue_updates = [
    {"id": "Q1", "delta": 3},
    {"id": "Q1", "delta": 3},
    {"id": "Q2", "delta": -1},
    {"id": "Q3", "delta": -5},
]

for update in queue_updates:
    result = apply_queue_update(update)
    print(f'{update["id"]}: {result}; backlog={origin_backlog}')

Output

Q1: applied; backlog=3
Q1: duplicate skipped; backlog=3
Q2: applied; backlog=2
Q3: blocked negative backlog; backlog=2

The set makes sequential retries visible, but production needs a durable atomic write: record the event ID and apply its aggregate change together. Otherwise a worker can crash between those steps and a retry can lose or repeat an update. Use a transaction, conditional write, or stream processor state primitive with the same contract.

Bound lateness and freshness

A stream processor can't wait forever for missing events. It commonly tracks a watermark, an estimate of how far event-time processing has progressed. A watermark isn't a guarantee: an event with an older timestamp can still arrive later. A product policy decides whether that event can update a recent window or must wait for a controlled replay.

This miniature policy accepts updates no more than 45 minutes behind the watermark. A real stream processor also defines windows, triggers, state cleanup, and monitoring.

08-classify-late-events.py

def late_event_action(event_time, watermark, allowed_lateness):
    lateness = watermark - event_time
    if lateness <= timedelta(0):
        return "on-time path"
    if lateness <= allowed_lateness:
        return "accept late update"
    return "quarantine for replay"

watermark = dt("2026-05-01T12:00:00")
allowed_lateness = timedelta(minutes=45)

for event_time in [
    dt("2026-05-01T12:05:00"),
    dt("2026-05-01T11:30:00"),
    dt("2026-05-01T10:00:00"),
]:
    print(f"{event_time:%H:%M} -> {late_event_action(event_time, watermark, allowed_lateness)}")

Output

05 -> on-time path
30 -> accept late update
00 -> quarantine for replay

Freshness requires a serving policy:

09-apply-serving-freshness-policy.py

def serving_mode(updated_at, requested_at):
    lag_minutes = int((requested_at - updated_at).total_seconds() / 60)
    if lag_minutes < 0:
        raise ValueError("feature update is from the future")
    if lag_minutes <= 15:
        return "normal scoring"
    if lag_minutes <= 45:
        return "fallback and log degraded mode"
    return "abstain from narrow SLA estimate"

requested_at = dt("2026-05-01T12:00:00")
for updated_at in [
    dt("2026-05-01T11:52:00"),
    dt("2026-05-01T11:30:00"),
    dt("2026-05-01T10:30:00"),
]:
    print(f"{updated_at:%H:%M} -> {serving_mode(updated_at, requested_at)}")

Output

52 -> normal scoring
30 -> fallback and log degraded mode
30 -> abstain from narrow SLA estimate

No model fixes a missing or temporally invalid input. Record freshness lag with each prediction trace so operators can distinguish model error from stale state.

Compare Offline and Online Rows

Before promotion, replay a sample of requests offline and compare those rows with rows captured from online serving. Start with exact equality for categorical and integer fields; define tolerances explicitly if floating-point aggregations need them.

10-check-offline-online-parity.py

def mismatches(left, right):
    keys = left.keys() | right.keys()
    return {key: (left.get(key), right.get(key)) for key in keys if left.get(key) != right.get(key)}

correction_offline_row = {
    "status": as_of_status(arrivals, "R-204", dt("2026-05-01T10:40:00")),
}
correction_online_row = {"status": online_status(online_state, "R-204")}
correction_differences = mismatches(correction_offline_row, correction_online_row)
print("same-time retry mismatches:", correction_differences)

conflicting_revisions = [
    {
        "id": "E5-a",
        "run": "R-205",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:31:00"),
        "source_version": 7,
        "status": "worker_scaled",
    },
    {
        "id": "E5-b",
        "run": "R-205",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:32:00"),
        "source_version": 7,
        "status": "worker_scale_failed",
    },
]
conflict_online_state = {}
for event in conflicting_revisions:
    apply_latest_state(conflict_online_state, event)
conflict_offline = as_of_status(conflicting_revisions, "R-205", dt("2026-05-01T10:40:00"))
conflict_online = online_status(conflict_online_state, "R-205")
assert conflict_offline == conflict_online == "quarantined"
print("equal-order conflict parity:", conflict_offline, "==", conflict_online)

for event in events:
    if event["ingested_at"] <= dt("2026-05-01T16:00:00"):
        apply_latest_state(online_state, event)

offline_row = {
    "status": as_of_status(events, "R-204", dt("2026-05-01T16:00:00")),
    "origin_backlog": origin_backlog,
}
online_row = {
    "status": online_status(online_state, "R-204"),
    "origin_backlog": origin_backlog,
}

print("offline row:", offline_row)
print("online row:", online_row)
print("mismatches:", mismatches(offline_row, online_row))

Output

same-time retry mismatches: {}
equal-order conflict parity: quarantined == quarantined
offline row: {'status': 'gpu_pressure', 'origin_backlog': 2}
online row: {'status': 'gpu_pressure', 'origin_backlog': 2}
mismatches: {}

A stale online row produces a visible failure instead of a mysterious model regression.

11-detect-stale-online-state.py

stale_online_row = {
    "status": "worker_scaled",
    "origin_backlog": origin_backlog,
}
differences = mismatches(offline_row, stale_online_row)
print("mismatches:", differences)
print("release allowed:", not differences)

Output

mismatches: {'status': ('gpu_pressure', 'worker_scaled')}
release allowed: False

Publish a compact receipt beside every snapshot and candidate model. It makes replay semantics inspectable instead of leaving them hidden in job code.

12-publish-replay-receipt.py

receipt = {
    "feature_definition": "training_sla_features_v1",
    "replay_boundary": "event_time <= prediction_time and ingested_at <= prediction_time",
    "offline_online_ordering": "(event_time, source_version); equal-key conflicts quarantine",
    "aggregate_updates": "atomic event-id dedup; reject negative backlog",
    "snapshot": "training-sla-train-2026-05-01",
    "parity_samples_checked": 2,
    "freshness_policy": "sla-freshness-v1",
}

assert as_of_status(events, "R-204", dt("2026-05-01T10:00:00")) == "queued"
assert as_of_status(events, "R-204", dt("2026-05-01T12:00:00")) == "warmup_spike"
assert not correction_differences
assert not mismatches(offline_row, online_row)

for key, value in receipt.items():
    print(f"{key}={value}")

Output

feature_definition=training_sla_features_v1
replay_boundary=event_time <= prediction_time and ingested_at <= prediction_time
offline_online_ordering=(event_time, source_version); equal-key conflicts quarantine
aggregate_updates=atomic event-id dedup; reject negative backlog
snapshot=training-sla-train-2026-05-01
parity_samples_checked=2
freshness_policy=sla-freshness-v1

Practice: break the pipeline

Run the lab, then change one condition at a time:

Remove the ingested_at <= at condition from known_by. Confirm the 10:00 row leaks warmup_spike.
Remove the shared ordering checks. Confirm E4-v1-retry undoes the same-time correction online and creates a parity mismatch; then confirm E1-retry can regress state to queued.
Remove seen_event_ids. Confirm the duplicated Q1 update inflates backlog. Then remove the negative-state check and confirm Q3 publishes an impossible queue size.
Change online_row["status"] to "worker_scaled". Confirm parity blocks release.
Move a feature timestamp 90 minutes behind the request. Confirm serving abstains from a narrow SLA estimate.

Explain the replay without looking back

What strong answers show

Evidence	What a strong answer shows
temporal correctness	constructs training rows with explicit event-time and availability boundaries
online state	handles older arrivals, duplicate updates, and freshness failures deliberately
operational replay	versions snapshots, parity receipts, and corrected analytical views separately

When the Contract Breaks

Symptom	Cause	Fix
Training gets better after a backfill but serving doesn't	replay used facts that arrived later	filter by event time and ingestion time
Online status moves backward after retry or same-time corrections disagree	stale or ambiguous event overwrote latest state	compare event timestamp plus source revision; quarantine unresolved ties
Backlog inflates during redelivery	aggregate update ran twice	deduplicate by stable event ID in same atomic write as aggregate change
Reliable model emits bad SLA estimates during upstream lag	freshness wasn't part of serving policy	trace age and fail to a safer response

Next Step

Continue to Gradient Boosted Trees in Production

You can now generate point-in-time correct feature rows. Next you'll train a strong tabular baseline and decide when its validation evidence earns deployment.

PreviousFeature Engineering for Production ML

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Feast: Production Feature Store for Machine Learning

Feast Contributors · 2024

MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.

Google Cloud. · 2026 · Official documentation

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnProduction ML SystemsBatch and Streaming Feature Pipelines

⚙️MediumMLOps & Deployment

Batch and Streaming Feature Pipelines

Build point-in-time training-run features from events and preserve the same meaning in online serving.

12 min read

Learning path

Step 43 of 158 in the full curriculum

Feature Engineering for Production ML Gradient Boosted Trees in Production

One event has two relevant times

Consider a training-job telemetry event:

Field	Example	Meaning
`event_time`	May 1, 09:30	when the scheduler recorded the heartbeat
`ingested_at`	May 1, 11:00	when this pipeline learned about the heartbeat
`prediction_time`	May 1, 10:00	when the scoring system asked for a training SLA feature row

text

event_time <= prediction_time
and ingested_at <= prediction_time

The first boundary prevents future-event leakage. The second reproduces what the live system knew. Keep both timestamps even if your first source rarely arrives late.

Rebuild what serving knew

Build the replay in small steps. The first cell defines one normal event, one late arrival, and one future event for run R-204. Run the cells in order.

01-inspect-event-clocks.py

from datetime import datetime
from datetime import timedelta

def dt(value):
    return datetime.fromisoformat(value)

events = [
    {
        "id": "E1",
        "run": "R-204",
        "event_time": dt("2026-05-01T08:00:00"),
        "ingested_at": dt("2026-05-01T08:02:00"),
        "source_version": 1,
        "status": "queued",
    },
    {
        "id": "E2",
        "run": "R-204",
        "event_time": dt("2026-05-01T09:30:00"),
        "ingested_at": dt("2026-05-01T11:00:00"),
        "source_version": 1,
        "status": "warmup_spike",
    },
    {
        "id": "E3",
        "run": "R-204",
        "event_time": dt("2026-05-01T13:00:00"),
        "ingested_at": dt("2026-05-01T13:01:00"),
        "source_version": 1,
        "status": "gpu_pressure",
    },
]
prediction_time = dt("2026-05-01T10:00:00")

for event in events:
    lag_minutes = int((event["ingested_at"] - event["event_time"]).total_seconds() / 60)
    print(
        f'{event["id"]} status={event["status"]} '
        f'occurred={event["event_time"]:%H:%M} arrived={event["ingested_at"]:%H:%M} '
        f'lag={lag_minutes}m'
    )

Output

E1 status=queued occurred=08:00 arrived=08:02 lag=2m
E2 status=warmup_spike occurred=09:30 arrived=11:00 lag=90m
E3 status=gpu_pressure occurred=13:00 arrived=13:01 lag=1m

A filter on event_time alone looks reasonable, but it leaks E2 into a replay of the 10:00 decision.

02-expose-event-time-only-bug.py

def event_time_only(events, run_id, at):
    return [
        event
        for event in events
        if event["run"] == run_id and event["event_time"] <= at
    ]

naive_replay = event_time_only(events, "R-204", prediction_time)
print("event-time-only replay:", [event["status"] for event in naive_replay])
print("leaked late arrival:", "warmup_spike" in {event["status"] for event in naive_replay})

Output

event-time-only replay: ['queued', 'warmup_spike']
leaked late arrival: True

Add the availability boundary. Now replay sees the same facts live serving saw at 10:00.

03-filter-by-decision-time.py

def known_by(events, run_id, at):
    return [
        event
        for event in events
        if (
            event["run"] == run_id
            and event["event_time"] <= at
            and event["ingested_at"] <= at
        )
    ]

faithful_replay = known_by(events, "R-204", prediction_time)
print("decision-time replay:", [event["status"] for event in faithful_replay])
print("late heartbeat excluded:", "warmup_spike" not in {event["status"] for event in faithful_replay})

Output

decision-time replay: ['queued']
late heartbeat excluded: True

A historical training snapshot needs the latest visible fact for each prediction request. The warmup_spike becomes usable after it arrives at 11:00; gpu_pressure becomes usable after 13:01.

04-build-as-of-rows.py

def state_order(event):
    return (event["event_time"], event.get("source_version", 0))

def revision_value(event):
    return event["status"]

def as_of_status(events, run_id, at):
    visible = known_by(events, run_id, at)
    if not visible:
        return "missing"
    latest_order = max(state_order(event) for event in visible)
    latest = [event for event in visible if state_order(event) == latest_order]
    if len({revision_value(event) for event in latest}) > 1:
        return "quarantined"
    return revision_value(latest[0])

for requested_at in [
    dt("2026-05-01T10:00:00"),
    dt("2026-05-01T12:00:00"),
    dt("2026-05-01T16:00:00"),
]:
    print(f"{requested_at:%H:%M} -> {as_of_status(events, 'R-204', requested_at)}")

Output

00 -> queued
00 -> warmup_spike
00 -> gpu_pressure

05-separate-corrected-history.py

decision_view = [event["status"] for event in known_by(events, "R-204", prediction_time)]
corrected_view = [event["status"] for event in event_time_only(events, "R-204", prediction_time)]

print("decision-time view:", decision_view)
print("corrected event-time view:", corrected_view)

Output

decision-time view: ['queued']
corrected event-time view: ['queued', 'warmup_spike']

Keep online state current

Batch replay, stream updates, and online reads are related but distinct jobs:

Job	Reads	Produces	Typical latency
batch replay	bounded event history	versioned training snapshot	minutes or hours
stream updater	newly arrived events	current feature state	seconds
online read	current feature state	one scoring row	milliseconds

06-protect-latest-online-state.py

online_state = {}

def apply_latest_state(state, event):
    previous = state.get(event["run"])
    if previous is not None:
        if event["id"] == previous.get("id"):
            return "ignored duplicate event"
        if state_order(event) < state_order(previous):
            return "ignored stale event"
        if state_order(event) == state_order(previous):
            if previous.get("quarantined"):
                return "quarantine persists"
            if revision_value(event) == revision_value(previous):
                return "ignored duplicate revision"
            state[event["run"]] = {
                "run": event["run"],
                "event_time": event["event_time"],
                "source_version": event.get("source_version", 0),
                "status": "quarantined",
                "quarantined": True,
            }
            return "quarantined conflicting revision"
    state[event["run"]] = event
    return "stored"

def online_status(state, run_id):
    current = state.get(run_id)
    if current is None:
        return "missing"
    return "quarantined" if current.get("quarantined") else revision_value(current)

arrivals = [
    {
        "id": "E4",
        "run": "R-204",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:31:00"),
        "source_version": 1,
        "status": "worker_scaled",
    },
    {
        "id": "E4-correction",
        "run": "R-204",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:34:00"),
        "source_version": 2,
        "status": "worker_scaled_corrected",
    },
    {
        "id": "E4-v1-retry",
        "run": "R-204",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:36:00"),
        "source_version": 1,
        "status": "worker_scaled",
    },
    {
        "id": "E1-retry",
        "run": "R-204",
        "event_time": dt("2026-05-01T08:00:00"),
        "ingested_at": dt("2026-05-01T11:05:00"),
        "status": "queued",
    },
]

for event in arrivals:
    print(f'{event["id"]}: {apply_latest_state(online_state, event)}')
print("current status:", online_status(online_state, "R-204"))

Output

E4: stored
E4-correction: stored
E4-v1-retry: ignored stale event
E1-retry: ignored stale event
current status: worker_scaled_corrected

07-deduplicate-aggregate-updates.py

seen_event_ids = set()
origin_backlog = 0

def apply_queue_update(event):
    global origin_backlog
    if event["id"] in seen_event_ids:
        return "duplicate skipped"
    next_backlog = origin_backlog + event["delta"]
    if next_backlog < 0:
        return "blocked negative backlog"
    seen_event_ids.add(event["id"])
    origin_backlog = next_backlog
    return "applied"

queue_updates = [
    {"id": "Q1", "delta": 3},
    {"id": "Q1", "delta": 3},
    {"id": "Q2", "delta": -1},
    {"id": "Q3", "delta": -5},
]

for update in queue_updates:
    result = apply_queue_update(update)
    print(f'{update["id"]}: {result}; backlog={origin_backlog}')

Output

Q1: applied; backlog=3
Q1: duplicate skipped; backlog=3
Q2: applied; backlog=2
Q3: blocked negative backlog; backlog=2

Bound lateness and freshness

This miniature policy accepts updates no more than 45 minutes behind the watermark. A real stream processor also defines windows, triggers, state cleanup, and monitoring.

08-classify-late-events.py

def late_event_action(event_time, watermark, allowed_lateness):
    lateness = watermark - event_time
    if lateness <= timedelta(0):
        return "on-time path"
    if lateness <= allowed_lateness:
        return "accept late update"
    return "quarantine for replay"

watermark = dt("2026-05-01T12:00:00")
allowed_lateness = timedelta(minutes=45)

for event_time in [
    dt("2026-05-01T12:05:00"),
    dt("2026-05-01T11:30:00"),
    dt("2026-05-01T10:00:00"),
]:
    print(f"{event_time:%H:%M} -> {late_event_action(event_time, watermark, allowed_lateness)}")

Output

05 -> on-time path
30 -> accept late update
00 -> quarantine for replay

Freshness requires a serving policy:

09-apply-serving-freshness-policy.py

def serving_mode(updated_at, requested_at):
    lag_minutes = int((requested_at - updated_at).total_seconds() / 60)
    if lag_minutes < 0:
        raise ValueError("feature update is from the future")
    if lag_minutes <= 15:
        return "normal scoring"
    if lag_minutes <= 45:
        return "fallback and log degraded mode"
    return "abstain from narrow SLA estimate"

requested_at = dt("2026-05-01T12:00:00")
for updated_at in [
    dt("2026-05-01T11:52:00"),
    dt("2026-05-01T11:30:00"),
    dt("2026-05-01T10:30:00"),
]:
    print(f"{updated_at:%H:%M} -> {serving_mode(updated_at, requested_at)}")

Output

52 -> normal scoring
30 -> fallback and log degraded mode
30 -> abstain from narrow SLA estimate

No model fixes a missing or temporally invalid input. Record freshness lag with each prediction trace so operators can distinguish model error from stale state.

Compare Offline and Online Rows

10-check-offline-online-parity.py

def mismatches(left, right):
    keys = left.keys() | right.keys()
    return {key: (left.get(key), right.get(key)) for key in keys if left.get(key) != right.get(key)}

correction_offline_row = {
    "status": as_of_status(arrivals, "R-204", dt("2026-05-01T10:40:00")),
}
correction_online_row = {"status": online_status(online_state, "R-204")}
correction_differences = mismatches(correction_offline_row, correction_online_row)
print("same-time retry mismatches:", correction_differences)

conflicting_revisions = [
    {
        "id": "E5-a",
        "run": "R-205",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:31:00"),
        "source_version": 7,
        "status": "worker_scaled",
    },
    {
        "id": "E5-b",
        "run": "R-205",
        "event_time": dt("2026-05-01T10:30:00"),
        "ingested_at": dt("2026-05-01T10:32:00"),
        "source_version": 7,
        "status": "worker_scale_failed",
    },
]
conflict_online_state = {}
for event in conflicting_revisions:
    apply_latest_state(conflict_online_state, event)
conflict_offline = as_of_status(conflicting_revisions, "R-205", dt("2026-05-01T10:40:00"))
conflict_online = online_status(conflict_online_state, "R-205")
assert conflict_offline == conflict_online == "quarantined"
print("equal-order conflict parity:", conflict_offline, "==", conflict_online)

for event in events:
    if event["ingested_at"] <= dt("2026-05-01T16:00:00"):
        apply_latest_state(online_state, event)

offline_row = {
    "status": as_of_status(events, "R-204", dt("2026-05-01T16:00:00")),
    "origin_backlog": origin_backlog,
}
online_row = {
    "status": online_status(online_state, "R-204"),
    "origin_backlog": origin_backlog,
}

print("offline row:", offline_row)
print("online row:", online_row)
print("mismatches:", mismatches(offline_row, online_row))

Output

same-time retry mismatches: {}
equal-order conflict parity: quarantined == quarantined
offline row: {'status': 'gpu_pressure', 'origin_backlog': 2}
online row: {'status': 'gpu_pressure', 'origin_backlog': 2}
mismatches: {}

A stale online row produces a visible failure instead of a mysterious model regression.

11-detect-stale-online-state.py

stale_online_row = {
    "status": "worker_scaled",
    "origin_backlog": origin_backlog,
}
differences = mismatches(offline_row, stale_online_row)
print("mismatches:", differences)
print("release allowed:", not differences)

Output

mismatches: {'status': ('gpu_pressure', 'worker_scaled')}
release allowed: False

Publish a compact receipt beside every snapshot and candidate model. It makes replay semantics inspectable instead of leaving them hidden in job code.

12-publish-replay-receipt.py

receipt = {
    "feature_definition": "training_sla_features_v1",
    "replay_boundary": "event_time <= prediction_time and ingested_at <= prediction_time",
    "offline_online_ordering": "(event_time, source_version); equal-key conflicts quarantine",
    "aggregate_updates": "atomic event-id dedup; reject negative backlog",
    "snapshot": "training-sla-train-2026-05-01",
    "parity_samples_checked": 2,
    "freshness_policy": "sla-freshness-v1",
}

assert as_of_status(events, "R-204", dt("2026-05-01T10:00:00")) == "queued"
assert as_of_status(events, "R-204", dt("2026-05-01T12:00:00")) == "warmup_spike"
assert not correction_differences
assert not mismatches(offline_row, online_row)

for key, value in receipt.items():
    print(f"{key}={value}")

Output

feature_definition=training_sla_features_v1
replay_boundary=event_time <= prediction_time and ingested_at <= prediction_time
offline_online_ordering=(event_time, source_version); equal-key conflicts quarantine
aggregate_updates=atomic event-id dedup; reject negative backlog
snapshot=training-sla-train-2026-05-01
parity_samples_checked=2
freshness_policy=sla-freshness-v1

Practice: break the pipeline

Run the lab, then change one condition at a time:

Remove the ingested_at <= at condition from known_by. Confirm the 10:00 row leaks warmup_spike.
Remove the shared ordering checks. Confirm E4-v1-retry undoes the same-time correction online and creates a parity mismatch; then confirm E1-retry can regress state to queued.
Remove seen_event_ids. Confirm the duplicated Q1 update inflates backlog. Then remove the negative-state check and confirm Q3 publishes an impossible queue size.
Change online_row["status"] to "worker_scaled". Confirm parity blocks release.
Move a feature timestamp 90 minutes behind the request. Confirm serving abstains from a narrow SLA estimate.

Explain the replay without looking back

What strong answers show

Evidence	What a strong answer shows
temporal correctness	constructs training rows with explicit event-time and availability boundaries
online state	handles older arrivals, duplicate updates, and freshness failures deliberately
operational replay	versions snapshots, parity receipts, and corrected analytical views separately

When the Contract Breaks

Symptom	Cause	Fix
Training gets better after a backfill but serving doesn't	replay used facts that arrived later	filter by event time and ingestion time
Online status moves backward after retry or same-time corrections disagree	stale or ambiguous event overwrote latest state	compare event timestamp plus source revision; quarantine unresolved ties
Backlog inflates during redelivery	aggregate update ran twice	deduplicate by stable event ID in same atomic write as aggregate change
Reliable model emits bad SLA estimates during upstream lag	freshness wasn't part of serving policy	trace age and fail to a safer response

Next Step

Continue to Gradient Boosted Trees in Production

You can now generate point-in-time correct feature rows. Next you'll train a strong tabular baseline and decide when its validation evidence earns deployment.

PreviousFeature Engineering for Production ML

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Feast: Production Feature Store for Machine Learning

Feast Contributors · 2024

MLOps: Continuous Delivery and Automation Pipelines in Machine Learning.

Google Cloud. · 2026 · Official documentation

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Batch and Streaming Feature Pipelines

One event has two relevant times

Rebuild what serving knew

Keep online state current

Bound lateness and freshness

Compare Offline and Online Rows

Practice: break the pipeline

Explain the replay without looking back

What strong answers show

When the Contract Breaks

Mastery Check

Discussion

Batch and Streaming Feature Pipelines

One event has two relevant times

Rebuild what serving knew

Keep online state current

Bound lateness and freshness

Compare Offline and Online Rows

Practice: break the pipeline

Explain the replay without looking back

What strong answers show

When the Contract Breaks

Mastery Check

Discussion

Batch and Streaming Feature Pipelines

One event has two relevant times

Rebuild what serving knew

Keep online state current

Bound lateness and freshness

Compare Offline and Online Rows

Practice: break the pipeline

Explain the replay without looking back

Why can event_time <= prediction_time still leak information into a replay?

Are streaming updates and online serving the same job?

Why can batch replay and online reads use different storage systems?

What strong answers show

When the Contract Breaks

Mastery Check

Discussion

Batch and Streaming Feature Pipelines

One event has two relevant times

Rebuild what serving knew

Keep online state current

Bound lateness and freshness

Compare Offline and Online Rows

Practice: break the pipeline

Explain the replay without looking back

Why can event_time <= prediction_time still leak information into a replay?

Are streaming updates and online serving the same job?

Why can batch replay and online reads use different storage systems?

What strong answers show

When the Contract Breaks

Mastery Check

Discussion