LearnProduction ML SystemsForecasting and Anomaly Detection

📊MediumEvaluation & Benchmarks

Forecasting and Anomaly Detection

Forecast batch-job demand with time-aware evaluation and turn large forecast errors into reviewable operational alerts.

16 min read

Learning path

Step 46 of 158 in the full curriculum

Ranking and Recommendation Systems Monitoring Predictive Models

Ranking and recommendation chose an ordered slate at one moment. Forecasting asks the next production question: how many batch jobs will run tomorrow, and which observed counts fall outside that expectation?

Forecasting predicts a value for a later time. Anomaly detection asks whether an observation is unusual enough to inspect. The two jobs fit together: first estimate expected batch-job volume, then compare the actual count with that expectation. Monitoring chapters later reuse the same residual signal as a live operational check.

Seven-day batch job forecast chart comparing a weekly-lag baseline and its plus-or-minus three batch job band with observed volumes, highlighting Friday actual volume 160 against forecast 132 as a plus-28 review event, alongside holdout error and interval coverage evidence. — The weekly-lag baseline keeps normal-day MAE at `1.5` batch jobs. Friday's observed `160` falls well outside the `129..135` interval around forecast `132`, creating a `+28` launch review rather than an automatic incident.

Start with an ordered batch-job series

A time series is a sequence of observations indexed by time. This lesson uses daily outbound batch-job counts for one model-serving cluster. Each row has a date, a batch-job count, and a flag for a planned model launch.

The forecast horizon is how far ahead you predict. A one-day horizon answers tomorrow's staffing question. A seven-day horizon helps reserve capacity for the coming week. This lab predicts one day at a time, then evaluates a full future week.

Build four weeks of daily counts. The weekday pattern repeats, but small changes keep the series realistic. The final Friday launch event creates a larger jump.

build-batch-job-series.py

from datetime import date, timedelta
from hashlib import sha256
from json import dumps
from math import ceil

start = date(2026, 1, 5)
weekly_pattern = [100, 112, 115, 118, 132, 82, 76]
weekly_noise = [
    [0, 0, 0, 0, 0, 0, 0],
    [2, -2, 3, -2, 2, 2, 2],
    [4, -2, 4, -2, 0, 1, 2],
    [3, 1, 2, -1, 28, 2, 1],
]

rows = []
for week, noise in enumerate(weekly_noise):
    for weekday, (baseline, offset) in enumerate(zip(weekly_pattern, noise)):
        rows.append(
            {
                "date": start + timedelta(days=7 * week + weekday),
                "weekday": weekday,
                "volume": baseline + offset,
                "launch_event": week == 3 and weekday == 4,
            }
        )

print("rows:", len(rows))
print("first date:", rows[0]["date"])
print("last date:", rows[-1]["date"])
print("launch_event day:", next(row["date"] for row in rows if row["launch_event"]))

Output

rows: 28
first date: 2026-01-05
last date: 2026-02-01
launch_event day: 2026-01-30

The fixture is deliberately small enough to inspect. In production, each model-serving cluster, service tier, and forecast horizon may become a separate series or slice.

Keep future rows out of training

Ordinary tabular prediction often starts with a random train/test split. That's unsafe for forecasting. A shuffled training set can contain January 31 while its test set contains January 6. A model evaluated that way has learned from the future relative to some test rows.

Compare an invalid shuffled split with a chronological split. The audit prints whether the latest training date reaches or passes the earliest test date.

audit-time-splits.py

shuffled_train = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26]
shuffled_test = [index for index in range(len(rows)) if index not in shuffled_train]
time_train = list(range(21))
time_test = list(range(21, 28))

def split_summary(name, train_indices, test_indices):
    latest_train = max(rows[index]["date"] for index in train_indices)
    earliest_test = min(rows[index]["date"] for index in test_indices)
    future_leak = latest_train >= earliest_test
    print(
        f"{name}: latest_train={latest_train} "
        f"earliest_test={earliest_test} future_leak={future_leak}"
    )

split_summary("shuffled", shuffled_train, shuffled_test)
split_summary("time-aware", time_train, time_test)
assert max(time_train) < min(time_test)

Output

shuffled: latest_train=2026-01-31 earliest_test=2026-01-06 future_leak=True
time-aware: latest_train=2026-01-25 earliest_test=2026-01-26 future_leak=False

Hyndman and Athanasopoulos describe time-series cross-validation as repeated evaluation where each test observation occurs after the observations used for training.^{[1]Reference 1Forecasting: Principles and Practice, Third Edition.https://otexts.com/fpp3/} You can expand the training history or keep a rolling window. Either way, the time direction stays intact.

Establish a same-weekday baseline

Batch-job volume has seasonality when a pattern repeats at a known calendar frequency. Saturday volume may stay lower than weekday volume. Predicting tomorrow from yesterday alone will overreact around each weekend.

A seasonal naive forecast reuses the observation from the matching previous season. For daily batch job data with a weekly pattern, forecast Monday from last Monday, Tuesday from last Tuesday, and so on.^{[1]Reference 1Forecasting: Principles and Practice, Third Edition.https://otexts.com/fpp3/}

Target day	Same day last week	Forecast	Actual	Forecast error
Monday, Jan 26	Monday, Jan 19	104	103	-1
Friday, Jan 30	Friday, Jan 23	132	160	+28

The Friday arithmetic is 160 - 132 = 28. Positive error means the observed count exceeded the forecast. Negative error means the forecast was too high.

Compute the complete holdout week. Each forecast reads the value exactly seven rows earlier.

forecast-with-weekly-lag.py

holdout = list(range(21, 28))

def weekly_lag(index):
    return rows[index - 7]["volume"]

for index in holdout:
    forecast = weekly_lag(index)
    actual = rows[index]["volume"]
    error = actual - forecast
    print(
        f"{rows[index]['date']} forecast={forecast} "
        f"actual={actual} error={error:+d}"
    )

Output

2026-01-26 forecast=104 actual=103 error=-1
2026-01-27 forecast=110 actual=113 error=+3
2026-01-28 forecast=119 actual=117 error=-2
2026-01-29 forecast=116 actual=117 error=+1
2026-01-30 forecast=132 actual=160 error=+28
2026-01-31 forecast=83 actual=84 error=+1
2026-02-01 forecast=78 actual=77 error=-1

Simple baselines matter. A complex model should beat the seasonal naive forecast on future windows before its extra features and operational cost are justified.

Measure forecast error in batch jobs

Mean absolute error (MAE) averages the size of each mistake without letting positive and negative errors cancel:

\text{MAE} = \frac{1}{n}\sum_{t=1}^{n}|y_t - \hat{y}_t|

Here $n$ is the number of evaluated days, $y_t$ is observed volume on day $t$ , and $\hat{y}_t$ is the forecast for that day. For the holdout week, the absolute errors are 1, 3, 2, 1, 28, 1, 1. Their sum is 37, so MAE is 37 / 7 = 5.3 batch jobs.

Calculate overall MAE and normal-day MAE separately. The second metric excludes the known launch event day so you can see baseline behavior without hiding the operational spike.

measure-holdout-mae.py

def mean_absolute_error(errors):
    return sum(abs(error) for error in errors) / len(errors)

holdout_errors = [
    rows[index]["volume"] - weekly_lag(index)
    for index in holdout
]
normal_errors = [
    rows[index]["volume"] - weekly_lag(index)
    for index in holdout
    if not rows[index]["launch_event"]
]

print("holdout errors:", holdout_errors)
print("holdout MAE:", round(mean_absolute_error(holdout_errors), 1))
print("normal-day MAE:", round(mean_absolute_error(normal_errors), 1))

Output

holdout errors: [-1, 3, -2, 1, 28, 1, -1]
holdout MAE: 5.3
normal-day MAE: 1.5

MAE remains in batch jobs, which makes it understandable to operations. It doesn't tell you everything: underforecasting a capacity spike may cost more than overforecasting by the same amount. Record asymmetric business cost when the decision requires it.

Backtest more than one future window

A single future week is fragile evidence. Rolling-origin evaluation moves the forecast origin forward and measures several later windows. In this fixture, the first fold evaluates January 19 through January 25. The second evaluates January 26 through February 1.

Run both folds. The seasonal baseline can forecast each day because each fold starts after at least one full week of history.

run-rolling-origin-backtest.py

folds = []
for origin in (14, 21):
    indices = list(range(origin, origin + 7))
    errors = [
        rows[index]["volume"] - weekly_lag(index)
        for index in indices
    ]
    fold = {
        "start": rows[indices[0]]["date"],
        "end": rows[indices[-1]]["date"],
        "mae": round(mean_absolute_error(errors), 1),
    }
    folds.append(fold)
    print(f"{fold['start']} to {fold['end']}: MAE={fold['mae']}")

Output

2026-01-19 to 2026-01-25: MAE=0.9
2026-01-26 to 2026-02-01: MAE=5.3

Backtest by model-serving cluster, service tier, weekday, and launch-event status. A global average can hide one cluster that routinely underforecasts peak demand. The target must also match the decision: submitted job count helps GPU reservation, failed job count helps incident response, and queue-delay count helps operator notification.

Distinguish forecast errors from residuals

Two related quantities often get called residuals, but the distinction is useful:

Quantity	Where it comes from	Formula	What it tells you
residual	training data after fitting a model	observed value minus fitted value	whether model fits known history
forecast error	later observation after making a forecast	observed value minus forecast	whether model predicted unseen future data

Hyndman and Athanasopoulos make this distinction explicitly.^{[1]Reference 1Forecasting: Principles and Practice, Third Edition.https://otexts.com/fpp3/} Our Friday +28 is a forecast error because Friday's observed volume wasn't available when the forecast was issued. An operations team may casually say "residual alert," but the stored artifact should name the measured quantity precisely.

Diagram showing Daily batch-job counts ordered history, Seasonal baseline same weekday, Future forecast expected range, and Observed volume. — Daily batch-job counts ordered history, Seasonal baseline same weekday, Future forecast expected range, and Observed volume.

Calibrate an error band from earlier days

A point forecast such as 132 batch jobs doesn't express uncertainty. A prediction interval pairs a forecast with a lower and upper bound intended to cover a stated proportion of future outcomes. Forecast intervals should be evaluated for coverage, not displayed beside the point estimate alone.^{[1]Reference 1Forecasting: Principles and Practice, Third Edition.https://otexts.com/fpp3/}

For a small first-principles exercise, use the 95th percentile of earlier absolute one-day forecast errors. This produces a symmetric empirical error band. The code deliberately calibrates on January 12 through January 25, before the January 26 holdout starts, then measures coverage again on the later holdout week.

calibrate-empirical-error-band.py

calibration = list(range(7, 21))
calibration_errors = [
    rows[index]["volume"] - weekly_lag(index)
    for index in calibration
]
target_coverage = 0.95
interval_policy = "empirical-absolute-error-p95-v1"

def nearest_rank_quantile(values, probability):
    ordered = sorted(values)
    rank = max(0, ceil(probability * len(ordered)) - 1)
    return ordered[rank]

error_band = nearest_rank_quantile(
    [abs(error) for error in calibration_errors],
    probability=target_coverage,
)
calibration_coverage = sum(
    abs(error) <= error_band
    for error in calibration_errors
) / len(calibration_errors)
holdout_coverage = sum(
    abs(error) <= error_band
    for error in holdout_errors
) / len(holdout_errors)

assert max(calibration) < min(holdout)
print("calibration errors:", calibration_errors)
print("95% empirical absolute-error band:", error_band)
print("in-sample calibration coverage:", round(calibration_coverage, 3))
print("later holdout coverage:", round(holdout_coverage, 3))

Output

calibration errors: [2, -2, 3, -2, 2, 2, 2, 2, 0, 1, 0, -2, -1, 0]
95% empirical absolute-error band: 3
in-sample calibration coverage: 1.0
later holdout coverage: 0.857

The ±3 band explains the mechanics, not a production guarantee. Fourteen calibration errors are too few for a stable 95% interval. Its in-sample calibration coverage is 1.000, while later holdout coverage is 6 / 7 = 0.857 because Friday falls outside the band. A production system should measure out-of-sample interval coverage across rolling windows, forecast horizons, model-serving clusters, and important demand slices.

Turn an unusual error into a review artifact

An anomaly is an observation unusual enough to inspect. It isn't proof that a cluster, pipeline, or model failed. Friday's jump could come from a planned model launch, duplicated event ingestion, a large backfill, or genuine demand growth.

Apply the empirical band to the held-out week. The alert stores enough context to reproduce the comparison.

create-capacity-alert.py

forecast_version = "weekly-lag-v1"
training_cutoff = str(rows[20]["date"])
owner = "capacity-ops"

alerts = []
for index in holdout:
    forecast = weekly_lag(index)
    actual = rows[index]["volume"]
    forecast_error = actual - forecast
    if abs(forecast_error) > error_band:
        alerts.append(
            {
                "series": "fc-a.batch_jobs",
                "forecast_version": forecast_version,
                "training_cutoff": training_cutoff,
                "interval_policy": interval_policy,
                "date": str(rows[index]["date"]),
                "forecast": forecast,
                "actual": actual,
                "forecast_error": forecast_error,
                "lower": forecast - error_band,
                "upper": forecast + error_band,
                "launch_event": rows[index]["launch_event"],
                "action": "review",
                "owner": owner,
                "resolution": "pending",
            }
        )

for alert in alerts:
    print(alert)

Output

{'series': 'fc-a.batch_jobs', 'forecast_version': 'weekly-lag-v1', 'training_cutoff': '2026-01-25', 'interval_policy': 'empirical-absolute-error-p95-v1', 'date': '2026-01-30', 'forecast': 132, 'actual': 160, 'forecast_error': 28, 'lower': 129, 'upper': 135, 'launch_event': True, 'action': 'review', 'owner': 'capacity-ops', 'resolution': 'pending'}

The baseline caught Friday because 160 falls outside [129, 135]. The launch-event flag doesn't erase the event. It changes the first investigation step.

Route alerts with context. A planned model launch still deserves capacity review, but it shouldn't automatically create an incident page.

route-alert-with-context.py

def route_alert(alert):
    if alert["launch_event"]:
        return "review planned model launch capacity"
    return "page unexpected batch-job spike"

for alert in alerts:
    print(alert["date"], "->", route_alert(alert))

Output

2026-01-30 -> review planned model launch capacity

Log the forecast version, training cutoff, interval policy version, observed value, error, known-event flags, owner, and eventual resolution. That receipt helps operations act now and helps future model reviews learn from the alert.

Evaluate alert usefulness separately

Low MAE doesn't guarantee useful alerts. An alert threshold can still page too often or miss disruptions. Once reviewers label whether each historical alert was actionable, measure alert precision and recall:

Metric	Question
alert precision	Of the alerts sent to review, how many were actionable?
alert recall	Of the actionable disruptions, how many did the policy surface?

Compute both metrics on a five-event resolution fixture. The policy surfaced two useful events, sent one noisy alert, and missed one disruption.

evaluate-alert-policy.py

resolved_events = [
    {"alert": True, "actionable": True},
    {"alert": True, "actionable": False},
    {"alert": False, "actionable": True},
    {"alert": False, "actionable": False},
    {"alert": True, "actionable": True},
]

true_positives = sum(event["alert"] and event["actionable"] for event in resolved_events)
false_positives = sum(event["alert"] and not event["actionable"] for event in resolved_events)
false_negatives = sum(not event["alert"] and event["actionable"] for event in resolved_events)

def rate_or_none(numerator, denominator):
    return round(numerator / denominator, 3) if denominator else None

alert_precision = rate_or_none(true_positives, true_positives + false_positives)
alert_recall = rate_or_none(true_positives, true_positives + false_negatives)
print("alert precision:", alert_precision)
print("alert recall:", alert_recall)

Output

alert precision: 0.667
alert recall: 0.667

Tune the policy against operational cost. A missed cluster disruption and a noisy review ticket don't have equal consequences. If a denominator is zero, report None: no evidence isn't the same as measured failure. Page only when urgency warrants interruption; otherwise queue a review with the same evidence.

Reproduce a data-quality anomaly

Not every spike represents demand. A duplicated telemetry event can inflate a count before forecasting code sees it. Reproduce that failure with three raw telemetry rows, two of which describe the same batch job event.

detect-duplicate-telemetry-events.py

raw_job_events = [
    {"event_id": "job-001", "job_id": "batch-100", "type": "started"},
    {"event_id": "job-001", "job_id": "batch-100", "type": "started"},
    {"event_id": "job-002", "job_id": "batch-101", "type": "started"},
]
unique_events = {
    event["event_id"]: event
    for event in raw_job_events
}
deduplication_policy = "telemetry-event-id-v1"
duplicate_rows = len(raw_job_events) - len(unique_events)

print("naive started-job count:", len(raw_job_events))
print("deduplicated started-job count:", len(unique_events))
print("duplicate rows:", duplicate_rows)

Output

naive started-job count: 3
deduplicated started-job count: 2
duplicate rows: 1

An anomaly review should inspect input quality before retraining a model. Retraining on duplicated counts would teach the model to copy a pipeline bug. This tiny dictionary deduper keeps one row because the repeated payloads match. If two rows reuse an event ID but disagree, quarantine the conflict instead of silently choosing one.

Gate and publish a forecast candidate

A minimal release gate should prove time order, calibration order, normal-day accuracy, interval bookkeeping, contextualized alerts, and assigned ownership. These checks don't prove the baseline is production-ready, but they prevent avoidable mistakes from reaching a capacity workflow.

Run the gate.

check-forecast-release-gates.py

gates = {
    "time_order": max(time_train) < min(time_test),
    "calibration_before_holdout": max(calibration) < min(holdout),
    "normal_day_mae": mean_absolute_error(normal_errors) <= 3.0,
    "calibration_coverage_recorded": 0.0 <= calibration_coverage <= 1.0,
    "holdout_coverage_recorded": 0.0 <= holdout_coverage <= 1.0,
    "interval_policy_versioned": bool(interval_policy),
    "deduplication_policy_versioned": bool(deduplication_policy),
    "alerts_have_context": all(
        {
            "series", "forecast_version", "training_cutoff", "interval_policy",
            "date", "forecast", "actual", "forecast_error", "lower", "upper",
            "launch_event", "action", "owner", "resolution",
        } <= alert.keys()
        for alert in alerts
    ),
    "owner_assigned": bool(owner),
}

for name, passed in gates.items():
    print(f"{name}: {passed}")
print("release gate:", all(gates.values()))

Output

time_order: True
calibration_before_holdout: True
normal_day_mae: True
calibration_coverage_recorded: True
holdout_coverage_recorded: True
interval_policy_versioned: True
deduplication_policy_versioned: True
alerts_have_context: True
owner_assigned: True
release gate: True

Publish a receipt. The hash lets a later monitoring job identify the exact baseline, interval, and input policies it evaluates.

publish-forecast-candidate-receipt.py

receipt = {
    "series": "fc-a.batch_jobs",
    "forecast": forecast_version,
    "training_cutoff": training_cutoff,
    "interval_policy": {
        "version": interval_policy,
        "target_coverage": target_coverage,
        "absolute_error_band": error_band,
        "calibration_window": [str(rows[7]["date"]), str(rows[20]["date"])],
        "calibration_coverage": round(calibration_coverage, 3),
        "holdout_window": [str(rows[21]["date"]), str(rows[27]["date"])],
        "holdout_coverage": round(holdout_coverage, 3),
    },
    "input_policy": {
        "deduplication": deduplication_policy,
    },
    "evaluation": {
        "normal_day_mae": round(mean_absolute_error(normal_errors), 1),
        "alert_count": len(alerts),
    },
    "alerts": alerts,
    "release_checks": gates,
    "owner": owner,
    "status": "candidate_for_shadow" if all(gates.values()) else "blocked",
}
payload = dumps(receipt, sort_keys=True, separators=(",", ":"))

print(dumps(receipt, indent=2, sort_keys=True))
print("receipt sha256:", sha256(payload.encode()).hexdigest()[:12])

Output

{
  "alerts": [
    {
      "action": "review",
      "actual": 160,
      "date": "2026-01-30",
      "forecast": 132,
      "forecast_error": 28,
      "forecast_version": "weekly-lag-v1",
      "interval_policy": "empirical-absolute-error-p95-v1",
      "launch_event": true,
      "lower": 129,
      "owner": "capacity-ops",
      "resolution": "pending",
      "series": "fc-a.batch_jobs",
      "training_cutoff": "2026-01-25",
      "upper": 135
    }
  ],
  "evaluation": {
    "alert_count": 1,
    "normal_day_mae": 1.5
  },
  "forecast": "weekly-lag-v1",
  "input_policy": {
    "deduplication": "telemetry-event-id-v1"
  },
  "interval_policy": {
    "absolute_error_band": 3,
    "calibration_coverage": 1.0,
    "calibration_window": [
      "2026-01-12",
      "2026-01-25"
    ],
    "holdout_coverage": 0.857,
    "holdout_window": [
      "2026-01-26",
      "2026-02-01"
    ],
    "target_coverage": 0.95,
    "version": "empirical-absolute-error-p95-v1"
  },
  "owner": "capacity-ops",
  "release_checks": {
    "alerts_have_context": true,
    "calibration_before_holdout": true,
    "calibration_coverage_recorded": true,
    "deduplication_policy_versioned": true,
    "holdout_coverage_recorded": true,
    "interval_policy_versioned": true,
    "normal_day_mae": true,
    "owner_assigned": true,
    "time_order": true
  },
  "series": "fc-a.batch_jobs",
  "status": "candidate_for_shadow",
  "training_cutoff": "2026-01-25"
}
receipt sha256: a40b84aed22b

The receipt keeps calibration evidence separate from later holdout evidence and binds the policies that produced both. Passing this tiny fixture earns shadow evaluation, not a silent production rollout.

Explain the alert without looking back

Explain the complete path in your own words:

Why does a daily batch job series need chronological splits?
What repeated pattern does a same-weekday baseline capture?
Why is January 30's +28 a forecast error rather than a training residual?
What does the empirical ±3 band mean, and what doesn't it guarantee?
Why do alert precision and recall need a separate reviewed dataset?
Why should input deduplication happen before model retraining?

Practice

Change Friday's observed volume from 160 to 134. Predict whether the alert fires before rerunning the lab.
Change probability=0.95 to probability=0.50. Explain why the narrower band sends more rows to review.
Add a second model-serving cluster with a different weekend pattern. Explain why a shared global error band may hide local problems.
Add a two-day forecast horizon. Identify which earlier observations each prediction may use.
Extend resolved_events with two noisy alerts. Compute the new alert precision before running the cell.

Practice answer sketches

Prompt	Reasoning check
Friday becomes `134`	Forecast error becomes `134 - 132 = 2`, which stays inside the `±3` band. No alert fires.
Quantile becomes `0.50`	Band shrinks from `±3` to `±2`. Tuesday's `+3` and Friday's `+28` now enter review.
Add another model-serving cluster	Calibrate and report by cluster before sharing a band. Different local schedules can disappear inside an aggregate.
Predict two days ahead	For a target at time $t$ , use information available no later than $t - 2$ . Recompute each fold with that cutoff.
Add two noisy alerts	True positives stay at `2`, false positives rise from `1` to `3`, and precision falls to `2 / 5 = 0.4`.

What strong answers show

Evidence	A strong explanation demonstrates
temporal split	proves each evaluated day occurs after its training and calibration inputs
baseline	computes a same-weekday forecast and beats it before proposing extra model complexity
error metric	calculates MAE in batch jobs and discusses asymmetric capacity cost
uncertainty	treats the tiny empirical band as a teaching device and measures rolling-window coverage in production
anomaly handling	stores reproducible alert context and separates unusual observations from root causes
alert evaluation	measures reviewed alert precision and recall independently from forecast MAE
data quality	checks duplicate or stale inputs before retraining from anomalous observations

When the policy breaks

Symptom	Cause	Fix
Test results collapse at launch	random split leaked future behavior	use chronological rolling-origin validation
Alerts fire every Monday	weekly seasonality is absent	add a same-weekday baseline first
Interval looks precise but misses peaks	band was calibrated in sample or on too little history	measure out-of-sample coverage by horizon and slice
Promotion creates an incident page	alert threshold has no business context	route known events to capacity review
Model learns a fake demand jump	duplicated events entered training rows	deduplicate and audit ingestion before retraining
Operations ignores alerts	receipt lacks ownership or resolution	log policy version, context, owner, and outcome

Next Step

Continue to Monitoring Predictive Models

You can issue a time-aware forecast candidate and turn unusual future errors into reviewable alerts. Next, monitor deployed models for <span data-glossary="data-drift">input shift</span>, delayed outcome degradation, and retraining decisions.

PreviousRanking and Recommendation Systems

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Forecasting: Principles and Practice, Third Edition.

Hyndman, R. J. & Athanasopoulos, G. · 2021

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnProduction ML SystemsForecasting and Anomaly Detection

📊MediumEvaluation & Benchmarks

Forecasting and Anomaly Detection

Forecast batch-job demand with time-aware evaluation and turn large forecast errors into reviewable operational alerts.

16 min read

Learning path

Step 46 of 158 in the full curriculum

Ranking and Recommendation Systems Monitoring Predictive Models

Start with an ordered batch-job series

Build four weeks of daily counts. The weekday pattern repeats, but small changes keep the series realistic. The final Friday launch event creates a larger jump.

build-batch-job-series.py

from datetime import date, timedelta
from hashlib import sha256
from json import dumps
from math import ceil

start = date(2026, 1, 5)
weekly_pattern = [100, 112, 115, 118, 132, 82, 76]
weekly_noise = [
    [0, 0, 0, 0, 0, 0, 0],
    [2, -2, 3, -2, 2, 2, 2],
    [4, -2, 4, -2, 0, 1, 2],
    [3, 1, 2, -1, 28, 2, 1],
]

rows = []
for week, noise in enumerate(weekly_noise):
    for weekday, (baseline, offset) in enumerate(zip(weekly_pattern, noise)):
        rows.append(
            {
                "date": start + timedelta(days=7 * week + weekday),
                "weekday": weekday,
                "volume": baseline + offset,
                "launch_event": week == 3 and weekday == 4,
            }
        )

print("rows:", len(rows))
print("first date:", rows[0]["date"])
print("last date:", rows[-1]["date"])
print("launch_event day:", next(row["date"] for row in rows if row["launch_event"]))

Output

rows: 28
first date: 2026-01-05
last date: 2026-02-01
launch_event day: 2026-01-30

The fixture is deliberately small enough to inspect. In production, each model-serving cluster, service tier, and forecast horizon may become a separate series or slice.

Keep future rows out of training

Compare an invalid shuffled split with a chronological split. The audit prints whether the latest training date reaches or passes the earliest test date.

audit-time-splits.py

shuffled_train = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26]
shuffled_test = [index for index in range(len(rows)) if index not in shuffled_train]
time_train = list(range(21))
time_test = list(range(21, 28))

def split_summary(name, train_indices, test_indices):
    latest_train = max(rows[index]["date"] for index in train_indices)
    earliest_test = min(rows[index]["date"] for index in test_indices)
    future_leak = latest_train >= earliest_test
    print(
        f"{name}: latest_train={latest_train} "
        f"earliest_test={earliest_test} future_leak={future_leak}"
    )

split_summary("shuffled", shuffled_train, shuffled_test)
split_summary("time-aware", time_train, time_test)
assert max(time_train) < min(time_test)

Output

shuffled: latest_train=2026-01-31 earliest_test=2026-01-06 future_leak=True
time-aware: latest_train=2026-01-25 earliest_test=2026-01-26 future_leak=False

Establish a same-weekday baseline

Target day	Same day last week	Forecast	Actual	Forecast error
Monday, Jan 26	Monday, Jan 19	104	103	-1
Friday, Jan 30	Friday, Jan 23	132	160	+28

The Friday arithmetic is 160 - 132 = 28. Positive error means the observed count exceeded the forecast. Negative error means the forecast was too high.

Compute the complete holdout week. Each forecast reads the value exactly seven rows earlier.

forecast-with-weekly-lag.py

holdout = list(range(21, 28))

def weekly_lag(index):
    return rows[index - 7]["volume"]

for index in holdout:
    forecast = weekly_lag(index)
    actual = rows[index]["volume"]
    error = actual - forecast
    print(
        f"{rows[index]['date']} forecast={forecast} "
        f"actual={actual} error={error:+d}"
    )

Output

2026-01-26 forecast=104 actual=103 error=-1
2026-01-27 forecast=110 actual=113 error=+3
2026-01-28 forecast=119 actual=117 error=-2
2026-01-29 forecast=116 actual=117 error=+1
2026-01-30 forecast=132 actual=160 error=+28
2026-01-31 forecast=83 actual=84 error=+1
2026-02-01 forecast=78 actual=77 error=-1

Simple baselines matter. A complex model should beat the seasonal naive forecast on future windows before its extra features and operational cost are justified.

Measure forecast error in batch jobs

Mean absolute error (MAE) averages the size of each mistake without letting positive and negative errors cancel:

\text{MAE} = \frac{1}{n}\sum_{t=1}^{n}|y_t - \hat{y}_t|

Calculate overall MAE and normal-day MAE separately. The second metric excludes the known launch event day so you can see baseline behavior without hiding the operational spike.

measure-holdout-mae.py

def mean_absolute_error(errors):
    return sum(abs(error) for error in errors) / len(errors)

holdout_errors = [
    rows[index]["volume"] - weekly_lag(index)
    for index in holdout
]
normal_errors = [
    rows[index]["volume"] - weekly_lag(index)
    for index in holdout
    if not rows[index]["launch_event"]
]

print("holdout errors:", holdout_errors)
print("holdout MAE:", round(mean_absolute_error(holdout_errors), 1))
print("normal-day MAE:", round(mean_absolute_error(normal_errors), 1))

Output

holdout errors: [-1, 3, -2, 1, 28, 1, -1]
holdout MAE: 5.3
normal-day MAE: 1.5

Backtest more than one future window

Run both folds. The seasonal baseline can forecast each day because each fold starts after at least one full week of history.

run-rolling-origin-backtest.py

folds = []
for origin in (14, 21):
    indices = list(range(origin, origin + 7))
    errors = [
        rows[index]["volume"] - weekly_lag(index)
        for index in indices
    ]
    fold = {
        "start": rows[indices[0]]["date"],
        "end": rows[indices[-1]]["date"],
        "mae": round(mean_absolute_error(errors), 1),
    }
    folds.append(fold)
    print(f"{fold['start']} to {fold['end']}: MAE={fold['mae']}")

Output

2026-01-19 to 2026-01-25: MAE=0.9
2026-01-26 to 2026-02-01: MAE=5.3

Distinguish forecast errors from residuals

Two related quantities often get called residuals, but the distinction is useful:

Quantity	Where it comes from	Formula	What it tells you
residual	training data after fitting a model	observed value minus fitted value	whether model fits known history
forecast error	later observation after making a forecast	observed value minus forecast	whether model predicted unseen future data

Calibrate an error band from earlier days

calibrate-empirical-error-band.py

calibration = list(range(7, 21))
calibration_errors = [
    rows[index]["volume"] - weekly_lag(index)
    for index in calibration
]
target_coverage = 0.95
interval_policy = "empirical-absolute-error-p95-v1"

def nearest_rank_quantile(values, probability):
    ordered = sorted(values)
    rank = max(0, ceil(probability * len(ordered)) - 1)
    return ordered[rank]

error_band = nearest_rank_quantile(
    [abs(error) for error in calibration_errors],
    probability=target_coverage,
)
calibration_coverage = sum(
    abs(error) <= error_band
    for error in calibration_errors
) / len(calibration_errors)
holdout_coverage = sum(
    abs(error) <= error_band
    for error in holdout_errors
) / len(holdout_errors)

assert max(calibration) < min(holdout)
print("calibration errors:", calibration_errors)
print("95% empirical absolute-error band:", error_band)
print("in-sample calibration coverage:", round(calibration_coverage, 3))
print("later holdout coverage:", round(holdout_coverage, 3))

Output

calibration errors: [2, -2, 3, -2, 2, 2, 2, 2, 0, 1, 0, -2, -1, 0]
95% empirical absolute-error band: 3
in-sample calibration coverage: 1.0
later holdout coverage: 0.857

Turn an unusual error into a review artifact

Apply the empirical band to the held-out week. The alert stores enough context to reproduce the comparison.

create-capacity-alert.py

forecast_version = "weekly-lag-v1"
training_cutoff = str(rows[20]["date"])
owner = "capacity-ops"

alerts = []
for index in holdout:
    forecast = weekly_lag(index)
    actual = rows[index]["volume"]
    forecast_error = actual - forecast
    if abs(forecast_error) > error_band:
        alerts.append(
            {
                "series": "fc-a.batch_jobs",
                "forecast_version": forecast_version,
                "training_cutoff": training_cutoff,
                "interval_policy": interval_policy,
                "date": str(rows[index]["date"]),
                "forecast": forecast,
                "actual": actual,
                "forecast_error": forecast_error,
                "lower": forecast - error_band,
                "upper": forecast + error_band,
                "launch_event": rows[index]["launch_event"],
                "action": "review",
                "owner": owner,
                "resolution": "pending",
            }
        )

for alert in alerts:
    print(alert)

Output

{'series': 'fc-a.batch_jobs', 'forecast_version': 'weekly-lag-v1', 'training_cutoff': '2026-01-25', 'interval_policy': 'empirical-absolute-error-p95-v1', 'date': '2026-01-30', 'forecast': 132, 'actual': 160, 'forecast_error': 28, 'lower': 129, 'upper': 135, 'launch_event': True, 'action': 'review', 'owner': 'capacity-ops', 'resolution': 'pending'}

The baseline caught Friday because 160 falls outside [129, 135]. The launch-event flag doesn't erase the event. It changes the first investigation step.

Route alerts with context. A planned model launch still deserves capacity review, but it shouldn't automatically create an incident page.

route-alert-with-context.py

def route_alert(alert):
    if alert["launch_event"]:
        return "review planned model launch capacity"
    return "page unexpected batch-job spike"

for alert in alerts:
    print(alert["date"], "->", route_alert(alert))

Output

2026-01-30 -> review planned model launch capacity

Evaluate alert usefulness separately

Metric	Question
alert precision	Of the alerts sent to review, how many were actionable?
alert recall	Of the actionable disruptions, how many did the policy surface?

Compute both metrics on a five-event resolution fixture. The policy surfaced two useful events, sent one noisy alert, and missed one disruption.

evaluate-alert-policy.py

resolved_events = [
    {"alert": True, "actionable": True},
    {"alert": True, "actionable": False},
    {"alert": False, "actionable": True},
    {"alert": False, "actionable": False},
    {"alert": True, "actionable": True},
]

true_positives = sum(event["alert"] and event["actionable"] for event in resolved_events)
false_positives = sum(event["alert"] and not event["actionable"] for event in resolved_events)
false_negatives = sum(not event["alert"] and event["actionable"] for event in resolved_events)

def rate_or_none(numerator, denominator):
    return round(numerator / denominator, 3) if denominator else None

alert_precision = rate_or_none(true_positives, true_positives + false_positives)
alert_recall = rate_or_none(true_positives, true_positives + false_negatives)
print("alert precision:", alert_precision)
print("alert recall:", alert_recall)

Output

alert precision: 0.667
alert recall: 0.667

Reproduce a data-quality anomaly

detect-duplicate-telemetry-events.py

raw_job_events = [
    {"event_id": "job-001", "job_id": "batch-100", "type": "started"},
    {"event_id": "job-001", "job_id": "batch-100", "type": "started"},
    {"event_id": "job-002", "job_id": "batch-101", "type": "started"},
]
unique_events = {
    event["event_id"]: event
    for event in raw_job_events
}
deduplication_policy = "telemetry-event-id-v1"
duplicate_rows = len(raw_job_events) - len(unique_events)

print("naive started-job count:", len(raw_job_events))
print("deduplicated started-job count:", len(unique_events))
print("duplicate rows:", duplicate_rows)

Output

naive started-job count: 3
deduplicated started-job count: 2
duplicate rows: 1

Gate and publish a forecast candidate

Run the gate.

check-forecast-release-gates.py

gates = {
    "time_order": max(time_train) < min(time_test),
    "calibration_before_holdout": max(calibration) < min(holdout),
    "normal_day_mae": mean_absolute_error(normal_errors) <= 3.0,
    "calibration_coverage_recorded": 0.0 <= calibration_coverage <= 1.0,
    "holdout_coverage_recorded": 0.0 <= holdout_coverage <= 1.0,
    "interval_policy_versioned": bool(interval_policy),
    "deduplication_policy_versioned": bool(deduplication_policy),
    "alerts_have_context": all(
        {
            "series", "forecast_version", "training_cutoff", "interval_policy",
            "date", "forecast", "actual", "forecast_error", "lower", "upper",
            "launch_event", "action", "owner", "resolution",
        } <= alert.keys()
        for alert in alerts
    ),
    "owner_assigned": bool(owner),
}

for name, passed in gates.items():
    print(f"{name}: {passed}")
print("release gate:", all(gates.values()))

Output

time_order: True
calibration_before_holdout: True
normal_day_mae: True
calibration_coverage_recorded: True
holdout_coverage_recorded: True
interval_policy_versioned: True
deduplication_policy_versioned: True
alerts_have_context: True
owner_assigned: True
release gate: True

Publish a receipt. The hash lets a later monitoring job identify the exact baseline, interval, and input policies it evaluates.

publish-forecast-candidate-receipt.py

receipt = {
    "series": "fc-a.batch_jobs",
    "forecast": forecast_version,
    "training_cutoff": training_cutoff,
    "interval_policy": {
        "version": interval_policy,
        "target_coverage": target_coverage,
        "absolute_error_band": error_band,
        "calibration_window": [str(rows[7]["date"]), str(rows[20]["date"])],
        "calibration_coverage": round(calibration_coverage, 3),
        "holdout_window": [str(rows[21]["date"]), str(rows[27]["date"])],
        "holdout_coverage": round(holdout_coverage, 3),
    },
    "input_policy": {
        "deduplication": deduplication_policy,
    },
    "evaluation": {
        "normal_day_mae": round(mean_absolute_error(normal_errors), 1),
        "alert_count": len(alerts),
    },
    "alerts": alerts,
    "release_checks": gates,
    "owner": owner,
    "status": "candidate_for_shadow" if all(gates.values()) else "blocked",
}
payload = dumps(receipt, sort_keys=True, separators=(",", ":"))

print(dumps(receipt, indent=2, sort_keys=True))
print("receipt sha256:", sha256(payload.encode()).hexdigest()[:12])

Output

{
  "alerts": [
    {
      "action": "review",
      "actual": 160,
      "date": "2026-01-30",
      "forecast": 132,
      "forecast_error": 28,
      "forecast_version": "weekly-lag-v1",
      "interval_policy": "empirical-absolute-error-p95-v1",
      "launch_event": true,
      "lower": 129,
      "owner": "capacity-ops",
      "resolution": "pending",
      "series": "fc-a.batch_jobs",
      "training_cutoff": "2026-01-25",
      "upper": 135
    }
  ],
  "evaluation": {
    "alert_count": 1,
    "normal_day_mae": 1.5
  },
  "forecast": "weekly-lag-v1",
  "input_policy": {
    "deduplication": "telemetry-event-id-v1"
  },
  "interval_policy": {
    "absolute_error_band": 3,
    "calibration_coverage": 1.0,
    "calibration_window": [
      "2026-01-12",
      "2026-01-25"
    ],
    "holdout_coverage": 0.857,
    "holdout_window": [
      "2026-01-26",
      "2026-02-01"
    ],
    "target_coverage": 0.95,
    "version": "empirical-absolute-error-p95-v1"
  },
  "owner": "capacity-ops",
  "release_checks": {
    "alerts_have_context": true,
    "calibration_before_holdout": true,
    "calibration_coverage_recorded": true,
    "deduplication_policy_versioned": true,
    "holdout_coverage_recorded": true,
    "interval_policy_versioned": true,
    "normal_day_mae": true,
    "owner_assigned": true,
    "time_order": true
  },
  "series": "fc-a.batch_jobs",
  "status": "candidate_for_shadow",
  "training_cutoff": "2026-01-25"
}
receipt sha256: a40b84aed22b

Explain the alert without looking back

Explain the complete path in your own words:

Why does a daily batch job series need chronological splits?
What repeated pattern does a same-weekday baseline capture?
Why is January 30's +28 a forecast error rather than a training residual?
What does the empirical ±3 band mean, and what doesn't it guarantee?
Why do alert precision and recall need a separate reviewed dataset?
Why should input deduplication happen before model retraining?

Practice

Change Friday's observed volume from 160 to 134. Predict whether the alert fires before rerunning the lab.
Change probability=0.95 to probability=0.50. Explain why the narrower band sends more rows to review.
Add a second model-serving cluster with a different weekend pattern. Explain why a shared global error band may hide local problems.
Add a two-day forecast horizon. Identify which earlier observations each prediction may use.
Extend resolved_events with two noisy alerts. Compute the new alert precision before running the cell.

Practice answer sketches

Prompt	Reasoning check
Friday becomes `134`	Forecast error becomes `134 - 132 = 2`, which stays inside the `±3` band. No alert fires.
Quantile becomes `0.50`	Band shrinks from `±3` to `±2`. Tuesday's `+3` and Friday's `+28` now enter review.
Add another model-serving cluster	Calibrate and report by cluster before sharing a band. Different local schedules can disappear inside an aggregate.
Predict two days ahead	For a target at time $t$ , use information available no later than $t - 2$ . Recompute each fold with that cutoff.
Add two noisy alerts	True positives stay at `2`, false positives rise from `1` to `3`, and precision falls to `2 / 5 = 0.4`.

What strong answers show

Evidence	A strong explanation demonstrates
temporal split	proves each evaluated day occurs after its training and calibration inputs
baseline	computes a same-weekday forecast and beats it before proposing extra model complexity
error metric	calculates MAE in batch jobs and discusses asymmetric capacity cost
uncertainty	treats the tiny empirical band as a teaching device and measures rolling-window coverage in production
anomaly handling	stores reproducible alert context and separates unusual observations from root causes
alert evaluation	measures reviewed alert precision and recall independently from forecast MAE
data quality	checks duplicate or stale inputs before retraining from anomalous observations

When the policy breaks

Symptom	Cause	Fix
Test results collapse at launch	random split leaked future behavior	use chronological rolling-origin validation
Alerts fire every Monday	weekly seasonality is absent	add a same-weekday baseline first
Interval looks precise but misses peaks	band was calibrated in sample or on too little history	measure out-of-sample coverage by horizon and slice
Promotion creates an incident page	alert threshold has no business context	route known events to capacity review
Model learns a fake demand jump	duplicated events entered training rows	deduplicate and audit ingestion before retraining
Operations ignores alerts	receipt lacks ownership or resolution	log policy version, context, owner, and outcome

Next Step

Continue to Monitoring Predictive Models

PreviousRanking and Recommendation Systems

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Forecasting: Principles and Practice, Third Edition.

Hyndman, R. J. & Athanasopoulos, G. · 2021

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Forecasting and Anomaly Detection

Start with an ordered batch-job series

Keep future rows out of training

Establish a same-weekday baseline

Measure forecast error in batch jobs

Backtest more than one future window

Distinguish forecast errors from residuals

Calibrate an error band from earlier days

Turn an unusual error into a review artifact

Evaluate alert usefulness separately

Reproduce a data-quality anomaly

Gate and publish a forecast candidate

Explain the alert without looking back

Practice

Practice answer sketches

What strong answers show

When the policy breaks

Mastery Check

Discussion

Forecasting and Anomaly Detection

Start with an ordered batch-job series

Keep future rows out of training

Establish a same-weekday baseline

Measure forecast error in batch jobs

Backtest more than one future window

Distinguish forecast errors from residuals

Calibrate an error band from earlier days

Turn an unusual error into a review artifact

Evaluate alert usefulness separately

Reproduce a data-quality anomaly

Gate and publish a forecast candidate

Explain the alert without looking back

Practice

Practice answer sketches

What strong answers show

When the policy breaks

Mastery Check

Discussion

Forecasting and Anomaly Detection

Start with an ordered batch-job series

Keep future rows out of training

Why isn't a random split acceptable just because every row belongs to the same cluster?

Establish a same-weekday baseline

Measure forecast error in batch jobs

Backtest more than one future window

Distinguish forecast errors from residuals

Calibrate an error band from earlier days

Why doesn't the tiny ±3 band prove that future batch-job counts will land inside it 95% of the time?

Turn an unusual error into a review artifact

Evaluate alert usefulness separately

Reproduce a data-quality anomaly

Why shouldn't the alert policy declare an incident as soon as a count crosses the error band?

Gate and publish a forecast candidate

Explain the alert without looking back

Practice

Practice answer sketches

What strong answers show

When the policy breaks

Mastery Check

Discussion

Forecasting and Anomaly Detection

Start with an ordered batch-job series

Keep future rows out of training

Why isn't a random split acceptable just because every row belongs to the same cluster?

Establish a same-weekday baseline

Measure forecast error in batch jobs

Backtest more than one future window

Distinguish forecast errors from residuals

Calibrate an error band from earlier days

Why doesn't the tiny ±3 band prove that future batch-job counts will land inside it 95% of the time?

Turn an unusual error into a review artifact

Evaluate alert usefulness separately

Reproduce a data-quality anomaly

Why shouldn't the alert policy declare an incident as soon as a count crosses the error band?

Gate and publish a forecast candidate

Explain the alert without looking back

Practice

Practice answer sketches

What strong answers show

When the policy breaks

Mastery Check

Discussion