IndustryDeep Dive

What Does an AI Engineer Actually Do?

AI engineering sits between foundation models and product engineering. The day-to-day work behind useful LLM systems is prompts, RAG, tools, evals, serving, and lightweight adaptation.

LeetLLM TeamFebruary 19, 2026Updated June 11, 202615 min read

Foundation models become working product features through AI engineering. The engineers doing that work usually don't train the frontier model from scratch. They connect it to product data, define tool contracts, control latency and cost, build evals, and make failures observable.

Think about an account-support assistant. The model alone can't know live account status, can't decide which policy document the user may see, and can't safely call billing APIs without guardrails. The AI engineer designs that path: retrieve allowed evidence, call approved tools, validate output, measure quality, and keep the feature inside budget.

That's the role in one sentence: build reliable software around probabilistic models.

Why the role exists now

AI engineering shift from a custom model delivery project to an AI product loop with retrieval, tools, evaluation, serving, adaptation, and measurement around a foundation model. — AI product work shifted from delivering one model toward operating the product loop around it: retrieval, tools, evals, serving, adaptation, and measurement.

Before foundation-model APIs and strong open-weight models became practical, many AI product efforts centered on collecting data, training a model, and deploying that model. That work still matters, but many product teams can now start with a capable hosted or open-weight model and spend most engineering effort on the surrounding system.

Swyx popularized the "AI Engineer" framing in 2023 as applied AI work moved toward software built on foundation models.^{[1]Reference 1The Rise of the AI Engineer.https://www.latent.space/p/ai-engineer} The term stuck because it names a real ownership boundary. The model is a component. The product is the deliverable.

Ownership boundary

A useful distinction: data scientists usually ship analysis and decisions, ML engineers usually ship trained or deployed models, and AI engineers usually ship model-backed product workflows. Their question is, "Can this feature solve the user's problem safely, cheaply, and measurably?"

The roles overlap. A strong AI engineer still needs statistics, software engineering, and model intuition. Day-to-day work lands in different places.

Role emphasis	Primary output	Typical ownership question
Data science	Analysis, experiments, and decision support	Does the evidence support this decision?
ML engineering	Trained models and production ML pipelines	Can this model train, deploy, and stay reliable?
AI engineering	Model-backed product workflows	Can this feature use models safely, cheaply, and measurably?

Diagram showing Product need, Prompt + output contract, Evidence + tools, and Model call. — Product need, Prompt + output contract, Evidence + tools, and Model call.

Scope rule: Titles overlap. Compare the artifact and ownership boundary in a job description before deciding which skill set it rewards.

What AI engineers build

AI engineer skill map showing the request path through prompt contract, retrieval, tools, serving, and evaluation, plus a quality loop for tracing, adaptation, and regression tests. — Use the skill map like a debugger: locate the failing stage first, then choose the smallest prompt, retrieval, tool, serving, adaptation, or eval fix that changes the metric.

Most AI engineering work falls into six connected areas.

Prompt contracts and structured output

Prompting isn't "asking nicely." It's the contract between product logic and model behavior. AI engineers write system instructions, examples, tool descriptions, and output schemas so model responses can be used by normal software.

Structured outputs matter because downstream code needs typed fields, not vibes. If a model extracts account_status, next_action, and confidence, the backend needs to validate those fields before rendering or calling another service.^{[2]Reference 2Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

Failure mode: a prompt works in a notebook, then production users send weird inputs and JSON parsing breaks. Use a schema, validation, retries with bounded repair, and tests that include messy examples.

RAG and retrieval

Retrieval-augmented generation (RAG) gives the model relevant external evidence at answer time. Demo RAG can be a PDF folder and a vector database. Production RAG needs parsing, chunking, embedding, exact-match search, permissions, reranking, and evals.

Production RAG architecture with an offline ingestion lane for parsing, chunking, embedding, indexing, and access metadata, plus an online lane that filters permissions before retrieving, reranking, assembling context, and answering. — Production RAG has two lanes: prepare trustworthy indexed evidence offline, then filter permissions before retrieving and assembling the tiny context packet for one query.

Dense retrieval helps with semantic matches. BM25 protects exact terms such as product names, error codes, and IDs.^{[3]Reference 3Dense Passage Retrieval for Open-Domain Question Answering.https://arxiv.org/abs/2004.04906}^{[4]Reference 4The Probabilistic Relevance Framework: BM25 and Beyond.https://doi.org/10.1561/1500000019} Many serious systems use hybrid retrieval because users mix fuzzy language with exact identifiers.

Security is part of retrieval, not an afterthought. If private chunks enter the prompt, the model has already seen them. Guardrails after generation can't undo that. AI engineers store access metadata with chunks and test that forbidden documents never reach context assembly.

Agents and tools

Agents are model-driven loops that can call tools: search, database lookup, code execution, ticket creation, or API calls. Control is the difficult part. The model proposes actions; software decides what tools exist, what parameters are valid, what permissions apply, and when to stop.

Comparison of ReAct and plan-and-execute agent timing: ReAct decides after each observation, while plan-and-execute chooses a sequence before running tools. — ReAct chooses again after each tool observation. Plan-and-Execute commits to a sequence before running tools.

ReAct is useful when tool output should change the next step.^{[5]Reference 5ReAct: Synergizing Reasoning and Acting in Language Models.https://arxiv.org/abs/2210.03629} Plan-and-Execute works better when the workflow has a stable structure. Model Context Protocol (MCP) matters because it standardizes how tools and resources are exposed to AI clients. Anthropic's 2025 donation announcement says it donated MCP to the Linux Foundation and co-founded the Agentic AI Foundation (AAIF) with Block and OpenAI.^{[6]Reference 6Donating the Model Context Protocol and establishing the Agentic AI Foundationhttps://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation}

Failure mode: the agent loops forever, calls a nonexistent tool, or takes a side effect without enough evidence. Use schema validation, max-iteration limits, allowlists, permission checks, idempotency, and human review for risky actions.

Inference, serving, and cost

Every large language model (LLM) feature has a cost and latency shape. Provider APIs usually bill by input, cached input, and output tokens, so prompt length and response length directly affect spend.^{[7]Reference 7OpenAI API Pricinghttps://developers.openai.com/api/docs/pricing} Self-hosting adds GPU memory, batching, scheduling, routing, and cache pressure.

The KV cache is often the serving bottleneck because it stores attention state for previous tokens and grows with sequence length and concurrency. PagedAttention stores KV cache in fixed-size blocks to reduce fragmentation, while vLLM, SGLang, TensorRT-LLM, TGI, and llama.cpp make different serving trade-offs.^{[8]Reference 8Efficient Memory Management for Large Language Model Serving with PagedAttentionhttps://arxiv.org/abs/2309.06180}^{[9]Reference 9vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttentionhttps://github.com/vllm-project/vllm}^{[10]Reference 10SGLang: Efficient Execution of Structured Language Model Programshttps://arxiv.org/abs/2312.07104}^{[11]Reference 11TensorRT-LLM: A High-Performance Inference Framework for LLMs.https://github.com/NVIDIA/TensorRT-LLM}^{[12]Reference 12llama.cpp: Inference of LLaMA model in pure C/C++https://github.com/ggml-org/llama.cpp}

Failure mode: the feature is accurate but too slow or too expensive. The fix might be shorter context, prompt caching, semantic caching, model routing, quantization, batching, or a smaller model on easy requests.

Lightweight adaptation

Prompting and RAG should usually come first. Sometimes they stop improving the metric. Then model adaptation becomes reasonable.

AI engineers are more likely to use LoRA, QLoRA (quantized LoRA), instruction data curation, synthetic examples, and adapter rollout than full pretraining.^{[13]Reference 13LoRA: Low-Rank Adaptation of Large Language Models.https://arxiv.org/abs/2106.09685}^{[14]Reference 14QLoRA: Efficient Finetuning of Quantized Language Models.https://arxiv.org/abs/2305.14314} The decision question is practical: is the failure about missing facts, unstable behavior, format adherence, or domain style?

Fresh facts usually belong in RAG. Durable behavior can justify tuning. Format problems may be better solved with structured output before any weight update.

Evaluation and regression testing

This is the biggest difference between demos and products. An AI engineer needs to answer, "Did the change make quality better, worse, or just different?"

That usually means a golden dataset, deterministic checks where possible, LLM-as-judge where needed, human calibration for the judge, and release gates that catch regressions. Benchmarks such as MMLU, HumanEval, and SWE-bench are useful context, but they rarely replace your product evals.^{[15]Reference 15Measuring Massive Multitask Language Understanding (MMLU).https://arxiv.org/abs/2009.03300}^{[16]Reference 16Evaluating Large Language Models Trained on Code (HumanEval).https://arxiv.org/abs/2107.03374}^{[17]Reference 17SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.https://arxiv.org/abs/2310.06770}

Common mistake: spot-check ten examples and ship. Better pattern: collect real user cases, label expected behavior, run them every time the prompt, retrieval config, model, or tool loop changes, and inspect failures before rollout.

Evaluation rule: Public benchmarks describe a model or scaffold under one protocol. Product release decisions need task-local cases, failure labels, and a regression threshold.

What a typical week looks like

At a startup, one engineer may own the whole loop: reproduce a bad answer, trace retrieval, update prompts, compare embedding models, expand the eval set, deploy, and watch spend.

Startup AI engineer weekly loop from tracing a bad answer, repairing prompt or retrieval, checking embeddings, expanding the eval set, deploying behind a gate, and using production feedback for the next trace. — Startup speed still needs measurement: trace the bad answer, repair the smallest stage, add regression cases, then deploy behind a gate and feed production failures back into the next trace.

At a larger product company, the path is more structured: product goal, model contract, RAG or tool path, eval suite, backend integration, A/B test, and production monitoring.

Product company AI engineering workflow connecting product goal, AI contract, eval gate, backend integration, A/B rollout, and production observability. — The AI engineer connects product intent, model behavior, backend constraints, release gates, rollout, and production monitoring.

At a frontier AI lab or AI infrastructure company, the scope is narrower and deeper: tool-use infrastructure, eval platforms, model-serving systems, data pipelines, or post-training workflows. The work may look closer to systems engineering or ML infrastructure than product feature work.

Three ownership examples

Job titles hide the real boundary. Follow one artifact from request to production to see what the AI engineer owns.

Support answer with citations

Product asks for answers grounded in account policy. The AI engineer turns that request into a retrieval and response contract: allowed document scopes, required citations, refusal behavior, latency budget, and an eval set with answerable and unanswerable cases.

The work crosses several layers. Ingestion must preserve document version and access metadata. Retrieval must filter before ranking. Prompt assembly must fit the evidence budget. The response schema must distinguish answer, citation, and abstention. Release checks must catch unsupported answers and permission leaks.

Completion isn't "the chatbot answered my test question." Completion means the service passes retrieval and answer evals, forbidden documents never enter model context, traces expose failures, and a rollback path exists.

Tool that changes state

Suppose an assistant may create a support ticket. The model can propose create_ticket, but trusted application code owns authorization, schema validation, idempotency, rate limits, and execution. The AI engineer defines that boundary and tests it with duplicate requests, missing fields, unauthorized users, and tool timeouts.

A polished natural-language response can't compensate for a duplicated side effect. Success requires the correct ticket state, a durable action receipt, and a response that reflects what the tool actually returned.

Model-routing change

A team wants to send easy requests to a smaller model. The AI engineer defines the routing signal, fallback path, quality threshold, latency and cost metrics, and evaluation slices. They also test low-confidence cases and provider failures.

The router ships only when the smaller path preserves required quality and the fallback contains failures. Average latency alone can hide a bad tail or a weak slice, so acceptance criteria include per-route quality and fallback rate.

Tools you should recognize

Tool names change quickly. Stable categories matter more: model APIs such as OpenAI, Anthropic, Gemini, Mistral, and Groq; open-weight models such as Llama, Qwen, DeepSeek, Mistral, and Kimi; serving stacks such as vLLM, TensorRT-LLM, SGLang, TGI, and llama.cpp; retrieval systems such as Qdrant, Weaviate, Pinecone, and pgvector; orchestration with LangGraph, LlamaIndex, Haystack, or custom code; evaluation with Braintrust, LangSmith, or custom evals; and observability with OpenTelemetry, Helicone, or LangSmith.

Learn the concepts behind those names: pricing, rate limits, context windows, structured output, licenses, quantization, batching, KV cache, metadata filters, reranking, tool retries, judges, regression gates, traces, token spend, and failure analysis.

Default to boring technology when possible. A stack your team can debug at 2 a.m. beats a trendy abstraction nobody understands.

Diagnose one bad answer

Take a support assistant that cites the wrong cancellation policy. Editing the prompt first is tempting, but the failure may have happened earlier.

Start from the trace and reconstruct one request:

Confirm user identity and document access scope.
Read the normalized query sent to retrieval.
Inspect retrieved document IDs, versions, and scores.
Inspect reranked evidence and final context packet.
Compare generated claims with cited passages.
Check schema validation, fallback, and rendered response.

If the correct policy never appeared in initial retrieval, prompt work can't recover it. Check parsing, chunk boundaries, metadata filters, hybrid search, and query rewriting. Add the failing request to retrieval evals before changing the index or retriever.

If retrieval found the correct policy but reranking dropped it, compare ranking features and score distributions. Fix or replace the reranker, then verify that unrelated query slices don't regress.

If correct evidence reached the model but the answer contradicted it, tighten the response contract, require claim-level citations, add an abstention path, or test another model. Keep the same evidence packet while comparing generation changes.

If the model response was correct but the user saw the wrong text, inspect parsing, caching, and frontend rendering. AI failures are often ordinary distributed-systems failures near a model call.

Finish with three artifacts: a regression case, a trace showing the repaired path, and a metric change over the relevant eval slice. Without those, the team has a plausible patch but no proof.

Career progression

Titles vary, so judge roles by ownership. Junior and mid-level engineers usually own feature slices, prompt changes, eval harnesses, or simple RAG endpoints. Senior engineers own end-to-end systems and trade-offs across product, infrastructure, quality, and cost. Staff and principal engineers standardize shared platforms, governance, vendor strategy, and serving architecture. AI platform leaders own internal APIs, observability, security, and budget systems across teams.

Don't over-index on title. A "Software Engineer, AI" can own more real AI system surface area than an "AI Engineer" role that only tweaks prompts.

How to break in

If you're a software engineer

You already know production software. Add transformer fundamentals, one real RAG project with retrieval evals and citations, inference economics, one tool-using agent with permissions and stopping rules, and one eval suite that catches regressions.

Strong portfolio signal: "I found this failure, wrote this eval, shipped this fix, and moved this metric."

If you're a data scientist

Your advantage is measurement: experimentation, labels, and statistical thinking map directly to evaluation work. Your gap is usually systems depth, so build APIs with typed code, observability, deployment practice, and backend integration skills.

If you're a new grad

Focus on proof. Build two or three small systems that show the whole loop: document QA over real docs with retrieval metrics and citations, a tool-using agent with a safe mock API and failure tests, or a model-routing/caching demo that compares cost, latency, and quality.

Write short technical notes about what broke. Clear failure analysis beats a polished demo video.

Portfolio acceptance criteria

A portfolio project should prove an engineering loop beyond displaying a chat interface. Before calling a project complete, require evidence in five areas.

Behavior

A written task contract names supported inputs, outputs, refusals, and side effects.
A fixed eval set covers normal cases, edge cases, and expected failures.
Results include per-case evidence alongside any aggregate score.

Safety and control

Retrieval applies access checks before context assembly.
Tool calls use schemas, authorization, idempotency, and bounded retries.
Risky actions have an approval or simulation path.

Operations

Traces connect request, retrieval, model, tool, and response stages.
Dashboards or logs expose latency, errors, token use, and fallback rate.
Deployment has a rollback path and a documented failure drill.

Reproducibility

Repository includes setup instructions and pinned dependencies.
Eval commands produce saved outputs from a clean environment.
Model, prompt, dataset, and index revisions are recorded with each run.

Communication

README states what failed, what changed, and which metric moved.
Architecture diagram marks trusted boundaries and external dependencies.
Trade-off notes explain one rejected design and why it lost.

One strong project can satisfy these criteria without using every popular framework. Reviewers can inspect the contract, run the eval, reproduce a failure, and see how the system behaves when a dependency breaks.

What separates the role

AI engineers build products with foundation models; ML engineers more often build or maintain models.
The core stack is prompt contracts, retrieval, tools, serving, adaptation, evaluation, and observability.
The main differentiator isn't demo speed. It's whether you can prove quality, control cost, and debug failures.
Breaking in requires one end-to-end project with clear trade-offs, not a list of tool names.
Fundamentals still matter because attention, context windows, and serving mechanics shape product decisions.

If you're ready to start, the AI Engineering Curriculum begins with Git, shell, and Linux for reproducible AI work, then builds toward RAG, agents, evaluation, serving, and system design. Scaled dot-product attention comes later as a transformer deep dive.

PreviousRAG vs Fine-Tuning vs Prompting NextUnderstanding SWE-bench

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

The Rise of the AI Engineer.

swyx · 2023

Structured outputs

OpenAI · 2024

Dense Passage Retrieval for Open-Domain Question Answering.

Karpukhin, V., et al. · 2020 · EMNLP 2020

The Probabilistic Relevance Framework: BM25 and Beyond.

Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Donating the Model Context Protocol and establishing the Agentic AI Foundation

Anthropic · 2025

OpenAI API Pricing

OpenAI · 2026

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., et al. · 2023 · SOSP 2023

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM Team · 2024

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2023 · arXiv:2312.07104

TensorRT-LLM: A High-Performance Inference Framework for LLMs.

NVIDIA · 2024

llama.cpp: Inference of LLaMA model in pure C/C++

Gerganov, G. · 2023

LoRA: Low-Rank Adaptation of Large Language Models.

Hu, E. J., et al. · 2021 · ICLR

QLoRA: Efficient Finetuning of Quantized Language Models.

Dettmers, T., et al. · 2023 · NeurIPS

Measuring Massive Multitask Language Understanding (MMLU).

Hendrycks, D., et al. · 2021 · ICLR 2021

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.

Jimenez, C. E., et al. · 2024 · ICLR 2024

Blog