A brief history of modern AI: tracing the evolution from the original Transformer to GPT-3, the open-source explosion, and the rise of frontier reasoning models.
The Transformer was introduced in 2017 for machine translation.[1] But how did we get from translating French to AI assistants that can write code, answer questions, and call tools?
This article traces the evolution of large language models (LLMs), focusing on the architectural decisions that made modern systems possible.
💡 Key insight: The defining realization of the last decade was scaling laws: language-model performance improves predictably as you increase model size, data, and compute, as long as those ingredients stay in balance.[2][3]
| Era | What changed | Why it matters for engineers |
|---|---|---|
| Transformer | Attention replaced recurrent sequence processing. | Long-range token relationships became easier to learn. |
| GPT path | Decoder-only models made generation the main interface. | One prompt format can express many tasks. |
| Scaling laws | Size, data, and compute became predictable levers. | Model quality became an engineering budget problem. |
| Instruction tuning | Human-preferred answers replaced raw document continuation. | Chat APIs became useful products, not only research demos. |
Use this diagram as your anchor for the rest of the article: modern LLMs followed the decoder-only path, then became useful products after scale and instruction tuning.
The original 2017 Transformer had two parts: an Encoder (to read the input) and a Decoder (to generate the output). Soon after, researchers realized they could separate them.
For a few years, BERT-style models dominated many NLP benchmarks. Decoder-only models won the general assistant track because generation is a flexible interface. If a model can generate text, you can ask for translation, summarization, coding, planning, or classification in the same format: prompt in, text out.
In 2020, OpenAI published work on scaling laws for neural language models.[2] The core finding: loss improves predictably with three factors.
OpenAI then trained GPT-3, a 175-billion parameter decoder-only model.[6] It was more than 100 times larger than GPT-2.
GPT-3 changed the field because it demonstrated in-context learning. You didn't need to retrain the model for every new task. You could show a few examples in the prompt, and the model often inferred the pattern on the fly. This was the practical birth of prompt engineering.
GPT-3 was powerful, but it was still a document completer. If you prompted it with "Write a recipe for cake", it might continue with another prompt instead of answering directly, because the base objective was "continue the text."
To fix this, researchers used RLHF (Reinforcement Learning from Human Feedback). Humans compared model responses, and the model was trained to prefer the answers people rated higher.[7] This turned a raw autocomplete engine into a more helpful conversational assistant.
In late 2022, OpenAI wrapped an instruction-tuned model in a chat interface and called it ChatGPT. For most users, that was the moment LLMs became a product instead of a research demo.
While other labs built proprietary model families, Meta took a different approach.
In early 2023, Meta released LLaMA, a family of capable foundation models.[8] Its weights quickly spread through the research and developer community.
Researchers realized you didn't always need 175 billion parameters to get useful performance. Chinchilla-style scaling showed that many models were undertrained for their size: smaller models trained on more tokens could be far more efficient.[3] That insight helped drive fine-tuning, quantization, and local inference engines.
The field then shifted from "make every model bigger" to "spend compute more intelligently."
Mixture-of-Experts (MoE) models activate only part of the network for each token. Mixtral, for example, routes each token to a small subset of expert feed-forward networks.[9] This gives the model more total capacity while keeping active compute lower than a dense model of the same total size.
Classic LLMs spend roughly the same forward-pass budget per output token, whether they're greeting you or solving a hard proof. Reasoning models add extra test-time compute before producing the final answer. OpenAI's o1-preview made this approach visible in product form in 2024.[10]
Next, continue to Prompt Engineering Fundamentals, where these models become tools you can steer.
Attention Is All You Need.
Vaswani, A., et al. · 2017
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Devlin, J., et al. · 2019 · NAACL 2019
Improving Language Understanding by Generative Pre-Training.
Radford, A., et al. · 2018
Language Models are Unsupervised Multitask Learners.
Radford, A., et al. · 2019
Language Models are Few-Shot Learners.
Brown, T., et al. · 2020 · NeurIPS 2020
Scaling Laws for Neural Language Models
Kaplan et al. · 2020
Training Compute-Optimal Large Language Models.
Hoffmann, J., et al. · 2022 · NeurIPS 2022
Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
Ouyang, L., et al. · 2022 · NeurIPS 2022
LLaMA: Open and Efficient Foundation Language Models.
Touvron, H., et al. · 2023
Mixtral of Experts.
Jiang, A. Q., et al. · 2024
o1 Preview Model
OpenAI · 2024