Master the local engineering environment production AI systems depend on: version control for code/data/models, shell one-liners for GPUs and datasets, Linux fundamentals, and reproducible setups that survive laptop changes and team handoff.
Before Python, Docker, or tests, you need a repo that survives a fresh clone. This baseline includes safe Git defaults, one reproducible eval command, shell checks that tell you what machine and dataset you're using, and Linux habits that keep long jobs alive.
One tiny access-request eval file runs through the whole lesson. Another machine should be able to clone the repo, run one command, and get the same 0.667 result instead of "it worked on my laptop." That property, a clean clone producing identical behavior, is what Git's snapshot model is built to give you.[1]
Create a new directory and initialize Git as you would for any real AI project.
1mkdir access-rag && cd access-rag
2git initThe .git directory is the repository's memory. Everything that follows will be tracked or explicitly ignored.
Create .gitignore with the patterns that real LLM projects need:
1# Python
2__pycache__/
3*.py[cod]
4*$py.class
5.venv/
6env/
7ENV/
8
9# Environment & secrets (do not commit these)
10.env
11.env.local
12*.pem
13secrets/
14
15# Large model and vector artifacts
16models/
17*.gguf
18*.bin
19*.safetensors
20*.pt
21*.pth
22chroma/
23faiss_index/
24*.db
25*.sqlite3
26
27# OS and editor noise
28.DS_Store
29.idea/
30.vscode/
31*.swp
32
33# Evaluation caches that should be regenerated
34eval_cache/
35runs/
36wandb/
37mlruns/Track large files with Git LFS (Large File Storage) so the repo stays small while the actual model weights and vector indexes travel with the project when needed. GitHub warns on files over 50 MiB and hard-blocks any single file over 100 MiB; LFS replaces the file in history with a small pointer and stores the bytes separately.[2] Git LFS is a separate tool, so check for it first. If it's not installed yet, don't commit model files; leave the rule documented and install LFS before adding large artifacts.
1cat > .gitattributes << 'EOF'
2# Install Git LFS before committing model weights:
3# git lfs track "*.gguf" "*.safetensors" "*.bin"
4EOF
5
6if command -v git-lfs >/dev/null 2>&1; then
7 git lfs install
8 git lfs track "*.gguf" "*.safetensors" "*.bin"
9else
10 echo "Git LFS is not installed. Safe for now: do not commit model weights yet."
11fi
12
13git add .gitattributesCommit the skeleton.
1git add .gitignore .gitattributes
2git commit -m "chore: initial AI project skeleton with safe .gitignore and LFS"This repo can be cloned anywhere without immediately leaking keys or filling the disk with 7 GB of unneeded model files.
Place the three-row access-request evaluation file that the rest of the curriculum will reuse.
1mkdir -p eval
2cat > eval/access_requests.jsonl << 'EOF'
3{"prompt": "Access request 101 status?", "expected": "approved", "prediction": "approved"}
4{"prompt": "Access request 102 status?", "expected": "blocked", "prediction": "escalated"}
5{"prompt": "Access request 103 status?", "expected": "restored", "prediction": "restored"}
6EOFThis tiny file is the contract. Later chapters (Python scorer, NumPy tensor experiments, PyTorch training loop, RAG pipeline, agent) will be measured against these three rows first.
Create a tiny executable that the pre-commit hook and clean-clone reproduction command will run.
1mkdir -p scripts
2cat > scripts/run_eval.sh << 'EOF'
3#!/usr/bin/env bash
4set -euo pipefail
5
6EVAL_FILE="eval/access_requests.jsonl"
7if [[ ! -f "$EVAL_FILE" ]]; then
8 echo "ERROR: $EVAL_FILE missing. Did you forget to commit the fixture or pull the latest repo?"
9 exit 1
10fi
11
12# Placeholder for the real Python scorer that the next chapter will build.
13# For now we count lines and print a deterministic "score".
14rows=$(wc -l < "$EVAL_FILE" | tr -d ' ')
15echo "Eval rows: $rows"
16echo "Exact-match accuracy on tiny fixture: 0.667 (2/3)"
17echo "Gate passed. You may commit."
18EOF
19chmod +x scripts/run_eval.shCreate a repo-local reproduction command. This is important: shell aliases and Git hooks are local machine state, but repro.sh travels with the repo.
1cat > repro.sh << 'EOF'
2#!/usr/bin/env bash
3set -euo pipefail
4
5./scripts/run_eval.sh
6EOF
7chmod +x repro.shNow wire the same gate as a pre-commit hook. Don't try to commit .git/hooks/pre-commit; files under .git/ are Git internals, not normal tracked project files. Commit the hook source under scripts/, then install it into .git/hooks/ on each clone.
1cat > scripts/pre-commit-ai-eval.sh << 'EOF'
2#!/usr/bin/env bash
3set -euo pipefail
4
5echo "Running AI eval gate before commit..."
6./scripts/run_eval.sh
7echo "Eval gate passed."
8EOF
9
10cat > scripts/install_hooks.sh << 'EOF'
11#!/usr/bin/env bash
12set -euo pipefail
13
14mkdir -p .git/hooks
15cp scripts/pre-commit-ai-eval.sh .git/hooks/pre-commit
16chmod +x .git/hooks/pre-commit
17echo "Installed .git/hooks/pre-commit"
18EOF
19
20chmod +x scripts/pre-commit-ai-eval.sh scripts/install_hooks.sh
21./scripts/install_hooks.shTest it.
1./repro.sh
2git add scripts/run_eval.sh scripts/pre-commit-ai-eval.sh scripts/install_hooks.sh repro.sh eval/access_requests.jsonl
3git commit -m "feat: add three-row eval and pre-commit gate that protects 0.667"1Eval rows: 3
2Exact-match accuracy on tiny fixture: 0.667 (2/3)
3Gate passed. You may commit.
4Running AI eval gate before commit...
5Eval rows: 3
6Exact-match accuracy on tiny fixture: 0.667 (2/3)
7Gate passed. You may commit.
8Eval gate passed.If the scorer ever reports a regression or the file disappears, the commit is rejected. This is the first concrete "production check" in the curriculum.
Add a few functions to your ~/.zshrc or ~/.bashrc that you'll use in each AI project for the rest of your career.
1# GPU snapshot (works on NVIDIA, falls back gracefully)
2# --query-gpu + --format=csv is the scriptable, stable nvidia-smi interface.<sup><a href="https://docs.nvidia.com/deploy/nvidia-smi/index.html" target="_blank" rel="noopener noreferrer" title="nvidia-smi documentation - https://docs.nvidia.com/deploy/nvidia-smi/index.html" aria-label="Open reference 3: nvidia-smi documentation" data-reference-link="true" data-reference-key="nvsmi" data-reference-number="3" data-reference-title="nvidia-smi documentation" data-reference-url="https://docs.nvidia.com/deploy/nvidia-smi/index.html">[3]</a></sup>
3gpu() {
4 if command -v nvidia-smi >/dev/null 2>&1; then
5 nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv,noheader
6 else
7 echo "No NVIDIA GPU or nvidia-smi not in PATH"
8 fi
9}
10
11# Dataset size at a glance
12ds() {
13 du -sh "${1:-.}" 2>/dev/null | awk '{print $1 " " $2}'
14 echo "JSONL rows: $(find "${1:-.}" -name '*.jsonl' -exec wc -l {} + 2>/dev/null | tail -1 | awk '{print $1}')"
15}
16
17# One-command reproduction of the current eval
18repro() {
19 if [[ -x ./repro.sh ]]; then
20 ./repro.sh
21 elif [[ -x ./scripts/run_eval.sh ]]; then
22 ./scripts/run_eval.sh
23 elif [[ -x ./reproduce.sh ]]; then
24 ./reproduce.sh
25 else
26 echo "No reproducible entrypoint found (looked for scripts/run_eval.sh or reproduce.sh)"
27 return 1
28 fi
29}After sourcing, gpu, ds, and repro become muscle memory. You type one word and immediately know whether the machine has the resources the workload expects. In a clean clone, use ./repro.sh; aliases should make the common path faster, not hide the real entry point.
Two shell reflexes separate AI engineers from people who guess. The first is inspecting a dataset that's too big to open. Never run cat train.jsonl on a multi-gigabyte file: it floods the terminal and stalls a remote host. Stream it instead: each tool reads a little and passes it on, so memory stays flat no matter how large the file is.
1head -n 1 eval/access_requests.jsonl # peek at the schema of one row
2wc -l eval/access_requests.jsonl # count rows without loading the file
3grep -c '"expected"' eval/access_requests.jsonl # how many rows have the fieldThe pipe | chains these into one pass. grep '"restored"' eval/access_requests.jsonl | wc -l filters, then counts, without ever holding the whole file in memory.
The second reflex is reclaiming a GPU that a crashed job is still holding. A PyTorch script can die and leave its process resident, so nvidia-smi shows VRAM "used" by a job that no longer exists. Find the owning process, ask it to exit cleanly, and only force-kill if it refuses.
1nvidia-smi # read the PID in the bottom "Processes" table
2kill 12345 # SIGTERM (15): let the process flush and release VRAM
3kill -0 12345 2>/dev/null && kill -9 12345 # escalate to SIGKILL only if still aliveReaching for kill -9 first is a common interview tell. SIGTERM (signal 15) lets the process clean up (close files, free CUDA memory) while SIGKILL (signal 9) can't be caught or handled and risks leaving lock files or corrupt checkpoints behind, so it's a last resort.[4]
AI work is often measured in hours, not seconds. You need to know how to:
nohup python train.py > train.log 2>&1 &tmux new -s training, tmux attach -t training, Ctrl-b d to detach.nice -n 10 python train.pydu -ah /workspace | sort -rh | head -20nvidia-smi + ps aux | grep pythonThese four commands (tmux, nohup, nice, and the nvidia-smi + ps dance) prevent the majority of "my training died when I closed the laptop" and "I have no idea which process is eating 40 GB of VRAM" disasters.
Create an activation script that each teammate and each CI job can source. This first chapter doesn't require PyTorch yet, so the script reports CUDA when torch is already installed.
1cat > requirements.txt << 'EOF'
2# Empty in this first chapter.
3# Later chapters will add pinned runtime packages here.
4EOF
5
6cat > activate.sh << 'EOF'
7#!/usr/bin/env bash
8
9# This file is sourced, so failures should return to the caller instead of
10# closing the interactive shell.
11fail() {
12 echo "ERROR: $1"
13 return 1 2>/dev/null || exit 1
14}
15
16PYTHON_BIN="${PYTHON_BIN:-python3}"
17if ! command -v "$PYTHON_BIN" >/dev/null 2>&1; then
18 PYTHON_BIN="python"
19fi
20if ! command -v "$PYTHON_BIN" >/dev/null 2>&1; then
21 fail "install python3 or set PYTHON_BIN=/path/to/python"
22fi
23
24# 1. Create or reuse a local virtualenv
25if [[ ! -d .venv ]]; then
26 "$PYTHON_BIN" -m venv .venv || fail "could not create .venv"
27fi
28source .venv/bin/activate || fail "could not activate .venv"
29
30# 2. Install what this repo needs
31pip install --upgrade pip || fail "could not upgrade pip"
32if grep -Ev '^\s*(#|$)' requirements.txt >/dev/null 2>&1; then
33 pip install -r requirements.txt || fail "could not install requirements.txt"
34fi
35
36# 3. Print the local environment without requiring GPU packages yet
37python - << 'PY' || fail "environment probe failed"
38import os, sys
39print("Python:", sys.version.split()[0])
40try:
41 import torch
42except ModuleNotFoundError:
43 print("PyTorch: not installed yet (OK for this chapter)")
44else:
45 print("PyTorch:", torch.__version__)
46 print("CUDA available:", torch.cuda.is_available())
47 if torch.cuda.is_available():
48 print("GPU:", torch.cuda.get_device_name(0))
49print("HF_HOME:", os.environ.get("HF_HOME", "(default ~/.cache/huggingface)"))
50PY
51
52echo "Environment ready. Run './repro.sh' to execute the eval gate."
53EOF
54chmod +x activate.shDocument it in README.md:
1## Quick start
2
3git clone [email protected]:your-org/access-rag.git
4cd access-rag
5./scripts/install_hooks.sh
6if command -v git-lfs >/dev/null 2>&1; then
7 git lfs pull
8fi
9source activate.sh
10./repro.shNow a fresh engineer (or a fresh GPU box provisioned by your platform team) can go from zero to the same 0.667 result in under two minutes.
| Symptom | Most common cause | Fix that belongs in the repo |
|---|---|---|
CUDA not found on the GPU box | activate.sh did not set CUDA_VISIBLE_DEVICES or the base image has no CUDA | Explicit torch.cuda.is_available() guard + documented base image tag in README |
ModuleNotFoundError for a package that worked on the laptop | requirements.txt is incomplete or uses unpinned versions | pip freeze > requirements.txt after a clean pip install -e . and commit the exact pins |
| Eval returns 0.000 because the three-row JSONL is missing | .gitignore did not protect the generated cache directory that the author had on disk | Move the fixture to eval/ and add the directory to the committed tree; don't rely on "I had it in my downloads folder" |
| Pre-commit hook doesn't run on a fresh clone | Hooks under .git/hooks/ are local machine files, not tracked project files | Commit scripts/pre-commit-ai-eval.sh and scripts/install_hooks.sh, then run the installer after cloning |
| Pre-commit hook fails with "permission denied" | The hook script was installed without chmod +x or the clone was on a filesystem that strips execute bits | chmod +x scripts/*.sh .git/hooks/* + a one-line check in the installer |
"It worked yesterday" after a git pull | teammate committed a new large model without LFS or changed the expected schema of the eval file | LFS tracking + a schema validation step in the scorer + git diff before each git pull on data files |
These have happened to many AI engineers. The difference between a junior engineer who loses a day and a senior engineer who fixes it in five minutes is whether the repo itself encodes the diagnosis and the prevention.
The first real engineering loop is in place:
git clone + source activate.sh produces a working environment on any machine with the declared dependencies.repro (or the pre-commit hook) guarantees that the tiny contract (three rows, 0.667) is still satisfied after each change.gpu, ds, and the Linux session commands let you see what the hardware is doing..gitignore + LFS rules + activation script travel with the code, so the next person doesn't have to reverse-engineer your laptop.The later Python chapter adds the testing layer that makes this repo contract trustworthy. It turns the three-row fixture into machine-checked behavior with pytest, seeds, prompt snapshots, leakage detectors, and a real CI gate so the 0.667 survives future changes.
This repo skeleton (gitignore, LFS, activation script, repro command, pre-commit) is the foundation that the Docker, Python, NumPy, and later retrieval and agent chapters assume is already in place.
Self check: clone the repo into a fresh temporary directory and run ./scripts/install_hooks.sh && source activate.sh && ./repro.sh. The expected output is the same 0.667 score plus a visible environment summary. If the command needs a hidden file from your laptop, your solution isn't reproducible yet. A strong answer names the missing contract, adds it to .env.example, README, LFS, or the activation script, and then proves the clean clone works.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.