Intro to ML - 2026 Edition

Module 1: RAG Fundamentals

⏱️ 8 min read | 🎯 Beginner

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances AI responses by first retrieving relevant information from a knowledge base, then using that information to generate accurate, contextual answers.

How RAG Works

Step 1: Query Processing - User asks a question
Step 2: Retrieval - System searches knowledge base for relevant documents
Step 3: Context Assembly - Retrieved documents are added to the prompt
Step 4: Generation - AI generates answer using retrieved context
Step 5: Response - Answer is returned to user with optional citations

Why RAG Matters

✓ Benefits

Reduces hallucinations
Access to current data
Company-specific knowledge
No model retraining needed
Transparent sourcing

⚠ Challenges

Requires vector database
Quality depends on chunks
Additional latency
Infrastructure complexity
Retrieval accuracy critical

💡 Real-World Example

Scenario: Customer support chatbot for a SaaS company

Without RAG: AI gives generic answers or outdated information

With RAG: AI searches current documentation, finds exact feature details, and provides accurate answers with links to docs

Module 2: Agentic AI Systems

⏱️ 7 min read | 🎯 Intermediate

What is Agentic AI?

Agentic AI refers to AI systems that can autonomously plan, decide, and execute multi-step tasks to achieve goals. Unlike simple chatbots, these systems can break down complex requests, use tools, and iterate on their approach.

Key Characteristics

🎯

Goal-Oriented

🔄

Iterative

🛠️

Tool-Using

🧠

Decision-Making

Agentic Workflow Example

// User Request: "Analyze our Q3 sales and create a report"

// Agent's autonomous steps:
Search database for Q3 sales data
Calculate key metrics (growth, averages, trends)
Compare with Q2 and previous Q3
Identify top performing products
Generate visualizations
Write executive summary
Create formatted report
Email to stakeholders

// All done from one initial request!
                    

⚠ Important Considerations

Human Oversight: Critical decisions should require approval

Budget Limits: Set spending caps to prevent runaway costs

Audit Trails: Log all actions for transparency and debugging

Module 3: Chunking Strategies

⏱️ 10 min read | 🎯 Intermediate

Why Chunking Matters

Chunking is one of the most critical factors in RAG performance. Poor chunking leads to irrelevant retrievals, incomplete context, and inaccurate answers. Good chunking ensures semantic coherence and optimal retrieval.

Chunking Strategies

Strategy	How It Works	Best For	Typical Size
Fixed Size	Split by character/token count	Uniform documents, simple setup	500-1000 tokens
Semantic	Split at meaning boundaries	Natural language, narratives	Variable
Structural	Use document structure (headers, sections)	Technical docs, manuals	Variable
Sliding Window	Overlapping chunks	Context continuity needed	800 tokens + 200 overlap
Recursive	Hierarchical splitting	Complex documents	Variable

Chunk Size Impact

Too Small (< 200 tokens)

❌ Loses context

❌ Fragments information

❌ More retrievals needed

✓ Precise targeting

Optimal (400-800 tokens)

✓ Balanced context

✓ Good semantic coherence

✓ Efficient retrieval

✓ Cost-effective

Too Large (> 1500 tokens)

❌ Dilutes relevance

❌ Higher costs

❌ Slower processing

✓ Full context preserved

🎯 Best Practices

Add Metadata: Include document title, section, date in each chunk
Preserve Context: Include parent section headers in chunks
Test & Iterate: Evaluate retrieval quality with real queries
Consider Overlap: 10-20% overlap helps maintain continuity
Match Domain: Technical docs need different strategy than chat logs

// Example: Good chunk with metadata
{
  "content": "Our vacation policy allows 15 days annually...",
  "metadata": {
    "document": "Employee Handbook 2024",
    "section": "Benefits > Time Off > Vacation",
    "last_updated": "2024-01-15",
    "chunk_id": "handbook_147"
  }
}
                

🎮 Interactive Prompt Engineering Demo

Test Your Prompts!

Experiment with different prompts and see how the AI responds. Try improving the prompt using the patterns you learned!

System Prompt / Instructions:

User Input:

💡 Try These Improvements:

Add a clear role: "You are a financial analyst..."
Specify output format: "Provide 3 bullet points..."
Include constraints: "Focus only on revenue trends..."
Request reasoning: "Explain your analysis step by step..."

Module 4: Tools & Function Calling

⏱️ 12 min read | 🎯 Intermediate

Understanding Tools & Functions

Tools are capabilities AI can access (search, calculator, database). Function calling is the mechanism where AI requests to use these tools with structured parameters.

Native Function Calling APIs

💡 What are Native APIs?

Some AI providers (OpenAI, Anthropic, Google) offer built-in function calling support. The model natively understands when and how to call functions, outputting structured JSON that your code can execute.

// Example: OpenAI Native Function Calling

const response = await openai.chat.completions.create({
  model: "gpt-4",
  messages: [
    { role: "user", content: "What's the weather in Paris?" }
  ],
  tools: [
    {
      type: "function",
      function: {
        name: "get_weather",
        description: "Get current weather for a location",
        parameters: {
          type: "object",
          properties: {
            location: { type: "string" },
            units: { type: "string", enum: ["celsius", "fahrenheit"] }
          },
          required: ["location"]
        }
      }
    }
  ]
});

// Model responds with structured function call:
{
  "tool_calls": [{
    "function": {
      "name": "get_weather",
      "arguments": '{"location": "Paris, France", "units": "celsius"}'
    }
  }]
}
                    

Open-Source Models & Frameworks

⚠ The Challenge

Most open-source models (Llama, Mistral, etc.) don't have native function calling. They're trained for text generation, not structured tool use. Solution: Use frameworks that add this capability!

Popular Frameworks

LangChain

Approach: Wraps models with agents and tool abstractions

How: Prompts model to output specific format, parses response

Best For: Complex chains, multiple tools, production apps

LlamaIndex

Approach: Data-focused framework with built-in tools

How: Query engines + tool abstractions

Best For: RAG applications, document Q&A

Instructor

Approach: Structured output validation using Pydantic

How: Forces model outputs into schemas

Best For: Data extraction, API responses

// Example: LangChain with Open-Source Model

from langchain.agents import initialize_agent, Tool
from langchain.llms import HuggingFacePipeline

# Define tools
tools = [
    Tool(
        name="Weather",
        func=get_weather,
        description="Get current weather for a location. Input: city name"
    ),
    Tool(
        name="Calculator",
        func=calculate,
        description="Perform math calculations. Input: expression"
    )
]

# Initialize with open-source model (e.g., Llama 2)
llm = HuggingFacePipeline.from_model_id(
    model_id="meta-llama/Llama-2-7b-chat-hf",
    task="text-generation"
)

# Create agent - LangChain handles function calling logic
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent="zero-shot-react-description"
)

# Use it
response = agent.run("What's the weather in Paris?")
# LangChain prompts model, parses "Weather" tool call, executes it
                    

Native APIs vs Frameworks

Aspect	Native Function Calling	Framework-Based
Models	GPT-4, Claude, Gemini	Any model (Llama, Mistral, etc.)
Accuracy	Very high (trained for it)	Moderate (depends on prompting)
Reliability	Consistent structured output	May need retry logic
Setup	Simple - just define functions	More complex - framework config
Cost	API pricing (per token)	Self-hosted or cheaper APIs
Control	Less (provider-dependent)	Full control over logic
Best For	Production, reliability critical	Cost optimization, customization

How Frameworks Work Behind the Scenes

// What frameworks do with open-source models:

// 1. Inject tool descriptions into prompt
prompt = `You have access to these tools:

Tool: Weather
Description: Get current weather for a location
Input: city name

Tool: Calculator  
Description: Perform math calculations
Input: math expression

To use a tool, output: Action: [tool_name]
Action Input: [input]

Question: What's 15% of 200?`

// 2. Model generates text response (not structured)
model_output = "I'll use the Calculator tool.
Action: Calculator
Action Input: 200 * 0.15"

// 3. Framework parses this text to extract tool call
parsed = {
  tool: "Calculator",
  input: "200 * 0.15"
}

// 4. Framework executes tool and feeds result back
result = calculator("200 * 0.15")  // Returns: 30

// 5. Framework adds to prompt and gets final answer
"The calculation shows that 15% of 200 is 30."
                    

🎯 Choosing Your Approach

Use Native Function Calling when:

Reliability is critical (financial transactions, medical)
You need consistent structured outputs
Budget allows for API costs
Quick development time is priority

Use Frameworks with Open-Source when:

Cost optimization is important
You need full control and customization
Data privacy requires on-premise hosting
You have expertise to handle edge cases

Real-World Example

Scenario: E-commerce customer service chatbot

Native API Approach (GPT-4):

Cost: $0.03 per conversation
Setup: 2 days
Accuracy: 98% correct tool calls

Framework Approach (Llama 2 + LangChain):

Cost: $0.002 per conversation (15x cheaper!)
Setup: 1-2 weeks
Accuracy: 92% correct tool calls (needs tuning)

Module 5: Model Context Protocol (MCP)

⏱️ 8 min read | 🎯 Advanced

What is MCP?

Model Context Protocol (MCP) is an open standard for connecting AI models to external tools and data sources. Think of it as USB for AI - a universal way to plug in any tool to any AI system.

Why MCP Matters

Without MCP

❌ Custom integration per tool

❌ Vendor lock-in

❌ Duplicate work

❌ Hard to maintain

With MCP

✓ Standard interface

✓ Portable tools

✓ Write once, use anywhere

✓ Community ecosystem

MCP Architecture

// MCP Server (Tool Provider)
class WeatherMCPServer {
  // Exposes capabilities via MCP standard
  
  getTools() {
    return [{
      name: "get_weather",
      description: "Fetch current weather",
      inputSchema: { ... }
    }]
  }
  
  executeTool(name, args) {
    if (name === "get_weather") {
      return fetchWeatherAPI(args.location)
    }
  }
}

// MCP Client (AI Application)
class AIApplication {
  connectToMCPServer(serverUrl) {
    // Auto-discovers available tools
    const tools = await server.getTools()
    // AI can now use these tools!
  }
}
                    

🚀 MCP Use Cases

Database Access: Connect to MySQL, Postgres, MongoDB
File Systems: Read/write local or cloud files
APIs: Integrate with Slack, GitHub, Salesforce
Internal Tools: Custom business logic and data
Browser Automation: Web scraping and testing

💡 Getting Started with MCP

1. Choose or Build MCP Server: Use existing servers or create your own

2. Configure AI Client: Point your AI application to the MCP server

3. Test Integration: Verify tool discovery and execution

4. Deploy: Run MCP servers alongside your AI application

Module 6: Prompt Engineering

⏱️ 12 min read | 🎯 All Levels

The Art and Science of Prompting

Prompt engineering is crafting instructions that guide AI to produce desired outputs. It's your most powerful and cost-effective tool for AI customization.

Core Prompting Patterns

Pattern 1: Clear Role & Context

❌ Poor Prompt

"Write about AI"

✓ Good Prompt

"You are a technical writer creating documentation for software engineers. Explain how transformer models work, assuming the reader has basic ML knowledge but hasn't worked with transformers before."

Pattern 2: Specific Instructions

// Bad: Vague request
"Summarize this article"

// Good: Specific requirements
"Summarize this article in 3 bullet points, each under 20 words.
Focus on: 1) Main findings, 2) Methodology, 3) Business implications.
Use simple language suitable for executives."
                    

Pattern 3: Output Structure

// Request specific format
"Analyze this customer review and respond in JSON:
{
  'sentiment': 'positive' | 'negative' | 'neutral',
  'key_issues': [string array],
  'priority': 'high' | 'medium' | 'low',
  'suggested_response': string
}"
                    

Pattern 4: Think Step-by-Step

❌ Without Reasoning

"Is this code secure?"

✓ With Reasoning

"Analyze this code for security issues. First, identify potential vulnerabilities. Then, assess their severity. Finally, suggest fixes. Explain your reasoning for each issue."

Common Anti-Patterns

⚠ What NOT to Do

Too Vague: "Make it better" - Better how? What criteria?
Conflicting Instructions: "Be brief but comprehensive" - Pick one!
Assuming Context: "Fix the bug" - What bug? In what code?
Ignoring Constraints: Not specifying length, format, or style
No Examples: Complex tasks need examples of desired output
Implicit Negatives: "Don't mention X" often makes AI focus on X

Advanced Patterns

Persona Pattern: "As a [role], consider [perspective]..."
Template Pattern: Provide filled example, ask AI to follow format
Refinement Pattern: "First draft, then critique, then final version"
Constraints Pattern: "Without using X, solve Y"
Comparison Pattern: "Compare A vs B on dimensions X, Y, Z"

Module 7: Chain of Thought (CoT)

⏱️ 9 min read | 🎯 Intermediate

What is Chain of Thought?

Chain of Thought is a prompting technique that encourages AI to show its reasoning process step-by-step before providing a final answer. This dramatically improves accuracy on complex reasoning tasks.

The Power of CoT

Without CoT

Q: If a store has 15% off and an additional $10 coupon, which is better on a $80 item?

A: The 15% discount is better.

❌ Often wrong, no reasoning shown

With CoT

Q: Let's work through this step-by-step:

A: 1. 15% off $80 = $80 × 0.15 = $12 discount, pays $68
2. $10 coupon = pays $70
3. $68 < $70
Therefore: 15% discount is better

✓ Correct with clear reasoning

CoT Trigger Phrases

// Basic CoT triggers:
"Let's think step by step."
"Let's work through this systematically."
"Let's break this down:"
"First, let's analyze..."

// Structured CoT:
"Before answering, consider:
1. What information do we have?
2. What's being asked?
3. What steps are needed?
4. What's the final answer?"

// Domain-specific CoT:
"Debug this code by:
1. Identifying the expected behavior
2. Tracing execution flow
3. Finding where actual differs from expected
4. Suggesting the fix"
                    

When to Use CoT

Math & Calculations: Multi-step arithmetic, word problems
Logical Reasoning: "If X then Y" scenarios, deductions
Complex Analysis: Code debugging, root cause analysis
Decision Making: Weighing multiple factors, trade-offs
Planning: Project planning, strategy development

🎯 CoT Best Practices

Explicit Request: Directly ask for step-by-step reasoning
Provide Structure: Suggest the reasoning framework to use
Show Examples: Demonstrate desired reasoning in few-shot examples
Verify Steps: Ask AI to check its own work before final answer

💡 Real Impact

Studies show CoT can improve accuracy on reasoning tasks by 20-50% compared to direct answering, especially on complex problems.

Module 8: Few-Shot Learning

⏱️ 10 min read | 🎯 All Levels

Learning by Example

Few-shot learning means providing the AI with a few examples of the desired input-output pattern. The AI learns from these examples and applies the pattern to new inputs.

Types of Shot Learning

Type	Examples Given	When to Use	Effectiveness
Zero-Shot	0 - Just instructions	Simple, well-known tasks	⭐⭐
One-Shot	1 example	Format clarification	⭐⭐⭐
Few-Shot	2-5 examples	Most tasks, pattern learning	⭐⭐⭐⭐
Many-Shot	10+ examples	Complex patterns, edge cases	⭐⭐⭐⭐⭐

Few-Shot Example Pattern

// Task: Classify customer support tickets by urgency

Classify these support tickets as: Critical, High, Medium, or Low

Example 1:
Ticket: "Website is completely down, customers can't checkout"
Classification: Critical
Reason: Revenue-impacting outage

Example 2:
Ticket: "Button color doesn't match brand guidelines"
Classification: Low
Reason: Minor UI issue, no functional impact

Example 3:
Ticket: "Password reset emails taking 30 minutes to arrive"
Classification: High
Reason: Affects user access, but workaround exists

Now classify:
Ticket: "Getting error 500 on admin dashboard when uploading files"
Classification:

// AI learns the pattern and applies it correctly
                    

Crafting Effective Examples

Diverse Examples: Cover different scenarios, not just similar cases
Include Edge Cases: Show how to handle ambiguous situations
Show Reasoning: Include "why" for complex classifications
Consistent Format: Keep structure identical across examples
Representative Sample: Examples should mirror real-world distribution

⚠ Common Mistakes

Too Similar: All examples showing same pattern variation
Unbalanced: 4 positive examples, 1 negative (biases AI)
Wrong Examples: Showing incorrect outputs teaches wrong pattern
Too Many: Beyond 10-15 examples, consider fine-tuning instead

🎯 When Few-Shot Shines

Custom Formatting: Company-specific document styles
Domain Language: Industry jargon and terminology
Classification: Categorizing with custom labels
Extraction: Pulling structured data from unstructured text
Style Matching: Emulating specific writing tone or format

Module 9: Prompt Engineering vs Fine-tuning vs RAG

⏱️ 8 min read | 🎯 All Levels

Choosing Your Approach

These three methods solve different problems. Understanding when to use each can save significant time and money.

Detailed Comparison

Factor	Prompt Engineering	RAG	Fine-tuning
Setup Time	Minutes	Days to weeks	Weeks to months
Cost	$0 - $50	$500 - $5K/month	$10K - $100K+
Data Needed	0-10 examples	Documents/knowledge base	1000+ examples
Iteration Speed	Instant	Hours	Days
Best For	Behavior, format, reasoning	Knowledge access, current info	Specialized style, domain language
Maintenance	Easy updates	Update documents	Retrain periodically

Decision Framework

// START HERE
Can prompt engineering solve it? (80% of cases)
├─ YES → Use prompts (cheapest, fastest)
└─ NO → Continue

Need access to specific knowledge/documents?
├─ YES → Use RAG
│   └─ Combine with good prompts
└─ NO → Continue

Need specialized behavior that prompts can't achieve?
├─ Examples: Medical diagnosis style, legal writing format
├─ Have 1000+ quality training examples?
│   ├─ YES → Consider fine-tuning
│   └─ NO → Improve prompts or gather data
└─ Unlikely → Stick with prompts + RAG
                    

Real-World Scenarios

Scenario 1: Customer Support Bot

Solution: RAG + Prompt Engineering

Why: Need product knowledge (RAG) + professional tone (prompts). Fine-tuning not needed.

Cost: ~$800/month

Scenario 2: Code Reviewer

Solution: Prompt Engineering only

Why: Good prompts can specify coding standards. No special knowledge needed.

Cost: ~$20/month

Scenario 3: Medical Report Generator

Solution: Fine-tuning + RAG

Why: Highly specialized medical language (fine-tune) + patient data access (RAG)

Cost: ~$15K setup + $2K/month

Module 10: Evaluation & Metrics

⏱️ 12 min read | 🎯 Advanced

Measuring AI Performance

You can't improve what you don't measure. Evaluation metrics help you understand if your AI system is working well and where to focus improvements.

Key Metrics

1. Accuracy Metrics

Precision

Of AI's positive predictions, how many were correct?

Recall

Of all actual positives, how many did AI find?

F1 Score

Balanced measure of precision and recall

Exact Match

Percentage of perfect answers

2. RAG-Specific Metrics

Metric	What It Measures	Good Score
Retrieval Precision	% of retrieved chunks that are relevant	> 70%
Retrieval Recall	% of relevant docs that were retrieved	> 80%
Answer Relevance	Does answer address the question?	> 85%
Faithfulness	Is answer grounded in retrieved docs?	> 90%
Context Precision	Are relevant chunks ranked high?	> 75%

3. Hallucination Detection

⚠ Types of Hallucinations

Factual: Inventing facts not in source material
Contextual: Correct facts but wrong context
Temporal: Outdated information presented as current
Conflation: Mixing details from different sources

// Measuring hallucination rate:

hallucination_rate = (
  hallucinated_responses / total_responses
) × 100

// Target: < 5% for production systems
// Critical systems (medical, legal): < 1%

// Detection methods:
1. Compare answer to source documents
2. Check for unsupported claims
3. Verify factual consistency
4. Cross-reference with known facts
                    

Evaluation Framework

Create Test Set: 100-500 representative queries with known good answers
Automated Metrics: Run tests after each change to catch regressions
Human Evaluation: Sample 50-100 responses monthly for quality check
A/B Testing: Compare variants with real users (10-20% traffic)
Monitor Production: Track metrics on live traffic continuously

Practical Evaluation Setup

// Sample evaluation pipeline

const testCases = [
  {
    query: "What is our vacation policy?",
    expected_contains: ["15 days", "annually"],
    must_not_contain: ["unlimited"],
    source_document: "employee_handbook.pdf"
  },
  // ... more test cases
]

function evaluateRAG(testCases) {
  let results = {
    precision: 0,
    recall: 0,
    hallucinations: 0,
    avg_latency: 0
  }
  
  for (const test of testCases) {
    const start = Date.now()
    const response = await ragSystem.query(test.query)
    const latency = Date.now() - start
    
    // Check if expected content present
    // Check for hallucinations
    // Verify source usage
    // Record metrics
  }
  
  return results
}
                    

💡 Continuous Improvement

Week 1: Establish baseline metrics

Week 2: Identify worst-performing queries

Week 3: Improve (better chunks, prompts, etc.)

Week 4: Re-evaluate and iterate

Goal: 5-10% improvement per iteration cycle

Module 11: On-Premise vs Cloud AI

⏱️ 7 min read | 🎯 Intermediate

Deployment Decision

Where you run your AI affects security, cost, performance, and maintenance. This is a strategic decision with long-term implications.

Comprehensive Comparison

Factor	On-Premise	Cloud
Initial Cost	$50K-500K+ (hardware)	$0 (pay-as-you-go)
Per-Query Cost	$0.0001-0.001 (at scale)	$0.001-0.02
Scalability	Limited by hardware	Instant, unlimited
Data Control	Complete, never leaves premises	Depends on provider terms
Latest Models	Delayed, manual updates	Immediate access
Maintenance	Your team (24/7)	Provider handles
Expertise Required	High (ML engineers, DevOps)	Medium (API integration)
Time to Production	3-6 months	Days to weeks

Decision Tree

// Choose On-Premise if:
✓ Regulatory requirements (HIPAA, financial)
✓ Highly sensitive data that cannot leave premises
✓ Very high volume (millions of queries/day)
✓ Existing AI infrastructure and team
✓ Long-term cost optimization (3+ years)

// Choose Cloud if:
✓ Need to launch quickly (weeks, not months)
✓ Variable or unpredictable load
✓ Want latest models and features
✓ Limited AI infrastructure expertise
✓ Starting small, may scale later

// Consider Hybrid if:
✓ Some data must stay on-prem, some can go cloud
✓ Testing cloud before full on-prem deployment
✓ Want flexibility and redundancy
                    

🎯 Hybrid Approach Example

Sensitive Operations: Customer PII analysis → On-premise

General Operations: Public FAQs, general queries → Cloud

Development: Testing and experimentation → Cloud

Production: Core business logic → On-premise

Module 12: DevOps for AI & Cost Management

⏱️ 11 min read | 🎯 Advanced

Production AI Operations

Deploying AI is just the beginning. Proper DevOps and cost management are critical for sustainable, efficient AI operations.

Cost Components Breakdown

Component	% of Total	Optimization Impact
Model API Calls	50-70%	High - Caching can save 40-60%
Vector Database	15-25%	Medium - Index optimization helps
Embedding Generation	10-15%	Medium - Batch processing saves 20%
Storage & Infrastructure	5-10%	Low - Regular cleanup needed

Top 10 Cost Optimization Strategies

Implement Caching: Cache identical/similar queries - biggest win (40-60% savings)
Choose Right Model: Use smaller models for simple tasks (GPT-3.5 vs GPT-4)
Prompt Optimization: Remove unnecessary words, compress context
Batch Processing: Process non-urgent requests in batches (20% cheaper)
Set Token Limits: Prevent runaway generation with max_tokens
Smart Retrieval: Retrieve fewer chunks when possible (3 instead of 10)
Rate Limiting: Prevent abuse and unexpected spikes
Monitoring & Alerts: Set spending alerts before costs spiral
Optimize Embeddings: Cache embeddings, use cheaper models for embeddings
Regular Audits: Weekly review of high-cost queries and users

Real Cost Example

Case Study: Customer Support Chatbot

Volume: 10,000 conversations/month

Avg tokens/conversation: 2,000 (input + output)

Before Optimization:

Model: GPT-4 for everything
No caching
Retrieving 10 chunks per query
Cost: $5,200/month

After Optimization:

GPT-3.5 for 70% of queries, GPT-4 for complex only
50% cache hit rate
5 chunks per query average
Prompt compression
Cost: $850/month (84% reduction!)

Essential DevOps Practices

Monitoring

Response latency
Error rates
Token usage
Cost per user
Cache hit rates

Logging

All queries & responses
Retrieved chunks
Model versions
User feedback
Error traces

Testing

Regression tests
Load testing
A/B experiments
Hallucination checks
Performance benchmarks

// Example: Implementing smart caching

class AICache {
  async get(query) {
    // Semantic similarity check
    const similar = await findSimilarQuery(query, threshold: 0.95)
    
    if (similar && notStale(similar.timestamp)) {
      // Cache hit - save 100% of API cost!
      logMetric('cache_hit')
      return similar.response
    }
    
    // Cache miss - call API and store
    const response = await callAI(query)
    await store(query, response)
    logMetric('cache_miss')
    return response
  }
}
                
                

Module 13: Reasoning vs Non-Reasoning Models

⏱️ 10 min read | 🎯 Advanced

A New Paradigm in AI

Reasoning models represent a fundamental shift in how AI approaches complex problems. Unlike traditional models that generate answers immediately, reasoning models think through problems step-by-step before responding.

Architecture Differences

Traditional (Non-Reasoning) Models

// Architecture: Standard Transformer

User Input → Tokenization → Transformer Layers → Output Tokens

// Process:
1. Receive prompt
2. Process through neural network layers
3. Predict next token, then next, then next...
4. Return complete response

// Characteristics:
- Single forward pass
- Fast generation
- No explicit reasoning trace
- Output appears immediately
                    

Reasoning Models

// Architecture: Transformer + Reasoning Layer/Process

User Input → Reasoning Phase → Generation Phase → Output

// Process:
1. Receive prompt
2. Generate hidden reasoning tokens (not shown to user)
3. Explore multiple solution paths
4. Self-verify and correct
5. Generate final answer based on reasoning

// Characteristics:
- Multi-stage processing
- Slower but more accurate
- Explicit reasoning trace (optional visibility)
- Can backtrack and correct mistakes
                    

Key Architectural Components

Reasoning Tokens

Hidden tokens generated during "thinking" phase

Purpose: Work through problem internally

Cost: Uses more compute

Search & Verification

Model explores multiple solution paths

Purpose: Find best approach

Method: Tree search or beam search

Self-Correction

Model checks own work and revises

Purpose: Catch errors before responding

Impact: Dramatic accuracy improvement

Detailed Comparison

Aspect	Non-Reasoning Models	Reasoning Models
Examples	GPT-4, Claude 3.5, Llama 3	OpenAI o1, o3, DeepSeek R1
Response Time	1-3 seconds	10-60+ seconds
Token Usage	Lower (direct generation)	Higher (reasoning + answer)
Math Accuracy	60-70% on complex problems	90-95% on complex problems
Coding Tasks	Good for standard patterns	Excellent for complex algorithms
Training	Supervised fine-tuning	Reinforcement learning + search
Reasoning Visibility	Only in output text	Can expose thinking process
Best For	General conversation, creative writing	Math, logic, complex problem-solving

How Reasoning Models Work Internally

💡 The Process Explained

Step 1: Problem Understanding

Model generates hidden tokens analyzing the problem structure, constraints, and requirements.

Step 2: Solution Exploration

Model explores multiple approaches: "What if I try method A? What about method B?" Each path is evaluated.

Step 3: Self-Verification

Model checks its work: "Does this solution satisfy all constraints? Let me verify each step."

Step 4: Refinement

If issues found, model backtracks and tries different approach. Repeats until confident.

Step 5: Final Generation

Only after thorough reasoning, model generates the final response to the user.

Example: Same Problem, Different Models

GPT-4 (Non-Reasoning)

Problem: "If 5 machines make 5 widgets in 5 minutes, how long for 100 machines to make 100 widgets?"

Response (2 seconds):

"It would take 20 minutes."

❌ Incorrect - fell into intuitive trap

o1 (Reasoning)

Same Problem

Thinking (15 seconds):

"Wait, let me think... if 5 machines make 5 widgets in 5 minutes, that means each machine makes 1 widget in 5 minutes. So 100 machines would each make 1 widget in 5 minutes..."

Response:

"It would take 5 minutes."

✅ Correct - reasoned through the problem

Training Differences

Traditional Models: Trained on (prompt, response) pairs. Learn to predict what comes next based on patterns in training data.
Reasoning Models: Trained using reinforcement learning with reward signals for correct reasoning steps. Learn to explore solution space systematically.
Key Innovation: Reasoning models receive rewards not just for correct final answers, but for correct intermediate reasoning steps.
Result: Model learns to "think" through problems rather than pattern-match to memorized solutions.

Cost-Performance Trade-offs

Real-World Scenario: Code Generation

Task: Generate complex algorithm for graph optimization

GPT-4 Approach:

Cost: $0.03 per attempt
Time: 3 seconds
Success rate: 60%
Total cost (avg): $0.05 (with retries)

o1 Approach:

Cost: $0.15 per attempt
Time: 30 seconds
Success rate: 95%
Total cost (avg): $0.16 (rarely needs retry)

Analysis: Reasoning model costs 3x more but succeeds first time. For complex tasks where accuracy matters, it's actually more cost-effective!

When to Use Each Type

🎯 Use Non-Reasoning Models For:

General conversation: Customer support, chatbots
Creative writing: Marketing copy, stories, content
Simple tasks: Summarization, translation, formatting
High volume: When speed and cost matter more than perfection
Real-time responses: Interactive applications

🧠 Use Reasoning Models For:

Complex math: Multi-step calculations, proofs
Advanced coding: Algorithm design, bug fixing
Logic puzzles: Planning, scheduling, optimization
Scientific reasoning: Hypothesis generation, analysis
High-stakes decisions: Where accuracy is critical

💡 Hybrid Approach

Many production systems use both:

Fast model (GPT-4): Initial response, simple queries
Reasoning model (o1): Triggered for complex problems detected by fast model
Result: 95% of queries handled quickly and cheaply, 5% get deep reasoning when needed

🎮 Demo 1: RAG (Retrieval-Augmented Generation)

⏱️ Interactive | 🎯 Hands-On

What You'll Learn

See how RAG retrieves relevant chunks from a knowledge base and uses them to answer questions accurately. Compare the difference between RAG and non-RAG responses.

🔑 API Configuration

OpenAI API Key:

Your key is used only for this demo and never stored. Get one at platform.openai.com

📚 Knowledge Base (Edit These Chunks)

Chunk 1:

Chunk 2:

Chunk 3:

❓ Your Question

🎮 Demo 2: Prompt Engineering

⏱️ Interactive | 🎯 Hands-On

What You'll Learn

Experiment with different prompts and see how they affect AI output quality, structure, and detail.

🔑 API Configuration

OpenAI API Key:

Your key is used only for this demo and never stored.

System Prompt:

User Input:

Quick Examples:

💡 Try These Improvements:

Add a specific role: "You are a financial analyst for tech startups..."
Specify output format: "Provide exactly 3 bullet points..."
Include constraints: "Focus only on revenue trends and product performance..."
Request reasoning: "Think step-by-step and explain your analysis..."

🎮 Demo 3: Function Calling

⏱️ Interactive | 🎯 Hands-On

What You'll Learn

See how function calling enables AI to access real-time data and external tools. Compare responses with and without tool access.

🔑 API Configuration

OpenAI API Key:

Your key is used only for this demo and never stored.

Available Tool:

get_weather(location: string)
Returns: {temp: number, conditions: string, city: string}
Example: get_weather("Paris") → {temp: 18, conditions: "Partly cloudy", city: "Paris"}

Ask about weather:

🎯 Key Insights:

With Tools: AI recognizes need for current data, calls function, receives real data, provides accurate answer
Without Tools: AI only knows what was in training data, must decline or speculate
Real Impact: Function calling bridges the gap between AI knowledge and real-world, current information

🎮 Demo 4: Model Context Protocol (MCP)

⏱️ Interactive | 🎯 Hands-On

What You'll Learn

See how MCP provides a universal interface for AI to discover and use tools. This demo simulates connecting to an MCP server with multiple tools.

🔑 API Configuration

OpenAI API Key:

Your key is used only for this demo and never stored.

🖥️ Simulated MCP Server

MCP Server: "company-tools-server"

Tool 1: get_employee_info(employee_id: string)
Returns employee details from HR database

Tool 2: check_calendar(date: string)
Returns meetings scheduled for a date

Tool 3: get_weather(location: string)
Returns current weather conditions

Your Query:

🎯 MCP Advantages:

Tool Discovery: AI automatically discovers available tools from MCP server
Universal Interface: Same MCP tools work with any AI model that supports MCP
No Custom Code: Don't need to write integration code for each AI provider
Composable: Easily add/remove tools by connecting/disconnecting MCP servers

📝 Final Assessment

Test your knowledge across all modules

Question 1: RAG

What is the main benefit of using RAG over retraining an AI model?

A) Access current information without expensive retraining

B) Makes the AI respond faster

C) Reduces cloud computing costs

D) Makes the AI more creative

Question 2: Chunking

What is the optimal chunk size for most RAG applications?

A) 50-100 tokens

B) 400-800 tokens

C) 2000-3000 tokens

D) Chunk size doesn't matter

Question 3: MCP

What problem does Model Context Protocol (MCP) solve?

A) Makes AI models smaller and faster

B) Provides standard interface for connecting AI to tools

C) Reduces token costs by 50%

D) Prevents AI hallucinations

Question 4: Prompt Engineering

Which is an example of a good prompt pattern?

A) "Make it better"

B) "You are a [role]. Analyze X considering Y. Output format: Z"

C) "Do what I asked before"

D) "Just fix it"

Question 5: Chain of Thought

When should you use Chain of Thought prompting?

A) For all queries to be safe

B) For complex reasoning, math, and multi-step problems

C) Only for creative writing tasks

D) Never, it wastes tokens

Question 6: Few-Shot Learning

How many examples typically constitute "few-shot" learning?

A) 0 examples

B) 2-5 examples

C) 100+ examples

D) Exactly 1 example

Question 7: Customization Methods

Which approach should you try FIRST for AI customization?

A) Fine-tuning

B) RAG implementation

C) Prompt engineering

D) Training from scratch

Question 8: Evaluation

What is an acceptable hallucination rate for production AI systems?

A) < 5%

B) 10-15%

C) 20-25%

D) Hallucinations are unavoidable

Question 9: Deployment

When is on-premise AI deployment most appropriate?

A) When starting a new project quickly

B) For regulatory requirements and highly sensitive data

C) When you want the latest AI models

D) On-premise is always better

Question 10: Cost Optimization

What is the most effective cost reduction strategy?

A) Always use the cheapest model

B) Implementing caching for similar queries (40-60% savings)

C) Reducing response quality

D) Limiting user access

🎓 Intro to ML - 2026 Edition

Module 1: RAG Fundamentals

How RAG Works

Why RAG Matters

✓ Benefits

⚠ Challenges

💡 Real-World Example

Module 2: Agentic AI Systems

Key Characteristics

🎯

🔄

🛠️

🧠

Agentic Workflow Example

⚠ Important Considerations

Module 3: Chunking Strategies

Chunking Strategies

Chunk Size Impact

Too Small (< 200 tokens)

Optimal (400-800 tokens)

Too Large (> 1500 tokens)

🎯 Best Practices

🎮 Interactive Prompt Engineering Demo

Test Your Prompts!

💡 Try These Improvements:

Module 4: Tools & Function Calling

Native Function Calling APIs

💡 What are Native APIs?

Open-Source Models & Frameworks

⚠ The Challenge

Popular Frameworks

LangChain

LlamaIndex

Instructor

Native APIs vs Frameworks

How Frameworks Work Behind the Scenes

🎯 Choosing Your Approach

Real-World Example

Module 5: Model Context Protocol (MCP)

Why MCP Matters

Without MCP

With MCP

MCP Architecture

🚀 MCP Use Cases

💡 Getting Started with MCP

Module 6: Prompt Engineering

Core Prompting Patterns

Pattern 1: Clear Role & Context

❌ Poor Prompt

✓ Good Prompt

Pattern 2: Specific Instructions

Pattern 3: Output Structure

Pattern 4: Think Step-by-Step

❌ Without Reasoning

✓ With Reasoning

Common Anti-Patterns

⚠ What NOT to Do

Advanced Patterns

Module 7: Chain of Thought (CoT)

The Power of CoT

Without CoT

With CoT

CoT Trigger Phrases

When to Use CoT

🎯 CoT Best Practices

💡 Real Impact

Module 8: Few-Shot Learning

Types of Shot Learning

Few-Shot Example Pattern

Crafting Effective Examples

⚠ Common Mistakes

🎯 When Few-Shot Shines

Module 9: Prompt Engineering vs Fine-tuning vs RAG

Detailed Comparison

Decision Framework

Real-World Scenarios

Scenario 1: Customer Support Bot

Scenario 2: Code Reviewer

Scenario 3: Medical Report Generator

Module 10: Evaluation & Metrics