Module 1: RAG Fundamentals
Retrieval-Augmented Generation (RAG) is a technique that enhances AI responses by first retrieving relevant information from a knowledge base, then using that information to generate accurate, contextual answers.
How RAG Works
- Step 1: Query Processing - User asks a question
- Step 2: Retrieval - System searches knowledge base for relevant documents
- Step 3: Context Assembly - Retrieved documents are added to the prompt
- Step 4: Generation - AI generates answer using retrieved context
- Step 5: Response - Answer is returned to user with optional citations
Why RAG Matters
✓ Benefits
- Reduces hallucinations
- Access to current data
- Company-specific knowledge
- No model retraining needed
- Transparent sourcing
⚠ Challenges
- Requires vector database
- Quality depends on chunks
- Additional latency
- Infrastructure complexity
- Retrieval accuracy critical
💡 Real-World Example
Scenario: Customer support chatbot for a SaaS company
Without RAG: AI gives generic answers or outdated information
With RAG: AI searches current documentation, finds exact feature details, and provides accurate answers with links to docs
Module 2: Agentic AI Systems
Agentic AI refers to AI systems that can autonomously plan, decide, and execute multi-step tasks to achieve goals. Unlike simple chatbots, these systems can break down complex requests, use tools, and iterate on their approach.
Key Characteristics
🎯
Goal-Oriented
🔄
Iterative
🛠️
Tool-Using
🧠
Decision-Making
Agentic Workflow Example
⚠ Important Considerations
Human Oversight: Critical decisions should require approval
Budget Limits: Set spending caps to prevent runaway costs
Audit Trails: Log all actions for transparency and debugging
Module 3: Chunking Strategies
Chunking is one of the most critical factors in RAG performance. Poor chunking leads to irrelevant retrievals, incomplete context, and inaccurate answers. Good chunking ensures semantic coherence and optimal retrieval.
Chunking Strategies
| Strategy | How It Works | Best For | Typical Size |
|---|---|---|---|
| Fixed Size | Split by character/token count | Uniform documents, simple setup | 500-1000 tokens |
| Semantic | Split at meaning boundaries | Natural language, narratives | Variable |
| Structural | Use document structure (headers, sections) | Technical docs, manuals | Variable |
| Sliding Window | Overlapping chunks | Context continuity needed | 800 tokens + 200 overlap |
| Recursive | Hierarchical splitting | Complex documents | Variable |
Chunk Size Impact
Too Small (< 200 tokens)
❌ Loses context
❌ Fragments information
❌ More retrievals needed
✓ Precise targeting
Optimal (400-800 tokens)
✓ Balanced context
✓ Good semantic coherence
✓ Efficient retrieval
✓ Cost-effective
Too Large (> 1500 tokens)
❌ Dilutes relevance
❌ Higher costs
❌ Slower processing
✓ Full context preserved
🎯 Best Practices
- Add Metadata: Include document title, section, date in each chunk
- Preserve Context: Include parent section headers in chunks
- Test & Iterate: Evaluate retrieval quality with real queries
- Consider Overlap: 10-20% overlap helps maintain continuity
- Match Domain: Technical docs need different strategy than chat logs
🎮 Interactive Prompt Engineering Demo
Test Your Prompts!
Experiment with different prompts and see how the AI responds. Try improving the prompt using the patterns you learned!
💡 Try These Improvements:
- Add a clear role: "You are a financial analyst..."
- Specify output format: "Provide 3 bullet points..."
- Include constraints: "Focus only on revenue trends..."
- Request reasoning: "Explain your analysis step by step..."
Module 4: Tools & Function Calling
Tools are capabilities AI can access (search, calculator, database). Function calling is the mechanism where AI requests to use these tools with structured parameters.
Native Function Calling APIs
💡 What are Native APIs?
Some AI providers (OpenAI, Anthropic, Google) offer built-in function calling support. The model natively understands when and how to call functions, outputting structured JSON that your code can execute.
Open-Source Models & Frameworks
⚠ The Challenge
Most open-source models (Llama, Mistral, etc.) don't have native function calling. They're trained for text generation, not structured tool use. Solution: Use frameworks that add this capability!
Popular Frameworks
LangChain
Approach: Wraps models with agents and tool abstractions
How: Prompts model to output specific format, parses response
Best For: Complex chains, multiple tools, production apps
LlamaIndex
Approach: Data-focused framework with built-in tools
How: Query engines + tool abstractions
Best For: RAG applications, document Q&A
Instructor
Approach: Structured output validation using Pydantic
How: Forces model outputs into schemas
Best For: Data extraction, API responses
Native APIs vs Frameworks
| Aspect | Native Function Calling | Framework-Based |
|---|---|---|
| Models | GPT-4, Claude, Gemini | Any model (Llama, Mistral, etc.) |
| Accuracy | Very high (trained for it) | Moderate (depends on prompting) |
| Reliability | Consistent structured output | May need retry logic |
| Setup | Simple - just define functions | More complex - framework config |
| Cost | API pricing (per token) | Self-hosted or cheaper APIs |
| Control | Less (provider-dependent) | Full control over logic |
| Best For | Production, reliability critical | Cost optimization, customization |
How Frameworks Work Behind the Scenes
🎯 Choosing Your Approach
Use Native Function Calling when:
- Reliability is critical (financial transactions, medical)
- You need consistent structured outputs
- Budget allows for API costs
- Quick development time is priority
Use Frameworks with Open-Source when:
- Cost optimization is important
- You need full control and customization
- Data privacy requires on-premise hosting
- You have expertise to handle edge cases
Real-World Example
Scenario: E-commerce customer service chatbot
Native API Approach (GPT-4):
- Cost: $0.03 per conversation
- Setup: 2 days
- Accuracy: 98% correct tool calls
Framework Approach (Llama 2 + LangChain):
- Cost: $0.002 per conversation (15x cheaper!)
- Setup: 1-2 weeks
- Accuracy: 92% correct tool calls (needs tuning)
Module 5: Model Context Protocol (MCP)
Model Context Protocol (MCP) is an open standard for connecting AI models to external tools and data sources. Think of it as USB for AI - a universal way to plug in any tool to any AI system.
Why MCP Matters
Without MCP
❌ Custom integration per tool
❌ Vendor lock-in
❌ Duplicate work
❌ Hard to maintain
With MCP
✓ Standard interface
✓ Portable tools
✓ Write once, use anywhere
✓ Community ecosystem
MCP Architecture
🚀 MCP Use Cases
- Database Access: Connect to MySQL, Postgres, MongoDB
- File Systems: Read/write local or cloud files
- APIs: Integrate with Slack, GitHub, Salesforce
- Internal Tools: Custom business logic and data
- Browser Automation: Web scraping and testing
💡 Getting Started with MCP
1. Choose or Build MCP Server: Use existing servers or create your own
2. Configure AI Client: Point your AI application to the MCP server
3. Test Integration: Verify tool discovery and execution
4. Deploy: Run MCP servers alongside your AI application
Module 6: Prompt Engineering
Prompt engineering is crafting instructions that guide AI to produce desired outputs. It's your most powerful and cost-effective tool for AI customization.
Core Prompting Patterns
Pattern 1: Clear Role & Context
❌ Poor Prompt
"Write about AI"
✓ Good Prompt
"You are a technical writer creating documentation for software engineers. Explain how transformer models work, assuming the reader has basic ML knowledge but hasn't worked with transformers before."
Pattern 2: Specific Instructions
Pattern 3: Output Structure
Pattern 4: Think Step-by-Step
❌ Without Reasoning
"Is this code secure?"
✓ With Reasoning
"Analyze this code for security issues. First, identify potential vulnerabilities. Then, assess their severity. Finally, suggest fixes. Explain your reasoning for each issue."
Common Anti-Patterns
⚠ What NOT to Do
- Too Vague: "Make it better" - Better how? What criteria?
- Conflicting Instructions: "Be brief but comprehensive" - Pick one!
- Assuming Context: "Fix the bug" - What bug? In what code?
- Ignoring Constraints: Not specifying length, format, or style
- No Examples: Complex tasks need examples of desired output
- Implicit Negatives: "Don't mention X" often makes AI focus on X
Advanced Patterns
- Persona Pattern: "As a [role], consider [perspective]..."
- Template Pattern: Provide filled example, ask AI to follow format
- Refinement Pattern: "First draft, then critique, then final version"
- Constraints Pattern: "Without using X, solve Y"
- Comparison Pattern: "Compare A vs B on dimensions X, Y, Z"
Module 7: Chain of Thought (CoT)
Chain of Thought is a prompting technique that encourages AI to show its reasoning process step-by-step before providing a final answer. This dramatically improves accuracy on complex reasoning tasks.
The Power of CoT
Without CoT
Q: If a store has 15% off and an additional $10 coupon, which is better on a $80 item?
A: The 15% discount is better.
❌ Often wrong, no reasoning shown
With CoT
Q: Let's work through this step-by-step:
A:
1. 15% off $80 = $80 × 0.15 = $12 discount, pays $68
2. $10 coupon = pays $70
3. $68 < $70
Therefore: 15% discount is better
✓ Correct with clear reasoning
CoT Trigger Phrases
When to Use CoT
- Math & Calculations: Multi-step arithmetic, word problems
- Logical Reasoning: "If X then Y" scenarios, deductions
- Complex Analysis: Code debugging, root cause analysis
- Decision Making: Weighing multiple factors, trade-offs
- Planning: Project planning, strategy development
🎯 CoT Best Practices
- Explicit Request: Directly ask for step-by-step reasoning
- Provide Structure: Suggest the reasoning framework to use
- Show Examples: Demonstrate desired reasoning in few-shot examples
- Verify Steps: Ask AI to check its own work before final answer
💡 Real Impact
Studies show CoT can improve accuracy on reasoning tasks by 20-50% compared to direct answering, especially on complex problems.
Module 8: Few-Shot Learning
Few-shot learning means providing the AI with a few examples of the desired input-output pattern. The AI learns from these examples and applies the pattern to new inputs.
Types of Shot Learning
| Type | Examples Given | When to Use | Effectiveness |
|---|---|---|---|
| Zero-Shot | 0 - Just instructions | Simple, well-known tasks | ⭐⭐ |
| One-Shot | 1 example | Format clarification | ⭐⭐⭐ |
| Few-Shot | 2-5 examples | Most tasks, pattern learning | ⭐⭐⭐⭐ |
| Many-Shot | 10+ examples | Complex patterns, edge cases | ⭐⭐⭐⭐⭐ |
Few-Shot Example Pattern
Crafting Effective Examples
- Diverse Examples: Cover different scenarios, not just similar cases
- Include Edge Cases: Show how to handle ambiguous situations
- Show Reasoning: Include "why" for complex classifications
- Consistent Format: Keep structure identical across examples
- Representative Sample: Examples should mirror real-world distribution
⚠ Common Mistakes
- Too Similar: All examples showing same pattern variation
- Unbalanced: 4 positive examples, 1 negative (biases AI)
- Wrong Examples: Showing incorrect outputs teaches wrong pattern
- Too Many: Beyond 10-15 examples, consider fine-tuning instead
🎯 When Few-Shot Shines
- Custom Formatting: Company-specific document styles
- Domain Language: Industry jargon and terminology
- Classification: Categorizing with custom labels
- Extraction: Pulling structured data from unstructured text
- Style Matching: Emulating specific writing tone or format
Module 9: Prompt Engineering vs Fine-tuning vs RAG
These three methods solve different problems. Understanding when to use each can save significant time and money.
Detailed Comparison
| Factor | Prompt Engineering | RAG | Fine-tuning |
|---|---|---|---|
| Setup Time | Minutes | Days to weeks | Weeks to months |
| Cost | $0 - $50 | $500 - $5K/month | $10K - $100K+ |
| Data Needed | 0-10 examples | Documents/knowledge base | 1000+ examples |
| Iteration Speed | Instant | Hours | Days |
| Best For | Behavior, format, reasoning | Knowledge access, current info | Specialized style, domain language |
| Maintenance | Easy updates | Update documents | Retrain periodically |
Decision Framework
Real-World Scenarios
Scenario 1: Customer Support Bot
Solution: RAG + Prompt Engineering
Why: Need product knowledge (RAG) + professional tone (prompts). Fine-tuning not needed.
Cost: ~$800/month
Scenario 2: Code Reviewer
Solution: Prompt Engineering only
Why: Good prompts can specify coding standards. No special knowledge needed.
Cost: ~$20/month
Scenario 3: Medical Report Generator
Solution: Fine-tuning + RAG
Why: Highly specialized medical language (fine-tune) + patient data access (RAG)
Cost: ~$15K setup + $2K/month
Module 10: Evaluation & Metrics
You can't improve what you don't measure. Evaluation metrics help you understand if your AI system is working well and where to focus improvements.
Key Metrics
1. Accuracy Metrics
Precision
Of AI's positive predictions, how many were correct?
Recall
Of all actual positives, how many did AI find?
F1 Score
Balanced measure of precision and recall
Exact Match
Percentage of perfect answers
2. RAG-Specific Metrics
| Metric | What It Measures | Good Score |
|---|---|---|
| Retrieval Precision | % of retrieved chunks that are relevant | > 70% |
| Retrieval Recall | % of relevant docs that were retrieved | > 80% |
| Answer Relevance | Does answer address the question? | > 85% |
| Faithfulness | Is answer grounded in retrieved docs? | > 90% |
| Context Precision | Are relevant chunks ranked high? | > 75% |
3. Hallucination Detection
⚠ Types of Hallucinations
- Factual: Inventing facts not in source material
- Contextual: Correct facts but wrong context
- Temporal: Outdated information presented as current
- Conflation: Mixing details from different sources
Evaluation Framework
- Create Test Set: 100-500 representative queries with known good answers
- Automated Metrics: Run tests after each change to catch regressions
- Human Evaluation: Sample 50-100 responses monthly for quality check
- A/B Testing: Compare variants with real users (10-20% traffic)
- Monitor Production: Track metrics on live traffic continuously
Practical Evaluation Setup
💡 Continuous Improvement
Week 1: Establish baseline metrics
Week 2: Identify worst-performing queries
Week 3: Improve (better chunks, prompts, etc.)
Week 4: Re-evaluate and iterate
Goal: 5-10% improvement per iteration cycle
Module 11: On-Premise vs Cloud AI
Where you run your AI affects security, cost, performance, and maintenance. This is a strategic decision with long-term implications.
Comprehensive Comparison
| Factor | On-Premise | Cloud |
|---|---|---|
| Initial Cost | $50K-500K+ (hardware) | $0 (pay-as-you-go) |
| Per-Query Cost | $0.0001-0.001 (at scale) | $0.001-0.02 |
| Scalability | Limited by hardware | Instant, unlimited |
| Data Control | Complete, never leaves premises | Depends on provider terms |
| Latest Models | Delayed, manual updates | Immediate access |
| Maintenance | Your team (24/7) | Provider handles |
| Expertise Required | High (ML engineers, DevOps) | Medium (API integration) |
| Time to Production | 3-6 months | Days to weeks |
Decision Tree
🎯 Hybrid Approach Example
Sensitive Operations: Customer PII analysis → On-premise
General Operations: Public FAQs, general queries → Cloud
Development: Testing and experimentation → Cloud
Production: Core business logic → On-premise
Module 12: DevOps for AI & Cost Management
Deploying AI is just the beginning. Proper DevOps and cost management are critical for sustainable, efficient AI operations.
Cost Components Breakdown
| Component | % of Total | Optimization Impact |
|---|---|---|
| Model API Calls | 50-70% | High - Caching can save 40-60% |
| Vector Database | 15-25% | Medium - Index optimization helps |
| Embedding Generation | 10-15% | Medium - Batch processing saves 20% |
| Storage & Infrastructure | 5-10% | Low - Regular cleanup needed |
Top 10 Cost Optimization Strategies
- Implement Caching: Cache identical/similar queries - biggest win (40-60% savings)
- Choose Right Model: Use smaller models for simple tasks (GPT-3.5 vs GPT-4)
- Prompt Optimization: Remove unnecessary words, compress context
- Batch Processing: Process non-urgent requests in batches (20% cheaper)
- Set Token Limits: Prevent runaway generation with max_tokens
- Smart Retrieval: Retrieve fewer chunks when possible (3 instead of 10)
- Rate Limiting: Prevent abuse and unexpected spikes
- Monitoring & Alerts: Set spending alerts before costs spiral
- Optimize Embeddings: Cache embeddings, use cheaper models for embeddings
- Regular Audits: Weekly review of high-cost queries and users
Real Cost Example
Case Study: Customer Support Chatbot
Volume: 10,000 conversations/month
Avg tokens/conversation: 2,000 (input + output)
Before Optimization:
- Model: GPT-4 for everything
- No caching
- Retrieving 10 chunks per query
- Cost: $5,200/month
After Optimization:
- GPT-3.5 for 70% of queries, GPT-4 for complex only
- 50% cache hit rate
- 5 chunks per query average
- Prompt compression
- Cost: $850/month (84% reduction!)
Essential DevOps Practices
Monitoring
- Response latency
- Error rates
- Token usage
- Cost per user
- Cache hit rates
Logging
- All queries & responses
- Retrieved chunks
- Model versions
- User feedback
- Error traces
Testing
- Regression tests
- Load testing
- A/B experiments
- Hallucination checks
- Performance benchmarks
Module 13: Reasoning vs Non-Reasoning Models
Reasoning models represent a fundamental shift in how AI approaches complex problems. Unlike traditional models that generate answers immediately, reasoning models think through problems step-by-step before responding.
Architecture Differences
Traditional (Non-Reasoning) Models
Reasoning Models
Key Architectural Components
Reasoning Tokens
Hidden tokens generated during "thinking" phase
Purpose: Work through problem internally
Cost: Uses more compute
Search & Verification
Model explores multiple solution paths
Purpose: Find best approach
Method: Tree search or beam search
Self-Correction
Model checks own work and revises
Purpose: Catch errors before responding
Impact: Dramatic accuracy improvement
Detailed Comparison
| Aspect | Non-Reasoning Models | Reasoning Models |
|---|---|---|
| Examples | GPT-4, Claude 3.5, Llama 3 | OpenAI o1, o3, DeepSeek R1 |
| Response Time | 1-3 seconds | 10-60+ seconds |
| Token Usage | Lower (direct generation) | Higher (reasoning + answer) |
| Math Accuracy | 60-70% on complex problems | 90-95% on complex problems |
| Coding Tasks | Good for standard patterns | Excellent for complex algorithms |
| Training | Supervised fine-tuning | Reinforcement learning + search |
| Reasoning Visibility | Only in output text | Can expose thinking process |
| Best For | General conversation, creative writing | Math, logic, complex problem-solving |
How Reasoning Models Work Internally
💡 The Process Explained
Step 1: Problem Understanding
Model generates hidden tokens analyzing the problem structure, constraints, and requirements.
Step 2: Solution Exploration
Model explores multiple approaches: "What if I try method A? What about method B?" Each path is evaluated.
Step 3: Self-Verification
Model checks its work: "Does this solution satisfy all constraints? Let me verify each step."
Step 4: Refinement
If issues found, model backtracks and tries different approach. Repeats until confident.
Step 5: Final Generation
Only after thorough reasoning, model generates the final response to the user.
Example: Same Problem, Different Models
GPT-4 (Non-Reasoning)
Problem: "If 5 machines make 5 widgets in 5 minutes, how long for 100 machines to make 100 widgets?"
Response (2 seconds):
"It would take 20 minutes."
❌ Incorrect - fell into intuitive trap
o1 (Reasoning)
Same Problem
Thinking (15 seconds):
"Wait, let me think... if 5 machines make 5 widgets in 5 minutes, that means each machine makes 1 widget in 5 minutes. So 100 machines would each make 1 widget in 5 minutes..."
Response:
"It would take 5 minutes."
✅ Correct - reasoned through the problem
Training Differences
- Traditional Models: Trained on (prompt, response) pairs. Learn to predict what comes next based on patterns in training data.
- Reasoning Models: Trained using reinforcement learning with reward signals for correct reasoning steps. Learn to explore solution space systematically.
- Key Innovation: Reasoning models receive rewards not just for correct final answers, but for correct intermediate reasoning steps.
- Result: Model learns to "think" through problems rather than pattern-match to memorized solutions.
Cost-Performance Trade-offs
Real-World Scenario: Code Generation
Task: Generate complex algorithm for graph optimization
GPT-4 Approach:
- Cost: $0.03 per attempt
- Time: 3 seconds
- Success rate: 60%
- Total cost (avg): $0.05 (with retries)
o1 Approach:
- Cost: $0.15 per attempt
- Time: 30 seconds
- Success rate: 95%
- Total cost (avg): $0.16 (rarely needs retry)
Analysis: Reasoning model costs 3x more but succeeds first time. For complex tasks where accuracy matters, it's actually more cost-effective!
When to Use Each Type
🎯 Use Non-Reasoning Models For:
- General conversation: Customer support, chatbots
- Creative writing: Marketing copy, stories, content
- Simple tasks: Summarization, translation, formatting
- High volume: When speed and cost matter more than perfection
- Real-time responses: Interactive applications
🧠 Use Reasoning Models For:
- Complex math: Multi-step calculations, proofs
- Advanced coding: Algorithm design, bug fixing
- Logic puzzles: Planning, scheduling, optimization
- Scientific reasoning: Hypothesis generation, analysis
- High-stakes decisions: Where accuracy is critical
💡 Hybrid Approach
Many production systems use both:
- Fast model (GPT-4): Initial response, simple queries
- Reasoning model (o1): Triggered for complex problems detected by fast model
- Result: 95% of queries handled quickly and cheaply, 5% get deep reasoning when needed
🎮 Demo 1: RAG (Retrieval-Augmented Generation)
What You'll Learn
See how RAG retrieves relevant chunks from a knowledge base and uses them to answer questions accurately. Compare the difference between RAG and non-RAG responses.
🔑 API Configuration
Your key is used only for this demo and never stored. Get one at platform.openai.com
📚 Knowledge Base (Edit These Chunks)
❓ Your Question
🎮 Demo 2: Prompt Engineering
What You'll Learn
Experiment with different prompts and see how they affect AI output quality, structure, and detail.
🔑 API Configuration
Your key is used only for this demo and never stored.
💡 Try These Improvements:
- Add a specific role: "You are a financial analyst for tech startups..."
- Specify output format: "Provide exactly 3 bullet points..."
- Include constraints: "Focus only on revenue trends and product performance..."
- Request reasoning: "Think step-by-step and explain your analysis..."
🎮 Demo 3: Function Calling
What You'll Learn
See how function calling enables AI to access real-time data and external tools. Compare responses with and without tool access.
🔑 API Configuration
Your key is used only for this demo and never stored.
🎯 Key Insights:
- With Tools: AI recognizes need for current data, calls function, receives real data, provides accurate answer
- Without Tools: AI only knows what was in training data, must decline or speculate
- Real Impact: Function calling bridges the gap between AI knowledge and real-world, current information
🎮 Demo 4: Model Context Protocol (MCP)
What You'll Learn
See how MCP provides a universal interface for AI to discover and use tools. This demo simulates connecting to an MCP server with multiple tools.
🔑 API Configuration
Your key is used only for this demo and never stored.
🖥️ Simulated MCP Server
Returns employee details from HR database
Returns meetings scheduled for a date
Returns current weather conditions
🎯 MCP Advantages:
- Tool Discovery: AI automatically discovers available tools from MCP server
- Universal Interface: Same MCP tools work with any AI model that supports MCP
- No Custom Code: Don't need to write integration code for each AI provider
- Composable: Easily add/remove tools by connecting/disconnecting MCP servers
📝 Final Assessment
Question 1: RAG
What is the main benefit of using RAG over retraining an AI model?
Question 2: Chunking
What is the optimal chunk size for most RAG applications?
Question 3: MCP
What problem does Model Context Protocol (MCP) solve?
Question 4: Prompt Engineering
Which is an example of a good prompt pattern?
Question 5: Chain of Thought
When should you use Chain of Thought prompting?
Question 6: Few-Shot Learning
How many examples typically constitute "few-shot" learning?
Question 7: Customization Methods
Which approach should you try FIRST for AI customization?
Question 8: Evaluation
What is an acceptable hallucination rate for production AI systems?
Question 9: Deployment
When is on-premise AI deployment most appropriate?
Question 10: Cost Optimization
What is the most effective cost reduction strategy?