Chapter 2: Understanding AI Model Architectures
Why Model Architecture Matters for Prompting
Imagine trying to give instructions to different types of workers: an essay writer, a translator, an artist, and a research assistant who can consult multiple sources. You wouldn’t give them the same instructions, right? You’d tailor your request based on their skills and how they work.
Similarly, different AI models have different internal “designs” or architectures. These architectures dictate how they process information, what kind of tasks they excel at, and what their limitations are. Understanding the basics of these architectures helps you, the prompt engineer, craft instructions (prompts) that play to the model’s strengths and avoid its weaknesses.
Knowing the architecture helps you answer questions like:
- Why does this model sometimes forget information I gave it earlier? (Context window limits)
- Why is this model great at summarizing but struggles with creative writing? (Encoder-Decoder vs. Decoder-only Transformers)
- Why do I need such descriptive words to get a good image from this AI? (Diffusion models need detailed guidance)
- How can I ask this model to analyze both text and an image simultaneously? (Multimodal capabilities)
By tailoring your prompts to the model’s architecture, you can significantly improve the quality, relevance, and efficiency of the responses you receive.
Major AI Model Families and Their Prompting Implications
Let’s explore the most common architectures relevant to prompt engineering:
1. Transformer-Based Language Models (Decoder-Only Focus)
- Examples: OpenAI’s GPT series (GPT-3, GPT-4), Anthropic’s Claude, Google’s Gemini (in part), Meta’s LLaMA.
- Core Idea: These models are like incredibly sophisticated auto-complete engines. They read your prompt (the input sequence) and predict the most statistically likely next word (or “token”), then the next, and so on, generating the response sequentially.
- Key Architectural Feature: The Transformer architecture, particularly the self-attention mechanism. This allows the model to weigh the importance of different words in the input prompt when generating each new word. It can “pay attention” to relevant earlier parts of the text.
- Characteristics:
- Sequential Generation: Builds the output token by token.
- Context Window: Has a limit on how much text (prompt + generated response) it can “remember” or consider at once (measured in tokens, roughly equivalent to parts of words). Older models might have had a few thousand tokens; newer ones like GPT-4 Turbo or Claude 3 have windows exceeding 100,000 tokens.
- In-Context Learning: Very good at learning patterns or tasks from examples provided directly within the prompt (few-shot learning).
- Prompt Engineering Implications:
- Context Window Awareness: Be mindful of the token limit. For long tasks, you might need to break them down or summarize previous parts. Put the most crucial instructions or context early in the prompt, as some models pay more attention to the beginning and end.
- Clarity is Key: Since it predicts word by word, clear and unambiguous instructions reduce the chance of it going off-topic.
- Leverage In-Context Learning: Use few-shot examples (Input/Output pairs) to demonstrate the desired format, style, or task.
- Explicit Formatting: Tell the model exactly how you want the output structured (e.g., “Respond in JSON format,” “Use bullet points,” “Write in a formal tone”).
%%{ init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#EDE9FE', 'primaryTextColor': '#5B21B6', 'lineColor': '#A78BFA', 'textColor': '#1F2937', 'fontSize': '14px' }}}%% graph TD A[Input Prompt] --> B{Transformer Block}; B -- Attends to Input & Past Output --> C(Predict Next Token); C --> D[Append Token to Output]; D -- Loop for next token --> B; D --> E[Final Output Sequence]; style A fill:#FFFBEB,stroke:#FBBF24,stroke-width:2px; style E fill:#ECFDF5,stroke:#34D399,stroke-width:2px; style B fill:#EFF6FF,stroke:#60A5FA,stroke-width:2px; style C fill:#FEF2F2,stroke:#F87171,stroke-width:2px; style D fill:#FEF2F2,stroke:#F87171,stroke-width:2px;
Example Prompt :
You are a helpful Python programming assistant.
Your task is to convert the following JavaScript function into equivalent Python code.
Ensure the Python code is well-commented, explaining the logic.
JavaScript Input:
```javascript
function calculateFactorial(n) {
if (n < 0) {
return "Factorial is not defined for negative numbers";
} else if (n === 0) {
return 1;
} else {
let result = 1;
for (let i = 1; i <= n; i++) {
result *= i;
}
return result;
}
}
Python Output:
Here's the equivalent Python code for the given JavaScript calculateFactorial function, with clear comments explaining each step:
def calculate_factorial(n):
"""
Calculate the factorial of a non-negative integer n.
Args:
n (int): The number to calculate the factorial for.
Returns:
int or str: The factorial of n if n is non-negative;
otherwise, an error message.
"""
# Check if the input is negative
if n < 0:
return "Factorial is not defined for negative numbers"
# Base case: factorial of 0 is 1
elif n == 0:
return 1
else:
result = 1
# Multiply all integers from 1 to n
for i in range(1, n + 1):
result *= i
return result
Example usage:
print(calculate_factorial(5)) # Output: 120
print(calculate_factorial(-2)) # Output: "Factorial is not defined for negative numbers"
Self-Correction: Initially, I just asked for conversion. Adding the role, task clarity, commenting requirement, and clear delimiters (```
) makes the prompt much more effective for a Transformer model.
2. Encoder-Decoder Models (Seq2Seq)
- Examples: Google’s T5, Facebook’s BART, earlier machine translation models.
- Core Idea: These models have two main parts. The Encoder reads the entire input sequence (like a sentence in English) and compresses it into a fixed-size internal representation (a “thought vector” or context). The Decoder then takes this representation and generates the output sequence (like the translated sentence in French).
- Characteristics:
- Input Processing: Reads the whole input before starting output.
- Task-Specific Fine-Tuning: Often pre-trained and then fine-tuned for specific sequence-to-sequence tasks like translation, summarization, or text classification.
- Prompt Engineering Implications:
- Task Prefixes: Often perform best when the prompt starts with a specific prefix indicating the task (e.g.,
translate English to French:
,summarize:
,question:
). These prefixes were often used during their fine-tuning. - Conciseness: Since the encoder summarizes the input, overly long or rambling prompts might lose nuance. Get straight to the point.
- Domain Alignment: Performance is usually best when the prompt matches the tasks the model was fine-tuned for. Using a translation model for creative writing might yield poor results.
- Task Prefixes: Often perform best when the prompt starts with a specific prefix indicating the task (e.g.,
graph TD A[Input Sequence<br><i>e.g., English Sentence</i>] --> B{Encoder}; B -- Creates --> C[Context Vector<br><i>Internal Representation</i>]; C --> D{Decoder}; D -- Generates Step-by-Step --> E[Output Sequence<br><i>e.g., French Sentence</i>]; style A fill:#FFFBEB,stroke:#FBBF24,stroke-width:2px; style B fill:#EFF6FF,stroke:#60A5FA,stroke-width:2px; style C fill:#F3E8FF,stroke:#A78BFA,stroke-width:2px; style D fill:#FEF2F2,stroke:#F87171,stroke-width:2px; style E fill:#ECFDF5,stroke:#34D399,stroke-width:2px;
Example Prompt:
summarize: The international conference on renewable energy concluded yesterday after three days of intensive discussions. Over 500 delegates from 80 countries attended. Key topics included advancements in solar panel efficiency, breakthroughs in battery storage technology, challenges in grid integration for variable renewables like wind and solar, and policy frameworks needed to accelerate the transition away from fossil fuels. Several major investment commitments were announced, totaling over $5 billion for new wind and solar projects globally. The next conference is scheduled for 2026 in Berlin.
Note: Ensuring the input text is coherent and directly related to the summarize:
task prefix improves the likelihood of a good summary.
3. Diffusion Models
- Examples: Midjourney, Stable Diffusion, DALL·E 2/3, Google’s Imagen.
- Core Idea: Primarily used for image generation. They work by starting with random noise (like TV static) and gradually refining it over many steps, guided by the text prompt, until it forms a coherent image matching the description.
- Key Architectural Feature: Uses a text encoder (often based on models like CLIP or T5) to understand the prompt and guide the denoising process. The model learns to reverse a process that adds noise to images.
- Characteristics:
- Iterative Refinement: Generates images step-by-step, not instantly.
- Sensitivity to Descriptors: Highly influenced by specific nouns, adjectives, artistic styles, lighting terms, and composition keywords in the prompt.
- Text-to-Image Link: The quality of the text encoding directly impacts the faithfulness of the generated image to the prompt.
- Prompt Engineering Implications:
- Be Descriptive and Specific: Use rich, sensory language. Instead of “a cat,” try “A fluffy ginger Maine Coon cat lounging lazily in a sunbeam on a rustic wooden floor, photorealistic style.”
- Include Style and Medium: Specify artistic styles (“in the style of Studio Ghibli,” “impressionist painting,” “cyberpunk art”), camera angles (“wide angle shot,” “close-up portrait”), lighting (“cinematic lighting,” “soft morning light”), and composition.
- Use Keywords: These models respond well to comma-separated keywords and phrases.
- Negative Prompts: Many tools allow “negative prompts” to specify what not to include (e.g.,
--no blurry, text, watermark
). - Iteration is Crucial: Expect to refine your prompt multiple times, adding, removing, or rephrasing terms to get the desired visual.
%%{ init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#EDE9FE', 'primaryTextColor': '#5B21B6', 'lineColor': '#A78BFA', 'textColor': '#1F2937', 'fontSize': '14px' }}}%% graph TD A[Text Prompt] --> B{Text Encoder}; C[Initial Noise<br><i>Random Pixels</i>] --> D{Denoising Steps<br><i>Iterative Refinement</i>}; B -- Guides --> D; D -- Step 1 --> D1(...); D1 -- Step 2 --> D2(...); D2 -- Step N --> E[Generated Image]; style A fill:#FFFBEB,stroke:#FBBF24,stroke-width:2px; style B fill:#EFF6FF,stroke:#60A5FA,stroke-width:2px; style C fill:#F3F4F6,stroke:#9CA3AF,stroke-width:2px; style D fill:#FEF2F2,stroke:#F87171,stroke-width:2px; style E fill:#ECFDF5,stroke:#34D399,stroke-width:2px;
Example Prompt (Midjourney style):
A breathtaking landscape vista of the Scottish Highlands, dramatic storm clouds gathering over rugged green mountains, a lone stag silhouetted on a ridge, shafts of golden sunlight breaking through the clouds, epic fantasy painting, style of Albert Bierstadt, cinematic lighting, hyperdetailed, 8k resolution --ar 16:9 --style raw
4. Multimodal Models
- Examples: OpenAI’s GPT-4V (Vision), Google’s Gemini, Anthropic’s Claude 3 (Vision capabilities).
- Core Idea: These models are designed to process and understand information from multiple types (modalities) of input simultaneously, most commonly text and images.
- Key Architectural Feature: Often use separate encoders for each modality (e.g., a Vision Transformer for images, a standard Transformer for text) and then mechanisms (like cross-attention) to fuse the information from these different streams.
- Characteristics:
- Cross-Modal Understanding: Can relate concepts between text and images (e.g., answer questions about an image, describe an image in text, generate text based on visual content).
- Combined Input: Can accept prompts that include both text instructions and image data.
- Prompt Engineering Implications:
- Clear Task Specification: Explicitly state how the different modalities should interact. “Describe this image,” “What is the main object in the provided picture?”, “Based on the chart image, summarize the key trends.”
- Referencing Inputs: Clearly refer to the different inputs if necessary (e.g., “Compare the product in the image to the description in the text below.”).
- Specify Output Modality: Indicate whether you expect a text response, an edited image (if supported), etc.
%%{ init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#EDE9FE', 'primaryTextColor': '#5B21B6', 'lineColor': '#A78BFA', 'textColor': '#1F2937', 'fontSize': '14px' }}}%% graph TD subgraph Inputs A[Text Input]; B[Image Input]; end subgraph Encoders C{Text Encoder}; D{Image Encoder<br><i>e.g., Vision Transformer</i>}; end subgraph Processing E{Fusion Mechanism<br><i>e.g., Cross-Attention</i>}; F{Decoder / Generator}; end G[Final Output<br><i>e.g., Text Description, Answer</i>]; A --> C; B --> D; C --> E; D --> E; E --> F; F --> G; style A fill:#FFFBEB,stroke:#FBBF24,stroke-width:2px; style B fill:#E0F2FE,stroke:#38BDF8,stroke-width:2px; style C fill:#EFF6FF,stroke:#60A5FA,stroke-width:2px; style D fill:#EFF6FF,stroke:#60A5FA,stroke-width:2px; style E fill:#F3E8FF,stroke:#A78BFA,stroke-width:2px; style F fill:#FEF2F2,stroke:#F87171,stroke-width:2px; style G fill:#ECFDF5,stroke:#34D399,stroke-width:2px;
Example Prompt (Assuming an image of a busy city street is also provided):
You are a visual analysis assistant. Based *only* on the provided image: 1. Describe the overall scene and time of day. 2. Identify the primary mode of transportation visible. 3. List three distinct types of buildings or structures you can see. Respond in a numbered list format.
Key Architectural Considerations for Prompting: A Summary Table
Model Type | Primary Use Cases | Key Prompting Strategies | Why it Works (Architecture Link) |
---|---|---|---|
Transformer (Decoder-Only) (e.g., GPT-3/4, Claude, Gemini, LLaMA) |
Text generation, Q&A, chat, coding, creative writing |
|
Sequential token-by-token generation benefits from clear step-by-step guidance. The self-attention mechanism effectively utilizes examples and instructions provided within the prompt’s context window. |
Encoder-Decoder (Seq2Seq) (e.g., T5, BART, older translation models) |
Translation, summarization, text classification, question answering |
|
The encoder compresses the entire input into a fixed representation before the decoder begins generation. Prefixes often align with how the model was trained/fine-tuned for specific sequence-to-sequence tasks. |
Diffusion Models (e.g., Midjourney, Stable Diffusion, DALL·E 2/3, Imagen) |
Image generation from text |
|
A text encoder interprets the prompt to guide an iterative denoising process starting from random noise. Specific and detailed terms are crucial for steering the visual output accurately. |
Multimodal Models (e.g., GPT-4V, Gemini, Claude 3 Vision) |
Vision-language tasks (image description, visual Q&A), processing combined text/image input |
|
Separate encoders process different input types (e.g., Vision Transformer for images, text Transformer for text). Fusion mechanisms combine this information, requiring clear instructions on how the modalities should interact. |
Understanding Context Windows and Memory
A critical limitation tied to architecture is the context window.
- What it is: The maximum amount of information (measured in tokens) a model can “see” or process at any given moment. This includes both your input prompt and the response it has generated so far.
- Why it matters: If a conversation or a document exceeds the context window, the model effectively “forgets” the earliest parts. This can lead to:
- Repetitive responses.
- Loss of instructions or persona given early on.
- Inability to refer back to information outside the window.
- Examples:
- A model with an 8,000-token window might struggle to summarize a 15,000-token document accurately in one go.
- In a long chatbot conversation, the bot might forget your name or the initial topic discussed.
- Strategies:
- Prioritize: Place the most critical instructions or information at the end of the prompt (closer to where generation starts) or repeat key instructions periodically.
- Summarization: For long interactions, periodically summarize the key points and feed that summary back into the prompt.
- Chunking: Break down large documents or tasks into smaller pieces that fit within the context window.
- Retrieval-Augmented Generation (RAG): Use external tools to fetch relevant snippets from large documents just-in-time and insert them into the prompt (covered later).
- Choose Models Wisely: If dealing with very long contexts, select models specifically designed with large context windows (e.g., Claude 3 models, GPT-4 Turbo).
Some systems also implement explicit memory features (like ChatGPT’s Memory) which try to retain key facts about the user or conversation across sessions, mitigating context window limitations to some extent, but this is often managed outside the core model architecture itself.
Practical Exercise
- Choose a Task: Select a simple task, like “Write a short poem about a rainy day” or “Explain the concept of photosynthesis in one paragraph.”
- Select Two Different Model Types: If possible, access two different types of models (e.g., a GPT-style model via OpenAI API or ChatGPT, and an image model like Midjourney or Stable Diffusion, or perhaps a model known for summarization if available).
- Craft Initial Prompts: Write a basic prompt for your chosen task, tailored slightly for each model type based on what we discussed (e.g., more descriptive for image, maybe more direct for text).
- Observe and Compare: Run the prompts and compare the outputs.
- How did the text model handle the poem/explanation request?
- How did the image model interpret the concept (if applicable)? Could you prompt it differently to get a better visual representation?
- (If using two text models): Did they respond differently in style or detail?
- Refine for One Model: Choose one model and try to improve the output by refining the prompt. Apply at least two techniques discussed:
- Add a role (e.g., “You are a melancholic poet…”).
- Specify format (e.g., “Write a four-line stanza”).
- Add descriptive details or constraints.
- Provide a simple example (few-shot).
- Reflect: What changes had the biggest impact on the output? How did understanding the likely architecture help you refine the prompt?
Summary
A model’s architecture is like its cognitive toolkit – it defines how it perceives, processes, and generates information. Transformer models excel at sequential text tasks using attention, Encoder-Decoders are suited for transformation tasks like translation, Diffusion models build images from noise guided by text, and Multimodal models bridge different data types. Understanding these differences, particularly limitations like context windows, allows you to craft prompts that are significantly more effective, leading to more accurate, controlled, and useful AI outputs.
In the next chapter of course, we’ll build upon this architectural understanding to establish the core principles that underpin effective prompting, regardless of the specific model you’re interacting with.
External Sources:
- Jay Alammar: https://jalammar.github.io/illustrated-transformer/
- Lilian Weng: “What Are Diffusion Models?” – https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
- DataNorth: Context Length in LLMs: What Is It and Why It Is Important – https://datanorth.ai/blog/context-length