Chapter 281: Writing Optimized ESP32 Code

Chapter Objectives

By the end of this chapter, you will be able to:

  • Understand the key factors affecting performance on an ESP32: CPU speed, memory access, and I/O.
  • Configure ESP-IDF compiler settings to optimize for speed or binary size.
  • Strategically place code and data in faster memory regions like IRAM and DRAM.
  • Analyze the trade-offs between different optimization techniques.
  • Apply best practices for writing efficient C code for resource-constrained systems.
  • Recognize performance differences across various ESP32 variants.

Introduction

In embedded systems, every CPU cycle and every byte of memory counts. While the ESP32 is a powerful microcontroller, its resources are finite. Unoptimized code can lead to sluggish performance, missed real-time deadlines, excessive power consumption, and an inability to add new features. In a commercial product, these issues can mean the difference between success and failure.

This chapter shifts our focus from simply making code work to making it work efficiently. We will explore the art and science of optimization in the context of ESP-IDF. This involves instructing the compiler to be smarter, telling the chip where to place critical code, and writing our own logic in a way that respects the hardware’s architecture. Optimization is not about making every line of code as fast as possible; it’s about identifying critical bottlenecks and applying targeted, effective solutions.

Theory

Optimizing for an embedded system like the ESP32 is a multi-layered process. It spans from high-level algorithmic choices down to low-level hardware-specific tweaks.

graph TD
    subgraph "Optimization Cycle"
        direction LR
        A[Start: Application Works] --> B{"Measure Performance<br><b>(Profile Your Code)</b>"};
        B --> C{Identify Bottleneck<br><i>Is it CPU or I/O bound?</i>};
        C --> D[Apply Targeted Optimization<br>e.g., IRAM_ATTR, -Os, Algorithm change];
        D --> E{Test & Verify<br><i>Did it improve?<br>Does it still work?</i>};
        E -- Yes --> F[Integrate Change];
        E -- No --> C;
        F --> B;
    end

    G[End: Performance Goals Met]

    F --> G

    %% Styling
    classDef startNode fill:#EDE9FE,stroke:#5B21B6,stroke-width:2px,color:#5B21B6;
    classDef processNode fill:#DBEAFE,stroke:#2563EB,stroke-width:1px,color:#1E40AF;
    classDef decisionNode fill:#FEF3C7,stroke:#D97706,stroke-width:1px,color:#92400E;
    classDef validationNode fill:#FEE2E2,stroke:#DC2626,stroke-width:1px,color:#991B1B;
    classDef endNode fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46;

    class A,F startNode;
    class B,D processNode;
    class C,E decisionNode;
    class G endNode;

1. Understanding Performance Bottlenecks

Before you can optimize, you must first identify what is slowing your application down. A bottleneck is the part of the system that limits the overall performance. In ESP32 applications, bottlenecks typically fall into two categories:

  • CPU-Bound: The task is limited by the speed of the processor. This often occurs in applications involving heavy computation, such as digital signal processing (DSP), cryptographic operations, or complex state machines. The CPU is running at maximum capacity, and other resources (like I/O) are waiting for it to finish.
  • I/O-Bound: The task is limited by the speed of an input/output peripheral. This happens when the CPU spends most of its time waiting for data from a sensor, writing to a display, sending data over a network, or accessing flash memory. The CPU is often idle, waiting for the peripheral to complete its operation.

The first rule of optimization is to measure, don’t guess. Profiling your application (covered in Chapter 282) is the correct way to find bottlenecks. Once identified, you can apply the appropriate optimization strategy.

2. Compiler Optimizations

The GCC compiler used by ESP-IDF is a powerful tool with a sophisticated optimization engine. By providing it with specific instructions, or “flags,” you can significantly alter the generated machine code.

Optimization Levels

ESP-IDF allows you to set a global optimization level through menuconfig. The most common levels are:

Flag Name Primary Goal Best For
-O0 No Optimization Fastest compile time, direct source mapping. Initial development and debugging. Guarantees that what you see in the source code is what gets executed.
-O1 Basic Optimization Improve performance without long compile times. Rarely used directly; a step up from -O0.
-O2 Default Optimization Execution Speed CPU-bound tasks where performance is critical and a larger binary size is acceptable. The default in ESP-IDF.
-Os Size Optimization Smallest Binary Size Most production firmware. Reduces flash usage and speeds up Over-the-Air (OTA) updates.
-O3 Aggressive Optimization Maximum Execution Speed Highly computational, specialized applications (e.g., DSP) where every nanosecond counts, at the cost of significantly larger code.
-Og Debuggable Optimization Balance speed and debuggability. Debugging code that is too slow at -O0 but too complex to trace at -O2.

Tip: For production builds, -Os (Optimize for size) is almost always the best starting point. Flash space is often more constrained than CPU cycles on the ESP32. For computationally intensive applications where speed is paramount, -O2 or even -O3 may be justified.

How to Change Compiler Optimization Level:
  1. Run idf.py menuconfig in your project’s terminal.
  2. Navigate to Component config —> Compiler options.
  3. Select Optimization Level.
  4. Choose your desired level (e.g., Optimize for size (-Os)).
  5. Save and exit. The project will be recompiled with the new settings.

3. Memory Optimization: The “Attribute” System

One of the most powerful optimization techniques on the ESP32 is controlling where your code and data are stored. The processor can access some memory regions much faster than others.

  • DRAM (Data RAM): This is the primary RAM for data (variables, heap, task stacks). It’s fast.
  • IRAM (Instruction RAM): This RAM is exclusively for executing code. It is the fastest memory for code execution because it’s tightly coupled with the CPU core(s).
  • Flash: This is non-volatile memory where your program binary is stored. Code is typically executed from Flash via a cache. A cache miss occurs if the required instruction is not in the cache, forcing a slow read from the physical flash chip. This can introduce unpredictable delays.
sequenceDiagram
    actor CPU
    participant Cache
    participant Flash

    title CPU Instruction Fetch

    rect rgba(219, 234, 254, 0.5)
        note over CPU, Cache: Scenario 1: Cache Hit (Fast Path)
        CPU->>Cache: Request Instruction at Address X
        Cache-->>CPU: Found! Return instruction immediately
    end

    par 
        note over CPU, Flash: Scenario 2: Cache Miss (Slow Path)
        CPU->>Cache: Request Instruction at Address Y
        Cache->>CPU: Not in cache! Stall CPU.
        Cache->>Flash: Request memory block containing Address Y
        activate Flash
        Flash-->>Cache: Return entire memory block
        deactivate Flash
        Cache->>Cache: Store new block (evicting old one)
        Cache-->>CPU: Return Instruction Y. Resume CPU.
    end
Attribute Target Memory Applies To Primary Use Case
IRAM_ATTR Instruction RAM (IRAM) Functions Mandatory for ISRs. Used for time-critical code that must execute as fast as possible, avoiding Flash cache misses.
DRAM_ATTR Data RAM (DRAM) Initialized Data Forcing initialized, non-constant data into the main data RAM instead of being mapped from Flash.
RTC_DATA_ATTR RTC Slow Memory Data Preserving variables and application state across deep sleep cycles, as this memory region remains powered on.
(Default) Flash Functions & Constants The standard location for most code and all const data. Accessed via a hardware cache.

ESP-IDF provides macros (attributes) to tell the linker where to place specific functions or data variables.

Placing Functions in IRAM (IRAM_ATTR)

By default, most of your application code is placed in Flash. To place a function in the faster IRAM, use the IRAM_ATTR macro.

C
void IRAM_ATTR my_fast_function(void) {
    // This code will be placed in IRAM.
}

When to use IRAM_ATTR:

  1. Interrupt Service Routines (ISRs): ISRs must be in IRAM. Accessing flash is not permitted from an ISR because flash operations can be slow and may themselves be interrupted. Placing ISRs in IRAM ensures they execute quickly and deterministically.
  2. Time-Critical Code: Any function that has a strict timing requirement and cannot tolerate delays from flash cache misses. Examples include software-based protocol implementations or high-frequency control loops.
  3. Frequently Called Utility Functions: Small, frequently called functions can sometimes benefit from being in IRAM to avoid repeated cache misses.

Warning: IRAM is a very limited resource (e.g., 128KB on the original ESP32). Do not abuse IRAM_ATTR. Placing your entire program in IRAM is not possible and defeats the purpose. Use it only for small, critical sections of code.

Placing Data in DRAM (DRAM_ATTR)

This macro forces data to be placed in DRAM. It’s typically used for data that is initialized at compile time but needs to be modified at runtime.

C
DRAM_ATTR static uint8_t my_buffer[] = {0x01, 0x02, 0x03};

Constants that are never modified are best left in flash, which the compiler does by default for const data.

Placing Data in RTC Memory (RTC_DATA_ATTR)

This special attribute places data in the RTC slow memory. This memory region remains powered on during deep sleep, allowing your device to preserve state without writing to NVS.

C
RTC_DATA_ATTR int boot_count = 0; // This variable will survive a deep sleep cycle.

4. Algorithmic and C-Level Optimization

No amount of compiler or hardware tweaking can fix a fundamentally inefficient algorithm.

  • Data Types: Use the smallest data type that suits your needs. Don’t use a uint32_t if a uint8_t will suffice. This saves RAM and can lead to faster processing.
  • Fixed-Point vs. Floating-Point: Avoid floating-point arithmetic (floatdouble) in performance-critical code, especially on variants without a hardware Floating-Point Unit (FPU). Integer and fixed-point math are significantly faster. For example, instead of tracking voltage as 3.3 volts, you could track it as 3300 millivolts.
  • Loop Optimization: Unroll small, critical loops manually if the compiler fails to do so. Reduce the complexity of operations inside loops. For example, calculate constant values outside the loop.
  • Division and Modulo: Division and modulo operations are very slow on most microcontrollers. Where possible, replace division by a power of two with a bitwise right shift (e.g., x / 8 becomes x >> 3).
  • Heap Allocation: Avoid malloc() and free() in loops or real-time tasks. Dynamic memory allocation can be slow and lead to heap fragmentation, which can cause your application to fail unpredictably over time. Prefer static allocation or memory pools for objects with a long lifetime.

Practical Examples

Example 1: Measuring Performance Gain with IRAM_ATTR

Let’s write a simple test to observe the performance difference between a function running from Flash and the same function running from IRAM. We will use esp_timer_get_time() to measure the execution time of a computation-heavy loop.

Code
C
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_timer.h"
#include "esp_log.h"

static const char *TAG = "OPTIMIZATION_TEST";

// A function with some computational work, placed in Flash by default.
void performance_critical_task_in_flash(void) {
    // This loop is intentionally inefficient to make timing easier.
    for (volatile int i = 0; i < 50000; i++) {
        // Volatile prevents the compiler from optimizing the loop away.
    }
}

// The same function, but now placed in IRAM.
void IRAM_ATTR performance_critical_task_in_iram(void) {
    for (volatile int i = 0; i < 50000; i++) {
    }
}

void app_main(void) {
    // --- Test 1: Function in Flash ---
    // We run it once to "warm up" the cache.
    performance_critical_task_in_flash(); 
    
    int64_t start_time_flash = esp_timer_get_time();
    performance_critical_task_in_flash();
    int64_t end_time_flash = esp_timer_get_time();

    ESP_LOGI(TAG, "Time taken (Flash execution): %lld microseconds", end_time_flash - start_time_flash);

    // --- Test 2: Function in IRAM ---
    int64_t start_time_iram = esp_timer_get_time();
    performance_critical_task_in_iram();
    int64_t end_time_iram = esp_timer_get_time();

    ESP_LOGI(TAG, "Time taken (IRAM execution): %lld microseconds", end_time_iram - start_time_iram);

    // --- Test 3: Demonstrate Cache Miss Penalty ---
    // To show the effect of a cache miss, we can invalidate the cache before calling.
    // NOTE: This is an advanced technique for demonstration only.
    // On ESP32, cache is invalidated automatically on many operations. 
    // This test is more illustrative of the concept.
    
    ESP_LOGI(TAG, "Running Flash version again after IRAM call (likely cache miss)");
    start_time_flash = esp_timer_get_time();
    performance_critical_task_in_flash();
    end_time_flash = esp_timer_get_time();
    ESP_LOGI(TAG, "Time taken (Flash, after IRAM call): %lld microseconds", end_time_flash - start_time_flash);
}
Build and Run
  1. Create a new project: idf.py create-project optimization-example
  2. Copy the code above into your main/optimization-example.c.
  3. Connect your ESP32 and run idf.py flash monitor.
Observe

You will see output similar to this (exact numbers will vary based on ESP32 variant and system load):

Plaintext
I (315) OPTIMIZATION_TEST: Time taken (Flash execution): 250 microseconds
I (325) OPTIMIZATION_TEST: Time taken (IRAM execution): 110 microseconds
I (335) OPTIMIZATION_TEST: Running Flash version again after IRAM call (likely cache miss)
I (345) OPTIMIZATION_TEST: Time taken (Flash, after IRAM call): 255 microseconds

The key takeaway is that the IRAM execution is significantly faster and more consistent. The execution from Flash is subject to cache performance.

Variant Notes

The effectiveness and necessity of these techniques can vary across the ESP32 family.

Feature ESP32 ESP32-S2 ESP32-S3 ESP32-C3
CPU Core(s) 2x Xtensa LX6 1x Xtensa LX7 2x Xtensa LX7 1x RISC-V
Hardware FPU (Single Precision) (Emulated) (Single Precision) (Emulated)
AI / DSP Acceleration (Vector Instructions)
Typical IRAM 128 KB 128 KB 32 KB per core 10 KB
Typical DRAM 320 KB 320 KB 384 KB 400 KB
Performance Focus Balanced Performance Low Power, Secure I/O AI/ML & DSP Low Power, Cost-Effective
  • CPU and FPU:
    • ESP32: Dual-core Xtensa LX6 with a single-precision hardware FPU. Floating-point math is reasonably fast.
    • ESP32-S2: Single-core Xtensa LX7. No hardware FPU. Floating-point operations are emulated in software and are very slow. Integer/fixed-point math is critical for performance on this chip.
    • ESP32-S3: Dual-core Xtensa LX7 with FPU and additional vector instructions for AI/DSP acceleration. This is the most powerful variant for heavy computation.
    • ESP32-C3: Single-core 32-bit RISC-V core. No hardware FPU. Similar performance constraints to the S2 regarding floating-point math.
    • ESP32-C6 / H2: Single-core 32-bit RISC-V core. These “efficiency” cores are designed for low power, not raw performance. Optimization is key.
  • IRAM/DRAM Size: The amount of available IRAM and DRAM differs between variants. Always check the datasheet for your specific chip. What fits on an ESP32-S3 might not fit on an ESP32-C3.
  • Cache: The size and architecture of the instruction and data caches vary. Newer chips like the ESP32-S3 have more advanced cache configurations, which can improve performance when running code from flash compared to the original ESP32.

Common Mistakes & Troubleshooting Tips

Mistake / Issue Symptom(s) Troubleshooting / Solution
Linker error: “IRAM segment is full” The project fails to build/link. The error message explicitly mentions IRAM overflow. You have used IRAM_ATTR on too many functions. Solution: Only apply it to the most critical functions (ISRs, high-frequency loops). Profile your code to find true bottlenecks.
Guru Meditation Error (LoadProhibited) The device crashes and reboots, showing a backtrace with a “LoadProhibited” or “StoreProhibited” error. Often caused by calling a non-IRAM function from an ISR. Flash access is disabled during interrupts. Solution: Ensure the ISR and any function it calls are marked with IRAM_ATTR.
Benchmark code gives impossibly fast results Your timing test reports execution times of 0 or a few microseconds for a complex calculation. The compiler optimized your test away because the result was unused. Solution: Declare loop counters and result variables as volatile to force the compiler to execute the code.
Real-time task misses deadlines Jittery motor control, dropped sensor readings, or garbled communication protocol data. A high-priority task is being blocked. Common cause is using printf or ESP_LOGI inside a critical loop or ISR. Solution: Remove logging from performance-critical sections. Use GPIO toggling and an oscilloscope for high-speed timing.
Heap fragmentation / out of memory over time Application runs fine initially but crashes randomly after hours or days. heap_caps_get_free_size shows memory decreasing. Repeatedly using malloc() / free() in a long-running task. Solution: Statically allocate buffers and objects where possible. Use memory pools for objects with a defined lifetime. Avoid dynamic allocation in loops.

Exercises

  1. Fixed-Point Conversion: Take a function that uses float variables to perform a simple calculation (e.g., converting temperature from Celsius to Fahrenheit). Rewrite it using integer math (fixed-point) by working in thousandths of a degree. Benchmark both versions on your ESP32 variant and compare the results.
  2. Compiler Flag Impact: Take the IRAM_ATTR example project. Compile and flash it using the -O0 (No optimization), -Os (Size), and -O2 (Speed) optimization levels. Record both the final binary size (idf.py size) and the execution time for each level. Create a table to summarize your findings.
  3. Identify and Optimize: Find a piece of code in one of your existing projects that runs inside a loop. Analyze it for potential optimizations: can you move a calculation outside the loop? Can you replace a slow function with a faster one? Implement the change and measure the result.
  4. Stack vs. Heap: Write a task that creates a large array (~1KB). First, declare it as a local variable on the stack. Use uxTaskGetStackHighWaterMark() to see the stack usage. Second, refactor the code to allocate the array on the heap using malloc(). Discuss the pros and cons of each approach in your code comments.
  5. Explore menuconfig: Go through the menuconfig -> Component config -> Compiler options menu. Research what the “Enable C++ exceptions” and “Stack smashing protection” options do and in what scenarios you might enable or disable them.

Summary

  • Optimization is essential for creating robust, efficient, and feature-rich embedded applications on the ESP32.
  • Measure first. Use profiling tools to find CPU or I/O bottlenecks before attempting to optimize.
  • Leverage compiler optimizations. Use -Os for most production applications to balance speed and size. Use menuconfig to easily configure this.
  • Use memory attributes strategically. Place ISRs and time-critical functions in IRAM using IRAM_ATTR. Be mindful that IRAM is a scarce resource.
  • Write efficient C code. Choose appropriate data types, prefer integer/fixed-point math over floating-point (especially on chips without an FPU), and manage memory carefully.
  • Be aware of variant differences. Performance characteristics, especially concerning the FPU and memory sizes, vary significantly across the ESP32 family.

Further Reading

  • ESP-IDF Programming Guide: Performance Optimization: The official documentation from Espressif is the best place to start.
  • GCC Optimization Options: A deep dive into all the compiler flags available in GCC.
  • Agner Fog’s Software Optimization Resources: An excellent and detailed resource on writing fast C++ code, with principles that apply directly to C on microcontrollers.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top