Chapter 282: Profiling and Benchmarking ESP32 Applications

Chapter Objectives

By the end of this chapter, you will be able to:

  • Differentiate between benchmarking and profiling and know when to use each.
  • Use the esp_timer API for high-resolution, manual benchmarking of code segments.
  • Understand the architecture and setup of ESP-IDF’s Application Level Tracing feature.
  • Use GDB to retrieve and analyze trace data to understand program flow.
  • Set up and use SEGGER SystemView for advanced, graphical system-level profiling.
  • Identify performance bottlenecks and task behavior from profiling data.
  • Recognize variant-specific considerations for profiling tools.

Introduction

In the previous chapter, we explored how to optimize code. We discussed techniques like using IRAM, choosing compiler flags, and writing efficient C. However, a crucial question remains: what do we optimize? Applying optimizations blindly is a waste of development time and can even make code harder to maintain. The first rule of optimization is: don’t guess, measure.

This chapter introduces the tools and techniques for measuring your application’s performance. We will cover benchmarking, which is the precise timing of small, specific code blocks, and profiling, which gives a high-level view of the entire system’s dynamic behavior. By mastering these measurement techniques, you can pinpoint the exact locations of performance bottlenecks, verify the impact of your optimizations, and build truly efficient and reliable ESP32 applications.

Theory

1. Benchmarking vs. Profiling: Two Sides of the Same Coin

Though often used interchangeably, benchmarking and profiling are distinct processes that answer different questions.

  • Benchmarking is micro-analysis. It answers the question: “How long does this specific piece of code take to run?” It’s like using a stopwatch to time a single runner. You use benchmarking to measure the execution time of a particular function, algorithm, or I/O operation. It is ideal for comparing two different implementations of the same logic (e.g., a floating-point vs. a fixed-point calculation).
  • Profiling is macro-analysis. It answers the question: “How does my entire system behave over time?” It’s like watching a video of the entire race, showing how all runners interact, who is leading at different times, and who is idle. Profiling helps you understand which tasks are consuming the most CPU time, how often interrupts are firing, and when resources are being contended. It is the primary tool for finding system-level bottlenecks.
Aspect Benchmarking (Micro-Analysis) Profiling (Macro-Analysis)
Question Answered “How long does this specific function take to run?” “How does my entire system behave over time?”
Scope Narrow. Measures a single function, algorithm, or operation. Broad. Observes all tasks, interrupts, and resource interactions.
Analogy Using a stopwatch to time one specific part of a race. Watching a video replay of the entire race from multiple angles.
Primary Tool(s) Manual timing with esp_timer_get_time(). SEGGER SystemView, Application Level Tracing (apptrace).
Use Case Comparing two algorithms (e.g., float vs. integer math), verifying an optimization’s impact. Finding system-level bottlenecks, diagnosing task starvation, deadlocks, or priority inversion.

An effective workflow often involves both:

  1. Profile the system to identify a task or function that is consuming excessive resources.
  2. Benchmark specific parts of that function to pinpoint the slow sections.
  3. Apply an optimization.
  4. Benchmark again to confirm the improvement.

2. Manual Benchmarking with esp_timer

The simplest way to benchmark is to record the time before and after a code segment executes. ESP-IDF provides a high-resolution wall clock timer, esp_timer, perfect for this task.

The key function is esp_timer_get_time(), which returns the number of microseconds since the esp_timer was initialized (typically at boot).

C
#include "esp_timer.h"

int64_t start_time = esp_timer_get_time();

// ... code to be benchmarked ...

int64_t end_time = esp_timer_get_time();
int64_t execution_time = end_time - start_time;

printf("The operation took %lld microseconds.\n", execution_time);

Tip: When benchmarking code that runs from flash, it’s good practice to run the function once before starting the timer. This “warms up” the flash cache, ensuring that your measurement reflects the function’s steady-state performance rather than including the initial cache-miss penalty.

3. Application Level Tracing

For more advanced profiling, ESP-IDF provides a built-in Application Level Tracing library (apptrace). This powerful feature allows you to log events from your firmware with minimal overhead and send them to a host PC for analysis without interfering with your main application logic (e.g., through a UART printf).

It works by using a dedicated JTAG interface. Your code writes trace messages to a small block of RAM (the trace buffer). When the buffer is full, the application is briefly halted, and a host-side debugger (OpenOCD, managed by the VS Code extension) reads the data from the buffer via JTAG and saves it to a file.

sequenceDiagram
    participant App as ESP32 Application
    participant Buffer as Trace Buffer (RAM)
    participant CPU
    participant JTAG as JTAG Probe
    participant Host as Host PC (OpenOCD)

    App->>Buffer: Writes trace event data
    Note right of Buffer: Buffer fills up...
    
    critical Buffer is Full
        CPU-->>CPU: Halt Execution
        Host->>JTAG: Request trace data
        JTAG->>Buffer: Read entire buffer
        Buffer-->>JTAG: Send data
        JTAG-->>Host: Transfer data to file
        CPU-->>CPU: Resume Execution
    end
    
    App->>Buffer: Continue writing trace events...

This mechanism is extremely useful for understanding the sequence and timing of events in complex, multi-task applications.

4. System-Level Profiling with SEGGER SystemView

While apptrace is good for logging custom events, SEGGER SystemView provides a complete, graphical visualization of the system’s runtime state. It hooks directly into the FreeRTOS kernel to record events like:

  • Task starts, stops, and switches.
  • Queue sends and receives.
  • Semaphore/Mutex takes and gives.
  • ISR entry and exit.
  • Kernel timer events.

The result is a detailed timeline that can be viewed in the SystemView desktop application. This is arguably the most powerful tool for diagnosing complex real-time issues, such as priority inversion, task starvation, or deadlocks.

ESP-IDF has built-in support for SystemView, which works similarly to apptrace by sending the trace data over JTAG to the host.

Practical Examples

Example 1: Benchmarking Integer vs. Floating-Point Math

This example benchmarks two functions to concretely demonstrate the performance difference between floating-point and integer arithmetic, especially on a chip without an FPU.

Code
C
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_timer.h"
#include "esp_log.h"

static const char *TAG = "BENCHMARK";

#define ITERATIONS 10000

// --- Floating-Point Calculation ---
float floating_point_calc(float input) {
    float result = input;
    for (int i = 0; i < ITERATIONS; i++) {
        result = (result / 1.001f) + 0.5f;
    }
    return result;
}

// --- Integer (Fixed-Point) Calculation ---
// We simulate the same logic using integers scaled by 1000
int32_t integer_calc(int32_t input_scaled) {
    int32_t result_scaled = input_scaled;
    for (int i = 0; i < ITERATIONS; i++) {
        // Equivalent to (result / 1.001) + 0.5
        result_scaled = (result_scaled * 1000) / 1001;
        result_scaled += 500;
    }
    return result_scaled;
}

void app_main(void) {
    vTaskDelay(pdMS_TO_TICKS(1000)); // Wait a moment for system to stabilize

    ESP_LOGI(TAG, "Starting benchmarks...");

    // --- Benchmark Floating-Point ---
    int64_t start_fp = esp_timer_get_time();
    floating_point_calc(1000.0f);
    int64_t end_fp = esp_timer_get_time();
    ESP_LOGI(TAG, "Floating-point calculation took: %lld microseconds", end_fp - start_fp);

    // --- Benchmark Integer ---
    int64_t start_int = esp_timer_get_time();
    integer_calc(1000 * 1000);
    int64_t end_int = esp_timer_get_time();
    ESP_LOGI(TAG, "Integer calculation took: %lld microseconds", end_int - start_int);
}
Build and Run
  1. Create a new project.
  2. Copy the code into your main file.
  3. Flash and monitor: idf.py flash monitor.
Observe
  • On an ESP32-S3 (with FPU): The times will be relatively close, with the integer math possibly being slightly faster.
  • On an ESP32-C3 (no FPU): You will see a dramatic difference. The integer calculation will be an order of magnitude faster than the floating-point version, which must be emulated in software. This demonstrates the importance of choosing the right algorithm for your hardware.

Example 2: Setting up SystemView for Task Analysis

This example requires more setup but provides invaluable insight. We will create two tasks and watch their interaction in SystemView.

Pre-requisites
  1. A JTAG debugger (e.g., ESP-PROG, or a board with built-in JTAG like ESP-WROVER-KIT).
  2. SEGGER SystemView software installed on your PC (available for free from SEGGER’s website).
Setup Steps
  1. Configure the project in menuconfig:
    • idf.py menuconfig
    • Navigate to Component config —> Application Level Tracing.
    • Enable Trace Memory.
    • Set Trace memory size (e.g., 16384 bytes).
    • Go into FreeRTOS options.
    • Enable Enable SystemView tracing.
    • Enable Enable SEGGER SystemView module.
    • Save and exit.
  2. Add the code:
C
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "freertos/queue.h"
#include "esp_log.h"
#include "SEGGER_SYSVIEW.h"

static const char* TAG1 = "SENDER";
static const char* TAG2 = "RECEIVER";
static QueueHandle_t data_queue;

void sender_task(void *pvParameters) {
    int count = 0;
    while(1) {
        ESP_LOGI(TAG1, "Sending value: %d", count);
        // SystemView will automatically trace the xQueueSend call
        xQueueSend(data_queue, &count, portMAX_DELAY);
        count++;
        // Give receiver task time to run
        vTaskDelay(pdMS_TO_TICKS(1000)); 
    }
}

void receiver_task(void *pvParameters) {
    int received_value;
    while(1) {
        // SystemView will automatically trace the xQueueReceive call
        if (xQueueReceive(data_queue, &received_value, portMAX_DELAY)) {
            ESP_LOGI(TAG2, "Received value: %d", received_value);
            // Simulate some work
            vTaskDelay(pdMS_TO_TICKS(200)); 
        }
    }
}

void app_main(void) {
    data_queue = xQueueCreate(5, sizeof(int));
    // NOTE: For this example, we start SystemView recording immediately.
    // In a real app, you might start/stop it around specific operations.
    SEGGER_SYSVIEW_Conf();
    SEGGER_SYSVIEW_Start();

    xTaskCreate(sender_task, "SenderTask", 2048, NULL, 5, NULL);
    xTaskCreate(receiver_task, "ReceiverTask", 2048, NULL, 5, NULL);
}

Build, Flash, and Profile
  1. Build and flash the code normally: idf.py build flash.
  2. Start the profiling session from VS Code. Go to the ESP-IDF extension icon in the sidebar.
  3. Under the SystemView Tracing section, click Start. This will launch OpenOCD and the SystemView application.
  4. Let the trace run for a few seconds, then click Stop in the SystemView application.
Observe

In the SystemView timeline, you will see two tasks: SenderTask and ReceiverTask. You will see the SenderTask run, post to the queue (which will be marked), and then block. The ReceiverTask will then unblock, run, process the data, and block again. This visual confirmation is essential for debugging complex task interactions.

Variant Notes

  • JTAG Requirement: Both apptrace and SystemView tracing require a JTAG connection. While all ESP32 variants support JTAG, its availability differs.
    • ESP32-WROVER-KIT: Includes an onboard JTAG debugger, making it very easy to get started.
    • ESP32-S3-DevKitC: Has built-in USB-JTAG. You can use a single USB cable for both flashing and JTAG debugging.
    • Other DevKits (ESP32, C3, C6, etc.): Typically require an external JTAG probe like the ESP-PROG. You must wire the JTAG pins (TDI, TDO, TCK, TMS) from the probe to the ESP32 chip.
Board / Variant JTAG Method Required Hardware
ESP32-S3-DevKitC (and other S3 boards) Built-in USB-JTAG A single USB-C cable connected to the “USB” port. No extra hardware needed.
ESP32-WROVER-KIT On-board Debugger A single Micro-USB cable. The board includes the FT2232HL chip that provides the JTAG interface.
Most other DevKits (ESP32, ESP32-C3, etc.) External JTAG Probe An external debugger (e.g., ESP-PROG, J-Link) and jumper wires to connect the TCK, TMS, TDI, TDO, and GND pins.
  • Performance Overhead: Tracing is not free. It adds a small amount of CPU and memory overhead. On single-core chips (S2, C3, C6, H2), this overhead can be more noticeable than on dual-core chips (ESP32, S3), as the tracing mechanism competes for the same core resources as your application.

Common Mistakes & Troubleshooting Tips

Mistake / Issue Symptom(s) Troubleshooting / Solution
SystemView fails to connect or shows “ERROR: libusb_open() failed” The SystemView application opens but shows no data, and the VS Code terminal shows JTAG or USB errors. JTAG is not configured correctly. Solution: Ensure you have selected the correct launch configuration in VS Code for your board (e.g., “ESP32-S3-DevKitC-1 (JTAG)”). Verify the USB driver (WinUSB/libusbK) is installed correctly.
SystemView trace shows lots of “Overflow” events The trace timeline has red “Overflow” markers, and some task events are missing. The system is generating events faster than they can be read by the host. Solution: Increase the trace buffer size in menuconfig under Component config -> Application Level Tracing -> Trace memory size.
Application is much slower when profiling The device behaves sluggishly or misses deadlines only when SystemView tracing is enabled. The profiling itself is consuming too much CPU time. This is more common on single-core chips. Solution: Try reducing the number of events being traced (if possible) or accept the overhead for debugging. Remember to disable tracing for production performance tests.
The IDLE task uses >90% of CPU time In SystemView’s analysis window, the IDLE task appears to be the busiest task. This is not a mistake; it’s a good sign! Solution: This means your application tasks are efficient and the CPU has plenty of free time. A problem is when the IDLE task has 0% CPU time, which means your CPU is completely saturated.

Exercises

  1. Benchmark I/O: Write a small application that writes a 1KB block of data to the NVS (Non-Volatile Storage). Use esp_timer to benchmark how long the nvs_set_blob and nvs_commit functions take to execute.
  2. Profile a Web Server: Take the simple HTTP server example from the ESP-IDF examples directory. Add SystemView tracing. Make several requests to the server from your web browser and analyze the trace. Identify which tasks are created and how much time is spent in the network stack vs. your request handler.
  3. Find the Inefficiency: The code below sorts an array using a notoriously inefficient algorithm. Use the esp_timer benchmarking method to measure its performance. Then, replace it with a more efficient algorithm (e.g., you can find a simple qsort implementation online or use the C standard library one) and measure the improvement:
C
// Inefficient bubble sort
void bubble_sort(int arr[], int n) {
    int i, j;
    for (i = 0; i < n - 1; i++) {
        for (j = 0; j < n - i - 1; j++) {
            if (arr[j] > arr[j + 1]) {
                // Swap
                int temp = arr[j];
                arr[j] = arr[j + 1];
                arr[j + 1] = temp;
            }
        }
    }
}

Summary

  • Benchmarking measures the speed of specific code blocks and is done using esp_timer.
  • Profiling analyzes the whole system’s behavior over time and is best done with tools like SEGGER SystemView.
  • The golden rule is measure, don’t guess. Use these tools to find real bottlenecks before trying to optimize.
  • SystemView provides an unparalleled graphical view of FreeRTOS task interactions, making it invaluable for debugging complex real-time systems.
  • Effective profiling on the ESP32 requires a JTAG debugger, which is built-in on some development kits (like ESP32-S3) and requires an external probe for others.

Further Reading

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top