Chapter 282: Profiling and Benchmarking ESP32 Applications
Chapter Objectives
By the end of this chapter, you will be able to:
- Differentiate between benchmarking and profiling and know when to use each.
- Use the
esp_timer
API for high-resolution, manual benchmarking of code segments. - Understand the architecture and setup of ESP-IDF’s Application Level Tracing feature.
- Use GDB to retrieve and analyze trace data to understand program flow.
- Set up and use SEGGER SystemView for advanced, graphical system-level profiling.
- Identify performance bottlenecks and task behavior from profiling data.
- Recognize variant-specific considerations for profiling tools.
Introduction
In the previous chapter, we explored how to optimize code. We discussed techniques like using IRAM, choosing compiler flags, and writing efficient C. However, a crucial question remains: what do we optimize? Applying optimizations blindly is a waste of development time and can even make code harder to maintain. The first rule of optimization is: don’t guess, measure.
This chapter introduces the tools and techniques for measuring your application’s performance. We will cover benchmarking, which is the precise timing of small, specific code blocks, and profiling, which gives a high-level view of the entire system’s dynamic behavior. By mastering these measurement techniques, you can pinpoint the exact locations of performance bottlenecks, verify the impact of your optimizations, and build truly efficient and reliable ESP32 applications.
Theory
1. Benchmarking vs. Profiling: Two Sides of the Same Coin
Though often used interchangeably, benchmarking and profiling are distinct processes that answer different questions.
- Benchmarking is micro-analysis. It answers the question: “How long does this specific piece of code take to run?” It’s like using a stopwatch to time a single runner. You use benchmarking to measure the execution time of a particular function, algorithm, or I/O operation. It is ideal for comparing two different implementations of the same logic (e.g., a floating-point vs. a fixed-point calculation).
- Profiling is macro-analysis. It answers the question: “How does my entire system behave over time?” It’s like watching a video of the entire race, showing how all runners interact, who is leading at different times, and who is idle. Profiling helps you understand which tasks are consuming the most CPU time, how often interrupts are firing, and when resources are being contended. It is the primary tool for finding system-level bottlenecks.
Aspect | Benchmarking (Micro-Analysis) | Profiling (Macro-Analysis) |
---|---|---|
Question Answered | “How long does this specific function take to run?” | “How does my entire system behave over time?” |
Scope | Narrow. Measures a single function, algorithm, or operation. | Broad. Observes all tasks, interrupts, and resource interactions. |
Analogy | Using a stopwatch to time one specific part of a race. | Watching a video replay of the entire race from multiple angles. |
Primary Tool(s) | Manual timing with esp_timer_get_time(). | SEGGER SystemView, Application Level Tracing (apptrace). |
Use Case | Comparing two algorithms (e.g., float vs. integer math), verifying an optimization’s impact. | Finding system-level bottlenecks, diagnosing task starvation, deadlocks, or priority inversion. |
An effective workflow often involves both:
- Profile the system to identify a task or function that is consuming excessive resources.
- Benchmark specific parts of that function to pinpoint the slow sections.
- Apply an optimization.
- Benchmark again to confirm the improvement.
2. Manual Benchmarking with esp_timer
The simplest way to benchmark is to record the time before and after a code segment executes. ESP-IDF provides a high-resolution wall clock timer, esp_timer
, perfect for this task.
The key function is esp_timer_get_time()
, which returns the number of microseconds since the esp_timer
was initialized (typically at boot).
#include "esp_timer.h"
int64_t start_time = esp_timer_get_time();
// ... code to be benchmarked ...
int64_t end_time = esp_timer_get_time();
int64_t execution_time = end_time - start_time;
printf("The operation took %lld microseconds.\n", execution_time);
Tip: When benchmarking code that runs from flash, it’s good practice to run the function once before starting the timer. This “warms up” the flash cache, ensuring that your measurement reflects the function’s steady-state performance rather than including the initial cache-miss penalty.
3. Application Level Tracing
For more advanced profiling, ESP-IDF provides a built-in Application Level Tracing library (apptrace
). This powerful feature allows you to log events from your firmware with minimal overhead and send them to a host PC for analysis without interfering with your main application logic (e.g., through a UART printf
).
It works by using a dedicated JTAG interface. Your code writes trace messages to a small block of RAM (the trace buffer). When the buffer is full, the application is briefly halted, and a host-side debugger (OpenOCD, managed by the VS Code extension) reads the data from the buffer via JTAG and saves it to a file.
sequenceDiagram participant App as ESP32 Application participant Buffer as Trace Buffer (RAM) participant CPU participant JTAG as JTAG Probe participant Host as Host PC (OpenOCD) App->>Buffer: Writes trace event data Note right of Buffer: Buffer fills up... critical Buffer is Full CPU-->>CPU: Halt Execution Host->>JTAG: Request trace data JTAG->>Buffer: Read entire buffer Buffer-->>JTAG: Send data JTAG-->>Host: Transfer data to file CPU-->>CPU: Resume Execution end App->>Buffer: Continue writing trace events...
This mechanism is extremely useful for understanding the sequence and timing of events in complex, multi-task applications.
4. System-Level Profiling with SEGGER SystemView
While apptrace
is good for logging custom events, SEGGER SystemView provides a complete, graphical visualization of the system’s runtime state. It hooks directly into the FreeRTOS kernel to record events like:
- Task starts, stops, and switches.
- Queue sends and receives.
- Semaphore/Mutex takes and gives.
- ISR entry and exit.
- Kernel timer events.
The result is a detailed timeline that can be viewed in the SystemView desktop application. This is arguably the most powerful tool for diagnosing complex real-time issues, such as priority inversion, task starvation, or deadlocks.
ESP-IDF has built-in support for SystemView, which works similarly to apptrace
by sending the trace data over JTAG to the host.
Practical Examples
Example 1: Benchmarking Integer vs. Floating-Point Math
This example benchmarks two functions to concretely demonstrate the performance difference between floating-point and integer arithmetic, especially on a chip without an FPU.
Code
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_timer.h"
#include "esp_log.h"
static const char *TAG = "BENCHMARK";
#define ITERATIONS 10000
// --- Floating-Point Calculation ---
float floating_point_calc(float input) {
float result = input;
for (int i = 0; i < ITERATIONS; i++) {
result = (result / 1.001f) + 0.5f;
}
return result;
}
// --- Integer (Fixed-Point) Calculation ---
// We simulate the same logic using integers scaled by 1000
int32_t integer_calc(int32_t input_scaled) {
int32_t result_scaled = input_scaled;
for (int i = 0; i < ITERATIONS; i++) {
// Equivalent to (result / 1.001) + 0.5
result_scaled = (result_scaled * 1000) / 1001;
result_scaled += 500;
}
return result_scaled;
}
void app_main(void) {
vTaskDelay(pdMS_TO_TICKS(1000)); // Wait a moment for system to stabilize
ESP_LOGI(TAG, "Starting benchmarks...");
// --- Benchmark Floating-Point ---
int64_t start_fp = esp_timer_get_time();
floating_point_calc(1000.0f);
int64_t end_fp = esp_timer_get_time();
ESP_LOGI(TAG, "Floating-point calculation took: %lld microseconds", end_fp - start_fp);
// --- Benchmark Integer ---
int64_t start_int = esp_timer_get_time();
integer_calc(1000 * 1000);
int64_t end_int = esp_timer_get_time();
ESP_LOGI(TAG, "Integer calculation took: %lld microseconds", end_int - start_int);
}
Build and Run
- Create a new project.
- Copy the code into your
main
file. - Flash and monitor:
idf.py flash monitor
.
Observe
- On an ESP32-S3 (with FPU): The times will be relatively close, with the integer math possibly being slightly faster.
- On an ESP32-C3 (no FPU): You will see a dramatic difference. The integer calculation will be an order of magnitude faster than the floating-point version, which must be emulated in software. This demonstrates the importance of choosing the right algorithm for your hardware.
Example 2: Setting up SystemView for Task Analysis
This example requires more setup but provides invaluable insight. We will create two tasks and watch their interaction in SystemView.
Pre-requisites
- A JTAG debugger (e.g., ESP-PROG, or a board with built-in JTAG like ESP-WROVER-KIT).
- SEGGER SystemView software installed on your PC (available for free from SEGGER’s website).
Setup Steps
- Configure the project in
menuconfig
:idf.py menuconfig
- Navigate to
Component config
—>Application Level Tracing
. - Enable
Trace Memory
. - Set
Trace memory size
(e.g., 16384 bytes). - Go into
FreeRTOS
options. - Enable
Enable SystemView tracing
. - Enable
Enable SEGGER SystemView module
. - Save and exit.
- Add the code:
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "freertos/queue.h"
#include "esp_log.h"
#include "SEGGER_SYSVIEW.h"
static const char* TAG1 = "SENDER";
static const char* TAG2 = "RECEIVER";
static QueueHandle_t data_queue;
void sender_task(void *pvParameters) {
int count = 0;
while(1) {
ESP_LOGI(TAG1, "Sending value: %d", count);
// SystemView will automatically trace the xQueueSend call
xQueueSend(data_queue, &count, portMAX_DELAY);
count++;
// Give receiver task time to run
vTaskDelay(pdMS_TO_TICKS(1000));
}
}
void receiver_task(void *pvParameters) {
int received_value;
while(1) {
// SystemView will automatically trace the xQueueReceive call
if (xQueueReceive(data_queue, &received_value, portMAX_DELAY)) {
ESP_LOGI(TAG2, "Received value: %d", received_value);
// Simulate some work
vTaskDelay(pdMS_TO_TICKS(200));
}
}
}
void app_main(void) {
data_queue = xQueueCreate(5, sizeof(int));
// NOTE: For this example, we start SystemView recording immediately.
// In a real app, you might start/stop it around specific operations.
SEGGER_SYSVIEW_Conf();
SEGGER_SYSVIEW_Start();
xTaskCreate(sender_task, "SenderTask", 2048, NULL, 5, NULL);
xTaskCreate(receiver_task, "ReceiverTask", 2048, NULL, 5, NULL);
}
Build, Flash, and Profile
- Build and flash the code normally:
idf.py build flash
. - Start the profiling session from VS Code. Go to the
ESP-IDF
extension icon in the sidebar. - Under the
SystemView Tracing
section, clickStart
. This will launch OpenOCD and the SystemView application. - Let the trace run for a few seconds, then click
Stop
in the SystemView application.
Observe
In the SystemView timeline, you will see two tasks: SenderTask
and ReceiverTask
. You will see the SenderTask
run, post to the queue (which will be marked), and then block. The ReceiverTask
will then unblock, run, process the data, and block again. This visual confirmation is essential for debugging complex task interactions.
Variant Notes
- JTAG Requirement: Both
apptrace
and SystemView tracing require a JTAG connection. While all ESP32 variants support JTAG, its availability differs.- ESP32-WROVER-KIT: Includes an onboard JTAG debugger, making it very easy to get started.
- ESP32-S3-DevKitC: Has built-in USB-JTAG. You can use a single USB cable for both flashing and JTAG debugging.
- Other DevKits (ESP32, C3, C6, etc.): Typically require an external JTAG probe like the ESP-PROG. You must wire the JTAG pins (TDI, TDO, TCK, TMS) from the probe to the ESP32 chip.
Board / Variant | JTAG Method | Required Hardware |
---|---|---|
ESP32-S3-DevKitC (and other S3 boards) | Built-in USB-JTAG | A single USB-C cable connected to the “USB” port. No extra hardware needed. |
ESP32-WROVER-KIT | On-board Debugger | A single Micro-USB cable. The board includes the FT2232HL chip that provides the JTAG interface. |
Most other DevKits (ESP32, ESP32-C3, etc.) | External JTAG Probe | An external debugger (e.g., ESP-PROG, J-Link) and jumper wires to connect the TCK, TMS, TDI, TDO, and GND pins. |
- Performance Overhead: Tracing is not free. It adds a small amount of CPU and memory overhead. On single-core chips (S2, C3, C6, H2), this overhead can be more noticeable than on dual-core chips (ESP32, S3), as the tracing mechanism competes for the same core resources as your application.
Common Mistakes & Troubleshooting Tips
Mistake / Issue | Symptom(s) | Troubleshooting / Solution |
---|---|---|
SystemView fails to connect or shows “ERROR: libusb_open() failed” | The SystemView application opens but shows no data, and the VS Code terminal shows JTAG or USB errors. | JTAG is not configured correctly. Solution: Ensure you have selected the correct launch configuration in VS Code for your board (e.g., “ESP32-S3-DevKitC-1 (JTAG)”). Verify the USB driver (WinUSB/libusbK) is installed correctly. |
SystemView trace shows lots of “Overflow” events | The trace timeline has red “Overflow” markers, and some task events are missing. | The system is generating events faster than they can be read by the host. Solution: Increase the trace buffer size in menuconfig under Component config -> Application Level Tracing -> Trace memory size. |
Application is much slower when profiling | The device behaves sluggishly or misses deadlines only when SystemView tracing is enabled. | The profiling itself is consuming too much CPU time. This is more common on single-core chips. Solution: Try reducing the number of events being traced (if possible) or accept the overhead for debugging. Remember to disable tracing for production performance tests. |
The IDLE task uses >90% of CPU time | In SystemView’s analysis window, the IDLE task appears to be the busiest task. | This is not a mistake; it’s a good sign! Solution: This means your application tasks are efficient and the CPU has plenty of free time. A problem is when the IDLE task has 0% CPU time, which means your CPU is completely saturated. |
Exercises
- Benchmark I/O: Write a small application that writes a 1KB block of data to the NVS (Non-Volatile Storage). Use
esp_timer
to benchmark how long thenvs_set_blob
andnvs_commit
functions take to execute. - Profile a Web Server: Take the simple HTTP server example from the ESP-IDF examples directory. Add SystemView tracing. Make several requests to the server from your web browser and analyze the trace. Identify which tasks are created and how much time is spent in the network stack vs. your request handler.
- Find the Inefficiency: The code below sorts an array using a notoriously inefficient algorithm. Use the
esp_timer
benchmarking method to measure its performance. Then, replace it with a more efficient algorithm (e.g., you can find a simple qsort implementation online or use the C standard library one) and measure the improvement:
// Inefficient bubble sort
void bubble_sort(int arr[], int n) {
int i, j;
for (i = 0; i < n - 1; i++) {
for (j = 0; j < n - i - 1; j++) {
if (arr[j] > arr[j + 1]) {
// Swap
int temp = arr[j];
arr[j] = arr[j + 1];
arr[j + 1] = temp;
}
}
}
}
Summary
- Benchmarking measures the speed of specific code blocks and is done using
esp_timer
. - Profiling analyzes the whole system’s behavior over time and is best done with tools like SEGGER SystemView.
- The golden rule is measure, don’t guess. Use these tools to find real bottlenecks before trying to optimize.
- SystemView provides an unparalleled graphical view of FreeRTOS task interactions, making it invaluable for debugging complex real-time systems.
- Effective profiling on the ESP32 requires a JTAG debugger, which is built-in on some development kits (like ESP32-S3) and requires an external probe for others.
Further Reading
- ESP-IDF Documentation: Application Level Tracing:
- ESP-IDF Documentation: SystemView Tracing:
- SEGGER SystemView Official Page: