Chapter 293: Remote Device Monitoring

Chapter Objectives

By the end of this chapter, you will be able to:

  • Explain the role of remote monitoring in managing a fleet of IoT devices.
  • Differentiate between the key monitoring concepts: telemetry, logging, and diagnostics.
  • Use ESP-IDF APIs to gather critical system health metrics like heap memory usage, Wi-Fi signal strength, and CPU utilization.
  • Implement a system to periodically publish device health metrics (telemetry) over MQTT.
  • Build a controllable, on-demand remote logging system that redirects ESP_LOG output over the network.
  • Understand the performance, cost, and security considerations associated with remote monitoring.

Introduction

Your device has been provisioned, commissioned, and is now operating in the field. From your perspective as a developer, it has become a “black box.” What is it doing right now? Is it healthy? Is it struggling to maintain a Wi-Fi connection? Is its memory running low? Without a window into its real-time operation, you are flying blind. An issue that could be detected and fixed pre-emptively might instead escalate into a total device failure, leading to a poor user experience and costly support incidents.

This is where remote monitoring comes in. It is the practice of systematically collecting data from deployed devices to provide a clear, real-time view of your entire fleet’s health and performance. It allows you to move from a reactive support model (“a customer called to say their device is broken”) to a proactive one (“we detected that 5% of devices in a certain region have poor connectivity and can push an update to fix it”).

In this chapter, we will learn how to build the essential mechanisms for remote monitoring, turning our black boxes into transparent, manageable assets.

Theory

Effective remote monitoring isn’t about collecting as much data as possible; it’s about collecting the right data in the right way. We can categorize this data into three pillars.

%%{ init: { 'theme': 'base', 'themeVariables': { 'fontFamily': 'Open Sans' } } }%%
graph TD
    subgraph "Remote Device Monitoring"
        direction LR
        A[<b>Telemetry</b><br><i>Metrics</i>]
        B[<b>Logging</b><br><i>Events</i>]
        C[<b>Diagnostics</b><br><i>Commands</i>]
    end

    subgraph "Characteristics"
        direction LR
        A_Desc["-Lightweight & Numerical<br>- Periodic (e.g., every 5 mins)<br>- For Dashboards & Alerts<br>- <i>'What is the device's health?'</i>"]
        B_Desc["-Verbose & Textual<br>- Event-Driven (e.g., on error)<br>- For Deep-Dive Debugging<br>- <i>'Why did this event happen?'</i>"]
        C_Desc["-Active & On-Demand<br>- State Queries & Actions<br>- For Interactive Troubleshooting<br>- <i>'Can you run this test now?'</i>"]
    end

    A --> A_Desc
    B --> B_Desc
    C --> C_Desc

    classDef telemetry fill:#DBEAFE,stroke:#2563EB,stroke-width:2px,color:#1E40AF;
    classDef logging fill:#FEF3C7,stroke:#D97706,stroke-width:2px,color:#92400E;
    classDef diagnostics fill:#FEE2E2,stroke:#DC2626,stroke-width:2px,color:#991B1B;
    classDef desc fill:#F9FAFB,stroke:#D1D5DB,color:#374151;

    class A,A_Pill telemetry;
    class B,B_Pill logging;
    class C,C_Pill diagnostics;
    class A_Desc,B_Desc,C_Desc desc;

The Three Pillars of Monitoring

  1. Telemetry (Metrics):This is the foundation of monitoring. Telemetry consists of lightweight, typically numerical data points that represent the device’s vital signs. This data is collected periodically (e.g., every 5-10 minutes) and sent to a cloud backend.
    • Common Metrics: Free Heap Memory, Minimum Free Heap, Wi-Fi RSSI, CPU Utilization, Uptime, Internal Temperature, Restart Count.
    • Purpose: Ideal for creating dashboards that visualize the overall health of the fleet. It’s used for trend analysis (e.g., “Is memory usage slowly decreasing over time, indicating a leak?”) and for automated alerting (e.g., “Notify me if any device’s free heap drops below 10KB”).
    • Analogy: Telemetry is like the vital signs monitor next to a hospital bed, showing heart rate, blood pressure, and oxygen levels at a regular cadence.
  2. Logging:This is the same structured, textual logging (ESP_LOGI, ESP_LOGE, etc.) we use for debugging with a serial monitor, but redirected over the network. Because logs can be very verbose, sending all of them all the time is impractical and expensive.
    • Purpose: Used for deep-dive debugging of a specific device’s behavior. When a device is acting strangely, you can enable remote logging to see the exact sequence of events and errors leading up to the problem.
    • Best Practice: Remote logging should be disabled by default. It should be controllable, allowing a developer to remotely enable or disable it for a specific device, and often to set the desired log level (e.g., “Show me INFO level logs and above”).
    • Analogy: If telemetry shows a patient’s heart rate is erratic, logging is like getting the detailed ECG printout to analyze the specific arrhythmia.
  3. Diagnostics:While telemetry and logging are passive (the device sends data about what it’s doing), diagnostics are active. A developer sends a command to a device, instructing it to run a specific test or report a specific piece of state information.
    • Common Commands: “Ping server X,” “Report current task list and CPU usage,” “Get NVS statistics,” “Reboot in 10 seconds.”
    • Purpose: Allows for interactive troubleshooting of a live device without needing to deploy new firmware.
    • Security Risk: This is the most powerful and therefore most dangerous pillar. Diagnostic commands that can alter device state must be rigorously secured to ensure only authorized users can issue them.
    • Analogy: Diagnostics are like the doctor actively examining the patient—asking them to take a deep breath, checking their reflexes, or ordering a specific blood test.

Architectural & Implementation Details

  • Protocol & Data Format: MQTT is the standard choice for monitoring due to its efficiency and publish/subscribe architecture. JSON is the most common data format for its flexibility and ease of parsing on the cloud side.
    • Telemetry Topic: devices/{device-id}/telemetry
    • Logs Topic: devices/{device-id}/logs
    • Diagnostics Topics: devices/{device-id}/diag/command (to device) and devices/{device-id}/diag/response (from device).
  • Redirecting Logs: The key to remote logging is the ESP-IDF function esp_log_set_vprintf(). This function allows you to replace the default log-to-UART function with your own custom handler.Warning: Never perform a blocking network operation (like publishing to MQTT) directly inside your vprintf handler. The ESP_LOG functions are called from many places, including from within critical sections. Blocking here will cause deadlocks and watchdog resets. The correct approach is to pass the log message to a queue, and have a separate, lower-priority task that sends the queued messages over the network.
  • Gathering Metrics: The ESP-IDF provides a rich set of APIs for collecting telemetry.
    • esp_heap_caps_get_free_size(MALLOC_CAP_DEFAULT): Gets the current free heap size.
    • esp_heap_caps_get_minimum_free_size(MALLOC_CAP_DEFAULT): An incredibly useful metric that tells you the lowest the heap has ever been since boot. A value approaching zero indicates the device is at risk of memory exhaustion.
    • esp_wifi_sta_get_ap_info(): After connecting to Wi-Fi, this structure contains the RSSI (signal strength) of the connected access point.
    • esp_reset_reason(): Reports why the device last rebooted. Essential for detecting crash loops.
    • xTaskGetSystemState(): A FreeRTOS function that populates an array of structures with the status of every task in the system. This can be used to calculate CPU utilization if runtime stats are enabled (CONFIG_FREERTOS_GENERATE_RUN_TIME_STATS).
Metric ESP-IDF API Function Description & Usefulness
Current Free Heap heap_caps_get_free_size(MALLOC_CAP_DEFAULT) Returns the current available memory in bytes. A fundamental metric for real-time health monitoring.
Minimum Free Heap heap_caps_get_minimum_free_size(MALLOC_CAP_DEFAULT) Returns the smallest the free heap has ever been since boot. Critical for detecting slow memory leaks or peak memory usage issues.
Wi-Fi Signal Strength esp_wifi_sta_get_ap_info() Retrieves info about the connected AP, including RSSI (Received Signal Strength Indicator). Essential for diagnosing connectivity problems.
Last Reset Reason esp_reset_reason() Indicates why the device last rebooted (e.g., power-on, watchdog timer, brownout). Invaluable for identifying and tracking crashes.
System Uptime esp_timer_get_time() Returns the time in microseconds since boot. Useful for tracking device stability and correlating events over time.
Task States & CPU xTaskGetSystemState() Provides a snapshot of all running tasks and their states. Can be used to calculate CPU utilization if runtime stats are enabled.

Practical Example: Telemetry and Remote Logging

Let’s build a system that implements the first two pillars: periodic telemetry and on-demand remote logging. We will assume you have a working MQTT client connection.

1. Part 1: Periodic Health Telemetry

This task will wake up every minute, gather key metrics, and publish them as a JSON object.

C
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_system.h"
#include "esp_log.h"
#include "esp_wifi.h"
#include "esp_heap_caps.h"
#include "cJSON.h"
#include "esp_mqtt_client.h"

// Assume mqtt_client is initialized and connected elsewhere
extern esp_mqtt_client_handle_t mqtt_client;
extern char device_id[]; // A string holding the unique device ID

static const char *TAG = "MONITORING";
#define TELEMETRY_INTERVAL_MS (60 * 1000)
#define TELEMETRY_TOPIC_FORMAT "devices/%s/telemetry"

void telemetry_task(void *pvParameters) {
    char topic[128];
    snprintf(topic, sizeof(topic), TELEMETRY_TOPIC_FORMAT, device_id);

    while (1) {
        vTaskDelay(pdMS_TO_TICKS(TELEMETRY_INTERVAL_MS));

        // 1. Gather Metrics
        int64_t uptime_ms = esp_timer_get_time() / 1000;
        size_t free_heap = heap_caps_get_free_size(MALLOC_CAP_DEFAULT);
        size_t min_free_heap = heap_caps_get_minimum_free_size(MALLOC_CAP_DEFAULT);
        
        int8_t rssi = 0;
        wifi_ap_record_t ap_info;
        if (esp_wifi_sta_get_ap_info(&ap_info) == ESP_OK) {
            rssi = ap_info.rssi;
        }

        // 2. Create JSON Payload
        cJSON *root = cJSON_CreateObject();
        cJSON_AddNumberToObject(root, "uptime_ms", uptime_ms);
        cJSON_AddNumberToObject(root, "free_heap_bytes", free_heap);
        cJSON_AddNumberToObject(root, "min_free_heap_bytes", min_free_heap);
        cJSON_AddNumberToObject(root, "wifi_rssi", rssi);
        
        char *json_payload = cJSON_PrintUnformatted(root);

        // 3. Publish to MQTT
        if (json_payload) {
            esp_mqtt_client_publish(mqtt_client, topic, json_payload, 0, 1, 0);
            ESP_LOGI(TAG, "Published telemetry: %s", json_payload);
            free(json_payload);
        }

        cJSON_Delete(root);
    }
}

// In app_main:
// xTaskCreate(telemetry_task, "telemetry_task", 4096, NULL, 5, NULL);

2. Part 2: On-Demand Remote Logging

%%{ init: { 'theme': 'base', 'themeVariables': { 'fontFamily': 'Open Sans' } } }%%
sequenceDiagram
    participant App
    participant ESPLog
    participant Handler
    participant Queue
    participant Uploader
    participant MQTT

    App->>+ESPLog: ESP_LOGI(TAG, "User logged in")
    ESPLog->>+Handler: remote_log_vprintf("I (123) TAG: User logged in", ...)
    Note right of Handler: CRITICAL:<br/>Do NOT block here!
    Handler->>Queue: xQueueSend(log_queue, "...")
    Handler-->>-ESPLog: returns
    ESPLog-->>-App: returns
    
    loop Periodically Checks Queue
        Uploader->>+Queue: xQueueReceive(log_queue, ...)
        alt Remote Logging Enabled
            Queue-->>-Uploader: Returns log message
            Uploader->>+MQTT: PUBLISH devices/{id}/logs
            MQTT-->>-Uploader: ACK
        else Remote Logging Disabled
            Queue-->>Uploader: Blocks or returns empty
            Note over Uploader: Message is discarded
        end
    end

This is a more complex system involving a vprintf handler, a queue, and an uploader task.

C
#include "freertos/queue.h"

#define LOG_QUEUE_LENGTH 16
#define LOG_MSG_MAX_LEN 256
#define LOG_TOPIC_FORMAT "devices/%s/logs"

static QueueHandle_t log_queue = NULL;
static bool remote_logging_enabled = false; // Controllable flag

// --- The Uploader Task ---
// This task blocks waiting for logs on the queue and sends them.
void log_uploader_task(void *pvParameters) {
    char log_message[LOG_MSG_MAX_LEN];
    char topic[128];
    snprintf(topic, sizeof(topic), LOG_TOPIC_FORMAT, device_id);

    while (1) {
        // Block until a message arrives in the queue
        if (xQueueReceive(log_queue, &log_message, portMAX_DELAY) == pdTRUE) {
            // Only publish if the feature is enabled
            if (remote_logging_enabled) {
                esp_mqtt_client_publish(mqtt_client, topic, log_message, 0, 0, 0);
            }
        }
    }
}

// --- The Custom vprintf Handler ---
// This function gets called by ESP_LOGx macros. It's an Interrupt Service Routine (ISR) safe-ish context.
// It formats the log and pushes it to the queue.
int remote_log_vprintf(const char *format, va_list args) {
    // 1. First, send to the default UART console so we don't lose local debugging
    int ret = vprintf(format, args);

    // 2. Format the message for the queue
    char task_name[configMAX_TASK_NAME_LEN];
    char buffer[LOG_MSG_MAX_LEN];
    
    // Try to get the name of the task that's logging
    char *calling_task = pcTaskGetTaskName(NULL);
    if (calling_task == NULL) {
        calling_task = "main";
    }

    // Format into a string like "I (1234) [TASK_NAME]: message"
    // Note: vsnprintf is used for safety to prevent buffer overflows.
    int len = vsnprintf(buffer, sizeof(buffer), format, args);
    if (len > 0 && len < sizeof(buffer)) {
        // Remove trailing newline if it exists
        if (buffer[len-1] == '\n') {
            buffer[len-1] = '\0';
        }
        // Send the formatted string to the queue (non-blocking)
        xQueueSend(log_queue, &buffer, 0);
    }
    
    return ret;
}

void setup_remote_logging(void) {
    log_queue = xQueueCreate(LOG_QUEUE_LENGTH, LOG_MSG_MAX_LEN);
    if (log_queue == NULL) {
        ESP_LOGE(TAG, "Failed to create log queue");
        return;
    }

    xTaskCreate(log_uploader_task, "log_uploader", 4096, NULL, 4, NULL);
    
    // Set our custom function as the handler for all ESP_LOGx calls
    esp_log_set_vprintf(remote_log_vprintf);
    
    ESP_LOGI(TAG, "Remote logging system initialized.");
}

// To control it, subscribe to a topic like "devices/{device-id}/logs/control"
// and in the MQTT event handler, set `remote_logging_enabled` to true or false
// based on the payload (e.g., "ENABLE" or "DISABLE").

3. Build and Run

  1. Add cJSON to your project’s dependencies in CMakeLists.txtPRIV_REQUIRES cJSON.
  2. Integrate the code above into a project with a working MQTT client.
  3. Create the telemetry task and call setup_remote_logging() during initialization.
  4. Observe the telemetry messages appearing on your MQTT broker every minute.
  5. Publish “ENABLE” to the log control topic. You should now see all ESP_LOG messages also appear on the log topic.
  6. Publish “DISABLE” and observe that the logs stop being sent over MQTT (though they still appear on the serial monitor).

Variant Notes

Feature / Metric ESP32 (Classic) ESP32-S2/S3/C3 ESP32-H2/C6 Notes
Core Logic (vprintf, Queues) The fundamental software architecture is identical across all variants.
Dual-Core CPU Monitoring ESP32-S3 Only Allows monitoring CPU load on each core independently on dual-core chips.
Internal Temperature Sensor A valuable metric for detecting overheating, available on newer chip series.
Wi-Fi RSSI C6 Only Standard metric for any Wi-Fi enabled device.
802.15.4 (Thread/Zigbee) Metrics For mesh networks, you would monitor metrics like Parent Node ID and Link Quality.
  • Core Logic: The concepts and the core ESP-IDF APIs (esp_log_set_vprintfheap_caps, FreeRTOS queues) are identical across all ESP32 variants (ESP32, S2, S3, C-series, H-series). The example code will work on any of them.
  • On-Chip Temperature Sensor: The ESP32-S2, S3, C3, and subsequent chips include an internal temperature sensor. This is an excellent telemetry metric to add, especially for devices in enclosed spaces, to monitor for overheating. You can read it using the temperature_sensor driver.
  • CPU Utilization: The method for calculating CPU usage is the same (using FreeRTOS runtime stats), but the results differ. On dual-core variants (ESP32, ESP32-S3), you can monitor the load on each core independently, which is useful for diagnosing if one core is overloaded while the other is idle.
  • Connectivity Metrics: For the ESP32-H2 and ESP32-C6, which support 802.15.4 (Thread/Zigbee), your telemetry might include mesh-specific metrics instead of or in addition to Wi-Fi RSSI. This could include the device’s role (Router, End Device), parent node ID, and signal strength to the parent.

Common Mistakes & Troubleshooting Tips

Mistake / Issue Symptom(s) Troubleshooting / Solution
Blocking in vprintf Handler Device randomly reboots.
Watchdog Timer errors (Task watchdog got triggered).
System becomes unresponsive.
Fix: Never call blocking functions (like MQTT publish, vTaskDelay) inside the vprintf handler. Use a non-blocking FreeRTOS queue (xQueueSend) to pass the log message to a separate, lower-priority task for network transmission.
cJSON Memory Leak Device works for a while, then crashes.
Free heap memory steadily decreases over time (visible in telemetry).
Crashes often occur during JSON creation/publishing.
Fix: Always free the memory allocated by cJSON_PrintUnformatted(). After publishing the JSON string, call free(json_payload). Also ensure the root cJSON object is deleted with cJSON_Delete(root).
Log Flooding High data usage and cost from your cloud provider.
MQTT broker or backend service becomes slow or unresponsive.
Device seems busy and may miss other critical operations.
Fix: Remote logging must be disabled by default. Implement a control mechanism (e.g., a separate MQTT topic) to enable/disable logging and set log verbosity (ESP_LOG_INFO, ESP_LOG_DEBUG, etc.) on a per-device, on-demand basis.
Losing Local Serial Logs After implementing remote logging, nothing appears on the serial monitor anymore, making local debugging impossible. Fix: Your custom vprintf handler must explicitly call the original handler. The first line of your function should be vprintf(format, args) to ensure logs are still printed to UART.
Log Queue Overflows Logs are missing from the remote output, especially during bursts of activity.
The xQueueSend function might return errQUEUE_FULL.
Fix: 1. Ensure the log uploader task has sufficient priority.
2. Consider increasing the queue size.
3. Implement log batching: collect several logs from the queue and send them in a single MQTT message to improve efficiency.

Exercises

  1. Restart Reason Telemetry: Enhance the telemetry task. On the very first run after booting, it should call esp_reset_reason() to determine why the device started up. It should then publish a separate, one-time “boot” event message containing the restart reason (e.g., "reason": "POWERON_RESET" or "reason": "RTCWDT_BROWN_OUT_RESET"). This is invaluable for tracking down unexpected crashes.
  2. Log Batching: Modify the log_uploader_task. Instead of sending one MQTT message per log, have it wait up to 1 second or until it has collected 5 logs from the queue. It should then concatenate these logs into a single string (separated by newlines) and publish them in one MQTT message. This is a much more efficient use of network resources.
  3. CPU Usage Telemetry: Enable CONFIG_FREERTOS_GENERATE_RUN_TIME_STATS and CONFIG_FREERTOS_USE_TRACE_FACILITY in menuconfig. Enhance the telemetry task to call xTaskGetSystemState(). Iterate through the resulting task array, calculate the percentage of CPU time each task has used since the last telemetry send, and add this to the JSON payload.

Summary

  • Remote monitoring is essential for managing deployed IoT devices and consists of three pillars: Telemetry, Logging, and Diagnostics.
  • Telemetry provides periodic, lightweight vital signs (heap, RSSI) for fleet-wide dashboards and alerting.
  • Logging provides verbose, event-driven details for deep-dive debugging and should be remotely controllable to manage costs.
  • Diagnostics allow for active, on-demand troubleshooting by sending commands to devices.
  • The esp_log_set_vprintf() function is the key to remote logging, but must be used with a non-blocking queue to avoid system instability.
  • The ESP-IDF provides rich APIs like heap_caps_get_free_size and esp_reset_reason to gather critical health metrics.
  • A well-designed monitoring strategy is secure, cost-conscious, and provides actionable insights into your device fleet.

Further Reading

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top