Chapter 155: CAN Error Handling and Recovery

Chapter Objectives

By the end of this chapter, you will be able to:

  • Identify and understand the five basic types of CAN bus errors: Bit Error, Stuff Error, CRC Error, Form Error, and Acknowledgement (ACK) Error.
  • Explain the purpose and structure of CAN Active and Passive Error Frames.
  • Describe the three CAN controller error states: Error Active, Error Passive, and Bus-Off.
  • Understand the role and behavior of the Transmit Error Counter (TEC) and Receive Error Counter (REC).
  • Utilize TWAI alerts to detect various error conditions, including bus errors and state changes.
  • Implement bus-off recovery procedures using the ESP-IDF TWAI driver functions.
  • Retrieve and interpret TWAI status information, such as error counters and the current controller state.
  • Appreciate the importance of fault confinement in maintaining CAN network stability.

1. Introduction

The Controller Area Network (CAN) protocol is renowned for its robustness and reliability, which are paramount in automotive and industrial control systems where communication failures can have serious consequences. A significant part of this robustness comes from its sophisticated built-in mechanisms for error detection, signaling, and fault confinement. Individual nodes can detect errors, inform other nodes, and even isolate themselves from the bus if they become a persistent source of problems.

In the previous chapters, we’ve explored how to configure the ESP32‘s TWAI peripheral and exchange CAN messages. However, real-world communication is rarely perfect. Physical layer issues (like faulty wiring, improper termination, or electromagnetic interference), mismatched baud rates, or malfunctioning nodes can all introduce errors onto the bus.

This chapter focuses on how these errors are handled within the CAN protocol and how you, as an embedded systems developer using the ESP32, can leverage the TWAI driver’s features to monitor bus health, react to error conditions, and implement recovery strategies. Understanding these mechanisms is crucial for building truly resilient and dependable CAN-based applications.

2. Theory

CAN’s error handling is a multi-layered approach involving error detection by any node, error signaling to all nodes, and fault confinement to prevent a persistently faulty node from disrupting the entire network.

2.1. Types of CAN Errors

Any node participating in CAN communication actively monitors the bus for errors. There are five primary types of errors that can be detected:

  1. Bit Error:
    • Detection: A transmitter monitors the bus level while sending. If the bit level it reads back from the bus is different from the bit it intended to send, a Bit Error is detected.
    • Exception: This rule does not apply during the arbitration field (where a node intentionally stops transmitting if it sends a recessive bit but detects a dominant one) or during the ACK slot (where the transmitter sends a recessive bit and expects a dominant bit from a receiver).
    • Analogy: Imagine speaking a word and hearing yourself say a different word due to noise or a problem with your microphone.
  2. Stuff Error (Bit Stuffing Error):
    • Detection: To ensure enough signal transitions for synchronization, CAN uses bit stuffing: after five consecutive bits of the same polarity, the transmitter inserts a complementary bit. If a receiver detects six consecutive bits of the same polarity in a part of the frame that should be stuffed (SOF, Arbitration, Control, Data, CRC fields), a Stuff Error is detected.
    • Analogy: Imagine a rule in a written language where no letter can be repeated more than five times consecutively without inserting a special symbol. Seeing six identical letters in a row would be a “stuff error.”
  3. CRC Error (Cyclic Redundancy Check Error):
    • Detection: The transmitter calculates a 15-bit CRC checksum based on the content of the SOF, Arbitration, Control, and Data fields and appends it to the message. All receivers also calculate the CRC on the incoming message. If a receiver’s calculated CRC does not match the CRC received in the frame, a CRC Error is detected.
    • Analogy: Similar to a checksum on a downloaded file. If the calculated checksum doesn’t match the provided one, the file is likely corrupted.
  4. Form Error (Format Error):
    • Detection: Certain parts of a CAN frame have a fixed format (e.g., CRC Delimiter, ACK Delimiter, End of Frame field, Interframe Space must all be recessive). If a node detects a dominant bit in one of these fields where a recessive bit is expected, a Form Error is detected.
    • Analogy: Finding a punctuation mark in the middle of a word where it doesn’t belong – it violates the expected structure.
  5. Acknowledgement (ACK) Error:
    • Detection: After a transmitter successfully sends a message (up to the CRC delimiter), it transmits a recessive bit in the ACK Slot. At least one receiver that has correctly received the message must acknowledge it by transmitting a dominant bit in this ACK Slot. If the transmitter does not detect this dominant bit (i.e., the ACK Slot remains recessive), an ACK Error is detected.
    • Analogy: Sending a registered letter and not receiving a signed confirmation of receipt. It implies no one (correctly) received it.

2.2. Error Frames

When a node detects one of the errors mentioned above, it immediately stops transmitting the current data/remote frame (if it was transmitting) and broadcasts an Error Frame. The Error Frame serves to notify all other nodes on the bus that an error has occurred, causing them to discard the potentially corrupted message.

An Error Frame consists of two parts:

  1. Error Flag: This is a sequence of 6 to 12 bits of the same polarity (either all dominant or all recessive), which deliberately violates the bit stuffing rule. This makes it easily recognizable by all other nodes.
    • Active Error Flag: Composed of 6 consecutive dominant bits. Sent by nodes in the “Error Active” state.
    • Passive Error Flag: Composed of 6 consecutive recessive bits. Sent by nodes in the “Error Passive” state. These are less disruptive as they won’t override dominant bits from other nodes transmitting an active error flag.
  2. Error Delimiter: Following the Error Flag, there are 8 recessive bits.

After an Error Frame is completed, nodes may attempt to retransmit the interrupted message (if they were the original transmitter and are allowed to) after the usual Interframe Space.

2.3. Error Counters (TEC and REC)

To implement fault confinement, each CAN node maintains two internal error counters:

  • Transmit Error Counter (TEC): Primarily reflects errors related to message transmission.
  • Receive Error Counter (REC): Primarily reflects errors related to message reception.

The rules for incrementing and decrementing these counters are complex but follow general principles:

Incrementing Counters:

  • Transmitter detects an error: TEC increases by 8 (e.g., Bit Error during own transmission, ACK Error).
  • Receiver detects an error (except Bit Error during arbitration/ACK): REC increases by 1 (e.g., CRC Error, Form Error, Stuff Error on a message it was receiving).
  • Bit Error during ACK slot or if transmitter sends Passive Error Frame: TEC increases by 8.
  • Stuff Error during arbitration: REC increases by 1.
  • If a node sends an Error Frame, and it’s the first node to detect the error and successfully transmit the Error Flag, its TEC might increase by 8. Other nodes detecting the error and also sending error flags might have their REC increase.

Decrementing Counters:

  • Successful transmission of a message: TEC decreases by 1 (if TEC > 0).
  • Successful reception of a message (up to and including ACK slot): REC decreases by 1 (if REC > 0).
  • If REC was between 1 and 127, and it successfully processes a message up to the EOF, it may decrement faster if it was previously high.
  • If REC > 127, successful reception decrements REC by 1.

These rules are designed so that a node that is consistently causing errors (e.g., due to a hardware fault) will see its error counters rise much faster than a node that is only occasionally affected by errors from the bus. A summary table:

Event / Condition Effect on TEC (Transmit Error Counter) Effect on REC (Receive Error Counter)
Transmitter detects an error (e.g., Bit Error during own transmission, ACK Error, first to send Error Flag) Increases (e.g., by 8) No direct change from this specific event.
Receiver detects an error (e.g., CRC, Form, Stuff Error on a message it was receiving; Bit Error if transmitting passive error flag and it’s corrupted) No direct change for the receiver from its own reception error. (Transmitter of bad frame might see TEC increase). Increases (e.g., by 1 or 8 depending on error and state).
Successful transmission of a message by this node Decreases by 1 (if TEC > 0) No change.
Successful reception of a message by this node (up to and including correct ACK slot monitoring or sending ACK) No change. Decreases by 1 (if REC > 0). May decrease faster under certain conditions if REC was high but below 128.
Node sends an Active Error Flag Increases by 8 (if this node initiated it due to its TX error). Other nodes that detect this error sequence (or the original error) will increment their REC.
Node sends a Passive Error Flag Increases by 8 (if this node initiated it due to its TX error). Other nodes may increment REC. Passive error flags are less impactful.

2.4. CAN Node Error States

Based on the values of TEC and REC, a CAN node operates in one of three error states:

  1. Error Active:
    • Condition: TEC < 128 AND REC < 128.
    • Behavior: The node participates normally in bus communication. When it detects an error, it transmits an Active Error Frame (6 dominant bits). This actively signals the error to all other nodes.
  2. Error Passive:
    • Condition: TEC >= 128 OR REC >= 128 (but not Bus-Off).
    • Behavior: The node still participates in communication but is considered “less healthy.”
      • When it detects an error, it transmits a Passive Error Frame (6 recessive bits). A passive error frame does not corrupt other active error frames and is less disruptive.
      • After transmitting a message, an Error Passive node must wait an additional “Suspend Transmission” time (8 recessive bits) after the Interframe Space before it can initiate a new transmission. This effectively lowers its bus bandwidth usage.
  3. Bus-Off:
    • Condition: TEC >= 256.
    • Behavior: The node is considered to be a major source of errors and is automatically disconnected from the bus by its own controller.
      • It is not allowed to transmit any messages (including Data, Remote, or Error Frames). Its TX pin is forced to a permanent recessive state.
      • It can still receive messages if its physical layer is functional.
      • To rejoin the bus, the node must undergo a bus recovery sequence. This typically involves waiting for a specific number of consecutive occurrences of 11 recessive bits (Bus Idle condition) on the bus (128 occurrences of 11 recessive bits) and then resetting its error counters. The TWAI driver provides a function to initiate this.
stateDiagram-v2
    %% Mermaid State Diagram for CAN Error States (Fault Confinement)
    
    direction TB

    state "Error Active" as Active
    state "Error Passive" as Passive  
    state "Bus-Off" as BusOff

    [*] --> Active : Initialization / Low Error Count

    Active --> Passive : TEC >= 128 OR REC >= 128
    Passive --> Active : TEC < 128 AND REC < 128
    Passive --> BusOff : TEC >= 256
    BusOff --> Active : Successful Recovery Sequence

    %% State descriptions
    Active : TEC < 128 AND REC < 128
    Active : Normal operation
    Active : Sends Active Error Flags (dominant)
    Active : TEC++/REC++ on errors
    Active : TEC--/REC-- on success

    Passive : TEC >= 128 OR REC >= 128  
    Passive : Sends Passive Error Flags (recessive)
    Passive : Suspend Transmission delay after sending
    Passive : TEC++/REC++ on errors
    Passive : TEC--/REC-- on success

    BusOff : TEC >= 256
    BusOff : Node is offline (cannot transmit)
    BusOff : Requires recovery
    BusOff : TEC and REC reset to 0 after recovery

    %% Add notes for transitions
    note right of Active
        Error accumulation
    end note
    
    note left of Passive
        Successful operations reduce errors
    end note
    
    note right of Passive
        Severe error accumulation (transmit side)
    end note
    
    note left of BusOff
        Recovery requires 128 x 11 recessive bits
        or API calls like twai_initiate_recovery
    end note

    %% Styling
    classDef activeStyle fill:#D1FAE5,stroke:#059669,color:#065F46
    classDef passiveStyle fill:#FEF3C7,stroke:#D97706,color:#92400E  
    classDef busOffStyle fill:#FEE2E2,stroke:#DC2626,color:#991B1B
    
    class Active activeStyle
    class Passive passiveStyle
    class BusOff busOffStyle

This state mechanism is a key part of CAN’s fault confinement, preventing a single malfunctioning node from continuously disrupting the entire network.

2.5. TWAI Driver Support for Error Handling

The ESP-IDF TWAI driver provides several mechanisms to monitor and manage error conditions:

  • Alerts (alerts_enabled in twai_general_config_t):You can enable various alerts to be notified of specific events. Relevant error-related alerts include:
    • TWAI_ALERT_BUS_ERROR: A bus error (Bit, Stuff, CRC, Form, ACK) has occurred.
    • TWAI_ALERT_ARB_LOST: The controller lost arbitration during transmission.
    • TWAI_ALERT_ERR_PASS: The controller has entered the Error Passive state.
    • TWAI_ALERT_BUS_OFF: The controller has entered the Bus-Off state.
    • TWAI_ALERT_TX_FAILED: A transmission attempt ultimately failed (e.g., due to too many retries, bus-off). (Note: This specific alert might be implicitly covered by TWAI_ALERT_BUS_OFF or TWAI_ALERT_ARB_LOST leading to failure).
    • TWAI_ALERT_ERR_CNT_WARNING: TEC or REC has surpassed the error warning limit (default 96). (TWAI_ALERT_BELOW_ERR_WARN signals recovery from this).These alerts can be read using twai_read_alerts().
  • Status Information (twai_get_status_info()):This function retrieves the current status of the TWAI controller.
C
esp_err_t twai_get_status_info(twai_status_info_t *status_info);
  • The twai_status_info_t structure contains:
C
typedef struct {
    twai_state_t state;             /*!< Current state of TWAI controller (Stopped, Running, Bus-Off, Recovering) */
    uint32_t msgs_to_tx;            /*!< Number of messages queued for transmission */
    uint32_t msgs_to_rx;            /*!< Number of messages in RX queue waiting to be read */
    uint32_t tx_error_counter;      /*!< Current value of Transmit Error Counter */
    uint32_t rx_error_counter;      /*!< Current value of Receive Error Counter */
    uint32_t tx_failed_count;       /*!< Number of messages that failed to transmit */
    uint32_t rx_missed_count;       /*!< Number of messages that were lost due to a full RX queue */
    uint32_t arb_lost_count;        /*!< Number of times arbitration was lost */
    uint32_t bus_error_count;       /*!< Total number of bus errors encountered */
} twai_status_info_t;

Member Type Description
state twai_state_t Current state of the TWAI controller (e.g., TWAI_STATE_STOPPED, TWAI_STATE_RUNNING, TWAI_STATE_BUS_OFF, TWAI_STATE_RECOVERING).
msgs_to_tx uint32_t Number of messages currently queued in the software transmit queue waiting for transmission.
msgs_to_rx uint32_t Number of messages currently in the software receive queue waiting to be read by the application.
tx_error_counter uint32_t Current value of the Transmit Error Counter (TEC).
rx_error_counter uint32_t Current value of the Receive Error Counter (REC).
tx_failed_count uint32_t Cumulative count of messages that failed to transmit (e.g., due to excessive retries leading to bus-off or other unrecoverable TX errors).
rx_missed_count uint32_t Cumulative count of messages that were received by the hardware but lost due to a full RX queue (overflow).
arb_lost_count uint32_t Cumulative count of times arbitration was lost during transmission attempts.
bus_error_count uint32_t Cumulative count of general bus errors detected (Bit, Stuff, CRC, Form, ACK errors).
  • Bus-Off Recovery:
    • twai_initiate_recovery(): This function manually triggers the bus-off recovery process. The controller will then attempt to re-synchronize with the bus and reset its error state if successful.
    • twai_start(): If the driver is in the Bus-Off state, calling twai_start() can also attempt to re-initialize and start the controller, effectively acting as a recovery mechanism. Some TWAI controller versions might have an auto-recovery feature from bus-off, but relying on twai_initiate_recovery() or a stop/start sequence provides more explicit control.

3. Practical Examples

Example 1: Monitoring TWAI Status and Error Counters

This example demonstrates how to periodically read and log the TWAI controller’s status, including its state and error counters.

Prerequisites:

  • ESP-IDF v5.x project.
  • TWAI driver configured (e.g., TWAI_MODE_NORMAL or TWAI_MODE_SELF_TEST) and started.

Code Snippet:

C
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "driver/twai.h"
#include "esp_log.h"

static const char *TAG = "TWAI_STATUS_MONITOR";

#define TWAI_TX_GPIO_NUM CONFIG_EXAMPLE_TWAI_TX_GPIO
#define TWAI_RX_GPIO_NUM CONFIG_EXAMPLE_TWAI_RX_GPIO

// KConfig (ensure these are in your project's Kconfig.projbuild or sdkconfig.defaults)
// CONFIG_EXAMPLE_TWAI_TX_GPIO=21
// CONFIG_EXAMPLE_TWAI_RX_GPIO=22

static void twai_status_monitor_task(void *pvParameters)
{
    twai_status_info_t status_info;
    const TickType_t xDelay = pdMS_TO_TICKS(1000); // Check status every 1 second

    while (1) {
        if (twai_get_status_info(&status_info) == ESP_OK) {
            const char *state_str;
            switch (status_info.state) {
                case TWAI_STATE_STOPPED:    state_str = "STOPPED"; break;
                case TWAI_STATE_RUNNING:    state_str = "RUNNING"; break;
                case TWAI_STATE_BUS_OFF:    state_str = "BUS-OFF"; break;
                case TWAI_STATE_RECOVERING: state_str = "RECOVERING"; break;
                default:                    state_str = "UNKNOWN"; break;
            }

            ESP_LOGI(TAG, "TWAI Status: State=%s, TXQ=%lu, RXQ=%lu, TEC=%lu, REC=%lu, TX_failed=%lu, RX_missed=%lu, Arb_lost=%lu, Bus_errs=%lu",
                     state_str,
                     status_info.msgs_to_tx,
                     status_info.msgs_to_rx,
                     status_info.tx_error_counter,
                     status_info.rx_error_counter,
                     status_info.tx_failed_count,
                     status_info.rx_missed_count,
                     status_info.arb_lost_count,
                     status_info.bus_error_count);
            
            if (status_info.state == TWAI_STATE_BUS_OFF) {
                ESP_LOGW(TAG, "Controller is BUS-OFF! Recovery might be needed.");
            }

        } else {
            ESP_LOGE(TAG, "Failed to get TWAI status.");
        }
        vTaskDelay(xDelay);
    }
}

void app_main(void)
{
    ESP_LOGI(TAG, "TWAI Error Handling Example: Status Monitor");

    // Basic TWAI configuration (use SELF_TEST for simplicity if no external bus)
    twai_general_config_t g_config = TWAI_GENERAL_CONFIG_DEFAULT(TWAI_TX_GPIO_NUM, TWAI_RX_GPIO_NUM, TWAI_MODE_SELF_TEST);
    twai_timing_config_t t_config = TWAI_TIMING_CONFIG_125KBITS();
    twai_filter_config_t f_config = TWAI_FILTER_CONFIG_ACCEPT_ALL();

    ESP_LOGI(TAG, "Installing TWAI driver...");
    if (twai_driver_install(&g_config, &t_config, &f_config) != ESP_OK) {
        ESP_LOGE(TAG, "Failed to install TWAI driver.");
        return;
    }
    ESP_LOGI(TAG, "Starting TWAI driver...");
    if (twai_start() != ESP_OK) {
        ESP_LOGE(TAG, "Failed to start TWAI driver.");
        twai_driver_uninstall();
        return;
    }
    ESP_LOGI(TAG, "TWAI driver started.");

    xTaskCreate(twai_status_monitor_task, "twai_status_task", 4096, NULL, 5, NULL);

    // To observe TEC/REC changes, you'd need to induce errors.
    // In SELF_TEST mode without external factors, TEC/REC should ideally remain 0.
    // If you have an external bus, try disconnecting termination or another node to see errors.
}

Build and Run:

  1. Set KConfig GPIOs.
  2. Build, flash, monitor.
  3. Observe: The task will periodically print the TWAI status. In a healthy self-test loop or a well-behaved bus, TEC and REC should be low or zero, and state should be RUNNING.

Example 2: Handling Bus-Off Alert and Initiating Recovery

graph TD
    %% Mermaid Flowchart for Bus-Off Detection and Recovery
    %% Styles
    classDef start fill:#EDE9FE,stroke:#5B21B6,stroke-width:2px,color:#5B21B6; 
    classDef process fill:#DBEAFE,stroke:#2563EB,stroke-width:1px,color:#1E40AF; 
    classDef decision fill:#FEF3C7,stroke:#D97706,stroke-width:1px,color:#92400E; 
    classDef alert fill:#FEE2E2,stroke:#DC2626,stroke-width:1px,color:#991B1B; 
    classDef recovery fill:#D1FAE5,stroke:#059669,stroke-width:1px,color:#065F46;
    classDef monitor fill:#E0E7FF,stroke:#4338CA,color:#3730A3; 

    A[Application Running with TWAI Active]:::start
    A --> B["Enable <span style='font-family:monospace;font-size:0.8em;'>TWAI_ALERT_BUS_OFF</span> in <span style='font-family:monospace;font-size:0.8em;'>g_config.alerts_enabled</span>"]:::process
    B --> C{"Alert Handler Task:<br>Call <span style='font-family:monospace;font-size:0.8em;'>twai_read_alerts(&alerts, timeout)</span>"}:::monitor
    
    C --> D{Alert Triggered?}:::decision
    D -- "No (Timeout or Other Alerts)" --> C
    D -- "Yes" --> E{<span style='font-family:monospace;font-size:0.8em;'>alerts & TWAI_ALERT_BUS_OFF</span>?}:::decision
    
    E -- "Yes (Bus-Off Detected!)" --> F["Log Bus-Off Event"]:::alert
    F --> G["Call <span style='font-family:monospace;font-size:0.8em;'>twai_initiate_recovery()</span>"]:::recovery
    G --> H{Recovery Initiated Successfully?}:::decision
    H -- "Yes" --> I["TWAI Controller Enters Recovery State<br>(Monitor <span style='font-family:monospace;font-size:0.8em;'>twai_get_status_info()</span> for state changes: RECOVERING -> RUNNING)"]:::monitor
    H -- "No (Recovery Initiation Failed)" --> J["Log Recovery Initiation Failure<br>(May need driver stop/uninstall/reinstall)"]:::alert
    I --> changes

    E -- "No (Other Alert)" --> K["Handle Other Enabled Alerts<br>(e.g., ERR_PASS, BUS_ERROR)"]:::process
    K --> C

    subgraph "Background Process"
        L["Status Monitor Task (Optional):<br>Periodically call <span style='font-family:monospace;font-size:0.8em;'>twai_get_status_info()</span><br>to observe TEC, REC, State"]:::monitor
    end
    
    A -.-> L

This example sets up an alert for TWAI_ALERT_BUS_OFF. When the alert is triggered, it calls twai_initiate_recovery().

C
#include <stdio.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "driver/twai.h"
#include "esp_log.h"

static const char *TAG = "TWAI_BUS_OFF_RECOVERY";

#define TWAI_TX_GPIO_NUM CONFIG_EXAMPLE_TWAI_TX_GPIO
#define TWAI_RX_GPIO_NUM CONFIG_EXAMPLE_TWAI_RX_GPIO

// KConfig
// CONFIG_EXAMPLE_TWAI_TX_GPIO=21
// CONFIG_EXAMPLE_TWAI_RX_GPIO=22

static void twai_alert_handler_task(void *pvParameters)
{
    uint32_t alerts_triggered;
    twai_status_info_t status_info;

    ESP_LOGI(TAG, "TWAI Alert Handler Task started.");

    while (1) {
        // Block until an alert is triggered (or timeout)
        if (twai_read_alerts(&alerts_triggered, pdMS_TO_TICKS(portMAX_DELAY)) == ESP_OK) {
            ESP_LOGI(TAG, "Alerts triggered: 0x%08lX", alerts_triggered);

            if (alerts_triggered & TWAI_ALERT_BUS_OFF) {
                ESP_LOGE(TAG, "ALERT: Bus-Off event detected!");
                ESP_LOGI(TAG, "Attempting to initiate recovery...");
                if (twai_initiate_recovery() == ESP_OK) {
                    ESP_LOGI(TAG, "Bus-Off recovery initiated.");
                    // Monitor status to see if it returns to RUNNING
                } else {
                    ESP_LOGE(TAG, "Failed to initiate Bus-Off recovery.");
                }
            }
            if (alerts_triggered & TWAI_ALERT_ERR_PASS) {
                ESP_LOGW(TAG, "ALERT: Entered Error Passive state.");
            }
            if (alerts_triggered & TWAI_ALERT_BUS_ERROR) {
                ESP_LOGW(TAG, "ALERT: Bus error detected.");
                if (twai_get_status_info(&status_info) == ESP_OK) {
                     ESP_LOGW(TAG, "Current TEC: %lu, REC: %lu", status_info.tx_error_counter, status_info.rx_error_counter);
                }
            }
            // Add more alert checks as needed
        }
    }
}

void app_main(void)
{
    ESP_LOGI(TAG, "TWAI Bus-Off Recovery Example");

    twai_general_config_t g_config = TWAI_GENERAL_CONFIG_DEFAULT(TWAI_TX_GPIO_NUM, TWAI_RX_GPIO_NUM, TWAI_MODE_NORMAL);
    // Enable alerts for Bus-Off, Error Passive, and general Bus Errors
    g_config.alerts_enabled = TWAI_ALERT_BUS_OFF | TWAI_ALERT_ERR_PASS | TWAI_ALERT_BUS_ERROR | TWAI_ALERT_ERR_CNT_WARNING;
    
    twai_timing_config_t t_config = TWAI_TIMING_CONFIG_125KBITS();
    twai_filter_config_t f_config = TWAI_FILTER_CONFIG_ACCEPT_ALL();

    ESP_LOGI(TAG, "Installing TWAI driver with alerts enabled...");
    if (twai_driver_install(&g_config, &t_config, &f_config) != ESP_OK) {
        ESP_LOGE(TAG, "Failed to install TWAI driver.");
        return;
    }
    ESP_LOGI(TAG, "Starting TWAI driver...");
    if (twai_start() != ESP_OK) {
        ESP_LOGE(TAG, "Failed to start TWAI driver.");
        twai_driver_uninstall();
        return;
    }
    ESP_LOGI(TAG, "TWAI driver started.");

    xTaskCreate(twai_alert_handler_task, "twai_alert_task", 4096, NULL, 10, NULL);
    xTaskCreate(twai_status_monitor_task, "twai_status_task", 4096, NULL, 5, NULL); // From Example 1

    ESP_LOGI(TAG, "System running. To test Bus-Off, you would need to create severe and persistent bus errors.");
    ESP_LOGI(TAG, "For example, by shorting CAN_H/CAN_L or removing termination on a live bus (use a test setup, not a critical system!).");
    ESP_LOGI(TAG, "Or by having this node transmit continuously while no other node acknowledges (ACK errors).");

    // Example: Simulate conditions leading to ACK errors (if this is the only active node)
    // This will likely cause TEC to rise and eventually lead to Bus-Off if run for long enough.
    // Use with caution and on a test bus.
    /*
    vTaskDelay(pdMS_TO_TICKS(5000)); // Wait for tasks to start
    ESP_LOGI(TAG, "Attempting to transmit repeatedly to potentially trigger ACK errors / Bus-Off...");
    twai_message_t dummy_msg;
    dummy_msg.identifier = 0x7FF; // Low priority ID
    dummy_msg.flags = 0;
    dummy_msg.data_length_code = 1;
    dummy_msg.data[0] = 0x55;
    for (int i = 0; i < 500; i++) { // Transmit many times
        esp_err_t tx_res = twai_transmit(&dummy_msg, pdMS_TO_TICKS(10));
        if (tx_res != ESP_OK && tx_res != ESP_ERR_TIMEOUT) {
             ESP_LOGE(TAG, "TX error during stress test: %s", esp_err_to_name(tx_res));
             // Check status_info.state here via a shared mechanism or log
             twai_status_info_t current_status;
             if (twai_get_status_info(&current_status) == ESP_OK && current_status.state == TWAI_STATE_BUS_OFF) {
                 ESP_LOGE(TAG, "Entered BUS-OFF during stress test. Recovery should be triggered by alert task.");
                 break;
             }
        }
        vTaskDelay(pdMS_TO_TICKS(5)); // Small delay between transmissions
    }
    ESP_LOGI(TAG, "Finished transmission stress test.");
    */
}

Build and Run:

  • This example is best tested on a physical CAN bus where you can induce errors.
  • If you run the commented-out transmission loop with TWAI_MODE_NORMAL and no other acknowledging CAN node (or faulty termination), the ESP32 will experience ACK errors. Its TEC will rise, potentially leading to Error Passive and then Bus-Off states. The alert task should then detect Bus-Off and attempt recovery. The status monitor will show the state changes.
  • Observe: Logs from twai_alert_handler_task indicating TWAI_ALERT_BUS_OFF and the recovery attempt. Logs from twai_status_monitor_task showing the state change to BUS-OFF and then hopefully back to RECOVERING and RUNNING.

4. Variant Notes

  • Core Error Mechanisms: The fundamental CAN error types (Bit, Stuff, CRC, Form, ACK), error states (Active, Passive, Bus-Off), and TEC/REC behavior are part of the CAN standard and are implemented consistently by the TWAI peripheral across all ESP32 variants (ESP32, S2, S3, C3, C6, H2).
  • Alerts and Status: The specific alerts (TWAI_ALERT_...) and the twai_status_info_t structure provided by the ESP-IDF driver are also generally consistent for these core error handling features.
  • Bus-Off Recovery Implementation: The twai_initiate_recovery() function and the ability to recover by restarting the driver (twai_start()) are standard features. The exact timing of the hardware’s bus-off recovery sequence (128 occurrences of 11 recessive bits) is per the CAN specification.
  • Hardware Auto-Recovery: Some CAN controllers might offer hardware-based automatic bus-off recovery. While the ESP32’s TWAI controller might have some level of this, the ESP-IDF driver encourages explicit recovery initiation via twai_initiate_recovery() or a driver restart for more deterministic application behavior.

5. Common Mistakes & Troubleshooting Tips

Mistake / Issue Symptom(s) Troubleshooting / Solution
Ignoring Error Counters and Status Application is unaware of degrading bus health; node may suddenly go Bus-Off without prior warning to the application. TEC/REC values are not monitored.
  • Monitor Status: Periodically call twai_get_status_info() to check TEC, REC, and controller state (TWAI_STATE_RUNNING, TWAI_STATE_BUS_OFF, etc.).
  • Enable Alerts: In twai_general_config_t, set alerts_enabled to include flags like TWAI_ALERT_BUS_ERROR, TWAI_ALERT_ERR_PASS, TWAI_ALERT_BUS_OFF, TWAI_ALERT_ERR_CNT_WARNING.
  • Alert Handling Task: Implement a task to call twai_read_alerts() and react to triggered alerts.
No Bus-Off Recovery Implemented Node enters Bus-Off state and remains offline indefinitely. No attempt to rejoin the bus.
  • Implement Recovery: Handle the TWAI_ALERT_BUS_OFF alert or detect Bus-Off state via twai_get_status_info().
  • Initiate Recovery: Call twai_initiate_recovery(). After this, the driver state becomes TWAI_STATE_RECOVERING.
  • Restart Driver: Alternatively, or if recovery fails, twai_stop(), twai_driver_uninstall(), re-configure, twai_driver_install(), and twai_start() can achieve recovery.
  • Retry Limits: Consider limiting automatic recovery attempts to prevent continuous cycling if the bus has a persistent fault.
Persistent Physical Layer Problems Node repeatedly goes Bus-Off despite recovery attempts. High TEC/REC values. Frequent bus error alerts.
  • Check Physical Bus: Inspect wiring for shorts, opens, or bad connections.
  • Termination: Verify correct 120 Ohm termination resistors are at BOTH ends of the main bus trunk (not on stubs). Check for missing or multiple terminations.
  • Transceiver: Ensure the CAN transceiver is powered correctly and functioning.
  • Noise (EMI): Investigate potential sources of electromagnetic interference near the CAN bus lines. Shielding or twisted pair wiring might be necessary.
  • Node Hardware: Check if the ESP32 board or its components related to CAN are faulty.
Mismatched Baud Rates Between Nodes Frequent Form Errors, CRC Errors, Bit Errors, ACK Errors. Communication is unreliable or fails completely. Nodes may enter Error Passive or Bus-Off.
  • Verify All Nodes: Ensure ALL nodes on the CAN bus are configured for the EXACT same nominal baud rate.
  • Bit Timing: Also verify that bit timing parameters (TSEG1, TSEG2, SJW, Sample Point) are compatible across all nodes. Use a CAN bit timing calculator to check settings.
Misinterpreting twai_transmit() Failures in Error States Application keeps trying to transmit when the controller is Bus-Off, leading to repeated ESP_ERR_INVALID_STATE.
  • Check State Before TX: Before attempting twai_transmit(), especially after known errors, check the controller state via twai_get_status_info().
  • Handle Invalid State: If state is TWAI_STATE_BUS_OFF or TWAI_STATE_STOPPED, transmission will fail. Wait for recovery or re-start the driver.
  • TX Queue Full: Differentiate ESP_ERR_TIMEOUT (TX queue full, potentially recoverable by waiting) from ESP_ERR_INVALID_STATE (more serious issue requiring state change/recovery).

6. Exercises

  1. Enhanced Bus-Off Recovery:
    • Modify Example 2 (twai_alert_handler_task). If twai_initiate_recovery() is called and the node subsequently goes Bus-Off again within a short period (e.g., 3 times in 1 minute), implement a strategy where the application stops trying to recover automatically for a longer duration (e.g., 5 minutes) and logs a critical “Persistent Bus Failure” message. This prevents rapid, continuous recovery attempts on a fundamentally broken bus.
  2. TEC/REC Threshold Warning:
    • Using the twai_status_monitor_task from Example 1, add logic to issue a specific warning log if either TEC or REC exceeds a threshold (e.g., 64, which is halfway to the Error Passive limit of 128). This can serve as an early indicator of degrading bus quality. The TWAI_ALERT_ERR_CNT_WARNING alert can also be used for this.
  3. Simulate ACK Errors and Observe TEC:
    • Set up your ESP32 in TWAI_MODE_NORMAL. Ensure no other CAN nodes are connected and acknowledging, or deliberately remove bus termination from one end (on a safe test bus only!).
    • Write a task that attempts to transmit a CAN message repeatedly (e.g., 100 times with a small delay between each).
    • In the twai_status_monitor_task, observe how the TEC increases due to ACK errors. Does the node transition to Error Passive? Can you make it go Bus-Off? (Be careful with stressing hardware if issues persist).
    • Important Safety Note: Modifying bus termination should only be done on an isolated test setup, not a production or critical CAN network.

7. Summary

  • CAN incorporates robust error detection for Bit, Stuff, CRC, Form, and ACK errors.
  • Detected errors lead to the transmission of Error Frames (Active or Passive) to alert other nodes.
  • Transmit Error Counter (TEC) and Receive Error Counter (REC) track fault levels.
  • Nodes transition between Error Active, Error Passive, and Bus-Off states based on TEC/REC values, implementing fault confinement.
  • The ESP-IDF TWAI driver allows monitoring of these states and counters via twai_get_status_info().
  • Error conditions can be detected using TWAI alerts (e.g., TWAI_ALERT_BUS_OFF, TWAI_ALERT_BUS_ERROR).
  • Bus-Off recovery is critical and can be initiated using twai_initiate_recovery() or by restarting the driver.
  • Persistent physical layer problems or configuration mismatches (like baud rates) are common sources of CAN errors.

8. Further Reading

  • ESP-IDF TWAI API Reference:
  • Bosch CAN Specification (Version 2.0 Part A/B):
    • The original specification provides the definitive details on error detection, error signaling, and fault confinement. Search for “Bosch CAN Specification.”
  • Application Notes on CAN Error Handling:
    • Many microcontroller vendors (e.g., Microchip, NXP, TI) publish detailed application notes on CAN error handling principles and best practices.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top