Auto-Transform Your C Code to MISRA Compliance with H2LooP Code-Sanitizr (in Beta)

MISRA rules are a set of coding standards developed by the Motor Industry Software Reliability Association (MISRA) to ensure safety, security, and reliability in software systems, particularly in embedded systems for the automotive and other industries. We have a developed in-house trained Small Language Model (SLM) - H2LooP Code-Sanitizr (currently in beta) which takes in your non-compliant C code and transforms it into the compliant one within seconds.

This report benchmarks multiple language models, ranging from large-scale frontier Gemini variants to compact Small Language Models (SLMs) like Gemma-4B, and the specialized H2LooP Code-Sanitizr (Beta), on the automatic correction of C code violating MISRA C:2012 rules. The goal is to assess how effectively compact, domain-specialized SLMs can deliver safe, rule-aware code corrections comparable to more than 100x larger models - particularly for safety-critical automotive and embedded systems where MISRA compliance is mandatory, without exposing your codebase or IP to cloud LLM providers.

Task Definition for Models:

Models were provided with incorrect C code snippets and MISRA C:2012 rule violations detected using the H2LooP Toolchain. They were then prompted to generate corrected code that addresses these errors.

1. Dataset and Evaluation Setup

Split	Samples
Training	1350
Validation	150
Evaluation	163

Evaluation Metrics

Errors Solved per Row — Net change in number of MISRA rule violations per sample (positive = fix, negative = regression).
Rule-wise Solved Distribution — Aggregated rule-level corrections across required/mandatory rules.
Character Delta (%) — Code modification magnitude, reflecting how aggressively a model edits original code.

H2LooP Toolchain

The H2LooP Toolchain is an integral component of the evaluation pipeline.
Its primary role is to process raw C source files, utilizing AST to accurately locate and extract the target erroneous functions. It then identifies and maps the MISRA C:2012 rule violations within that function(s).

2. Models Evaluated

Model	Description
H2LooP Code-Sanitizr (Beta)	Lightweight model trained within the H2LooP framework to automatically correct and sanitize C code for MISRA C 2012 compliance.
Expert Corrected Code	Manually corrected by domain experts.
Gemini 2.5 Flash (with H2LooP Toolchain)	Flagship workhorse model from Google integrated with the H2LooP rule-aware toolchain.
Gemini 2.5 Flash (without H2LooP Toolchain)	Same model evaluated without external rule context.
Gemini 2.5 Pro (Thinking, with H2LooP Toolchain)	State-of-the-art Gemini variant leveraging rule-level guidance via the H2LooP toolchain.
Gemini 2.5 Pro (Thinking, without H2LooP Toolchain)	Same Pro model evaluated without external rule context.
Gemma3-4B (with/without H2LooP Toolchain)	Compact model serving as the similarly sized comparable SLM to ours.

H2LooP Code-Sanitizr (Beta): It's a compact rule-aware model trained within the H2LooP framework to automatically identify MISRA errors with its toolchain, make corrections and leave a brief comment on the resoning behind the change, with minimal code edits. Designed for interpretable, fine-grained compliance correction.

Please note, H2LooP Code-Sanitizr (Beta) is also referred to as Sanitizr or H2LooP Code-Sanitizr throughout this report.

3. Aggregate Evaluation Summary

Figure 4. Relationship between model sizes and their normalized efficiency.

This plot illustrates the efficiency–scalability trade-off across models, using the Performance Index and Efficiency Index which are formally defined in the subsections below. While the Gemini 2.5 Pro variants are better in absolute performance (indices near 100), the H2LooP Code-Sanitizr (Beta) achieves comparable rule-correction capability at a fraction of the parameter scale, reflecting much superior performance-per-parameter efficiency.

Overall, the plot reinforces that domain-specialized adaptation can yield near-top-tier reliability without massive model size or computational overhead.

Evaluation Samples: 163

Average Errors Solved per Sample

Model	Avg Errors Solved
Expert Corrected Code	1.54
Gemini 2.5 Pro (Thinking, with H2LooP Toolchain)	1.31
Gemini 2.5 Flash (with H2LooP Toolchain)	1.25
Gemini 2.5 Pro (Thinking, without H2LooP Toolchain)	0.55
H2LooP Code-Sanitizr (Beta)	0.66
Gemini 2.5 Flash (without H2LooP Toolchain)	0.63
Gemma3-4B (with H2LooP Toolchain)	-0.05
Gemma3-4B (without H2LooP Toolchain)	0.00

Insights:

The H2LooP Code-Sanitizr (Beta) achieves an average of 0.66 errors solved per sample, representing the strongest performance among compact, specialized models.
Similarly-sized Gemma3-4B (with/without H2LooP Toolchain) lags significantly behind, indicating that the model lacks sufficient rule understanding without targeted adaptation.
Gemini 2.5 Pro and Flash perform much worse without the H2LooP Toolchain.

Average Character Delta (% vs Incorrect Code)

Model	% Character Delta
Expert Corrected Code	12.71%
Gemini 2.5 Pro (Thinking, with H2LooP Toolchain)	13.20%
Gemini 2.5 Flash (with H2LooP Toolchain)	30.72%
Gemini 2.5 Pro (Thinking, without H2LooP Toolchain)	12.28%
Gemini 2.5 Flash (without H2LooP Toolchain)	24.35%
H2LooP Code-Sanitizr (Beta)	12.34%
Gemma3-4B (with H2LooP Toolchain)	5.26%
Gemma3-4B (without H2LooP Toolchain)	0.29%

Insights:

Gemini models generally perform broader, higher-magnitude edits (20–30%), indicating more aggressive rewrites.
The H2LooP Code-Sanitizr (Beta) shows a compact edit footprint (~8% delta), closer to the expert baseline (12.7%), implying targeted and efficient corrections, just like the domain experts.
Gemma3-4B models, both with and without the H2LooP toolchain, make minimal modifications, often insufficient to correct rule violations.

Normalized Performance and Efficiency Indices (0–100 Scale)

To provide a unified comparison across models, two normalized indices were computed:

Performance Index (0 = worst, 100 = best) — based on Average Errors Solved, normalized so the weakest model (Gemma, −0.05) → 0 and the strongest (Gemini 2.5 Pro with H2LooP Toolchain, 1.31) → 100.
Efficiency Index (0 = worst, 100 = best) — based on (Errors Solved ÷( % Character Delta * Parameter Size)), i.e. fixes per percent of code changed, normalized across all models.

Normalization formulas:

Performance Index

Performance Index = [(solved - min_solved) / (max_solved - min_solved)] × 100

Efficiency Index

Efficiency Index = normalize(solved / (char_delta * param_size))_0–100

‍

Model	Avg Errors Solved	% Char Δ	Performance Index	Efficiency Index
Gemma3-4B	−0.05	5.26	0.0	0.0
H2LooP Code-Sanitizr (Beta)	0.53	8.34	42.6	66.8
Gemini 2.5 Flash (with H2LooP Toolchain)	1.25	30.7	95.6	8.8
Gemini 2.5 Pro (Thinking, with H2LooP Toolchain)	1.31	13.2	100.0	0.8
Gemini 2.5 Flash (without H2LooP Toolchain)	0.63	24.4	50.0	13.5
Gemini 2.5 Pro (Thinking, without H2LooP Toolchain)	0.55	12.3	44.1	5.2

‍

Normalized Performance vs Efficiency Indices

Interpretation:

H2LooP Code-Sanitizr achieves ≈ 43% of the absolute correction capability relative to the top-performing model's ceiling (100). Critically, it operates with an efficiency score that is over 80 times higher (66.8 vs 0.8) than the highest-performing model, demonstrating superior value and deployability.
While the highest-performing models achieve greater absolute accuracy, their large parameter size and/or code edit footprints render them significantly less efficient according to the defined metric. The Sanitizr uniquely balances precision, minimal code alteration, and its compact size.
Gemma3-4B as similarly sized model occupies the lower bound of both indices, highlighting minimal correction ability and speaks to domain adaptaion of Sanitizr.

The H2LooP Code-Sanitizr (Beta) raises SLM correction ability (model performance) from 0 → ~43 on the 0–100 scale while demonstrating vastly superior efficiency (Efficiency Index ≈ 67) achieving meaningful compliance improvements through compact, targeted edits.

4. Rule Family level Performance

The following analysis illustrates performance by MISRA rule families, highlighting which categories of safety rules each model handles best. Each family represents a distinct safety domain — from initialization and typing to pointer handling and control-flow integrity. Performance metrics show average correction rates between 0 (no improvement) and 100 (expert parity).

Family	Representative Rules	Top Performer(s)	Highlights
9.x	9.1–9.5	Gemini 2.5 Pro / H2LooP Code-Sanitizr	Strong performance on initialization and assignment corrections.
10.x	10.1, 10.4	Gemini 2.5 Pro / H2LooP Code-Sanitizr	Accurate expression evaluation and operator handling.
11.x	11.1–11.9	Gemini 2.5 Pro	Perfect pointer conversion handling; Sanitizr approaches consistency.
13.x	13.2, 13.6	Gemini 2.5 Pro	Stable expression-side-effect correction; partial Sanitizr recovery.
15.x	15.2–15.7	Gemini 2.5 Pro / H2LooP Code-Sanitizr	Reliable control-flow structure validation.
18.x	18.1–18.8	Gemini 2.5 Pro / H2LooP Code-Sanitizr	Challenging family; Sanitizr shows promising pointer-safety gains.
2.x	2.2	Gemini 2.5 Pro / H2LooP Code-Sanitizr	Variable scope and linkage rule corrections.
8.x	8.4	Gemini 2.5 Pro / H2LooP Code-Sanitizr	Object definition and visibility corrections dominate dataset.

Each rule family addresses a distinct safety or reliability concern in C programming. Below, we analyze average model performance across these families, combining expert fix rates, Gemini-class performance, and H2LooP Code-Sanitizr (Beta) results.

Family 9.x — Initialization and Assignment

Description: Requires explicit initialization of all variables and data structures before use to prevent undefined runtime behavior.

Importance: Critical for system predictability and safe startup states in automotive or safety-critical firmware.