Prompt Level Comparison

TAB Dataset — GPT-4o — 10 documents

Level 1 (Naive)
Level 2 (Intermediate)
Level 3 (CoT Expert)
Level 3_fix1 (CoT Expert)

Overall Metrics

MetricLevel 1 (Naive)Level 2 (Intermediate)Level 3 (CoT Expert)Level 3_fix1 (CoT Expert)
Overall Recall93.1%98.5%95.0%96.9%
Word Retention79.9%79.2%79.9%79.5%
Structure Similarity100.0%100.0%100.0%100.0%
Entities Masked446472455464
Entities Missed3372415

Recall by Entity Type

CODE

Level 1 (Naive)96.2% (51/53)
96%
Level 2 (Intermediate)96.2% (51/53)
96%
Level 3 (CoT Expert)86.8% (46/53)
87%
Level 3_fix1 (CoT Expert)86.8% (46/53)
87%

DATETIME

Level 1 (Naive)99.0% (192/194)
99%
Level 2 (Intermediate)100.0% (194/194)
100%
Level 3 (CoT Expert)99.0% (192/194)
99%
Level 3_fix1 (CoT Expert)99.5% (193/194)
100%

DEM

Level 1 (Naive)62.5% (20/32)
62%
Level 2 (Intermediate)100.0% (32/32)
100%
Level 3 (CoT Expert)96.9% (31/32)
97%
Level 3_fix1 (CoT Expert)100.0% (32/32)
100%

LOC

Level 1 (Naive)100.0% (21/21)
100%
Level 2 (Intermediate)100.0% (21/21)
100%
Level 3 (CoT Expert)100.0% (21/21)
100%
Level 3_fix1 (CoT Expert)100.0% (21/21)
100%

MISC

Level 1 (Naive)66.7% (26/39)
67%
Level 2 (Intermediate)92.3% (36/39)
92%
Level 3 (CoT Expert)87.2% (34/39)
87%
Level 3_fix1 (CoT Expert)94.9% (37/39)
95%

ORG

Level 1 (Naive)94.5% (52/55)
94%
Level 2 (Intermediate)98.2% (54/55)
98%
Level 3 (CoT Expert)85.5% (47/55)
86%
Level 3_fix1 (CoT Expert)92.7% (51/55)
93%

PERSON

Level 1 (Naive)98.6% (72/73)
99%
Level 2 (Intermediate)98.6% (72/73)
99%
Level 3 (CoT Expert)98.6% (72/73)
99%
Level 3_fix1 (CoT Expert)98.6% (72/73)
99%

QUANTITY

Level 1 (Naive)100.0% (12/12)
100%
Level 2 (Intermediate)100.0% (12/12)
100%
Level 3 (CoT Expert)100.0% (12/12)
100%
Level 3_fix1 (CoT Expert)100.0% (12/12)
100%
Generated by compare_levels.py