LLM Anonymization on the TAB Dataset

Adapting "Large Language Models are Advanced Anonymizers" (ICLR 2025) to the Text Anonymization Benchmark

Table of Contents

  1. Background & Motivation
  2. The Original Project
  3. The TAB Dataset
  4. How We Adapted the Pipeline
  5. Architecture & Code Structure
  6. Prompt Engineering (3 Levels)
  7. Evaluation Metrics
  8. Worked Example
  9. How to Run
  10. Summary

1 Background & Motivation

Text anonymization is a critical task in NLP and privacy engineering. The goal is to remove or replace personally identifiable information (PII) from text while preserving the document's meaning and utility.

Recent research has shown that Large Language Models (LLMs) like GPT-4 are surprisingly effective at this task; sometimes outperforming traditional rule-based and NER-based approaches.

Our Goal: Take an existing LLM anonymization system designed for Reddit comments and adapt it to work on the TAB (Text Anonymization Benchmark) : a dataset of European court case documents with gold-standard PII annotations.

2 The Original Project

The base project implements the paper "Large Language Models are Advanced Anonymizers" (Staab et al., ICLR 2025). It works on Reddit comments and performs attribute-level anonymization.

Original Pipeline

Reddit Comments
Infer Attributes
(age, gender, income...)
LLM Anonymization
(rewrite text)
Utility Scoring
(BLEU, ROUGE, LLM)
Eval Attack
(can attributes still
be inferred?)

What It Does

AspectDetails
InputReddit user's comment history
PII TypesAge, gender, income, location, education, occupation, relationship status
AnonymizationLLM rewrites comments to prevent attribute inference (e.g., "Zürich" → "an expensive city")
EvaluationRun inference attacks on anonymized text; if model can still guess attributes, anonymization failed
Key InsightGPT-4 anonymization reduces correct attribute inference by ~60% while preserving text quality

3 The TAB Dataset

The Text Anonymization Benchmark (TAB) is an open-source corpus developed by Pilán et al. (2022) for evaluating text anonymization systems. It consists of 1,268 English-language court cases from the European Court of Human Rights (ECHR).

Dataset Structure

SplitDocumentsPurpose
Train~1,000Training anonymization models
Dev~70Hyperparameter tuning
Test127Final evaluation

Entity Types Annotated

TypeDescriptionExample
PERSONNames of individualsMr Galip Yalman
LOCLocationsAnkara, Turkey
ORGOrganizationsEuropean Commission of Human Rights
DATETIMEDates and times15 March 1999
CODECase/reference numbers36110/97
DEMDemographic infoTurkish, Kurdish
QUANTITYAmounts, numbersEUR 5,000
MISCOther identifiersVarious

Annotation Labels

Each entity is labeled with an identifier type that indicates the masking decision:
  • DIRECT — Uniquely identifies a person → must always be masked
  • QUASI — Could identify in combination → should be masked
  • NO_MASK — Not identifying → keep as is

4 How We Adapted the Pipeline

The original project does attribute-level anonymization on short Reddit comments. TAB requires entity-level anonymization on long legal documents. Here is how we bridged the gap:

Key Differences

AspectOriginal (Reddit)TAB Adaptation (Court Cases)
Input Short Reddit comments (~50–200 words) Long court documents (~5,000 chars avg)
PII Type Inferred attributes (age, gender...) Named entities (PERSON, LOC, ORG...)
PII Source LLM infers attributes from text Gold annotations provided in dataset
Anonymization Generalize text ("Zürich" → "expensive city") Replace entities with placeholders ([PERSON], [LOC]...)
Challenge Prevent attribute inference Replace all PII while preserving legal reasoning
Evaluation Adversarial inference attack Entity recall/precision vs gold standard

New Pipeline

TAB Court
Document
Chunk
Document
(≤3500 chars)
Feed Entities
+ Text to LLM
LLM Replaces
Entities with
Placeholders
Evaluate
Recall &
Preservation

What We Reused vs. What We Built New

Reused from Original

  • OpenAI API integration (openai.ChatCompletion)
  • Prompt engineering philosophy (3 levels)
  • Config-driven YAML approach
  • Retry logic for API errors
  • Incremental JSONL output format
  • Response parsing with # separator

Built New for TAB

  • TAB data loader and parser
  • Document chunking (for long texts)
  • Entity-aware prompts (legal domain)
  • Entity-level evaluation metrics
  • Auto-download of TAB dataset
  • Resume support for long runs

5 Architecture & Code Structure

Files Created

llm-anonymization/
├── run_tab.py              ← Main self-contained runner (all-in-one)
├── configs/anonymization/
│   └── tab.yaml           ← Configuration file
├── src/tab/               ← Module directory (NEW)
│   ├── __init__.py
│   ├── tab_loader.py     ← TAB dataset parser & downloader
│   ├── tab_anonymize.py  ← Anonymization pipeline
│   └── tab_evaluation.py ← Evaluation metrics
└── data/tab/              ← Downloaded TAB data (auto)
    ├── echr_train.json
    ├── echr_dev.json
    └── echr_test.json

Key Component: Data Loader

The TAB dataset uses a specific JSON format where annotations are nested under annotator keys:

// TAB JSON structure
{
  "doc_id": "001-61807",
  "text": "PROCEDURE\nThe case originated...",
  "annotations": {
    "annotator1": {
      "entity_mentions": [
        {
          "entity_type": "CODE",
          "span_text": "36110/97",
          "identifier_type": "DIRECT",
          "start_offset": 54,
          "end_offset": 62
        }
      ]
    }
  }
}

Our parser extracts all entity mentions from this nested structure:

def parse_document(doc_json):
    annotations = []
    for ann_key, ann_data in doc_json["annotations"].items():
        # annotations → annotatorN → entity_mentions → [list]
        entity_mentions = ann_data.get("entity_mentions", [])
        for mention in entity_mentions:
            annotations.append(EntityMention(
                entity_type=mention["entity_type"],
                span_text=mention["span_text"],
                identifier_type=mention["identifier_type"],
                ...
            ))
    return TABDocument(doc_id=..., text=..., annotations=annotations)

Key Component: Document Chunking

Court documents average ~5,000 characters . It is too long for a single LLM prompt. We split at paragraph boundaries into chunks of ≤3,500 characters, tracking which annotations belong to each chunk:

def chunk_document(doc, max_chars=3500):
    if len(doc.text) <= max_chars:
        return [(doc.text, doc.annotations)]  # Fits in one chunk

    chunks = []
    for paragraph in doc.text.split("\n"):
        if current_chunk_too_long:
            chunks.append((current_chunk, entities_in_range))
            start_new_chunk()
    return chunks

6 Prompt Engineering (3 Levels)

Following the original paper's approach, we designed three prompt levels of increasing sophistication, mirroring the original Level 1 (naive), Level 2 (intermediate), and Level 3 (CoT + expert) structure. (in src/tab/tab_anonymize.py)

Level 1 — Naive

System: "You are a helpful assistant that anonymizes legal documents by replacing personal identifiers with appropriate placeholders."

Instruction: Simply asks to replace identifiers with [PERSON], [LOC], etc.
Output: Write # then anonymized text.

Level 2 — Intermediate

System: Expert legal document anonymizer role. Emphasizes preserving structure, meaning, and legal reasoning.

Instruction: Specifies all 8 entity categories. Tells model to keep text as is.
Output: Write # then anonymized text.

Level 3 — Advanced (Chain-of-Thought)

System: Expert with deep ECHR experience. Includes:
  • All 8 replacement categories with examples
  • DIRECT vs QUASI identifier distinction
  • Co-reference instructions ([PERSON_1], [PERSON_2])
  • Rules about what to keep vs replace
Input: Document excerpt + list of pre-identified entities from TAB annotations
Output: First explain changes, then # followed by anonymized text (Chain-of-Thought)

Prompt Structure (Level 3)

System Prompt:
  You are an expert legal document anonymizer...
  Replacement categories: [PERSON], [LOC], [ORG], [DATETIME]...
  Rules: Replace DIRECT always, QUASI when risky...

User Prompt:
  Header: Below is an excerpt from an ECHR case...
  Document excerpt:
    <actual text chunk>
  Identified entities:
    - "Mr Galip Yalman" (PERSON, DIRECT)
    - "36110/97" (CODE, DIRECT)
    - "Ankara" (LOC, QUASI)
  Footer: First explain, then write # and anonymized text.

7 Evaluation Metrics

We evaluate two aspects: privacy protection (did we mask the right things?) and text utility (is the text still useful?).

Privacy Metrics

MetricWhat it MeasuresHow it's Computed
Entity Recall % of PII entities successfully masked Check if original span text is absent from anonymized output
Precision Estimate % of inserted placeholders that match real entities Ratio of correctly masked entities to total placeholders
F1 Score Harmonic mean of precision and recall Standard F1 formula
Per-Type Recall Recall broken down by entity type Separate counts for PERSON, LOC, ORG, etc.

Utility Metrics

MetricWhat it MeasuresHow it's Computed
Word Retention How much original vocabulary is preserved Set intersection of original vs anonymized words
Structure Similarity Whether paragraph structure is preserved Ratio of paragraph counts

Evaluation Logic

def evaluate_entity_detection(doc, anonymized_text):
    for entity in doc.entities_to_mask:
        # If the original span text STILL appears → missed
        if entity.span_text in anonymized_text:
            overall["missed"] += 1
        else:
            overall["masked"] += 1  # Successfully anonymized!

    recall = masked / total  # Higher = better privacy

8 Worked Example

Here's what anonymization looks like on a real TAB document excerpt:

Original Text

The case originated in an application (no. 36110/97) against the Republic of Turkey lodged with the European Commission of Human Rights under former Article 25 of the Convention... The applicant, Mr Galip Yalman, is a Turkish national who was born in 1940 and lives in Ankara.

Anonymized Text

The case originated in an application (no. [CODE]) against the Republic of Turkey lodged with the European Commission of Human Rights under former Article 25 of the Convention... The applicant, [PERSON_1], is a [DEM] national who was born in [DATETIME] and lives in [LOC].

Result: All 5 identifying entities were replaced with appropriate category placeholders. Legal reasoning and document structure remain fully intact.

9 How to Run

Prerequisites

# Only two dependencies needed
pip install openai==0.28.1
pip install pyyaml  # only if using --config flag

Step 1: View Dataset Statistics (No API Key Needed)

cd llm-anonymization
python run_tab.py --stats_only --split test

Step 2: Run Anonymization

# Anonymize 5 documents with GPT-4o
python run_tab.py --model gpt-4o --split test --max_docs 5

# Compare prompt levels
python run_tab.py --model gpt-4o --prompt_level 1 --max_docs 10
python run_tab.py --model gpt-4o --prompt_level 3 --max_docs 10

# Use a config file
python run_tab.py --config configs/anonymization/tab.yaml

Step 3: Evaluate Results

python run_tab.py --evaluate --results_path anonymized_results/tab/results.jsonl --split test

Output Files

FileContent
anonymized_results/tab/results.jsonlOne JSON per line: original, anonymized, ground truth, metadata
anonymized_results/tab/evaluation.jsonFull evaluation with per-document and aggregate metrics

10 Summary

What We Did

  1. Studied the original LLM anonymization pipeline (Reddit attribute-level anonymization)
  2. Analyzed the TAB dataset format (ECHR court cases with entity annotations)
  3. Built a data loader that downloads and parses TAB's nested JSON format
  4. Designed entity-aware prompts at 3 sophistication levels for legal documents
  5. Implemented document chunking to handle long court cases within LLM context limits
  6. Created evaluation metrics (entity recall, precision, per-type analysis, text preservation)
  7. Wrapped everything in a self-contained script with resume support and auto-download

Key Technical Decisions

DecisionReason
Self-contained run_tab.py Original project has heavy dependencies (PyTorch, sentence-transformers). TAB works with just openai.
3 prompt levels Mirrors the original paper's approach. Allows comparing simple vs sophisticated prompts.
Feed ground-truth entities to LLM TAB provides annotations — using them lets the LLM focus on replacement quality.
Paragraph-based chunking Preserves semantic boundaries. Doesn't split sentences.
Span-based recall evaluation If the original entity text still appears in output, entity was NOT masked — simple but effective.

Potential Extensions