TAB Anonymization — LLM Adaptation Report

1 Background & Motivation

Text anonymization is a critical task in NLP and privacy engineering. The goal is to remove or replace personally identifiable information (PII) from text while preserving the document's meaning and utility.

Recent research has shown that Large Language Models (LLMs) like GPT-4 are surprisingly effective at this task; sometimes outperforming traditional rule-based and NER-based approaches.

Our Goal: Take an existing LLM anonymization system designed for Reddit comments and adapt it to work on the TAB (Text Anonymization Benchmark) : a dataset of European court case documents with gold-standard PII annotations.

2 The Original Project

The base project implements the paper "Large Language Models are Advanced Anonymizers" (Staab et al., ICLR 2025). It works on Reddit comments and performs attribute-level anonymization.

Original Pipeline

Reddit Comments

→

Infer Attributes
(age, gender, income...)

→

LLM Anonymization
(rewrite text)

→

Utility Scoring
(BLEU, ROUGE, LLM)

→

Eval Attack
(can attributes still
be inferred?)

What It Does

Aspect	Details
Input	Reddit user's comment history
PII Types	Age, gender, income, location, education, occupation, relationship status
Anonymization	LLM rewrites comments to prevent attribute inference (e.g., "Zürich" → "an expensive city")
Evaluation	Run inference attacks on anonymized text; if model can still guess attributes, anonymization failed
Key Insight	GPT-4 anonymization reduces correct attribute inference by ~60% while preserving text quality

3 The TAB Dataset

The Text Anonymization Benchmark (TAB) is an open-source corpus developed by Pilán et al. (2022) for evaluating text anonymization systems. It consists of 1,268 English-language court cases from the European Court of Human Rights (ECHR).

Dataset Structure

Split	Documents	Purpose
Train	~1,000	Training anonymization models
Dev	~70	Hyperparameter tuning
Test	127	Final evaluation

Entity Types Annotated

Type	Description	Example
PERSON	Names of individuals	Mr Galip Yalman
LOC	Locations	Ankara, Turkey
ORG	Organizations	European Commission of Human Rights
DATETIME	Dates and times	15 March 1999
CODE	Case/reference numbers	36110/97
DEM	Demographic info	Turkish, Kurdish
QUANTITY	Amounts, numbers	EUR 5,000
MISC	Other identifiers	Various

Annotation Labels

Each entity is labeled with an identifier type that indicates the masking decision:

DIRECT — Uniquely identifies a person → must always be masked
QUASI — Could identify in combination → should be masked
NO_MASK — Not identifying → keep as is

4 How We Adapted the Pipeline

The original project does attribute-level anonymization on short Reddit comments. TAB requires entity-level anonymization on long legal documents. Here is how we bridged the gap:

Key Differences

Aspect	Original (Reddit)	TAB Adaptation (Court Cases)
Input	Short Reddit comments (~50–200 words)	Long court documents (~5,000 chars avg)
PII Type	Inferred attributes (age, gender...)	Named entities (PERSON, LOC, ORG...)
PII Source	LLM infers attributes from text	Gold annotations provided in dataset
Anonymization	Generalize text ("Zürich" → "expensive city")	Replace entities with placeholders ([PERSON], [LOC]...)
Challenge	Prevent attribute inference	Replace all PII while preserving legal reasoning
Evaluation	Adversarial inference attack	Entity recall/precision vs gold standard

New Pipeline

TAB Court
Document

→

Chunk
Document
(≤3500 chars)

→

Feed Entities
+ Text to LLM

→

LLM Replaces
Entities with
Placeholders

→

Evaluate
Recall &
Preservation

What We Reused vs. What We Built New

Reused from Original

OpenAI API integration (openai.ChatCompletion)
Prompt engineering philosophy (3 levels)
Config-driven YAML approach
Retry logic for API errors
Incremental JSONL output format
Response parsing with # separator

Built New for TAB

TAB data loader and parser
Document chunking (for long texts)
Entity-aware prompts (legal domain)
Entity-level evaluation metrics
Auto-download of TAB dataset
Resume support for long runs

5 Architecture & Code Structure

Files Created

llm-anonymization/
├── run_tab.py              ← Main self-contained runner (all-in-one)
├── configs/anonymization/
│   └── tab.yaml           ← Configuration file
├── src/tab/               ← Module directory (NEW)
│   ├── __init__.py
│   ├── tab_loader.py     ← TAB dataset parser & downloader
│   ├── tab_anonymize.py  ← Anonymization pipeline
│   └── tab_evaluation.py ← Evaluation metrics
└── data/tab/              ← Downloaded TAB data (auto)
    ├── echr_train.json
    ├── echr_dev.json
    └── echr_test.json

Key Component: Data Loader

The TAB dataset uses a specific JSON format where annotations are nested under annotator keys:

// TAB JSON structure
{
  "doc_id": "001-61807",
  "text": "PROCEDURE\nThe case originated...",
  "annotations": {
    "annotator1": {
      "entity_mentions": [
        {
          "entity_type": "CODE",
          "span_text": "36110/97",
          "identifier_type": "DIRECT",
          "start_offset": 54,
          "end_offset": 62
        }
      ]
    }
  }
}

Our parser extracts all entity mentions from this nested structure:

def parse_document(doc_json):
    annotations = []
    for ann_key, ann_data in doc_json["annotations"].items():
        # annotations → annotatorN → entity_mentions → [list]
        entity_mentions = ann_data.get("entity_mentions", [])
        for mention in entity_mentions:
            annotations.append(EntityMention(
                entity_type=mention["entity_type"],
                span_text=mention["span_text"],
                identifier_type=mention["identifier_type"],
                ...
            ))
    return TABDocument(doc_id=..., text=..., annotations=annotations)

Key Component: Document Chunking

Court documents average ~5,000 characters . It is too long for a single LLM prompt. We split at paragraph boundaries into chunks of ≤3,500 characters, tracking which annotations belong to each chunk:

def chunk_document(doc, max_chars=3500):
    if len(doc.text) <= max_chars:
        return [(doc.text, doc.annotations)]  # Fits in one chunk

    chunks = []
    for paragraph in doc.text.split("\n"):
        if current_chunk_too_long:
            chunks.append((current_chunk, entities_in_range))
            start_new_chunk()
    return chunks

6 Prompt Engineering (3 Levels)

Following the original paper's approach, we designed three prompt levels of increasing sophistication, mirroring the original Level 1 (naive), Level 2 (intermediate), and Level 3 (CoT + expert) structure. (in src/tab/tab_anonymize.py)

Level 1 — Naive

System: "You are a helpful assistant that anonymizes legal documents by replacing personal identifiers with appropriate placeholders."

Instruction: Simply asks to replace identifiers with [PERSON], [LOC], etc.
Output: Write # then anonymized text.

Level 2 — Intermediate

System: Expert legal document anonymizer role. Emphasizes preserving structure, meaning, and legal reasoning.

Instruction: Specifies all 8 entity categories. Tells model to keep text as is.
Output: Write # then anonymized text.

Level 3 — Advanced (Chain-of-Thought)

System: Expert with deep ECHR experience. Includes:

All 8 replacement categories with examples
DIRECT vs QUASI identifier distinction
Co-reference instructions ([PERSON_1], [PERSON_2])
Rules about what to keep vs replace

Input: Document excerpt + list of pre-identified entities from TAB annotations
Output: First explain changes, then # followed by anonymized text (Chain-of-Thought)

Prompt Structure (Level 3)

System Prompt:
  You are an expert legal document anonymizer...
  Replacement categories: [PERSON], [LOC], [ORG], [DATETIME]...
  Rules: Replace DIRECT always, QUASI when risky...

User Prompt:
  Header: Below is an excerpt from an ECHR case...
  Document excerpt:
    <actual text chunk>
  Identified entities:
    - "Mr Galip Yalman" (PERSON, DIRECT)
    - "36110/97" (CODE, DIRECT)
    - "Ankara" (LOC, QUASI)
  Footer: First explain, then write # and anonymized text.

7 Evaluation Metrics

We evaluate two aspects: privacy protection (did we mask the right things?) and text utility (is the text still useful?).

Privacy Metrics

Metric	What it Measures	How it's Computed
Entity Recall	% of PII entities successfully masked	Check if original span text is absent from anonymized output
Precision Estimate	% of inserted placeholders that match real entities	Ratio of correctly masked entities to total placeholders
F1 Score	Harmonic mean of precision and recall	Standard F1 formula
Per-Type Recall	Recall broken down by entity type	Separate counts for PERSON, LOC, ORG, etc.

Utility Metrics

Metric	What it Measures	How it's Computed
Word Retention	How much original vocabulary is preserved	Set intersection of original vs anonymized words
Structure Similarity	Whether paragraph structure is preserved	Ratio of paragraph counts

Evaluation Logic

def evaluate_entity_detection(doc, anonymized_text):
    for entity in doc.entities_to_mask:
        # If the original span text STILL appears → missed
        if entity.span_text in anonymized_text:
            overall["missed"] += 1
        else:
            overall["masked"] += 1  # Successfully anonymized!

    recall = masked / total  # Higher = better privacy

8 Worked Example

Here's what anonymization looks like on a real TAB document excerpt:

Original Text

The case originated in an application (no. 36110/97) against the Republic of Turkey lodged with the European Commission of Human Rights under former Article 25 of the Convention... The applicant, Mr Galip Yalman, is a Turkish national who was born in 1940 and lives in Ankara.

Anonymized Text

The case originated in an application (no. [CODE]) against the Republic of Turkey lodged with the European Commission of Human Rights under former Article 25 of the Convention... The applicant, [PERSON_1], is a [DEM] national who was born in [DATETIME] and lives in [LOC].

Result: All 5 identifying entities were replaced with appropriate category placeholders. Legal reasoning and document structure remain fully intact.

9 How to Run

Prerequisites

# Only two dependencies needed
pip install openai==0.28.1
pip install pyyaml  # only if using --config flag

Step 1: View Dataset Statistics (No API Key Needed)

cd llm-anonymization
python run_tab.py --stats_only --split test

Step 2: Run Anonymization

# Anonymize 5 documents with GPT-4o
python run_tab.py --model gpt-4o --split test --max_docs 5

# Compare prompt levels
python run_tab.py --model gpt-4o --prompt_level 1 --max_docs 10
python run_tab.py --model gpt-4o --prompt_level 3 --max_docs 10

# Use a config file
python run_tab.py --config configs/anonymization/tab.yaml

Step 3: Evaluate Results

python run_tab.py --evaluate --results_path anonymized_results/tab/results.jsonl --split test

Output Files

File	Content
`anonymized_results/tab/results.jsonl`	One JSON per line: original, anonymized, ground truth, metadata
`anonymized_results/tab/evaluation.json`	Full evaluation with per-document and aggregate metrics

10 Summary

What We Did

Studied the original LLM anonymization pipeline (Reddit attribute-level anonymization)
Analyzed the TAB dataset format (ECHR court cases with entity annotations)
Built a data loader that downloads and parses TAB's nested JSON format
Designed entity-aware prompts at 3 sophistication levels for legal documents
Implemented document chunking to handle long court cases within LLM context limits
Created evaluation metrics (entity recall, precision, per-type analysis, text preservation)
Wrapped everything in a self-contained script with resume support and auto-download

Key Technical Decisions

Decision	Reason
Self-contained `run_tab.py`	Original project has heavy dependencies (PyTorch, sentence-transformers). TAB works with just `openai`.
3 prompt levels	Mirrors the original paper's approach. Allows comparing simple vs sophisticated prompts.
Feed ground-truth entities to LLM	TAB provides annotations — using them lets the LLM focus on replacement quality.
Paragraph-based chunking	Preserves semantic boundaries. Doesn't split sentences.
Span-based recall evaluation	If the original entity text still appears in output, entity was NOT masked — simple but effective.

Potential Extensions

Compare multiple models (GPT-4o vs GPT-3.5 vs open-source LLMs)
Add BLEU/ROUGE scores for richer utility evaluation
Implement iterative anonymization (multiple passes like the original pipeline)
Add the official TAB evaluation script for direct comparison with published baselines

LLM Anonymization on the TAB Dataset

Table of Contents