Detect PII / PHI in a Document Corpus
Skill: Convert document text into a structured PII/PHI entity report
Region: United States Category: Legal / eDiscovery Does: Takes a document text corpus and produces a structured PII/PHI entity-detection report — classifying Social Security numbers, financial-account numbers, dates of birth, medical-record numbers, biometrics, and contact data per the HIPAA Safe Harbor and CCPA categories — to drive redaction, privacy review, and protective-order compliance. Authority: NIST SP 800-122 (PII) · HIPAA Safe Harbor §164.514(b)(2) (18 identifiers) · CCPA §1798.140(v) · FRCP 5.2 (privacy redaction)
This detects and classifies sensitive identifiers; it does not itself redact (see
build-redaction-instruction-set). Detection should be treated as decision-support with human review — false negatives create exposure and false positives waste review time. Store detected values as hashes/offsets, never as cleartext, in the report.
When this applies
- Before production, to find personal data requiring FRCP 5.2 redaction (SSNs, financial accounts, DOB, minors' names).
- On HIPAA-regulated collections, to locate PHI (the 18 Safe Harbor identifiers) for de-identification or protective-order handling.
- For CCPA/GDPR cross-border productions, to quantify and locate personal data subject to minimization.
HIPAA Safe Harbor — the 18 identifiers (§164.514(b)(2))
Names · Geographic subdivisions < state · All date elements (except year) tied to an
individual · Phone · Fax · Email · SSN · Medical record number · Health-plan beneficiary #
· Account numbers · Certificate/license numbers · Vehicle identifiers/plates · Device
identifiers/serials · URLs · IP addresses · Biometric identifiers (finger/voice prints)
· Full-face photos & comparable images · Any other unique identifying number/code
Detect these plus NIST SP 800-122 PII (e.g. SSN, passport, financial accounts) and CCPA §1798.140(v) categories (identifiers, financial, biometric, geolocation, etc.).
Output structure (JSON per entity)
{
"document_id": "DOC000123",
"bates": "ABC000451",
"page": 3,
"span": { "start": 1204, "end": 1215 },
"entity_type": "US_SSN",
"entity_value_hash": "sha256:9f86d0...",
"value_redacted_preview": "***-**-6789",
"confidence": 0.98,
"detector": "regex+checksum",
"jurisdiction_flags": ["FRCP_5.2", "HIPAA_SafeHarbor", "CCPA"],
"redaction_priority": "high"
}
Detection rules
- Layer detectors: pattern/regex + validation (SSN area/group sanity, Luhn check on card numbers, ABA routing checksum) + context keywords ("SSN", "DOB", "account no.") + NER for names/addresses. Validation cuts false positives sharply.
- Classify by jurisdiction: tag each hit with which regimes implicate it (
FRCP_5.2,HIPAA_SafeHarbor,CCPA,GDPR) — the same SSN may carry multiple flags and drive different handling. - Assign redaction priority: high (SSN, full financial account, biometrics, minor's identity), medium (DOB, MRN, partial account, address), low (business contact info that may be non-sensitive).
- Never store cleartext sensitive values in the report — keep a one-way hash plus a masked preview and character offsets so redaction can locate them.
- Record provenance (
detector,confidence) so reviewers can prioritize low-confidence hits and the process is defensible. - Treat output as a review queue, not an automatic action — counsel confirms before redaction/production.
Worked example (one document, summarized)
DOC000123 (Bates ABC000451–453) — 4 entities detected:
p1 US_SSN "***-**-6789" conf 0.98 [FRCP_5.2, HIPAA, CCPA] → high
p1 DOB "**/**/1984" conf 0.95 [HIPAA, CCPA] → medium
p2 FINANCIAL_ACCOUNT "****1234" (Luhn ok) conf 0.92 [FRCP_5.2, CCPA] → high
p3 EMAIL "j***@acme.com" conf 0.99 [HIPAA, CCPA] → low
→ feeds build-redaction-instruction-set: redact the SSN (last 4 ok), full account, DOB→year
Validation checklist
- Detectors layered (pattern + checksum/validation + context + NER); high-risk types checksum-validated
- HIPAA Safe Harbor 18 identifiers covered where PHI is in scope; NIST/CCPA categories covered
- Each entity carries document_id, bates, page, offset span, type, hashed value, confidence, detector
- Jurisdiction flags assigned (FRCP 5.2 / HIPAA / CCPA / GDPR) and redaction priority set
- No cleartext sensitive values stored; only hash + masked preview
- Low-confidence and context-only hits routed for human review (not auto-redacted)
- Output structured for ingestion by the redaction-instruction step / Relativity-DISCO-Everlaw
- Sampling/QC done to estimate recall (missed identifiers) before relying on results for production
Last updated: 2026-05-31 — detection is decision-support requiring attorney review; confirm the identifier set and handling against current NIST SP 800-122, HIPAA §164.514(b)(2), CCPA §1798.140(v), FRCP 5.2, and any case-specific protective order before production.