FinchContext
Run with

Detect PII / PHI in a Document Corpus

Skill: Convert document text into a structured PII/PHI entity report

Region: United States Category: Legal / eDiscovery Does: Takes a document text corpus and produces a structured PII/PHI entity-detection report — classifying Social Security numbers, financial-account numbers, dates of birth, medical-record numbers, biometrics, and contact data per the HIPAA Safe Harbor and CCPA categories — to drive redaction, privacy review, and protective-order compliance. Authority: NIST SP 800-122 (PII) · HIPAA Safe Harbor §164.514(b)(2) (18 identifiers) · CCPA §1798.140(v) · FRCP 5.2 (privacy redaction)

This detects and classifies sensitive identifiers; it does not itself redact (see build-redaction-instruction-set). Detection should be treated as decision-support with human review — false negatives create exposure and false positives waste review time. Store detected values as hashes/offsets, never as cleartext, in the report.


When this applies


HIPAA Safe Harbor — the 18 identifiers (§164.514(b)(2))

Names · Geographic subdivisions < state · All date elements (except year) tied to an
individual · Phone · Fax · Email · SSN · Medical record number · Health-plan beneficiary #
· Account numbers · Certificate/license numbers · Vehicle identifiers/plates · Device
identifiers/serials · URLs · IP addresses · Biometric identifiers (finger/voice prints)
· Full-face photos & comparable images · Any other unique identifying number/code

Detect these plus NIST SP 800-122 PII (e.g. SSN, passport, financial accounts) and CCPA §1798.140(v) categories (identifiers, financial, biometric, geolocation, etc.).


Output structure (JSON per entity)

{
  "document_id": "DOC000123",
  "bates": "ABC000451",
  "page": 3,
  "span": { "start": 1204, "end": 1215 },
  "entity_type": "US_SSN",
  "entity_value_hash": "sha256:9f86d0...",
  "value_redacted_preview": "***-**-6789",
  "confidence": 0.98,
  "detector": "regex+checksum",
  "jurisdiction_flags": ["FRCP_5.2", "HIPAA_SafeHarbor", "CCPA"],
  "redaction_priority": "high"
}

Detection rules


Worked example (one document, summarized)

DOC000123 (Bates ABC000451–453) — 4 entities detected:
  p1  US_SSN            "***-**-6789"   conf 0.98  [FRCP_5.2, HIPAA, CCPA]  → high
  p1  DOB               "**/**/1984"    conf 0.95  [HIPAA, CCPA]            → medium
  p2  FINANCIAL_ACCOUNT "****1234" (Luhn ok) conf 0.92 [FRCP_5.2, CCPA]    → high
  p3  EMAIL             "j***@acme.com" conf 0.99  [HIPAA, CCPA]           → low
→ feeds build-redaction-instruction-set: redact the SSN (last 4 ok), full account, DOB→year

Validation checklist


Last updated: 2026-05-31 — detection is decision-support requiring attorney review; confirm the identifier set and handling against current NIST SP 800-122, HIPAA §164.514(b)(2), CCPA §1798.140(v), FRCP 5.2, and any case-specific protective order before production.