What does the Detect PII / PHI in a Document Corpus skill do?

Convert document text into a structured PII/PHI entity-detection report — SSNs, financial accounts, DOB, medical-record numbers, biometrics, and contact data per HIPAA Safe Harbor (18 identifiers), NIST SP 800-122, CCPA, and FRCP 5.2 — with hashed values, offsets, jurisdiction flags, and redaction priority for downstream redaction

How do I run the Detect PII / PHI in a Document Corpus skill?

Copy the skill and paste it into your own Claude or ChatGPT, then provide your source data. FinchContext skills run inside your own AI assistant, so your financial data never leaves your control.

Which region does the Detect PII / PHI in a Document Corpus skill cover?

United States. It targets the local tax authority / e-invoicing format described on this page.

Detect PII / PHI in a Document Corpus

Skill: Convert document text into a structured PII/PHI entity report

Region: United States Category: Legal / eDiscovery Does: Takes a document text corpus and produces a structured PII/PHI entity-detection report — classifying Social Security numbers, financial-account numbers, dates of birth, medical-record numbers, biometrics, and contact data per the HIPAA Safe Harbor and CCPA categories — to drive redaction, privacy review, and protective-order compliance. Authority: NIST SP 800-122 (PII) · HIPAA Safe Harbor §164.514(b)(2) (18 identifiers) · CCPA §1798.140(v) · FRCP 5.2 (privacy redaction)

This detects and classifies sensitive identifiers; it does not itself redact (see build-redaction-instruction-set). Detection should be treated as decision-support with human review — false negatives create exposure and false positives waste review time. Store detected values as hashes/offsets, never as cleartext, in the report.

When this applies

Before production, to find personal data requiring FRCP 5.2 redaction (SSNs, financial accounts, DOB, minors' names).
On HIPAA-regulated collections, to locate PHI (the 18 Safe Harbor identifiers) for de-identification or protective-order handling.
For CCPA/GDPR cross-border productions, to quantify and locate personal data subject to minimization.

HIPAA Safe Harbor — the 18 identifiers (§164.514(b)(2))

Names · Geographic subdivisions < state · All date elements (except year) tied to an
individual · Phone · Fax · Email · SSN · Medical record number · Health-plan beneficiary #
· Account numbers · Certificate/license numbers · Vehicle identifiers/plates · Device
identifiers/serials · URLs · IP addresses · Biometric identifiers (finger/voice prints)
· Full-face photos & comparable images · Any other unique identifying number/code

Detect these plus NIST SP 800-122 PII (e.g. SSN, passport, financial accounts) and CCPA §1798.140(v) categories (identifiers, financial, biometric, geolocation, etc.).

Output structure (JSON per entity)

{
  "document_id": "DOC000123",
  "bates": "ABC000451",
  "page": 3,
  "span": { "start": 1204, "end": 1215 },
  "entity_type": "US_SSN",
  "entity_value_hash": "sha256:9f86d0...",
  "value_redacted_preview": "***-**-6789",
  "confidence": 0.98,
  "detector": "regex+checksum",
  "jurisdiction_flags": ["FRCP_5.2", "HIPAA_SafeHarbor", "CCPA"],
  "redaction_priority": "high"
}

Detection rules

Layer detectors: pattern/regex + validation (SSN area/group sanity, Luhn check on card numbers, ABA routing checksum) + context keywords ("SSN", "DOB", "account no.") + NER for names/addresses. Validation cuts false positives sharply.
Classify by jurisdiction: tag each hit with which regimes implicate it (FRCP_5.2, HIPAA_SafeHarbor, CCPA, GDPR) — the same SSN may carry multiple flags and drive different handling.
Assign redaction priority: high (SSN, full financial account, biometrics, minor's identity), medium (DOB, MRN, partial account, address), low (business contact info that may be non-sensitive).
Never store cleartext sensitive values in the report — keep a one-way hash plus a masked preview and character offsets so redaction can locate them.
Record provenance (detector, confidence) so reviewers can prioritize low-confidence hits and the process is defensible.
Treat output as a review queue, not an automatic action — counsel confirms before redaction/production.

Worked example (one document, summarized)

DOC000123 (Bates ABC000451–453) — 4 entities detected:
  p1  US_SSN            "***-**-6789"   conf 0.98  [FRCP_5.2, HIPAA, CCPA]  → high
  p1  DOB               "**/**/1984"    conf 0.95  [HIPAA, CCPA]            → medium
  p2  FINANCIAL_ACCOUNT "****1234" (Luhn ok) conf 0.92 [FRCP_5.2, CCPA]    → high
  p3  EMAIL             "j***@acme.com" conf 0.99  [HIPAA, CCPA]           → low
→ feeds build-redaction-instruction-set: redact the SSN (last 4 ok), full account, DOB→year

Validation checklist

Detectors layered (pattern + checksum/validation + context + NER); high-risk types checksum-validated
HIPAA Safe Harbor 18 identifiers covered where PHI is in scope; NIST/CCPA categories covered
Each entity carries document_id, bates, page, offset span, type, hashed value, confidence, detector
Jurisdiction flags assigned (FRCP 5.2 / HIPAA / CCPA / GDPR) and redaction priority set
No cleartext sensitive values stored; only hash + masked preview
Low-confidence and context-only hits routed for human review (not auto-redacted)
Output structured for ingestion by the redaction-instruction step / Relativity-DISCO-Everlaw
Sampling/QC done to estimate recall (missed identifiers) before relying on results for production

Last updated: 2026-05-31 — detection is decision-support requiring attorney review; confirm the identifier set and handling against current NIST SP 800-122, HIPAA §164.514(b)(2), CCPA §1798.140(v), FRCP 5.2, and any case-specific protective order before production.