Loading...
This recipe identifies Personally Identifiable Information (PII) across datasets by scanning column names for common indicators such as name, email, phone, address, date of birth, social security number, bank details, and more. By running this recipe, you can identify columns as PII or Non-PII within your data ecosystem, providing clear direction for governing them. Note: Metadata is not sent to AI. Uses python logic to identify PII
Before advanced AI workflows, teams need a reliable, explainable first-pass scanner to find obvious PII (emails, SSNs, phone numbers, names) using regexes, metadata, and profiling. This recipe provides a deterministic, auditable foundation for PII inventory, risk triage, and remediation planning.
Step 1 — Read catalog and profiling metadata
Load tables, columns, example values, and profiling stats to establish scan scope.
Step 2 — Apply deterministic regex scanners
Run a battery of regex patterns for emails, phone numbers, SSNs, credit card formats, IPs, and common identifiers.
Step 3 — Use metadata cues and name heuristics
Combine column name matching (e.g., contains 'email', 'ssn', 'phone', 'dob') with profiling signals to boost confidence.
Step 4 — Compute column-level PII flags and counts
Aggregate match counts, sample values, and compute percentage of rows matching PII patterns.
Step 5 — Assign risk category and severity
High: direct identifier matches with high prevalence; Medium: partial matches or low prevalence; Low: possible indirect identifiers.
Step 6 — Generate top tables and connection summaries
List tables and systems with the highest PII concentration to prioritize remediation.
Step 7 — Output PII datasets and reports
Produce PII_candidate_columns, PII_summary_by_table, and dashboards for stewards and compliance teams.
| Insight Category | What the recipe discovered | Business Impact |
|---|---|---|
| Direct identifiers | Columns matching email regex found in 18 tables with >2% prevalence. | Requires fast-track masking or access restriction for affected tables. |
| High-risk tables | Customer_profile table contains multiple direct identifiers and high row-level prevalence. | Prioritize encryption and audit controls for this table. |
| False positive indicators | Several numeric codes matched credit-card-like patterns due to short length; flagged for manual review. | Reduces unnecessary remediation work by surfacing likely false positives early. |
Make sure the following ingredients are available in your workspace: