Others • Others

PII Identifier V1

Description

This recipe identifies Personally Identifiable Information (PII) across datasets by scanning column names for common indicators such as name, email, phone, address, date of birth, social security number, bank details, and more. By running this recipe, you can identify columns as PII or Non-PII within your data ecosystem, providing clear direction for governing them. Note: Metadata is not sent to AI. Uses python logic to identify PII

Details

PII Identifier V1

Baseline PII discovery that uses heuristics and metadata signals to locate obvious sensitive attributes across the catalog.

Why this recipe matters

Before advanced AI workflows, teams need a reliable, explainable first-pass scanner to find obvious PII (emails, SSNs, phone numbers, names) using regexes, metadata, and profiling. This recipe provides a deterministic, auditable foundation for PII inventory, risk triage, and remediation planning.

Business value

Quickly surface high-confidence PII candidates for immediate remediation.
Reduce manual discovery effort by providing an auditable first-pass scan.
Enable rapid prioritization for masking, encryption, or access controls.
Improve audit readiness and reduce exposure windows.
Provide deterministic rules that are easy to validate for compliance teams.

Who benefits

Privacy & Compliance Teams
Data Governance & Stewards
Security & Risk Teams
Data Engineers
Audit & Legal Teams

What you receive

Regex-based PII matches

Column-level PII flags

PII candidate dataset

Risk scoring & counts

Top PII tables summary

How the recipe works (guided flow)

Step 1 — Read catalog and profiling metadata
Load tables, columns, example values, and profiling stats to establish scan scope.

Step 2 — Apply deterministic regex scanners
Run a battery of regex patterns for emails, phone numbers, SSNs, credit card formats, IPs, and common identifiers.

Step 3 — Use metadata cues and name heuristics
Combine column name matching (e.g., contains 'email', 'ssn', 'phone', 'dob') with profiling signals to boost confidence.

Step 4 — Compute column-level PII flags and counts
Aggregate match counts, sample values, and compute percentage of rows matching PII patterns.

Step 5 — Assign risk category and severity
High: direct identifier matches with high prevalence; Medium: partial matches or low prevalence; Low: possible indirect identifiers.

Step 6 — Generate top tables and connection summaries
List tables and systems with the highest PII concentration to prioritize remediation.

Step 7 — Output PII datasets and reports
Produce PII_candidate_columns, PII_summary_by_table, and dashboards for stewards and compliance teams.

Sample insights (illustrative)

Insight Category	What the recipe discovered	Business Impact
Direct identifiers	Columns matching email regex found in 18 tables with >2% prevalence.	Requires fast-track masking or access restriction for affected tables.
High-risk tables	Customer_profile table contains multiple direct identifiers and high row-level prevalence.	Prioritize encryption and audit controls for this table.
False positive indicators	Several numeric codes matched credit-card-like patterns due to short length; flagged for manual review.	Reduces unnecessary remediation work by surfacing likely false positives early.

Before you run this recipe

Make sure the following ingredients are available in your workspace:

Connected datasets are crawled and profiled
Columns have distinct counts and top values generated

Back to Recipes

Others • Others

PII Identifier V1

Description

Details

PII Identifier V1

Baseline PII discovery that uses heuristics and metadata signals to locate obvious sensitive attributes across the catalog.

Why this recipe matters

Business value

Quickly surface high-confidence PII candidates for immediate remediation.
Reduce manual discovery effort by providing an auditable first-pass scan.
Enable rapid prioritization for masking, encryption, or access controls.
Improve audit readiness and reduce exposure windows.
Provide deterministic rules that are easy to validate for compliance teams.

Who benefits

Privacy & Compliance Teams
Data Governance & Stewards
Security & Risk Teams
Data Engineers
Audit & Legal Teams

What you receive

Regex-based PII matches

Column-level PII flags

PII candidate dataset

Risk scoring & counts

Top PII tables summary

How the recipe works (guided flow)

Sample insights (illustrative)

Insight Category	What the recipe discovered	Business Impact
Direct identifiers	Columns matching email regex found in 18 tables with >2% prevalence.	Requires fast-track masking or access restriction for affected tables.
High-risk tables	Customer_profile table contains multiple direct identifiers and high row-level prevalence.	Prioritize encryption and audit controls for this table.
False positive indicators	Several numeric codes matched credit-card-like patterns due to short length; flagged for manual review.	Reduces unnecessary remediation work by surfacing likely false positives early.

Before you run this recipe

Make sure the following ingredients are available in your workspace:

Connected datasets are crawled and profiled
Columns have distinct counts and top values generated