Loading...
This recipe identifies Personally Identifiable Information (PII) across datasets by scanning column names for common indicators such as name, email, phone, address, date of birth, social security number, bank details, and more. By running this recipe, you can identify columns as PII or Non-PII within your data ecosystem, providing clear direction for governing them. Note: This recipe would take 2 or 3 hrs to execute depending on the data which is cataloged.
Organizations store PII across hundreds of tables without visibility into where sensitive information resides. Manual identification is slow, inconsistent, and unreliable. This recipe uses AI-driven classification and metadata processing to generate a repeatable, scalable, and accurate PII discovery framework.
Step 1 — Read metadata from oetable, oecolumn, and connectioninfo
Load all tables, columns, and connection details to establish the full scan scope.
Step 2 — Read AI-enriched PII classifications
Consume the AI classifier output for each column (PII Identifier).
Step 3 — Compute scannedcolumns dataset
Join tables and columns to create a unified structural inventory.
Step 4 — Filter only PII-tagged columns
Apply AI classification to filter and generate PIIColumns dataset.
Step 5 — Aggregate PII counts per table
Identify tables with the highest PII concentration.
Step 6 — Aggregate PII counts per connection
Determine which systems pose elevated privacy risks.
Step 7 — Generate bar charts and metrics
Top 10 tables, top 10 connections, total PII counts, and max distribution insights.
Step 8 — Output datasets and governance reports
Deliver scannedcolumns, PIIColumns, charts, and summary metrics for stewardship and audits.
| Insight Category | What the recipe discovered | Business Impact |
|---|---|---|
| High-risk tables | Table containing hightest number of PII fields. | Prioritize masking, encryption, and access review. |
| Risky connections | Connectors that holds the maximum distribution of PII columns. | Requires stricter access controls and periodic audits. |
| Category insights | Direct Identifiers dominate PII categories across sources. | Supports prioritization of tokenization and anonymization strategies. |
Make sure the following ingredients are available: