Loading...
In large data environments, it’s common to have multiple tables that contain the same or nearly identical structures — often due to migrations, backups, or data team silos. These duplicate or redundant tables can cause confusion, increase storage costs, and lead to inconsistent reporting. This recipe helps data stewards, analysts, and engineers quickly identify such duplicates by comparing the column structure (names and data types) of all tables in OvalEdge. By calculating a similarity percentage between table pairs, the recipe highlights which tables are structurally identical (95–100% match), enabling teams to clean up redundancy and maintain a consistent data catalog.
Many organizations accumulate multiple tables that represent the same data but differ only in naming conventions, environments, or incomplete migrations. These duplicates lead to confusion, inconsistent reporting, and unnecessary storage costs. This recipe identifies tables with highly similar column structures, enabling data teams to eliminate redundancy, improve trust, and simplify data landscapes.
Step 1 — Read table and column metadata
Load all active tables and their columns from the catalog, capturing table ID, table name, column name, and column data type.
Step 2 — Filter to structurally meaningful tables
Restrict the analysis to tables with a minimum column count (≥10 columns) to eliminate noise from trivial or incomplete structures.
Step 3 — Normalize column profiles for eligible tables
Create a normalized column profile for each eligible table, retaining only the columns that will participate in structural comparison.
Step 4 — Compare column structures across table pairs
For every distinct pair of eligible tables, compare columns using exact matches on column name and data type and count the number of matching columns.
Step 5 — Calculate structural containment match percentage
Compute a structural match percentage using the formula:
match_percent = (matching_columns × 100) / LEAST(table1_columns, table2_columns)
Step 6 — Identify high-confidence duplicate candidates
Flag table pairs with a match percentage between 95–100% as strong duplicate or near-duplicate candidates.
Step 7 — Output ranked duplicate mappings
Produce a ranked report of duplicate table pairs including table names, matching column count, total column counts, and calculated match percentage.
| Insight Category | What the recipe discovered | Business Impact |
|---|---|---|
| High similarity tables | Two customer tables share 92% column similarity with identical primary keys. | Indicates redundant storage and potential reporting conflicts. |
| Redundant environment copies | Three staging tables structurally match production versions. | Migration cleanup opportunity to reduce storage footprint. |
| Legacy duplicates | An older finance table structure matches the newer curated version. | Signals candidates for archival or deprecation. |
Make sure the following ingredients are available: