Others • Others

Duplicate Table Identification Based on Column Structure

Description

In large data environments, it’s common to have multiple tables that contain the same or nearly identical structures — often due to migrations, backups, or data team silos. These duplicate or redundant tables can cause confusion, increase storage costs, and lead to inconsistent reporting. This recipe helps data stewards, analysts, and engineers quickly identify such duplicates by comparing the column structure (names and data types) of all tables in OvalEdge. By calculating a similarity percentage between table pairs, the recipe highlights which tables are structurally identical (95–100% match), enabling teams to clean up redundancy and maintain a consistent data catalog.

Details

Automatically detect duplicate or redundant tables by comparing column structure, similarity, and metadata patterns across the catalog.

Why this recipe matters

Many organizations accumulate multiple tables that represent the same data but differ only in naming conventions, environments, or incomplete migrations. These duplicates lead to confusion, inconsistent reporting, and unnecessary storage costs. This recipe identifies tables with highly similar column structures, enabling data teams to eliminate redundancy, improve trust, and simplify data landscapes.

Business value

Reduces storage and compute costs by highlighting redundant datasets.
Prevents reporting inconsistencies caused by duplicate sources.
Improves trust and transparency in the data landscape.
Accelerates data catalog cleanup and consolidation projects.
Supports modernization and migration efforts with accurate duplication insights.

Who benefits

Data Engineering Teams
Data Architects
Data Governance Teams
Catalog Owners & Stewards
Analytics & BI Teams

What you receive

Duplicate table pairs

Column similarity scores

Metadata-based match indicators

Structural fingerprinting analysis

Duplicate reduction report

How the recipe works (guided flow)

Step 1 — Read table and column metadata
Load all active tables and their columns from the catalog, capturing table ID, table name, column name, and column data type.

Step 2 — Filter to structurally meaningful tables
Restrict the analysis to tables with a minimum column count (≥10 columns) to eliminate noise from trivial or incomplete structures.

Step 3 — Normalize column profiles for eligible tables
Create a normalized column profile for each eligible table, retaining only the columns that will participate in structural comparison.

Step 4 — Compare column structures across table pairs
For every distinct pair of eligible tables, compare columns using exact matches on column name and data type and count the number of matching columns.

Step 5 — Calculate structural containment match percentage
Compute a structural match percentage using the formula:
match_percent = (matching_columns × 100) / LEAST(table1_columns, table2_columns)

Step 6 — Identify high-confidence duplicate candidates
Flag table pairs with a match percentage between 95–100% as strong duplicate or near-duplicate candidates.

Step 7 — Output ranked duplicate mappings
Produce a ranked report of duplicate table pairs including table names, matching column count, total column counts, and calculated match percentage.

Sample insights (illustrative)

Insight Category	What the recipe discovered	Business Impact
High similarity tables	Two customer tables share 92% column similarity with identical primary keys.	Indicates redundant storage and potential reporting conflicts.
Redundant environment copies	Three staging tables structurally match production versions.	Migration cleanup opportunity to reduce storage footprint.
Legacy duplicates	An older finance table structure matches the newer curated version.	Signals candidates for archival or deprecation.

Before you run this recipe

Make sure the following ingredients are available:

Connected datasets are crawled and profiled
Columns have distinct counts and top values generated

Back to Recipes

Others • Others

Duplicate Table Identification Based on Column Structure

Description

Details

Duplicate Table Identification Based on Column Structure

Automatically detect duplicate or redundant tables by comparing column structure, similarity, and metadata patterns across the catalog.

Why this recipe matters

Business value

Reduces storage and compute costs by highlighting redundant datasets.
Prevents reporting inconsistencies caused by duplicate sources.
Improves trust and transparency in the data landscape.
Accelerates data catalog cleanup and consolidation projects.
Supports modernization and migration efforts with accurate duplication insights.

Who benefits

Data Engineering Teams
Data Architects
Data Governance Teams
Catalog Owners & Stewards
Analytics & BI Teams

What you receive

Duplicate table pairs

Column similarity scores

Metadata-based match indicators

Structural fingerprinting analysis

Duplicate reduction report

How the recipe works (guided flow)

Sample insights (illustrative)

Insight Category	What the recipe discovered	Business Impact
High similarity tables	Two customer tables share 92% column similarity with identical primary keys.	Indicates redundant storage and potential reporting conflicts.
Redundant environment copies	Three staging tables structurally match production versions.	Migration cleanup opportunity to reduce storage footprint.
Legacy duplicates	An older finance table structure matches the newer curated version.	Signals candidates for archival or deprecation.

Before you run this recipe

Make sure the following ingredients are available:

Connected datasets are crawled and profiled
Columns have distinct counts and top values generated