Data deduplication
Finding and removing duplicate rows from a dataset, either exact duplicates or rows that match on a chosen set of key columns.
Deduplication is the cleanup step that collapses repeated records into one. There are two flavors: exact deduplication, where an entire row must match another byte for byte, and key-based deduplication, where rows count as duplicates if they agree on chosen columns, say email, even if other fields differ. Key-based is what most real datasets need, because the same customer often appears with slightly different formatting.
Duplicates creep in through merged exports, repeated imports, and join mistakes, and they quietly inflate counts and totals. The order of removal matters too: keeping the first versus the last occurrence can change which version of a record survives. Deciding the key and the keep rule up front is the difference between clean dedup and silent data loss.