Glossary

Data deduplication

Finding and removing duplicate rows from a dataset, either exact duplicates or rows that match on a chosen set of key columns.

Deduplication is the cleanup step that collapses repeated records into one. There are two flavors: exact deduplication, where an entire row must match another byte for byte, and key-based deduplication, where rows count as duplicates if they agree on chosen columns, say email, even if other fields differ. Key-based is what most real datasets need, because the same customer often appears with slightly different formatting.

Duplicates creep in through merged exports, repeated imports, and join mistakes, and they quietly inflate counts and totals. The order of removal matters too: keeping the first versus the last occurrence can change which version of a record survives. Deciding the key and the keep rule up front is the difference between clean dedup and silent data loss.

Related

Compare two spreadsheets

Drop two files into SheetCompare and see every changed cell. Free, private, and runs in your browser.