deduplication : Java Glossary

* 0-9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all)

deduplication

One of the big lessons you learn over your life in programming is, it is far easier to keep junk out of databases, than remove it later. A database with lax validation is much more difficult to code for and keep bug free than one designed by Nazis. In particular, this means ensuring no duplicates. The problem is the duplicates are rarely identical. They contain conflicting information because they don’t get updated in sync. If there are no duplicates, even intentional ones, they can’t get out of sync.

Most deduping is case-sensitive, though some techniques can be done in a case-insensitive way.

DeDuping a Collection

There are several techniques for deduping a Collection:

Scan ArrayList bottom to top.
Copy non-duplicate elements of a Collection to a new empty Collection.
Use a HashSet
Use Iterator.remove().

Scanning bottom to top was fastest and Iterator.remove() was slowest.

Here is the sample code for five methods, and a method that does not work, that newbies often try.

DeDuping a File

Normally you just read the sorted file sequentially, comparing this record with the previous one, then write the non-dups to a temporary file, in a manner similar to deduping a collection by copying. When you are done you delete the original file, and rename the temporary file to the original file. I usually do this with: HunkIO.createTempFile() and HunkIO.deleteAndRename().

ArrayList
CSVDeDup utility
Data DeDuplication for Dummies free eBook
DeDup utility
HashSet
Iterator
validation

standard footer
	This page is posted on the web at:	http://mindprod.com/jgloss/deduplication.html
	Optional Replicator mirror of mindprod.com on local hard disk J:	J:\mindprod\jgloss\deduplication.html
	Please read the feedback from other visitors, or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission.
	Canadian Mind Products IP:[65.110.21.43] Your face IP:[216.73.217.43]
Feedback	You are visitor number