deduplication : Java Glossary

*0-9ABCDEFGHIJKLMNOPQRSTUVWXYZ (all)

deduplication

One of the big lessons you learn over your life in programming is, it is far easier to keep junk out of databases, than remove it later. A database with lax validation is much more difficult to code for and keep bug free than one designed by Nazis. In particular, this means ensuring no duplicates. The problem is the duplicates are rarely identical. They contain conflicting information because they don’t get updated in sync. If there are no duplicates, even intentional ones, they can’t get out of sync.

Most deduping is case-sensitive, though some techniques can be done in a case-insensitive way.

DeDuping a Collection

There are several techniques for deduping a Collection:

  1. Scan ArrayList bottom to top.
  2. Copy non-duplicate elements of a Collection to a new empty Collection.
  3. Use a HashSet
  4. Use Iterator.remove().

Scanning bottom to top was fastest and Iterator.remove() was slowest.

Here is the sample code for five methods, and a method that does not work, that newbies often try.

DeDuping a File

Normally you just read the sorted file sequentially, comparing this record with the previous one, then write the non-dups to a temporary file, in a manner similar to deduping a collection by copying. When you are done you delete the original file, and rename the temporary file to the original file. I usually do this with: HunkIO.createTempFile() and HunkIO.deleteAndRename().


This page is posted
on the web at:

http://mindprod.com/jgloss/deduplication.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\jgloss\deduplication.html
logo
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[54.163.147.69]
You are visitor number