One of the big lessons you learn over your life in programming is, it is far easier to keep junk out of databases, than remove it later. A database with lax validation is much more difficult to code for and keep bug free than one designed by Nazis. In particular, this means ensuring no duplicates. The problem is the duplicates are rarely identical. They contain conflicting information because they don’t get updated in sync. If there are no duplicates, even intentional ones, they can’t get out of sync.
Most deduping is case-sensitive, though some techniques can be done in a case-insensitive way.
There are several techniques for deduping a Collection:
Scanning bottom to top was fastest and Iterator.remove() was slowest.
Here is the sample code for five methods, and a method that does not work, that newbies often try.
Normally you just read the sorted file sequentially, comparing this record with the previous one, then write the non-dups to a temporary file, in a manner similar to deduping a collection by copying. When you are done you delete the original file, and rename the temporary file to the original file. I usually do this with: HunkIO.createTempFile() and HunkIO.deleteAndRename().
This page is posted |
http://mindprod.com/jgloss/deduplication.html | |
Optional Replicator mirror
|
J:\mindprod\jgloss\deduplication.html | |
Please read the feedback from other visitors,
or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission. | ||
Canadian
Mind
Products
IP:[65.110.21.43] Your face IP:[3.144.7.216] |
| |
Feedback |
You are visitor number | |