EIDE flaw : Java Glossary

*0-9ABCDEFGHIJKLMNOPQRSTUVWXYZ (all)

EIDE flaw
Many EIDE (Extended Integrated Drive Electronics) (Extended Integrated Drive Electronics) controllers, even on motherboards from major manufacturers like Intel, contain either of two flawed chip designs, the RZ-1000 or the CMD-640/CMD-640B. Both will subtly corrupt your data during simultaneous i/o. This is a far more serious problem than the famous "I am Pentium of Borg; you will be approximated" flaw in the Pentium floating point unit.

Introduction

There are serious flaws affecting about 1/3 of all PCI (Peripheral Component Interconnect) motherboards built prior to 1998. They flaws affect any motherboard or EIDE controller paddleboard containing the PC-Tech RZ-1000 PCI EIDE controller chip or the CMD (Command) PCIO (Program Controlled Input/Output) 640 PCI EIDE controller chip.

The flaws affect motherboards from ASUSTeK, AT&T (American Telephone & Telegraph), DEC (Digital Equipment Corporation), Dell, Gateway, Intel, Micron, NEC (Nippon Electric Corporation), Zeos and others. Since Intel makes so many of the motherboards sold under other brand names, the flaws affect many machines, both 486 and Pentium PCI.

The flaws show up most frequently when you run a true multitasking operating system such as OS/2 Warp or NT. It also shows up under Windows For WorkGroups in 32-bit mode during tape or floppy backup and restore. In theory the flaws could do damage under DOS (Disk Operating System), DESQview, Windows and Windows For WorkGroups in 16-bit mode, but so far there have been no damage reports. Windows-95 contains code to bypass the flaws.

The RZ-1000 has two flaws. The CMD-640 has those same two flaws plus three others. To make matters worse, most motherboard manufacturers using these two flawed chips connected them up incorrectly. There are software bypasses for these flaws. However, the Warp fix the CMD-640 reduces disk performance by 15 to 50%. The RZ-1000 fix has negligible impact on disk I/O though it can slow down background processes.

I would advise new hardware to bypass the CMD-640 flaws and living with software fixes to bypass the RZ-1000 flaws.

What are the symptoms?

When you are using an IDE (Integrated Drive Electronics) or EIDE hard disk attached to the EIDE motherboard port, the flaws subtly corrupt your files by randomly changing bytes every once in a while. The flaws introduce bugs into EXE files, subtle errors into your spreadsheets, stray characters into your word processing documents, changes to the deductions in last year’s tax return files and random changes to engineering design files.

This corruption happens when you are simultaneously using your EIDE or IDE hard disk and some other device, most commonly the floppy drive or mag tape backup.

The same sorts of problem may occur on reading a CD-ROM (Compact Disc — Read Only Memory) drive attached to an EIDE port.

Is it Serious?

These flaws are nasty. They are causing hundreds of times more havoc than the infamous Pentium divide flaw ever did. I am Pentium of Borg. You will be approximated.

Not only does this corruption occur, but it occurs quietly, often going unnoticed.

If the system crashes, you usually put the blame on the operating system software, or the application. It might actually be a faulty RZ-1000 or CMD-640 EIDE controller chip nailing you.

When a directory becomes corrupted, you may not notice it until the damage is irreparable. If a spreadsheet application reads a comma-delimited ASCII (American Standard Code for Information Interchange) file, it may simply miss a few bytes in a number, an error that may go unnoticed and that error could cascade through the rest of the spreadsheet.

If you have had unexplained crashes in OS/2, you have probably experienced the problem and should make a thorough check for hidden corruption. Remember that the bug may only slightly alter your data and the corruption may not be obvious.

Keep in mind that not every problem is the RZ-1000’s or the CMD-640’s fault. Overheating, unrelated hardware faults and design flaws, or software bugs can cause similar symptoms. DMA (Direct Memory Access) channel conflicts also cause similar symptoms. Happily, EIDEtest and CDTest can unmask all manner of simultaneous I/O faults.

Unfortunately, correcting the problem just stops further file corruption. It will not help to clean up the existing damage to your files. Right now, the focus is on bypassing the flaws. Preventing further corruption is child’s play compared with the nightmare of trying to track down all the existing random errors in files. Backups even from day one may be corrupt. If you have the either of the flawed chips, you will probably never be able to completely eliminate the effects of past corruption.

How Do You Tell If You Have The Flawed Chips?

There are four categories of motherboard:
  1. Definitely safe. Motherboards may still have flaws, but all software in use bypasses them.
  2. Probably safe. In theory there could be problems, but no one has reported any so far.
  3. Possibly dangerous. You will have to run EIDEtest, CDtest, or IOTest to find out.
  4. Probably dangerous. You will still have to run the tests to find out for sure.

Definitely Safe

Definitely safe includes older machines with ISA (Industry Standard Architecture). EISA (Extended Industry Standard Architecture), or MCA (Micro-Channel Architecture) buses. The flaws only affect machines with the new PCI bus or the VESA (Video Equipment Standards Association) VL bus. PCI machines that use the new Triton chipset from Intel do not have the flaws.

PCI machines with Intel BIOS (Basic Input Output System) es that run only DOS, DESQview, Windows 3.1 or Windows-95 are safe. If you have a non-Intel BIOS and run only DOS, DESQview, Windows 3.1, Windows-95 and never use the fast mode simultaneous disk I/O feature on floppy or tape backup/restore, you are safe.

You still might want to test your machine. There are similar problems with other causes the tests will unmask.

Probably Safe

If you have a non-Intel BIOS and run only DOS, DESQview, Windows 3.1, or Windows for WorkGroups 3.11 in 16-bit disk access mode, you probably will not see the problem, even though you may have one of the faulty chips.

Possibly Dangerous

Most auxiliary chipsets (e.g., OPTI Viper, SMC, Mercury and Neptune) used on PCI motherboards do not include a built-in EIDE controller. Such motherboards use a separate EIDE controller chip — often the flawed RZ-1000 or CMD-640. If you use a separate no-name EIDE paddleboard, it will likely use the one of the flawed chips. In theory, the flaws could affect DOS, Windows and Windows For WorkGroups with 16-bit disk access during floppy/tape backup and restore, though no one has reported problems yet. Windows For WorkGroups with 32-bit disk access is dangerous if you have the flaws.

Probably Dangerous

PCI Motherboards (both 486 and Pentium) with the older Mercury and Neptune chipsets are likely to have the flawed chips. The Mercury chipset was popular in P60 and P66 systems and the Neptune in P70, P90 and P100 systems. Mercury chipsets are labeled with an MX suffix and Neptune with NX. If you are using NT, OS/2 Warp or Linux, you are likely to have already experienced extensive file corruption if either of the flawed chips are present. Check the list later in the article for motherboards known to carry the flawed chips.

Testing For The Flaws

Scot Llewelyn, one of the eight authors of PowerQuest’s PartitionMagic, discovered one of the RZ-1000 flaws and made it public. Prior to that, only employees of PC-Tech, Intel and Microsoft were aware of how to bypass the flaws. In the process of tracking the RZ-1000 problems down, Internet comp.os.os2.bugs participants discovered a second flawed chip, the CMD-640.

Scot did most of the initial work documenting the first RZ-1000 flaw. He wrote a program called IOtest that can detect the flaws if:

  1. You are using OS/2 Warp.
  2. You are willing to go through the hassle of creating a separate small partition to run the test. You can use his program, PartitionMagic, to make room to create one.
  3. You have an EIDE hard disk attached to your EIDE port. It cannot detect the problem if you only have an EIDE CD-ROM, or if the EIDE port is currently unused.

Scot originally called his test program DMAtest because he erroneously thought simultaneous DMA was the sole culprit. Do not confuse PowerQuest DMAtest with Gazelle’s DMAtest which only tests if the floppy drive will work happily simultaneously with the hard disk.

The world needed an easier-to-use test that would run under DESQview, Windows, Windows For WorkGroups, Windows 95, NT and OS/2. So I wrote EIDEtest to test for the flaws without requiring you to create a special partition or buy Warp OS/2. I also wrote CDTest to test for the flaws when you have an EIDE CD-ROM drive.

You can also get both programs from me by snail mail.

If these tests fail, it proves you have a serious problem, but not necessarily that you have the RZ-1000 or CMD-640 chip.

If the tests pass, you still may have a problem since, especially under DOS, DESQview and Windows, the flaws may only show up very rarely. If you run the tests under Windows-95 they will always pass, even if you have a defective chip because the operating system already bypasses the flaws. If you suspect trouble, run the tests several times.

Visual Inspection

You can also have a look at your motherboard. Between the PCI slots, at the edge of the motherboard, look for a rectangular chip about  1 × 2 cm (0.39 × 0.79 in) that says RZ-1000 near the top of the chip. There are variations on the chip name, e.g., RZ-1000BP. Unfortunately, the markings are not always present, especially in ASUSTeK motherboards which may have the CMD PCIO 640A or "CMD PCIO 640B" chip. As of 1995-10, all versions of the RZ-1000 and CMD-640 are defective, even new ones.

Direct Tests

The OS/2 Warp Bonus Pack Sysinfo version 3.02 utility (the upgraded downloaded version) will report on your EIDE controller. The signature for the RZ-1000 looks like this:
manufacturer: PC (Personal Computer) TECHNOLOGY INC
class code : 0001
Vendor ID: 1042
Device ID: 1000
Revision ID: 0001
For the CMD-640B it will look like this:
manufacturer : CMD TECHNOLOGY INC
class code : 0001
Vendor ID :1095
Device ID : 0640
Revision ID : 0002

The Warp disk driver IBM1S506.ADD with the /V switch will tell you if you have the RZ-1000 or CMD-640 chip.

Intel has written a new test that looks directly for either of the two faulty chips called CtrlTest.exe, however, it is filed under its old name RZTest.exe.

The Windows-95 Control Panel will also report on the EIDE controller chip.

Where Have Flaws Been Found?

Via email, on BIX (Byte Information Exchange) and on the Internet and in comp.os.os2.bugs, people have reported finding flaws in the following specific motherboards.
Motherboard Chip Reporters
Acculogic VL Paddleboard CMD-640 Mark Lord (mlord@bnr.ca) tentative
Acer Power P75 CMD-640 John Harvey, Beta Machinery Calgary
ACMA P590 ? Bob Smith
AST Bravo MS-T P/75 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu)
ASUSTeK PCI/I P54SP4 CMD-640 Marco Trunzer (ujjm@rzstud1.rz.uni-karlsruhe.de)
Maurice Schekkerman (schekker@prl.philips.nl)
Mike Coplien (kcoplien@facstaff.wisc.edu)
Robert Schultz (robert.schultz@execnet.com)
Thomas L. Kusterer (kustetl1@aplcomm.jhuapl.edu)
AT&T Globalyst 590 RZ-1000 Brian Myrick (brian@jagonet.com)
AT&T Globalyst 600 RZ-1000 Brian Myrick (brian@jagonet.com)
AT&T Globalyst 630 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu)
CMD CSA-62101Kx VL2 IDE paddleboard CMD-640B George Voros (george.voros@ghbbs.com)
Compaq Presario CMD-640 Walter Wu (wu000016@mc.duke.edu)
Compaq Prolinea CMD-640 Walter Wu (wu000016@mc.duke.edu)
DEC Celbris 590 CMD-640 Fred Thomsen (fthomsen@lexis.pop.upenn.edu)
DEC Starion 700I CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu)
DEC Venturis 466 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu)
DEC Venturis 560 CMD-640 Fred Thomsen (fthomsen@lexis.pop.upenn.edu)
Dell Dimension XPS (XML Paper Specification) P100 RZ-1000 Scot Llewelyn (scotl@itsnet.com)
Dell Dimension XPS P75 RZ-1000 Steve Ertman (sertman@ocean.fsu.edu)
Dell Dimension XPS P90 RZ-1000 Dong Chen (D_Chen@netcom.com)
Larry Lai (lai@iastate.edu)
Lawrence Rounds (ljrounds@netcom.com)
Mike Griggs (mpg@iadfw.net)
Mike Heath (heath@rohan.sdsu.edu)
Moira Watson (watson6@uwindsor.ca)
Nathaniel Beck @weber.ucsd.edu
Pete (pag@interramp.com)
Shallenberg (bobshall@subtone.wanet.com)
Wijadi Jodi (r2nw@dax.cc.uakron.edu)
Dell Optiplex 575 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu)
Dell Optiplex XM 590 CMD-640 Aron Eisenpress (afecu@cunyvm.cuny.edu)
Dell XPS-133c neither Blake Scholl (bscholl@one.net)
EliteGroup S154P-AIO CMD-640 Ulf Volz (volz@student.uni-kl.de)
EliteGroup UM8810P-AIO CMD-640 Bodo Huckestein (bh@thp.Uni-Koeln.DE) Guy Kapteijns (W.Kapteijns@kub.nl)
Escom P5/60
(Intel Premiere ATLX)
CMD-640 Detlef Meier (detlef.meier@materna.de)
Rogier van Wanroij (wanroij@cs.utwente.nl)
Escom P60I CMD-640 Tim Schofield (schofieldt@logica.com)
Escom P90 RZ-1000 Karl Knoflach (151579kk@student.eur.nl ) (Xav@mantra01.demon.co.uk)
Gateway 2000 P5-60, Intel Mercury Rev 3 RZ-1000 Angus Black (angus@spanner.hiway.co.uk)
Gary Farr (garyfarr@ix.netcom.com)
Daron Davis (daron_davis@dca.com)
Jerry Lynch (lynch.94@osu.edu)
Keith Patterson (dinosaur@buffnet.net)
Rick Gregory (rfg@us.dynix.com)
Roy L. Smith (smittyry@ix.netcom.com)
Gateway 2000 P5-66 RZ-1000 Randy Nerwick (nerwick@netcom.com)
Gateway 2000 P5-90 RZ-1000 Alan Murphy (alan@jac.co.uk) Roy L. Smith (smittyry@ix.netcom.com)
Gigabyte GA586-AP)
ALI chipse
CMD-640 Yacov Jegher (jegher@accent.net)
HP (Hewlett Packard) Vectra 590 CMD-640 Javier Vizcaino (jvizcain@msn.com)
Intel Hendrix CMD-640 Clif Purkiser Intel Corp (support@cs.intel.com)
Intel Insight P5-60
Premiere PCI II Baby AT (Advanced Technology), Neptune Chipset
RZ-1000 Jim Arnone (arnone@primenet.com)
Intel Plato 90 RZ-1000 Adrian Teo (adriant@singnet.com.sg)
Alain Rassel (Alain.Rassel@restena.lu)
Chris Norman (cnorman@oboe.aix.calpoly.edu)
Clif Purkiser Intel Corp (support@cs.intel.com)
Kevin Chua (chua@server.uwindsor.ca)
Kevin T. Van Maren (vanmaren@cs.utah.edu)
Kim Hvarre (kims@crash.ping.dk)
Martin Kogelbauer (e8826847@student.tuwien.ac.at)
Rick Nelson (rnelson2@ccmail.unl.edu)
Richard Techmanski (richt@netcom.com)
Intel Premiere RZ-1000 Clif Purkiser Intel Corp (support@cs.intel.com)
Intel Premiere LPX CMD-640 Clif Purkiser Intel Corp (support@cs.intel.com)
Intel Premiere MM CMD-640 Clif Purkiser Intel Corp (support@cs.intel.com)
Intel Robin LC CMD-640 Clif Purkiser Intel Corp (support@cs.intel.com)
Knowledgebase P90 laptop CMD-640 Andy Longton (alongton@clark.net)
Micron P75 CMD-640 Leroy Latta (latta@ibm.net)
Micron P5-90 CMD-640 Primary fails, secondary is OK. Eric Johnson (johnson@scripps.edu) Jim Short (jdshort@primenet.com) Mike Coplien (kcoplien@facstaff.wisc.edu)
Micronics M54Pi CMD-640 Adam Haar (s9406709@yallara.cs.rmit.edu.au)
Midwest Micro P90 CMD-640 (412d25$e8j@clarknet.clark.net)
NEC Image P90 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu)
Packard Bell Legend 100CD CMD-640 James Treworgy (jamie@access.digex.net)
PCI-EIDE local clone, Phoenix BIOS 4.04, ALI chipset CMD-640 (whelk@ios.com)
Quantex P5/90 PM-2 RZ-1000 Jay Schamus (jaylord@rcinet.com)
S1366 PCI EIDE paddeboard CMD-640B Ross Fleming (rossflem@serv.net)
Scandic UMC VIO8810A CMD-640B Daniel Spangberg (daniels@kemi.uu.se)
Soyo SY-4SA2 486 prior to B5 ? Jeffrey Hurwit (jhurwit@netcom.com)
Tagram SQ-588 CMD-640 Kurt Krasinski (kurt.krasinski@aquila.com)
Unknown 486 DX SMC37650 Eric Stephen Mountain (esm1@oak70.doc.ic.ac.uk )
Unknown 90 MHz ? Andreas (abenamou@galaxy.csc.calpoly.edu) Carol Lim (law30185@nus.sg)
Viglen P90 (Intel Plato) RZ-1000 Phil Buckley (phil@starbug.swstyle.co.uk)
Vobis RZ-1000 Thomas Wagner (twagner@bix.com)
Vobis 4886DX2-66 CMD-640 Guy Kapteijns (W.Kapteijns@kub.nl)
Zenon P90 RZ-1000 Aria Novianto (novianap@cs.purdue.edu)
ZEOS Pantera RZ-1000 Paul Whitelock (paulw9DDFL3r.DDI@netcom.com)

Known Good Motherboards

The following motherboards have been tested with EIDEtest or CDtest and found to be ok. Not to worry, there are many more good boards than I have listed here:
Motherboard Chip Reporters
Arsys P200-PCI Triton/sis Robert Aboud (raboud@pacific.telebyte.com)
ASUSTek PCI/I-P54TP4 Triton Roedy Green
Dell Dimension XPS P90c ? Note: older versions of this board were flawed. Dave Nuttall (dnuttall@texas.net)
Intel Zappa Triton Ron McGlade (ronmc@primenet.com)
Micronics 486 VLB ? Bob Meredith (meredith@interactive.net)
Seanix Opti Viper Bill Unruh (unruh@physics.ubc.ca)
Soyo SY-4SA2 486/B5 SYS Jeffrey Hurwit (jhurwit@netcom.com)

What Can You Do If You Have A Flaw?

  1. Pester the manufacturer. Unfortunately, the EIDE controller chips are soldered in. The only way to repair a flaw is to replace the whole motherboard, recycling the socketed chips — the CPU (Central Processing Unit), DRAM (Dynamic RAM) and SRAM (Static Random Access Memory) cache. It would be very expensive for computer and motherboard manufacturers to fix a flaw.
    After a month of stonewalling, Dell has announced it will offer a BIOS upgrade to turn off the prefetch buffers.
    According to lovergin@ens.lifl.fr, one retailer, La Cle Informatique, in France is offering to replace the defective Vobis motherboards it sold.
    You can contact Dell at support@us.dell.com or (800) 624-9896.
    Intel is now acknowledging the problem. For a short while, Intel offered to replace defective motherboards, then they reneged. You can contact them at support@cs.intel.com or call their tech support line (800) 628-8686. Select options 1-3-1. You can find international contact numbers via: http://www.intel.com/feedback.htm.
    You can call ASUSTeK at (408) 956-9077.
    Call PC-Tech at (612) 345-4555.
    Call CMD Technology at (714) 454-0800, (800) 426-3832 or (714) 455-1656 FAX (Facsimile).
  2. Buy a new unpopulated Triton PCI motherboard and recycle the CPU, DRAM and SRAM cache chips from the old motherboard. Unfortunately, the Triton chipset has design shortcuts that hamper performance in simultaneous I/O situations. At least they don’t corrupt data.
  3. Run the controller in degraded mode. Some BIOS es have a feature disable the EIDE prefetch buffer. Vendors may offer a BIOS upgrade to allow you to manually disable prefetch. The BIOS may also turn it off automatically if either of the defective chips is present. This will bypass both RZ-1000 flaws and two of the five CMD-640 flaws. Art Scott (scotta@pilot.msu.edu) suggests that you can sometimes tweak the performance of the RZ-1000 back up by configuring the setting in advanced BIOS for the maximum number of cycles that a PCI device can hold onto the PCI bus before the next board gets a turn from 66 to 33.
  4. Buy a PCI EIDE paddleboard controller such as the DTC (Data Technology Corp) 2130S, the Tekram 290N/290S, the Promise 2300+ or the BusLogic BT-910 to replace the one on the motherboard. You must disable the EIDE controller on the motherboard. This fix will waste one of your precious slots. Be careful. You could be leaping out of the RZ-1000 frying pan into the CMD-640 fire since paddleboards often use the CMD-640.
  5. Buy a SCSI (Small Computer System Interface) hard disk and CD-ROM, and avoid using the EIDE ports entirely. Under OS/2 and Linux, SCSI gives better performance, but costs more. DOS, Windows, Windows For WorkGroups and Windows-95 are unable to exploit the advanced features of SCSI, but at least avoid the EIDE flaws when you go pure SCSI.
  6. Find a software work-around. There are fixes for Warp to bypass all the flaws in the RZ-1000 and CMD-640. Fixpack 10 is the first fixpack to bypass the flaws. Now that Intel and IBM (International Business Machines) have finally revealed the technical details, all the operating system writers can patch their EIDE drivers to bypass the flaws. There are also fixes for NT 3.1 and 3.5. See below for details.
  7. Get a BIOS upgrade. For DOS, DESQview and Windows 3.1, to bypass the flaws you may need a new BIOS — an EPROM chip. If you have a flash BIOS, you can update it simply by downloading a file. Most BIOS es already have code to bypass the flaws for DOS, DESQview and Windows. However, more advanced operating systems bypass the BIOS, so even a smart BIOS will not protect you. However, the BIOS CMOS (Complementary Metal Oxide on Silicon) settings may allow you to disable prefetch, which also protects you even in true multitasking operating systems.
  8. Cut the trace. Cut the trace on the motherboard from the floppy changeline to the EIDE controller. However, this just bypasses one of the CMD-640’s five flaws and one of the RZ-1000’s two flaws.
  9. Use the Secondary EIDE Controller. Some motherboards such as the Micron P5-90 M54Pi-N 11P use different kinds of controller on the primary and secondary EIDE ports. The primary may be flawed, but the secondary OK.

Whatever method you use to bypass the flaws, retest with EIDEtest and CDTest afterwards to be sure your fix worked and you caught all the problems.

Cleaning Up The Mess

Once you have bypassed the flaws, you can start working the problem of cleaning up your files.

The first thing to do is to re-install your operating system and all your application programs. This will replace any damaged EXE and DLL (Dynamic Link Library) files.

Catching errors in your data files is more difficult. Keep your eyes peeled for any improbable spreadsheet results. You may have to hire a programmer to write you some comb programs to sniff through your databases, looking for suspicious values.

If you routinely use the verify feature of Lotus Magellan for DOS, it can detect changes to files that should not have changed. This may help you uncover some of the damage. The flaws are not polite enough to redate the files they corrupt.

If you have backups from before the time you bought the faulty machine, you can restore them and re-key everything.

Most people will not be so fortunate. All their backups will also be corrupt.

Most people with flaws will just have to put up with random errors dotting their data files ever after.

Operating System Summary

Operating System Work Around
Netware
Unixware 1.1
NEXTSTEP
Banyan
Solaris 2.4+
SCO Unix 3.1+
Windows-95
Windows-98
Windows-2000
- No problems reported.
DOS
DESQview
Windows 3.1
No problems reported so far. If you do have trouble:
  • Turn off EIDE prefetch in CMOS settings.
  • Upgrade BIOS chip.
  • Turn off simultaneous disk/floppy/tape I/O in your backup programs.
Windows For WorkGroups
  • Turn off 32 disk access mode.
  • Turn off EIDE prefetch in CMOS settings.
  • Upgrade BIOS chip.
  • Turn off simultaneous disk/floppy/tape I/O in your backup programs.
Windows NT 3.1
  • Turn off EIDE prefetch in CMOS settings.
  • Apply ATDISK.SYS fix.
Windows NT 3.5
  • Turn off EIDE prefetch in CMOS settings.
  • Apply the 640XNT35.ZIP fix.
OS/2 2.1
  • Disable prefetch buffer in CMOS settings.
  • Load the IBMINT13.I13 driver instead of the IBM1S506.ADD driver. This trick will only work if your BIOS has flaw bypass code. It will be slow.
  • Upgrade to Warp.
OS/2 Warp 3 Apply Fixpack 10, it contains all the special fixes.

If for some reason, you are unwilling to apply Fixpack 10, you can do the following:

  • Disable prefetch buffer in CMOS settings.
  • Apply the RZ-1000 portion of pj19409.zip if you have the RZ-1000.
  • Apply the CMD portion of pj19409.zip including IBMIDECD.FLT if you have the CMD-640.
  • If that does not work, try basedev=CMD640x.add /16BIT.
  • In a pinch, if you cannot do either of the first two things, add a line to config.sys BASEDEV=IBMINT13.I13 and remove the line BASDEV=IBM1S506.SYS. The IBMINTI3.I13 Device driver lives in C:\OS2\BOOT and on the first install diskette and the on the CDROM (Compact Disc Read Only Memory) in \OS2IMAGE\DISK_1. This trick will work only if your BIOS has flaw-bypass code. It will be slow.
Linux All current Linux kernels have a workaround that can be compiled in (you may have to compile your own kernel though).

For older versions:

  • Disable prefetch buffer in CMOS settings.
  • To bypass the CMD-640 flaws use the boot time kernel parameter: hda=serialize.
  • To bypass the prefetch flaws, use the default settings to suppress interrupts during I/O on the external Hard Disk Parameter utility hdparm.

Reporting Your Findings

Whether or not you find any flaws, please email me at email feedback to Roedy Green or Canadian Mind Products so I can add your board to the appropriate list:
  1. Test results. (I would like to hear about both machines with and without flaws.)
  2. Brand and model of your motherboard.
  3. Brand and model of your entire system.
  4. Which chip did you find, the RZ-1000, the CMD-640, the SMC 37650? What did SYSINFO 3.02 report about your EIDE controller chip?
  5. Have you noticed data file corruption?
  6. Which tests and versions did you use? (IOtest, EIDEtest, CDtest, RZtest, CtrlTest or visual inspection)
  7. What activities did you run in the background during the test?
  8. Which operating system and version you used to run the test (e.g. Warp Connect blue spine)
  9. Which fixpacks and patches did you applied before running the test?
  10. Brand and model of EIDE hard disk
  11. Brand and model of EIDE CD-ROM
  12. Markings on the suspect chip, e.g., RZ-1000BP, CMD PCIO640B, "SMC 37650".
  13. Vendor’s name
  14. Vendor’s response on informing him of your problem.

Whose Fault Is It?

The wags will have fun tormenting Intel for using the flawed RZ-1000 and CMD-640 in its motherboard designs, even though Intel did not manufacture either of the two faulty chips. Intel is not the only company to manufacture motherboards with the faulty chips, but Intel will bear the brunt of the bad publicity.

PC-Tech manufactured the faulty RZ-1000 EIDE controller chip used in many PCI motherboards. PC-Tech is a subsidiary of ZEOS, the clonemaker. In turn Micron Electronics owns ZEOS. PC-Tech has offices just down the street from Zeos in Minnesota. Intel bought the chips from PC-Tech and in turn many clone makers bought motherboards from Intel. Other motherboard manufacturers also used the faulty chips. In a similar way Intel and other companies also used the CMD-640 chip from the CMD Technology Corporation of Irvine California.

PC-Tech, Intel and the clone makers all failed to test their designs properly. The software makers did not test their software on enough machines to show up the problem before releasing it.

Even worse, in some motherboard designs, Intel used the CMD-640 chip. This goof was inexcusable, since the chip, by deliberate design, is incapable of simultaneous I/O.

How did the flawed CMD-640 chip and the RZ-1000 slip through Quality Assurance testing? My guess is no one did real world testing; technicians only tested under laboratory conditions using only simple operating systems like DOS. They might have ignored flaws that happened only sporadically, blaming it on a faulty chip rather than a faulty design. It is very hard to catch a flaw that only manifests rarely.

CMD, PC-Tech, Intel and Microsoft have known about how to bypass these problems for quite some time. IBM was aware there was a problem but was unaware of the solution. For obvious reasons, these companies were reluctant to inform the public of the danger of the ongoing subtle corruption.

No one who understood the RZ-1000 and CMD-640 flaws publicised their findings. If PC-TECH, Intel and Microsoft had not been so secretive, they could have averted the damage. Perhaps they were silent because the flaws primarily hurt the customers of competitor, IBM.

The collective damage done by withholding information about the flaws is huge, certainly many millions of dollars for those large companies whose backups are corrupt as well. It will be interesting to see if anyone launches a damage lawsuit against CMD, PC-Tech, Intel or Microsoft. If they do, it might make both hardware and software makers more careful about releasing improperly tested products.

IBM is not totally innocent either. According to Massimiliano Vispi (massiv@mix.it), on 1994-06-17, IBM posted a document:

http://ps.boulder.ibm.com/pbin-usa-ps/pub_huic_getrec.pl?DVantero.swm.boulder.ibm.com+DBos2+DA22398+STH085835+USPublic
that stated:
"Another case has been where the PCTech chip RZ-1000 used for IDE operations on the PCI bus is in use (PJ15378). On Intel Pentium motherboards with PCI/IDE on board slot, data is sometimes lost. This is a hardware error. This is PJ15378."

Sam Detweiler of IBM explained that this referred only to the trailing 2 byte loss RZ-1000 problem. IBM was not aware of the concurrent floppy problem with prefetch at that time.

Discussions with Intel and PC-Tech lead IBM to believe that re-writing the interrupt handler to avoid reading the IDE status register recursively would solve the problem. PC-Tech never did explain the precise failure mechanism.

IBM says the CMD-640 problem also appeared in 1994-10 with the Vobis systems. CMD did not inform IBM of the problem.

Prefetch also affected the CMD chips (640, 640A and 640B). CMD built their own driver based on IBM code to handle the serialization problem. They did not fix the prefetch problem in their driver so it appears they too were unaware of it at this time.

There is potential here for some massive lawsuits. No wonder the companies who knew about the flaws have been so tight-lipped. Think of the damage if Boeing or GM (Genetically Modified) had its plans for coming products stored on flawed machines. Literally, these flaws could cause plane crashes.

Intel’s Spin

There are three levels of Intel Inside.
  1. Weak. Your motherboard has an Intel CPU but a support chipset from another manufacturer.
  2. Medium. Your motherboard has an Intel CPU and Intel support chipset such as the Neptune or Triton, but some other company built the BIOS and motherboard.
  3. Strong. Your motherboard has an Intel CPU, Intel support chipset, Intel motherboard and Intel BIOS.

Intel literature on the RZ-1000 and CMD-640 only refers to (3). Intel cannot very well speak for (1) and (2) where the PCI EIDE controller design was out of their control, even though these machines bear the "Intel Inside" logo.

Intel does not make this distinction clear in their literature.

According to Intel, "This problem is a consequence of the RZ-1000’s inability to fully compensate for all the implications of running an IDE hard disk as an extension of the PCI bus, instead of running as an extension of the AT bus which it was originally designed to do."

Intel would have us believe the problems are flaws per se, but rather a limitation that the programmers forgot to take into consideration.

The truth is grey. UART (Universal Asynchronous Receiver/Transmitter) chips have similar flaws. Programmers have gradually learned to code around them. We don’t insist that all COM (Component Object Model) port hardware be recalled. We now tend to blame a programmer if he does not bypass the known UART flaws.

Given that software work-arounds are now possible, the primary blame shifts for any perpetuation of the problem to the software authors.

However, there are many other EIDE chip designs that do not have this limitation. Since the chip are supposedly generic implementations of the ATA (Advanced Technology Attachment) interface standard, I cannot so lightly excuse these flaws.

Speculation

Because setting the flaws right would be so expensive, I suspect that clone makers and motherboard manufacturers will continue to refuse to replace the defective equipment. At best they may offer BIOS upgrades to bypass the flaws. Microsoft has already added code to Windows-95 to bypass the flaws. Clone makers will rely on software vendors to write drivers that bypass the flaws for Warp, NT, Linux and the various UNIXes.

Now that the OS/2 fixes are out, the pressure to set things right will dwindle. Since DOS, Windows in 16-bit mode, Windows-95 are immune, little pressure to correct the problem is likely to come from those camps.

The motherboard manufacturer has five options:

  1. Replace the motherboard. Recalls on a mass scale would be extremely costly for the motherboard manufacturers, so you can count on them to fight. ($400 parts + $250 labour)
  2. Provide a replacement paddleboard EIDE controller that takes up a PCI slot. ($75)
  3. Provide a new BIOS chip that bypasses potential problems for DOS and Windows. The BIOS could also turn off prefetch which would rescue multitasking operating systems that do not use the BIOS for I/O. ($10)
  4. Tell the users to upgrade to software that bypasses the flaws and to turn off simultaneous disk/tape/floppy I/O in any backup software run under DOS, DESQview or Windows. Users won’t like the performance hit, however. ($0)
  5. Stonewall and refuse to even acknowledge the problem. This will be more difficult now that Intel and Dell have publicly admitted the problem. ($0)

Intel has already set the precedent by offering to replace defective Pentiums, even though software can bypass its divide flaw. The RZ-1000 flaws are far more serious and the CMD-640 flaws are even more serious still.

Keeping this under wraps is going to be hard for the clone builders. Brooke Crothers of Infoworld did several stories based on my compilations. I have been in contact with Jerry Pournelle of Byte. I sent email to John Dvorak. Even Dean Takahashi of the San Jose Mercury Daily News did story. In the 1995-11 editions, a 1000-word abridged version of this essay appeared across Canada in The Computer Paper and Toronto Computes. The stonewall is coming tumbling down. As one individual pointed out, I read your postings on the Internet, and see them the next day quoted in my daily newspaper.

What Are the Flaws?

IBM Confirmed the RZ-1000 has two different flaws:

  1. In prefetch mode, multi-sector reads often fail.
  2. The chip erroneously responds to floppy status commands and corrupts hard disk or CD-ROM I/O in the process.

IBM confirmed the CMD-640 has five different flaws:

  1. It has the same prefetch problem as the RZ-1000.
  2. It has the same floppy status problem as the RZ-1000.
  3. It does not support simultaneous I/O on the primary and secondary EIDE ports.
  4. Confusion over legacy and PCI mode.
  5. Does not support 32-bit writes.

The Flaws Under A Microscope

After the manner of Ionesco, Roedy Green said, All great programmers are paranoid. Programmers have to anticipate problems that could happen only once in a trillion machine cycles since such problems would still show up on average every three hours. EIDE problems sometimes go days without manifesting. Sometimes they show up within seconds, depending on the unrelated I/O activity in the machine.

I have read about ten conflicting explanations from authorities on the cause of the problems. Much of the confusion comes because there are so many different flaws — all generating similar symptoms. I based the following explanations on postings from Sam Detweiler of IBM ’s Warp Device Driver section (sdetweil@vnet.ibm.com).

The RZ-1000 and CMD-640 both have the prefetch flaw and the floppy status flaw. The CMD-640 has three additional flaws. I will focus on the three most important.

Flaw 1: Prefetch Buffer Flaw

The RZ-1000 and CMD-640 both have the prefetch flaw.

Data moves from the hard disk to RAM (Random Access Memory) via a bit bucket brigade. The RZ-1000 grabs data 16 bits at a time from a buffer in the integrated controller in the hard disk and hands it off 32 bits at a time off to the PCI bus. The CPU sits in a tight loop grabbing data from PCI bus and storing it in RAM. In prefetch mode, the RZ-1000 keeps ahead of the CPU, requesting two 16-bit chunks from the hard disk, in order to have a 32-bit chunk ready when the CPU asks.

When you disable the prefetch buffer, you turn off the parallelism and run in a degraded lock-step mode. In degraded mode, the RZ-1000 waits until the CPU asks for a 32-bit chunk. Then it puts the CPU on hold while it asks the hard disk for two 16-bit chunks. It glues them together and puts them on the PCI bus and allows the CPU to continue.

I advise all but the most dedicated technophiles to skip the next paragraph.

If the RZ-1000 is running with prefetch enabled, it erroneously considers a sector read complete as soon as it has grabbed the last 16 bits from the hard disk and stuffed it into the prefetch FIFO (First In First Out) buffer. It should not consider it complete until the CPU has stuffed all the data into RAM. The RZ-1000 then starts to read the next sector. If the current read operation is interrupted, or delayed by simultaneous DMA from some unrelated device, before the last two bytes are read from the FIFO and the next sector is prefetched into the FIFO before the current data transfer completes, then the chip will erroneously signal yet another Data Available Interrupt. Because OS/2 has already signalled EOI (End Of Interrupt) (End Of Interrupt) to the PIC (Programmable Interrupt Controller) (Programmable Interrupt Controller) and enabled interrupts, it recurses into the disk driver interrupt handler. The driver then reads the status register. Unfortunately because of a cheap design shortcut, the FIFO is used both for data and status. The CPU reads the data in front of the status as if it were the status. This causes the interrupted data transfer to later read the following status as if it were data, resulting in corruption. Both the RZ-1000 and CMD-640 fail in exactly the same way.

There are two software techniques to bypass this flaw:

  1. Never schedule more than one I/O at a time. Use strict polled mode with no interrupts. Turn off all unrelated interrupts during I/O. This is the DOS/Windows approach. The disadvantage is poor performance and possible lost incoming modem characters.
  2. Turn off the prefetch buffer. According to Intel and IBM, in a lightly loaded system, there is sufficient spare capacity on the PCI bus so running in degraded mode only slows the disk down by 1%. However, programs making extensive use of the PCI bus such as LANs (Local Area Networks) or video bit-map painting will also slow down. Both Intel and IBM tell us that turning off prefetch to bypass the flaw has negligible effect on performance. Yet in the Plato BIOS rev 12, Intel says that enabling the prefetch buffers will "significantly increase PCI IDE Hard Disk performance." They can’t have it both ways.

Flaw 2: Floppy Status

The RZ-1000 and CMD-640 both have the floppy status flaw.

This flaw is the result of an incredible chain of blunders.

The original MFM (Modified Frequency Modulation) (the predecessor to IDE) interface design blunder was using different bits of the same I/O port, 3F7, for two unrelated purposes, detecting the floppy changeline and reporting hard disk status. Modern EIDE controllers are no longer supposed to do this, but some chips carry on in the old tradition and provide legacy logic. Motherboard manufacturers then often blunder by attaching the floppy changeline to the EIDE controller. This way both the EIDE controller and the floppy controller think they are in charge of reporting floppy changeline status. On top of that, the designers of both the RZ-1000 and CMD-640 chips both blundered by trying to save a little silicon by using the same registers to store both hard disk status and data.

For the insatiably curious here is precisely how the corruption occurs. Simultaneously I/Os to both the hard disk are floppy disk are running. The floppy controller generates an I/O complete interrupt. The floppy driver then check the floppy status. Part of reading floppy status is checking the changeline bit — contained in the ambiguous port 3F7.

If the motherboard manufacturer goofed and hooked up the floppy changeline to the EIDE controller, the RZ-1000 erroneously responds to the floppy status request. It is in charge of the hard disk, not the floppy. It is the floppy controller’s job is to respond. The RZ-1000 feeds two data bytes from its FIFO out as floppy status. These data were was supposed to go to the hard disk driver. Thus the chip loses two bytes from the hard disk transfer, corrupting data. Turning off prefetch also solves this problem. Unlike the first flaw, only simultaneous floppy I/O start can trigger this problem. Simultaneous I/O of any kind can trigger the first flaw.

Flaw 3: No Simultaneous I/O

Only the CMD-640 has this flaw. The CMD-640 can’t do more than one I/O at a time. This flaw was so obvious everyone found out about it long ago. All EIDE controllers (even fully functioning ones) cannot run master and slave simultaneously. However, two separate EIDE controllers are supposed to allow primary and secondary channels to run at once. The CMD-640 has dual controllers on one chip. However, because of a lack of two register sets, the primary and secondary channels will not work simultaneously unlike every other design. For example, you can’t run your EIDE hard disk and EIDE CD-ROM at the same time.

Simultaneous I/O speed is the reason we put two EIDE devices on separate channels, both as masters, rather than making one a master and one a slave on the same channel.

IBM has a bypass for this blunder. When it detects a CMD-640, Warp never schedules more than one I/O at a time when the CMD-640 is active, reducing the operating system to DOS-like performance. Independent experiments show the degradation from using the CMD fix is 15 to 50%.

Background

If you read the literature on this problem, you will see various daunting technical terms. Here is a rough explanation.

There are six kinds of I/O used in PCs (Personal Computers).

  1. PIO (Programmed Input/Output)- Programmed I/O. The CPU spoon-feeds each byte to the I/O port. The port can usually accept data as fast as the CPU can feed it. Typical IDE drives work this way under DOS. For slower devices, the CPU polls the status to see if the device is ready for yet another byte.
  2. Scheduled I/O. This is a variant of PIO where the operating system feeds the I/O device some bytes, then calculates how long it should take for the I/O device to digest them, then it goes away for a while to do something else, then it comes back when it figures the I/O should be complete and feeds the device a few more bytes. This is how Warp usually controls parallel port printers.
  3. Interrupt I/O. Every time the port is ready to eat another byte, it raises an interrupt and the CPU feeds it some more. This is the typical way COM ports work and how Warp uses printers with the /IRQ option. Warp EIDE drivers combine methods (1) and (2). The hard disk interrupts when it has completed the read into its on-board buffer. Then the CPU fetches data out of the buffer with PIO mode.
  4. Third party DMA. The DMA controller on the motherboard copies data from RAM to the port and generates an interrupt when it is done with a block. Floppy drives and inexpensive mag tape backup drives use this method. Because of the unfortunate original AT design compromises, this method is exceedingly slow. Third Party DMA is never used for PCI bus devices though it is still used for ISA or motherboard-based floppy controllers on PCI motherboards.
  5. First party DMA, sometimes called Bus Mastering. A DMA controller on the device copies data from RAM to the port and generates an interrupt when done High end SCSI cards — such as the Adaptec 2940 or 2940W use this ultimate way to fly.
  6. Memory mapped I/O. The CPU copies data to a magic region of RAM which is actually on the I/O device. LAN (Local Area Network) cards or REGEN (Regenerate) VRAM (Video RAM) on video cards use this technique.

In a true multi-tasking system, such as OS/2, the CPU goes off and works on behalf of applications when the port is busy and trusts an interrupt to bring it back when the device needs more service. It schedules several I/Os simultaneously. In contrast, DOS and Windows never do more than one I/O at a time. Further, under DOS/Windows the CPU idles while waiting for its single I/O to complete rather than working on applications.

Learning More

You can use the Internet to learn more about this problem. If you do not have Internet access, I can provide you these files on diskette. See below for details. When accessing files on the Internet generally you must use lower case.

Test Programs

The Canadian Mind Products EIDEtest and CDtest programs for DOS, DESQview, Windows, Windows For WorkGroups, Windows 95, NT, Win2K, Win/XP, OS/2 and Warp. They ensure your hard disk and CDROM will function without interference from background I/O activity. These indirectly detect the flawed RZ-1000 and CMD-640 chips. Intel’s RZ-1000 and CMD-640 chip detect program. RZtest.exe expands to form CtrlTest.exe. Beware! the CtrlTest.Doc documentation contains an MSWord macro virus. Unfortunately PowerQuest, IBM have withdrawn their test software.

Fixes

Warp Fixpack 10. This bypasses the flaws for both the RZ-1000 and CMD-640 faulty EIDE chips. It also fixes numerous other bugs in Warp. It comes as a set of six files file — totalling about 8 MB. Make sure you get it from an official IBM CSD (Control Structure Diagram) site because there are leaked pre-released buggy copies floating about the net. Before applying it, verify that the readme.1st on the first fixpack disk is dated 1995-09-21 17:40. The package as a whole should be dated 1995-09-22 or later. This fixpack applies to all versions of Warp including Warp Connect. It contains in itself all earlier fixpacks. You don’t need to apply any previous fixpacks first. If you have the CMD-640, it is especially important you carefully read the installation instructions. You need to manually modify config.sys. Do a complete backup first. Many people are having a variety of troubles with Fixpack 10 — often traced to failure to carefully follow the installation instructions, including a COMMIT step. These separate fixes are no longer available. New versions of the OS (Operating System) have the fix built-in.

If you don’t want to install the entire Fixpack 10, you can install these Warp bypasses for the RZ-1000 and the CMD flaws. Warning. This file has been updated several times without changing the name. Make sure you get the most recent. The installation instructions are tricky. Follow them carefully.

CMD fixes for various operating systems CMD-640 chip. Expand with PkUnZip -d 640X_USR.403
CMD ’s BBS (Bulletin Board System) at (714) 454-1134. File 640X_USR.403
Warp bypass for the early CMD-640 chip flaws. It has been superseded by pj19409.zip. You no longer need to install it before pj19409.zip. The cmd640x.zip fix is no longer available.

Information on the Premiere/PCI II motherboard, commonly referred to as 'Plato' can be obtained from Intel’s Faxback Service at 800-525-3019 or 503-264-6835 in the US or +44(0)1793-496646 in the UK. Press option 2 for "components, boards, platforms and tools for OEMs (Original Equipment Manufacturers) and developers" and follow the prompts. Request a 'SYSTEMs’ catalog. From this, you can reference documents and their associated FAXBACK document number.

You can upgrade to the latest BIOS version to see if that resolves your motherboard issues.

You may also want to call the Intel Technical Support line at 1-800-628-8686 forhelp with your processor issues.

Essays


This page is posted
on the web at:

http://mindprod.com/jgloss/eideflaw.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\jgloss\eideflaw.html
Canadian Mind Products
Please the feedback from other visitors, or your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

IP:[65.110.21.43]
Your face IP:[3.12.123.41]
You are visitor number