Encoding Identification

Disclaimer

This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have prepared to help you is right here.

This project outline is not like the artificial, tidy little problems you are spoon-fed in school, when all the facts you need are included, nothing extraneous is mentioned, the answer is fully specified, along with hints to nudge you toward a single expected canonical solution. This project is much more like the real world of messy problems where it is up to you to fully the define the end point, or a series of ever more difficult versions of this project and research the information yourself to solve them.

Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; or give you any additional materials. I have too many other projects of my own.

Though I am a programmer by profession, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavour.

Please do not email me about this project without reading the disclaimer above.

possibilities

Unfortunately, the encoding scheme used is not usually embedded as a signature in the document. See encoding identification for a fuller description of the problem.

Your job is to look at the document and make an educated guess at the encoding scheme used to encode it. You might provide a list of guesses in descending order of probability for someone to make the final decision manually.

There are two parts to the project.

The viewer, which displays either the entire document in a given encoding, or just selected parts of the document that would render differently in different likely encodings. This is a simple text viewer with no HTML (Hypertext Markup Language) rendering. It can strip tags for HTML and XML (extensible Markup Language).

The guesser.

Look for the optional content-type meta tag in HTML and the optional encoding tag in XML.

Note how various common words from various foreign languages encoded.

The presence of BOM (Byte Order Mark) s.

The frequencies of various letters/bytes compared with sample known documents. This is a measure both of language and encoding.

Look for symbols used in the context of a currency marker.

Various ad hoc schemes you come up with to distinguish two similar encodings.

Rules that help track the source of a document from its name and knowing that a given source usually emitted a given encoding or set of encodings over a given date range.

You also might want to tackle this as a neural net problem. Teaching it with thousands of documents with known encoding.

If you have control over the source of the documents, you can sidestep the problem by embedding the encoding as the first field followed by a line terminator. Better still, settle on UTF-8 or UTF-16BE as your encoding and be done with the problem.

File Format Identification

Files are sometimes labeled with what they are in the first few bytes in a signature.

File Signatures
class	CAFEBABE in hex
*.gif	GIF87a or GIF89a
*.jpg	FFD8 in hex
*.png	89504e470d0a1a0a in hex

File Signatures

type

signature

class

CAFEBABE in hex

*.gif

GIF87a or GIF89a

*.jpg

FFD8 in hex

*.png

89504e470d0a1a0a in hex

standard footer
	This page is posted on the web at:	http://mindprod.com/project/encodingidentification.html
	Optional Replicator mirror of mindprod.com on local hard disk J:	J:\mindprod\project\encodingidentification.html
	Please read the feedback from other visitors, or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission.
	Canadian Mind Products IP:[65.110.21.43] Your face IP:[216.73.217.127]
Feedback	You are visitor number

This page is posted
on the web at:

http://mindprod.com/project/encodingidentification.html

Optional Replicator mirror
of mindprod.com
on local hard disk J:

J:\mindprod\project\encodingidentification.html

Please read the feedback from other visitors, or send your own feedback about the site.
Contact Roedy. Please feel free to link to this page without explicit permission.

Canadian Mind Products
IP:[65.110.21.43]
Your face IP:[216.73.217.127]

Feedback

You are visitor number