Encoding Identification
by Roedy Green ©1996-2009 Canadian Mind Products
This essay does not describe an existing computer program, just
one that should exist. This essay is about a suggested
student
project in Java programming. This essay gives a rough overview of how it
might work. I have
no source, object, specifications, file layouts or
anything else useful to implementing this project. Everything I have to say to
help you with this project is written below. I am
not prepared to help
you implement it; I have too many other projects of my own.
I do contract work for a living, which could include writing a program such as
this. However, I don’t do people’s homework
for them. That just robs them of an education.
You have my full permission to implement this project in any way you please and
to keep all the profits from your endeavor.
As the world has become a global village, the problem of file encodings has become more acute. Now a file created on one
side of the planet may be read on another. It is not obvious which encoding scheme was used. There are hundreds of possibilities.
Unfortunately, the encoding scheme used is not usually embedded as a signature in the document. See encoding
identification for a fuller description of the problem.
Your job is to look at the document and make an educated guess at the encoding scheme used to encode it. You might
provide a list of guesses in descending order of probability for someone to make the final decision manually.
There are two parts to the project.
- The viewer, which displays either the entire document in a given encoding, or just selected parts of the document that
would render differently in different likely encodings. This is a simple text viewer with no HTML rendering. It can
strip tags for HTML and XML.
- The guesser.
The guesser can use the following clues:
- Look for the optional content-type meta tag in HTML and the optional encoding
tag in XML.
- Note how various common words from various foreign languages encoded.
- The presence of BOMs (Byte Order
Marks).
- The frequencies of various letters/bytes compared with sample known documents. This is a measure both of language and
encoding.
- Look for symbols used in the context of a currency marker.
- Various ad hoc schemes you come up with to distinguish two similar encodings.
- Rules that help track the source of a document from its name, and knowing that a given source usually emitted a given
encoding or set of encodings over a given date range.
The guesser can also guess the language(s) used simply by looking for common but unique words in each language.
You also might want to tackle this as a neural net problem.
Teaching it with thousands of documents with known encoding.
If you have control over the source of the documents, you can sidestep the problem by embedding the encoding as the
first field followed by a line terminator. Better still, settle on UTF-8 or UTF-16BE as your encoding and be done with
the problem.
File Format Identification
There is a related broader problem, identifying what format a file is. In general, there is no way to tell what format a
file is, or what program can process it. You must simply remember or guess from the extension. Unfortunately, many
extensions like doc give little clue. This is a royal mess and one of the side effects that
males designed most of computing. I can’t imagine such a thing would have happened if Martha Stewart had a hand in.
Files would have been automatically neatly labeled with the format, creating program and encoding.
Files are sometimes labelled with what they are in the first few bytes in a signature.
| File Signatures |
| type |
signature |
| class |
CAFEBABE in hex |
| *.gif |
GIF87a or GIF89a |
| *.jpg |
FFD8 in hex |
| *.png |
89504e470d0a1a0a in hex |
You can collect these signatures and use them to guess what you have. Unfortunately, there is no central registry of
signatures and most files formats don’t have them. You will have to discover them yourself for the sorts of file
in your universe using a hex viewer.