Web-Based Screen Scraping | Historical Screen Scraping |
Tips |
Screenscraping (or now more often called webscraping) also refers to extracting information from HTML (Hypertext Markup Language) web pages on the web. Unless the authors permit reuse, you are violating copyright by doing that. I got in trouble by screenscraping foreign exchange rates off the Oanda site. Even material that looks fair game for reuse, e.g. prices, is not necessarily so. It is a legal minefield. It seems that manually extracting information is considered less sinful that using a program to do it, but you can still get in trouble.
Even if you screen-scrape for non-commercial purposes, even if you don’t repost the data and even if you don’t put much load on their server, they can still get irate, block you and send lawyer letters. I think the main reason is they put up their site primarily to serve ads, (the data offered are just bait) and you obviously are not reading the ads if you are screenscraping. If you stick to government sources, you will likely be safe.
Before you launch on a screen scraping project, do a thorough search of the site for a downloadable version of the data, sometimes in CSV (Comma-Separated Value) format, or spreadsheet format or SOAP/XML (extensible Markup Language) . This is ever so much more convenient and stable, not to mention quick. You just download the information you need, not a ton of copy to induce sales. If there is no such download, it never hurts to ask the source to provide one. It somehow never occurs to data providers that data are almost useless unless they are in computer-friendly format, i.e. not HTML.
Once you have your downloaded page,String.indexOf and regexes are useful tools to extract the data. Usually the data are too malformed to use a straightforward HTML parser. TagSoup can be useful to tidy up mangled HTML syntax prior to simple-minded programs sifting through the data.
JavaScript is a royal pain in the ass. It is as if its primary purpose is to foil screen scraping. There are two different pages:
Searching for strings that JavaScript generates will get you nowhere. Instead of looking for the generated strings, you have to look for the raw data JavaScript uses, e.g. error messages numbers. In theory, there should be some way to run JavaScript outside the browser on the page so your screen scraper too can see the decoded version. (See HtmlUnit)
In the olden days, screen scraping ran scripted client software which interacts with legacy green screen applications e.g. CICS 3270 terminal apps and (through the scripting) can return data to a host component. The host component can make the data available to non-legacy apps through ODBC (Open Data Base Connectivity), JDBC (Java Data Base Connectivity), etc.
The Screen scraper program has to fool the host into thinking it is talking to one of its usual hardware terminals with an operator sitting at it. It must compose queries in the format the usual hardware would produce and interpret the formatted data coming back, parsing it to extract the data and leave behind the formatting.
However, today screen scraping is much simpler. You have to emulate a browser and
the server sends you HTML.
Before you leap into writing an old-tyme screen scraper, investigate thoroughly all the possible terminals you might emulate that will work with the existing app. You might find some simpler to emulate than others.
Some of the old terminals had quite complex protocols, e.g.SDLC (Synchronous Data Link Communication) (Synchronous Data Link Communication), so you usually you don’t want to write that part from scratch. Look for a third party library to handle the low-level protocol details.
Screen scraping can also refer to capturing a bit image off the screen the program is running on using Robot.createScreenCapture.
To convert the pixels back to text is not quite as difficult as you might think. You can do a primitive OCR (Optical Character Recognition) that just compares clip regions with a cast of prototype characters set in the same font and size looking for an exact match. You might want to adjust colours to pure black and white before you start. This is quite a bit easier than realOCR (Optical Character Recognition) where you have to deal with imprecisely formed characters.
To separate characters you have to look for a vertical strip of white. To rapidly find the matching character you could use several methods:
This page is posted |
http://mindprod.com/jgloss/screenscraping.html | |
Optional Replicator mirror
|
J:\mindprod\jgloss\screenscraping.html | |
Please read the feedback from other visitors,
or send your own feedback about the site. Contact Roedy. Please feel free to link to this page without explicit permission. | ||
Canadian
Mind
Products
IP:[65.110.21.43] Your face IP:[3.147.74.247] |
| |
Feedback |
You are visitor number | |