Web pages contain various pieces of information that surround the body or the main content of the article. One can call them ancillary blocks of information that present along side main content. Examples of these includes Navigation Sidebars, Related Items components, Ads, Header, Footer, etc. When a user tries to find information by performing a keyword search, he is mostly looking for results that contain the actual content. When an information access tool provides the results that contain the results from the ancillary blocks as well, the user has to do more work to get the desired result (add more keywords to the search or paginate through a long list of results).

LATimes page with Ancillary blocks highlighted

LATimes page with Ancillary Blocks highlighted


To increase the quality of the returned results, the ancillary blocks can be removed at the indexing time. There are various approaches to remove the non-informative content from these web pages that have been discussed in the literature (Entropy-threshhold based approach by Lin and Ho, ContentExtractor and FeatureExtractor by Debnath et al). There is also another approach that is very straight forward and is used by perl module HTML-Content-Extractor and has been ported to PHP and is described here.

The product I am working on currently is based on Java and Python, and unfortunately there is no port for Python. Thus, I wrote one in python that uses the ideas above. In short, it does the following three things

1. Heuristics - Looking across web pages, it is evident that certain HTML tags do not create any informative content. These include tags like img, script, link, input, hr etc. These can be deleted from the DOM.

2. Feature based block detection and removal - Identifying ancillary blocks based on text to link ratio.

3. Remove all the HTML tags from the remaining DOM.

So far, I have just tried it on content pages from CNN and LA Times. As I try more pages and find other enhancements / issues, I’ll keep posting.