Plucking Out Article Content from an Arbitrary Web Page
2009-03-22
For Defogger, the user should be able to paste in a URL (or click on a bookmarklet when on another web page) and get back a defogged, enhanced view of the content. Using OpenCalais, Defogger can figure out the important people, organizations, events, and relationships contained in a news article. While OpenCalais does accept HTML as input, it assumes that everything in the HTML is fair game. For example, putting this AP article hosted on Google through OpenCalais will return “Google” as a company mentioned in the article, when in fact Google is only detected because its name is contained in all the extra clutter around the content of the article.
So I needed a way to remove that extra clutter and pass on exactly what mattered to OpenCalais. Interestingly, I had seen this problem already solved with a bookmarklet called Readability, “a simple tool that makes reading on the Web more enjoyable by removing the clutter around what you’re reading”. Try out the bookmarklet on some websites to see for yourself. It seems almost magical, and comes in handy all the time, especially for long articles, or for reading on the iPhone.
Luckily for me, Readability is open source, with the code hosted on Google Code. It was just up to me to translate the Javascript algorithm into Ruby. So here’s my version of the code:
require 'open-uri'
require 'nokogiri'
def pluck_article(url)
# get the raw HTML
doc = Nokogiri::HTML(open(url))
# get the paragraphs
paragraphs = doc.search('p')
# assign points to the parent nodes for each paragraph
parents = {}
paragraphs.each do |paragraph|
points = calculate_points(paragraph)
if parents.has_key?(paragraph.parent)
parents[paragraph.parent] += points
else
parents[paragraph.parent] = points
end
end
# get the parent node with the highest point total
winner = parents.sort{ |a,b| a[1] <=> b[1] }.last[0]
# return the plucked HTML content
"<h1>" + doc.search('title').inner_html + "</h1>" + winner.inner_html
end
def calculate_points(paragraph, starting_points = 0)
# reward for being a new paragraph
points = starting_points + 20
# look at the id and class of paragraph and parent
classes_and_ids = (paragraph.get_attribute('class') || '') + ' ' +
(paragraph.get_attribute('id') || '') + ' ' +
(paragraph.parent.get_attribute('class') || '') + ' ' +
(paragraph.parent.get_attribute('id') || '')
# deduct severely and return if clearly not content
if classes_and_ids =~ /comment|meta|footer|footnote/
points -= 3000
return points
end
# reward if probably content
if classes_and_ids =~ /post|hentry|entry|article/
points += 50
end
# look at the actual text of the paragraph
content = paragraph.content
# deduct if very short
if content.length < 20
points -= 50
end
# reward if long
if content.length > 100
points += 50
end
# deduct if no periods, question marks, or exclamation points
unless (content.include?('.') or content.include?('?') or content.include?('!'))
points -= 100
end
# reward for periods and commas
points += content.count('.') * 10
points += content.count(',') * 20
points
end
I’ve commented the algorithms thoroughly, and Ruby is quite readable, so it should be easy to follow. Essentially, the code takes advantage of several common conventions used on modern websites:
- An article (or blog post or press release) is coded in HTML as a lot of text enclosed in
<p>
tags. - Those
<p>
tags are grouped together, usually under a containing<div>
element. - Written English contains a lot of commas.
Using those observations above, along with a few other tricks, the code assigns points to all elements that contain paragraphs on the page, and the containing element with the highest point value “wins”. To use this, it’s a simple pluck_article('http://www.path.to/some/article')
.
I’ve tried the method above out on the sites of the New York Times, Google AP News, Wall Street Journal, Huffington Post, Athletics Nation, and even a press release on Nancy Pelosi’s site. All worked pretty well. The one platform where the algorithm breaks (both for my implementation above and for Readability) is on Scoop, the blogging platform behind Daily Kos and MyDD. Specifically, the parsing doesn’t work on blog posts that have a long extended entry (usually diaries). This is because Scoop generates a blog post’s main entry and extended entry as two separate <div>
elements, and the extended entry <div>
ends up “winning”. That’s not at all semantically correct, but I don’t expect it to ever be fixed. Subsequently, SoapBlox sites (like OpenLeft) also have the main entry problem, as SoapBlox is a direct port of Scoop.
I’ll sleep on the problem with Scoop/SoapBlox, but I’m pretty happy with the solution for now. Next up for Defogger: building the page templates using Haml, Sass, and Compass.
blog comments powered by Disqus