Plucking Out Article Content from an Arbitrary Web Page

2009-03-22

For Defogger, the user should be able to paste in a URL (or click on a bookmarklet when on another web page) and get back a defogged, enhanced view of the content. Using OpenCalais, Defogger can figure out the important people, organizations, events, and relationships contained in a news article. While OpenCalais does accept HTML as input, it assumes that everything in the HTML is fair game. For example, putting this AP article hosted on Google through OpenCalais will return “Google” as a company mentioned in the article, when in fact Google is only detected because its name is contained in all the extra clutter around the content of the article.

So I needed a way to remove that extra clutter and pass on exactly what mattered to OpenCalais. Interestingly, I had seen this problem already solved with a bookmarklet called Readability, “a simple tool that makes reading on the Web more enjoyable by removing the clutter around what you’re reading”. Try out the bookmarklet on some websites to see for yourself. It seems almost magical, and comes in handy all the time, especially for long articles, or for reading on the iPhone.

Luckily for me, Readability is open source, with the code hosted on Google Code. It was just up to me to translate the Javascript algorithm into Ruby. So here’s my version of the code:

require 'open-uri'
require 'nokogiri'

def pluck_article(url)
    
  # get the raw HTML
  doc = Nokogiri::HTML(open(url))  
  
  # get the paragraphs
  paragraphs = doc.search('p')
  
  # assign points to the parent nodes for each paragraph
  parents = {}
  paragraphs.each do |paragraph|
    points = calculate_points(paragraph)
    if parents.has_key?(paragraph.parent)
      parents[paragraph.parent] += points
    else
      parents[paragraph.parent] = points
    end
  end
  
  # get the parent node with the highest point total
  winner = parents.sort{ |a,b| a[1] <=> b[1] }.last[0]
  
  # return the plucked HTML content
  "<h1>" + doc.search('title').inner_html + "</h1>" + winner.inner_html
end


def calculate_points(paragraph, starting_points = 0)

  # reward for being a new paragraph
  points = starting_points + 20

  # look at the id and class of paragraph and parent
  classes_and_ids = (paragraph.get_attribute('class') || '') + ' ' + 
                    (paragraph.get_attribute('id') || '')  + ' ' + 
                    (paragraph.parent.get_attribute('class') || '')  + ' ' +
                    (paragraph.parent.get_attribute('id') || '') 

  # deduct severely and return if clearly not content
  if classes_and_ids =~ /comment|meta|footer|footnote/
    points -= 3000
    return points
  end
  
  # reward if probably content
  if classes_and_ids =~ /post|hentry|entry|article/
    points += 50
  end  
  
  # look at the actual text of the paragraph
  content = paragraph.content
  
  # deduct if very short
  if content.length < 20
    points -= 50
  end
  
  # reward if long
  if content.length > 100
    points += 50
  end
    
  # deduct if no periods, question marks, or exclamation points
  unless (content.include?('.') or content.include?('?') or content.include?('!'))
    points -= 100
  end
  
  # reward for periods and commas
  points += content.count('.') * 10
  points += content.count(',') * 20  
  
  points

end

I’ve commented the algorithms thoroughly, and Ruby is quite readable, so it should be easy to follow. Essentially, the code takes advantage of several common conventions used on modern websites:

An article (or blog post or press release) is coded in HTML as a lot of text enclosed in <p> tags.
Those <p> tags are grouped together, usually under a containing <div> element.
Written English contains a lot of commas.

Using those observations above, along with a few other tricks, the code assigns points to all elements that contain paragraphs on the page, and the containing element with the highest point value “wins”. To use this, it’s a simple pluck_article('http://www.path.to/some/article').

I’ve tried the method above out on the sites of the New York Times, Google AP News, Wall Street Journal, Huffington Post, Athletics Nation, and even a press release on Nancy Pelosi’s site. All worked pretty well. The one platform where the algorithm breaks (both for my implementation above and for Readability) is on Scoop, the blogging platform behind Daily Kos and MyDD. Specifically, the parsing doesn’t work on blog posts that have a long extended entry (usually diaries). This is because Scoop generates a blog post’s main entry and extended entry as two separate <div> elements, and the extended entry <div> ends up “winning”. That’s not at all semantically correct, but I don’t expect it to ever be fixed. Subsequently, SoapBlox sites (like OpenLeft) also have the main entry problem, as SoapBlox is a direct port of Scoop.

I’ll sleep on the problem with Scoop/SoapBlox, but I’m pretty happy with the solution for now. Next up for Defogger: building the page templates using Haml, Sass, and Compass.