Mining the NY Times Archives

New York Times

Dave Winer looks to the recently released New York Times archives as rich loam of fertile content upon which many applications can be built. In another life, as a product manager for, I came to appreciate the meta-data the Times would attach to their content as something Factiva would leverage for its clients. Factiva provided investment banks and corporate libraries with content feeds from major news outlets and used meta-data on their sources (often adding additional meta-data of its own) so their clients would get precisely the content they were interested in and avoid having to wade through irrelevant results that were often the result of blunt keyword searches.

If the global PR officers of Ford or Sharp were looking for breaking news stories, keyword searches on the internet would be nearly useless as they would pull in stories of used Ford cars for sale or someone’s “sharp” looking suit. These client would pay for the meta-data and Factiva’s taxonomy consultants would offer numerous tips & tricks to hone down their filters to find exactly what was required.

With this in mind, I took a quick look at the source on the New York Times stories and found that they contain much of the meta-data that I remember.

Today’s story on Iranian President Ahmadinejad’s speech at the UN contains the following meta tags:

  • byl= Warren Hoge
  • des= International Relations;Embargoes and Economic Sanctions;Atomic Weapons
  • per=Ahmadinejad, Mahmoud
  • org=United Nations;Security Counci
  • geo= Iran

A business article on the arrival of the Microsoft game Halo 3 has the following:

  • byl=Seth Schiesel
  • des=Computer and Video Games;Computers and the Internet
  • per=Gates, Bill
  • org=Microsoft Corp;Sony Corp;Nintendo Company Limited
  • ticker=Microsoft Corp|MSFT|NASDAQ;Best Buy Company Incorporated|BBY|NYSE;Sony Corp|SNE|NYSE;Nintendo Company Limited|NTDOY|other-OTC;GameStop Corporation|GME|NYSE;Circuit City Stores Inc|CC|NYSE

From this we can see elements of the taxonomy poke through.

  • byl – is the byline of the author of the story
  • des – the description and how this story is classified by the New York Times
  • per – nodes for individuals
  • org – company or organizational nodes
  • ticker – public company stock symbols and their listing exchange

I’ve only just started playing around with this but using text from the meta-data fields and your favorite search engine you can already start to sort results in interesting ways.

  1. Articles about Mahmoud Ahmedinejad
  2. Articles about Gates, Bill
  3. News about Nintendo

It’s still early days as it appears that the search engines have not crawled the archives completely and a quick check of older articles are lacking in most of this meta-data. It will be interesting to see what insights skillful use of the meta-data fields will yield over the next few weeks and what applications can be built on top of them.

Reblog this post [with Zemanta]


  1. “as something Factiva would leverage for it’s clients”

    please. simple grammar here. it’s “its.” can’t bloggers learn simple rules of English?


  2. Another interesting archive is their free one that dates back to 1851. Granted it doesn’t appear to have the same meta data but apparently all the data is available for analysis by anyone. I found examples here:
    Looks like its has a vast amount of text and images available for free.

  3. A typo in: "…I came to appreciate the meta-date the Times would attach …" meta-date should be meta-data.

Leave a comment