Mining the NYT Archives

Dave Winer looks to the recently released New York Times archives as rich loam of fertile content upon which many applications can be built. In another life, as a product manager for factiva.com, I came to appreciate the meta-data the Times would attach to their content as something Factiva would leverage for its clients. Factiva provided investment banks and corporate libraries with content feeds from major news outlets and used meta-data on their sources (often adding additional meta-data of its own) so their clients would get precisely the content they were interested in and avoid having to wade through irrelevant results that were often the result of blunt keyword searches.

If the global PR officers of Ford or Sharp were looking for breaking news stories, keyword searches on the internet would be nearly useless as they would pull in stories of used Ford cars for sale or someone’s “sharp” looking suit. These client would pay for the meta-data and Factiva’s taxonomy consultants would offer numerous tips & tricks to hone down their filters to find exactly what was required.

With this in mind, I took a quick look at the source on the New York Times stories and found that they contain much of the meta-data that I remember.

Today’s story on Iranian President Ahmadinejad’s speech at the UN contains the following meta tags:

  • byl= Warren Hoge
  • des= International Relations;Embargoes and Economic Sanctions;Atomic Weapons
  • per=Ahmadinejad, Mahmoud
  • org=United Nations;Security Counci
  • geo= Iran

A business article on the arrival of the Microsoft game Halo 3 has the following:

  • byl=Seth Schiesel
  • des=Computer and Video Games;Computers and the Internet
  • per=Gates, Bill
  • org=Microsoft Corp;Sony Corp;Nintendo Company Limited
  • ticker=Microsoft Corp|MSFT|NASDAQ;Best Buy Company Incorporated|BBY|NYSE;Sony Corp|SNE|NYSE;Nintendo Company Limited|NTDOY|other-OTC;GameStop Corporation|GME|NYSE;Circuit City Stores Inc|CC|NYSE

From this we can see elements of the nytimes.com taxonomy poke through.

  • byl – is the byline of the author of the story
  • des – the description and how this story is classified by the New York Times
  • per – nodes for individuals
  • org – company or organizational nodes
  • ticker – public company stock symbols and their listing exchange

I’ve only just started playing around with this but using text from the meta-data fields and your favorite search engine you can already start to sort results in interesting ways.

  1. Articles about Mahmoud Ahmedinejad
  2. Articles about Gates, Bill
  3. News about Nintendo

It’s still early days as it appears that the search engines have not crawled the archives completely and a quick check of older articles are lacking in most of this meta-data. It will be interesting to see what insights skillful use of the meta-data fields will yield over the next few weeks and what applications can be built on top of them.


Posted

in

by

Comments

16 responses to “Mining the NYT Archives”

  1. ian Avatar
    ian

    Fixed. Thanks Ed.

  2. Scripting News for 9/28/07 « Scripting News Annex Avatar

    […] Kennedy does a view-source on the NY Times and finds there’s a lot of metadata in there. […]

  3. Aaron Straup Cope Avatar

    You might be interested to know all of that information has been harvested daily for about 3 years now, just waiting for someone to play with it :

    http://aaronland.info/nytimes/
    http://aaronland.info/nytimes/related/

    Cheers,

  4. ian Avatar

    Very cool Aaron – look forward to digging into this!

  5. ed mccauly Avatar
    ed mccauly

    “as something Factiva would leverage for it’s clients”

    please. simple grammar here. it’s “its.” can’t bloggers learn simple rules of English?

    sigh

  6. ian Avatar

    Fixed. Thanks Ed.

  7. » links for 2007-09-29 Avatar

    […] Mining the NY Times Archives — everwas Great set of hacks and keywords for searching the NYT (via Dave Winer). (tags: lifehacks newyorktimes Search Metadata) […]

  8. Center for Citizen Media: Blog » Blog Archive » Bringing the New York Times’ Cornucopia to All Avatar

    […] he inspires others to do some spelunking of their own, the result is that people outside the Times are doing crucial R&D for the world’s most […]

  9. The New York Times river flows, but whereto? « Alexander van Elsas’s Weblog on new media & technologies and their effect on social behavior Avatar

    […] HTML news and using that data he can create mashups based upon , for example, outlines or keywords. Others have experimented with it as well allowing searches such as articles on Bill […]

  10. Robert L Avatar
    Robert L

    Another interesting archive is their free one that dates back to 1851. Granted it doesn’t appear to have the same meta data but apparently all the data is available for analysis by anyone. I found examples here: http://play.6ix.us/nyt/tm/
    Looks like its has a vast amount of text and images available for free.

  11. ian Avatar

    @ Robert, very cool – thanks for sharing!

  12. Another Cool New York Times Hack — everwas Avatar

    […] Langman left a comment on my previous post about meta-data at nytimes.com with a link to a couple of cool mashups that use keywords on the older archive of New York Times […]

  13. Greg Cohn Avatar

    great find Ian!

  14. Heather S Avatar
    Heather S

    A typo in: "…I came to appreciate the meta-date the Times would attach …" meta-date should be meta-data.

  15. iankennedy Avatar

    Thanks Heather, all fixed.

  16. New York Times API Recap Avatar

    […] The TimesTags API opens the taxonomy of 27,000 tags used to identify Times Topics. This classification system is organized into four dictionaries – descriptive, people, organizations, and geography. Dave Winer’s list of recent topics shows a sample of the kinds of individual and topic tags returned. Programmers can use it standalone to search for terms based on character strings in one or more dictionaries, or as an input in the faceted search of the Article API, as described by Ian Kennedy. […]

Leave a Reply to Aaron Straup CopeCancel reply