Mining the NY Times Archives

New York Times

Dave Winer looks to the recently released New York Times archives as rich loam of fertile content upon which many applications can be built. In another life, as a product manager for, I came to appreciate the meta-data the Times would attach to their content as something Factiva would leverage for its clients. Factiva provided investment banks and corporate libraries with content feeds from major news outlets and used meta-data on their sources (often adding additional meta-data of its own) so their clients would get precisely the content they were interested in and avoid having to wade through irrelevant results that were often the result of blunt keyword searches.

If the global PR officers of Ford or Sharp were looking for breaking news stories, keyword searches on the internet would be nearly useless as they would pull in stories of used Ford cars for sale or someone’s “sharp” looking suit. These client would pay for the meta-data and Factiva’s taxonomy consultants would offer numerous tips & tricks to hone down their filters to find exactly what was required.

With this in mind, I took a quick look at the source on the New York Times stories and found that they contain much of the meta-data that I remember.

Today’s story on Iranian President Ahmadinejad’s speech at the UN contains the following meta tags:

  • byl= Warren Hoge
  • des= International Relations;Embargoes and Economic Sanctions;Atomic Weapons
  • per=Ahmadinejad, Mahmoud
  • org=United Nations;Security Counci
  • geo= Iran

A business article on the arrival of the Microsoft game Halo 3 has the following:

  • byl=Seth Schiesel
  • des=Computer and Video Games;Computers and the Internet
  • per=Gates, Bill
  • org=Microsoft Corp;Sony Corp;Nintendo Company Limited
  • ticker=Microsoft Corp|MSFT|NASDAQ;Best Buy Company Incorporated|BBY|NYSE;Sony Corp|SNE|NYSE;Nintendo Company Limited|NTDOY|other-OTC;GameStop Corporation|GME|NYSE;Circuit City Stores Inc|CC|NYSE

From this we can see elements of the taxonomy poke through.

  • byl – is the byline of the author of the story
  • des – the description and how this story is classified by the New York Times
  • per – nodes for individuals
  • org – company or organizational nodes
  • ticker – public company stock symbols and their listing exchange

I’ve only just started playing around with this but using text from the meta-data fields and your favorite search engine you can already start to sort results in interesting ways.

  1. Articles about Mahmoud Ahmedinejad
  2. Articles about Gates, Bill
  3. News about Nintendo

It’s still early days as it appears that the search engines have not crawled the archives completely and a quick check of older articles are lacking in most of this meta-data. It will be interesting to see what insights skillful use of the meta-data fields will yield over the next few weeks and what applications can be built on top of them.

Reblog this post [with Zemanta]

MyBlogLog, more than just faces on a page

Most people first notice MyBlogLog as the service behind the Recent Readers widget you see on various sidebars like the one over at TechCrunch, Yodel, or PassiveAggressiveNotes. Once you spend a little time with the service you’ll see that it’s a whole lot more. Yeah, we have stats. They’re basic but that’s the idea. Keep it short and sweet and give you a quick snapshot of the basics. Where your visitors came from, what they looked at, and where they went.

hot in my communitiesOne feature that I feel is an overlooked gem is the “Hot in My Communities” area. On every person’s profile you’ll see this feature in the upper right hand corner. What’s here is a ranking of the ten most popular links from all the pages in your communities. For those that are not MyBlogLog users, communities are websites that you join during your travels across the web. You can either explicitly join a community or set a preference for the number of pages of a site you need to view before you are automatically joined (the default is 10). The idea is to keep the communities you join as you would a list of friends in a social networking site, ones that align with your interests. The Hot in My Communities feature is one reason why. We normalize this list against its overall traffic and popularity so that even if the site doesn’t get thousands of pageviews, if a link is “popping” on one of these sites, it’ll show up in your Hot in My Communities list.

This feature is really cool and acts like a little recommendation engine. You should see a mix of sites that you’ve already visited and links that are most likely going to be interesting because it’s what others that share your reading habits are visiting. If your My Communities area is like a bookshelf of sites you visit on a regular basis, then the Hot in My Communities are is like a dynamic list of pages that have been dog-eared for reference. While a visit to a site like Techmeme gives you a view of what’s hot for everyone, Hot in My Communities is a filtered view of what’s interesting based on sites that you follow. A personalized view.

Now that we have the integration of Yahoo IDs behind us the team has been going after some low-hanging fruit that we have wanted to get to for a long time. Today we released a feature that allows you to better manage your communities via the Hot in My Communities area. As you look through the list of links served up for your browsing pleasure, you’ll notice a question mark (?) next to each link. If you click on this link, you’ll get some information about which of your communities gave you this link. Just like you would manage your subscriptions in a feed reader, if you’re consistently getting irrelevant links within your list, you can chose to prune your Communities list and remove those that are not interesting to you. Conversely, if you are browsing someone else’s profile and you see a link you like, you can click through and join that community. Robyn has a nice write up on the feature over on the MyBlogLog log.

Happy Browsing!

Current Events gets into finance

ContentNext Media IndexRafat Ali and Staci Kramer over at have added a Finance tab to their site and along with it launched a financial index of the top 100 new media sites. Dow Jones has quite a nice little business from licensing its various indexes to financial firms and mutual funds that wanted to benchmark themselves for their clients. Could this be a new source of business for paidcontent?

Current Events

Nutrition education lunch boxes have high levels of lead


In an ironic twist of fate, 56,000 lunch boxes distributed by California’s Department of Public Health with the logo Eat Fruits & Vegetables and be Active were found to contain high levels of lead paint. Yes, these were manufactured in China.

A full recall is underway.

Long Journey West to the Farallons


Photo by Todd Sampson

Our first trip on Todd’s sailboat out beyond the Golden Gate was very relaxing. There was pretty much no wind so we ended up motoring almost the entire 90 miles out and back to the Farallon Islands. Not that I mind – I was a bit nervous having never ventured out in open water. Seas are quite a bit rougher than the Long Island Sound that I’m used to and the fog in the morning was a bit creepy. The emergence of a warship slipping out of the fog like a silent warrior was a wake up call for everyone. On board along with Todd was my brother-in-law Dav and co-worker at MyBlogLog, Chris Goffinet.

It was the perfect dry run in preparation for another trip sometime when we have some wind. We did get a chance to see some wildlife. Seals, dolphins, pelicans, whales, and even a sunfish. We also learned a very valuable lesson that, for some odd reason (maybe practical joke?), is not written up in any of the books we read about the Farallon Islands. Never, ever, anchor off the Farallons in calm seas unless you want to be covered in flies for the entire return trip.

It was so nice to spend time on the water and even spending the night on the boat the evening before our 5am departure was a reminder of trips I used to take with my father. It was great to get away, just 24 hours on the water but I feel like we went on a long camping trip. We went somewhere and saw things you don’t normally see and have stories to tell about it.

Thanks Todd, I look forward to more trips and explorations and hopefully a tad more of a breeze next time!


The Wall Comes Down

Everyone wondered if the New York Times would be able to pull off their Times Select premium news experiment. Despite projections of up to $10 Million in annual subscription revenues as of Wednesday morning most areas of will be free of charge. This is excellent news for bloggers who will now be able to point to articles on the site and know their readers will be able to follow their references with our having to pay a subscription fee.

Back when Times Select launched almost two years ago there was talk of driving subscriptions via an affiliate program. I guess that never really took off and now Vivian L. Schiller, senior vice president and general manager of admits that, “What wasn’t anticipated was the explosion in how much of our traffic would be generated by Google and Yahoo,”

It’s widely known that more traffic comes into the site via search engine links and blog referrals than via the front door and if you’re not converting successfully via these entry points then you’re better off monetizing the traffic via advertising.

It’ll be interesting to see if this puts pressure on to open up as Rupert Murdoch, their new owner, has hinted.

I still think that the optimal combination of free vs. premium is the one that I outlined two years ago when Times Select launched.

Restricting access during the period when these pieces are the most valuable will drive subscriptions to TimesSelect. It makes less sense to keep these pieces under lock and key throughout the time when people are mildly curious to see what all the fuss is about and have the time to sample a frequently referenced article without having to commit to an annual subscription. I would prefer to see the program re-jigged so that TimesSelect members get first dibs on grokking the perspective of the day but after 48 hours the doors are open for any and all up until the 3 month mark when they drop back to a view which restricts non-subscribers to only the first few paragraphs.

Open access to popular pieces for a three month period would help move low cost advertising inventory and allow for the fence-sitters to properly experience the quality of the Times’ news stream should they later decide they want to get access to this stuff prior the 48 hour embargo for non-subscribers.

Call it Kennedy’s Rolling Window of news & perspective. The cheap seats only let you see what’s directly in front of the window while subscribers get to see not only what’s coming down the pike but also dig back and review what’s gone by.


Climbing back up the rankings

One of the most frustrating things about moving your blog to a new domain is watching your various rankings drop off a cliff and the associated loss in all the things that come with it. Despite all the attention to detail (301 redirects, revisions on all your various social networking profiles, re-writing URLs) you basically cease to exist as far as the search engines are concerned and here we are, now a month later and I’m still crawling my way back to relevance.

Reputation and influence is not portable.

SEO is just a passing hobby of mine. Feeling inspired after the last Webmaster World conference in Las Vegas, I experimented a bit on my old domain and tried to see if I could get myself ranked for “social media advertising” and was pleasantly surprised when it only took me a few weeks to reach the #1 spot for the phrase on all four major search engines. I later realized that the term was not as popular as “social media marketing” so I shifted to focus on that term. I soon ranked highly for that term as well. That was back in January.

I later lost interest and didn’t really think of it until I moved everything over to this new blog on this new domain on August 5th. Right before I pulled the plug on the old blog, I took a snapshot of my rankings on various services and have been tracking my comeback and it’s been pretty slow going.

Here’s a summary of the highest point I had reached on and where I am now on

Rankings for “social media advertising”

  • – either 1 or 2 for Google, Yahoo, and MSN. #6 for Ask
  • – not even in the first 50 results

Rankings for “social media marketing”

  • – in the top 30 for Google, Yahoo and MSN, #5 for Ask
  • – #31 for Google, nowhere on Yahoo, MSN. Ask still has my old domain listed at #17

Technorati Authority. Stowe Boyd had a series of posts where he tracked his rise back up the rankings after he moved his blog which is interesting for comparison except that that was back before Technorati calculated an authority value.

  • – 56
  • – 12

Google Pagerank

  • – 4
  • – not even rated yet, must still be in the “sandbox”

Yahoo! Site Explorer Inlinks – I didn’t measure the inlinks on but I’ve been watching the inlinks climb up and am now at 866.

Google Webmaster Tools Inlinks – I also never got this on the old site but just saw it jump from just a handful to 4, 593 on 9/4


  • – 126 members in its heyday
  • – only 5 members have discovered this new site

It’ll be interesting to see if these numbers change much over time. If you feel like giving me a little boost, feel free to link to using the phrase “social media marketing” or “social media advertising” and see if that will change things.

Photo by Todd Sampson



Some internal discussion here on people that grab brand name Twitter handles so they can later sell them to the highest bidder. Like the domain squatters of old. Remember Joshua Quittners stunt? Yahoo colleague Ryan Kuder came up with the term.

squitter n. an individual who grabs a brand name twitter handle for future profit.

Let’s see how long these stay open: is taken.

Apple Price Drop: It was all part of the plan

Steve Jobs ain’t no dummy. Robert Cringely writes,

Apple introduced the iPhone at $599 to milk the early adopters and somewhat limit demand then dropped the price to $399 (the REAL price) to stimulate demand now that the product is a critical success and relatively bug-free. At least 500,000 iPhones went out at the old price, which means Apple made $100 million in extra profit.

Had nobody complained, Apple would have left it at that. But Jobs expected complaints and had an answer waiting — the $100 Apple store credit. This was no knee-jerk reaction, either. It was already there just waiting if needed. Apple keeps an undeserved $50 million and customers get $50 million back. Or do they? Some customers will never use their store credit. Those who do use it will nearly all buy something that costs more than $100. And, most importantly, those who bought their iPhones at an AT&T store will have to make what might be their first of many visits to an Apple Store. That is alone worth the $50 per customer this escapade will eventually cost Apple, taking into account unused credits and Apple Store wholesale costs.

The Puppet Master 

Not only that, think of all the free publicity this stunt created. All eyes will be on what gets put into the next version of the Apple OS, due to ship next month. . .

Google Reader adds Search. Why Competition is Good.

Google Reader Search Box

Just a few days after posting about the new Bloglines beta and how it was nice to re-discover their search your feeds feature, Google announced that they’ve finally added a search box to their Reader.

I’m staying with Bloglines right now for the novelty of it but I’ve noticed;

  1. Things that get marked “read” don’t seem to always stay that way, still trying to figure out the pattern
  2. I’m missing not having a decent mobile access ( doesn’t really do it for me) to read feeds on the go

Competition is a wonderful motivator. There was a whole lot of hoopla around the Google Reader API that was going to allow developers to extend Google Reader in all sorts of intersesting ways. Niall Kennedy did a whole dissection of the API and reverse engineered the Google Reader so people could see how it was put together. That was in December 2005 – now I can only find one example of a product that uses this promised but yet to be released API. I guess things got re-prioritized – we’ve all been there.

At least there is OPML which makes it easy to jump around. It sure is nice to have the freedom to experiment and walk if something better comes along.