Popping filter bubbles at SmartNews

It’s now just over a month since I joined SmartNews and I am digging into what’s under the hood and the mad science that drives the deceptively simple interface of the SmartNews product.

smartnews

On the surface, SmartNews is a news aggregator. Our server pulls in urls from a variety of feeds and custom crawls but the magic happens when we try and make sense of what we index to refine the 10 million+ stories down to several hundred most important stories of the day. That’s the technical challenge.

The BHAG is to address the increased polarization of society. The filter bubble that results from getting your news from social networks is caused by the echo chamber effect of a news feed optimized to show you more of what you engage with and less of what you do not. Personalization is excellent for increasing relevance in things like search where you need to narrow results to find what you’re looking for but personalization is dangerously limiting for a news product where a narrowly personalized experience has what Filter Bubble author Eli Pariser called the “negative implications for civic discourse.”

So how do you crawl 10 million URLs daily and figure out which stories are important enough for everyone to know? Enter Machine Learning.

I’m still a newbie to this but am beginning to appreciate the promise of the application of machine learning to provide a solution to the problem above. New to machine learning too? Here’s a compelling example of what you can do illustrated in a recent presentation by Samiur Rahman, and engineer at Mattermark that uses machine learning to match news to their company profiles.

Samiur Rahman on Machine Learning

The word relationship map above was the result of a machine learning algorithm being set loose on a corpus of 100,000 documents overnight. By scanning all the sentences in the documents and looking at the occurrence of words that appeared in those sentences and noting the frequency and proximity of those words, the algo was able to learn that Japan: sushi as USA : pizza, and that Einstein : scientist as Picasso : painter.

Those of you paying close attention will notice that some the relationships are off slightly – France : tapas? Google : Yahoo?  This is the power of the human mind at work. We’re great with pattern matches. Machine learning algorithms are just that, something that needs continual tuning. Koizumi : Japan? Well that shows you the limitations of working with a dated corpus of documents.

But take a step back and think about it. In 24 hours, a well-written algorithm can take a blob of text and parse it for meaning and use that to teach itself something about the world in which those documents were created.

Now jump over to SmartNews and understand that our algorithms are processing 10 million news stories each day and figuring out the most important news of the moment. Not only are we looking for what’s important, we’re also determining which section to feature the story, how prominently, where to cut the headline and how to best crop the thumbnail photo.

The algorithm is continually being trained and the questions that it kicks back are just as interesting as the choices it makes.

The push and pull between discovery, diversity, and relevance are all inputs into the ever-evolving algorithm. Today I learned about “exploration vs. exploitation”. How do we tell our users the most important stories of the day in a way that covers the bases but also teaches you something new?

This is a developing story, stay tuned!

Getting the Band Together Again at SmartNews

Following a month off after my unexpected liberation from Gigaom, I started this week as Director of Media & Technology Partnerships at SmartNews. I feel very fortunate to have discovered this company at a time when I believe I have a lot to offer.

First, some recent coverage,

While researching the company, I was delighted to learn they had hired Rich Jaroslovsky. Rich and I crossed paths a few times when I was working at Dow Jones as he was getting wsj.com off the ground. We both have a fascination with technology’s impact on media and I shared his mission to bring The Wall Street Journal online. We had since gone our separate ways but I always admired his love and respect for good journalism as a writer, editor, and business guy.

Rich explained to me that SmartNews thinks of itself as a machine learning company with a news front-end which is right in the nexus of what makes me tick. The co-founders, Ken Suzuki and Kaisei Hamamoto, are super-sharp engineers who see news discovery as an interesting problem to solve and hugely important for society to get right. To give you a sense for how they think, as they look for real estate for their San Francisco office, Ken and Kaisei each created their own interactive maps showing the locations of high tech startups and compared notes to determine that the area of 2nd and Howard was the ideal spot to focus their search.

I made my pitch (excerpted below) and here I am!

Two of the hardest challenges for the publishing industry are distribution and advertising. When publishers moved online, they had to reinvent their traditional distribution channels and navigate a new landscape.

Initially it was the portals such as Yahoo and AOL that would curate the best of the web. Advertising was also sold this way, manually curated and matched to broad channels of interest maintained by the portals.

As technology improved, search engines such as Google automated discovery and matching a reader’s interests to a publisher’s content. Advertising was automated and optimized via keyword matching and auction systems to extract maximum value. Distributed widgets allowed publishers to embed advertising into their sites and a combination of publisher tags and indexing that allowed them to take advantage of an ad network’s inventory.

Social media platforms have recently taken over as a source of traffic for publishers and content snippets shared via these networks represent the fastest growing segment of inbound readers for a publisher.

A common thread to success across all these channels is attractive representation of a publisher’s content within each distribution channel. Whether it’s meta-data, SEO, or “social media optimization,” each new distribution channel has spawned a new method of representing your content to the service which is doing the crawling and aggregation.

For a new distribution channel both the crawling and aggregation algorithms are key to successful presentation of content and relevant advertising to the reader.

Technology has enabled effortless distribution of news so the looming challenge is not so much the distribution of content but more its discovery and presentation. Social media burnout and personalization algorithms are still very basic and often push more and more similar content to the reader resulting in a “filter bubble” which shows the reader only what they want to see or worse, what they already know.

Working with publishers to find them new sources of readership and readers to teach them something they didn’t know is an important goal that aligns with my interests. The fact that the team is based in Japan, a culture with a strong culture of news readership, is attractive to me as I am a big fan of introducing Japan to the rest of the world.