everwas

a blog by Ian Kennedy

Tag: Machine Learning

Ghost in the Machine

There are at least two sides to every story. The Planned Parenthood videos were a polarizing topic that monopolized the news cycle several weeks ago. How do you teach an algorithm a point of view? How do you optimize for discovery and strike the right balance for diversity while avoiding duplication?

SmartNews is a news aggregation app driven by machine learning algorithms. The platform is tuned for discovery (as opposed to personalization). After using it regularly, I began collecting screenshots of my favorite examples when the app taught me something new or showed me two items side-by-side that suggested a subtle intelligence.

Two candidates and their technology.

The science and application of artificial intelligence to personalization is well understood. From Amazon’s people-that-bought-this-also-bought-that to Pandora’s Music Genome Project, software has been recommending what you’ll like next best based on what you’ve liked so far for years.

The new frontier in artificial intelligence is machine learning. Companies such as Spotify and Netflix are hard at work trying to predict future tastes based on an evolving understanding of collective tastes. Sure, learning assumes knowledge of the past, but projecting that learning into the future is much harder as you build a model based on an understanding of something that does not exist. Rather than showing you something we know you’ll like based on what you liked in the past, machine learning discovers things you didn’t know you would like.

First a little context. SmartNews, while deceptively simple, has a lot going on under the hood. At any time, the SmartNews app shows around 250 headlines across 8 categories. These headlines are selected from millions of stories that are scanned each day. In order to ensure that the stories featured in the app are the most important and interesting, a number of things must take place.

After harvesting URLs, the text of each article is run through a classifier that examines things such as the headline, author byline, publication date, images and video embeds. These pieces are analyzed by a semantic engine that extracts data so the algorithm can map the article to a topic cluster and place it into the appropriate subject category. (I wrote about how this is done in an earlier post)

Importance estimation is where we rank an article and determine where it will go in the app relative to other articles. Does it go towards the top of a section or towards the bottom? If the top, does it deserve featured treatment? Maybe it’s so topical it needs to be pushed to the Top page, which is reserved for only the most important stories of the moment.

Finally, diversification ensures there is a good mix of stories in each category. If there are 40 stories about guacamole and peas, here’s where we determine which to show and which to push to the background. If there’s a new development on a story, the update will push its way in and take prominence over an older story.

These are just details to give you context. The most amazing thing to me is when the app surfaces a “hidden gem” that I would not normally run across if I were using an RSS reader hard-coded to a collection of feeds, or a social network that is limited to news shared by my friends.

The best way to appreciate SmartNews as a discovery engine is to use it daily, but if you haven’t had a chance, here are a few more of my favorite Gems below:

While the Center for Medical Progress’ undercover video interviews with Planned Parenthood staffers may have been shocking, the representation of two points of view helped me see both sides of the issue. What was interesting was the Cosmopolitan article (a source I normally do not read) had the best measured rebuttal.

Much of the climate change news ends up in the Science category. As that story grows in relevance to us all, more publications dig into it. If you haven’t read this terrifying Rolling Stone piece, read it now.

Here’s an example of a developing story getting an update. ESPN reports that WWE is cutting its relationship with Hulk Hogan his comments that were offensive. People Magazine follows up with the story of his apology. Oh, also notice that the algorithm put both stories into the Entertain section.

As news of the killing of Cecil the Lion went viral, the algorithm was smart enough to surface a side of the story from a local Minnesota paper.

The screenshot above, more than any of the others, shows the freaky intelligence working behind the scenes. Like those times when an algorithmically generated playlist just nails the transition of one song into the next, drawing the causality between gun violence in the US to how such an environment might have prepared an off-duty soldier to do the right thing shows how a well-designed system can be greater than just the sum of its component parts.

Do you use SmartNews? Have you had the same experience? Send along some of your own Hidden Gems and I’ll add them to the gallery.

September 15, 2015
Spotify’s Mixtape Algorithm

With the launch of Apple Music’s “For You” feature, Spotify hand has been forced to unveil it’s own personalization engine in response. Discover Weekly was launched today via a series of well-timed pieces published today across the tech press. The PR push is on to explain to everyone currently evaluating Apple Music on a 3-month trial.

Spotify describes Discover Weekly as, “like having your best friend make you a personalised mixtape every single week.” More specifically, “Updated every Monday morning, Discover Weekly brings you two hours of custom-made music recommendations, tailored specifically to you and delivered as a unique Spotify playlist.”

Spotify, to date, has relied mostly on the social sharing of tracks and manually curated playlists (more than 2 billion!) to enhance the experience of the Spotify subscriber. The coverage today highlights the contribution of Echo Nest, an music intelligence and data platform acquired by Spotify in March of 2014. Reading a number of posts we learn the following:

Spotify’s internal tool that they use to build playlists has the wonderful moniker, Truffle Pig.
Inside Spotify’s Hunt for the Perfect Playlist

and,

The Echo Nest’s job within Spotify is to endlessly categorize and organize tracks. The team applies a huge number of attributes to every single song: Is it happy or sad? Is it guitar-driven? Are the vocals spoken or sung? Is it mellow, aggressive, or dancy? On and on the list goes. Meanwhile, the software is also scanning blogs and social networks—ten million posts a day, Lucchese says—to see the words people use to talk about music. With all this data combined, The Echo Nest can start to figure out what a “crunk” song sounds like, or what we mean when we talk about “dirty south” music.

Smart.

maybe some of the songs are bad, or the lead-off song isn’t representative of the rest of the playlist—we’ll try to refine that and give it a shot.” Playlists are made by people, but they live and die by data.

This is another way of underlining the best practices of machine learning. An algorithm is really only as good as it’s training set.

In order to keep a burst of listens from drifting your taste profile towards a fleeting interest, something re/code’s Kafka calls, “the Minions Problem“, Spotify isolates isolated wandering from the core.

Spotify says it solves the Minions Problem by identifying “taste clusters” and looking for outliers. So if you normally listen to 30-year-old indie rock but suddenly have a burst of Christmas music in your listening history, it won’t spend the next few weeks feeding you Frank Sinatra and Bing Crosby. The same goes for kids’ music, which is apparently why Spotify knows I didn’t really like “Happy” that much — it was just in the “Despicable Me 2” soundtrack.

Spotify has built its discovery algorithm on the listening behaviors of its 75 million users while Apple has advertised a more top-heavy approach using designated curators that publish playlists for a mass audience. I have to wonder what happened to all the Genius data that has been gathered after analyzing everyone’s iTunes collections and wonder if we’ll see that being used to balance out Apple’s approach.

I’ve heard that Spotify is working on a “family plan” that would let me break out the collective profile built up on my Spotify account that I share with my kids. That will yield more relevant personal recommendations so I don’t get the hip-hop heavy playlist that greeted me due to my son’s heavy rotation.

I think it’s still very early days and consumers will ultimately benefit in the music recommendation race that has just begun.

July 20, 2015
The importance of context

Surprisingly, the YouTube recommendation algorithm doesn’t draw inputs from far beyond the confines of YouTube itself. You might think that mining our Google search histories for clues about what videos we’d like would pay off. Nope, Goodrow says.

“The challenge is that web search history is very very broad.” Just because you Googled for help with your taxes does’t mean you want to watch YouTube videos about the ins and outs of U.S. tax law.

– To Take on HBO and Netflix, YouTube had to Rewire Itself, Fast Company

Not surprising at all actually. Just because everything on the internet can be connected doesn’t mean it has to be connected. When the internet is your world, zooming in on contexts and measuring behaviors in those contexts becomes paramount.

May 18, 2015
Popping filter bubbles at SmartNews
It’s now just over a month since I joined SmartNews and I am digging into what’s under the hood and the mad science that drives the deceptively simple interface of the SmartNews product.

On the surface, SmartNews is a news aggregator. Our server pulls in urls from a variety of feeds and custom crawls but the magic happens when we try and make sense of what we index to refine the 10 million+ stories down to several hundred most important stories of the day. That’s the technical challenge.

The BHAG is to address the increased polarization of society. The filter bubble that results from getting your news from social networks is caused by the echo chamber effect of a news feed optimized to show you more of what you engage with and less of what you do not. Personalization is excellent for increasing relevance in things like search where you need to narrow results to find what you’re looking for but personalization is dangerously limiting for a news product where a narrowly personalized experience has what Filter Bubble author Eli Pariser called the “negative implications for civic discourse.”

So how do you crawl 10 million URLs daily and figure out which stories are important enough for everyone to know? Enter Machine Learning.

I’m still a newbie to this but am beginning to appreciate the promise of the application of machine learning to provide a solution to the problem above. New to machine learning too? Here’s a compelling example of what you can do illustrated in a recent presentation by Samiur Rahman, and engineer at Mattermark that uses machine learning to match news to their company profiles.

The word relationship map above was the result of a machine learning algorithm being set loose on a corpus of 100,000 documents overnight. By scanning all the sentences in the documents and looking at the occurrence of words that appeared in those sentences and noting the frequency and proximity of those words, the algo was able to learn that Japan: sushi as USA : pizza, and that Einstein : scientist as Picasso : painter.

Those of you paying close attention will notice that some the relationships are off slightly – France : tapas? Google : Yahoo? This is the power of the human mind at work. We’re great with pattern matches. Machine learning algorithms are just that, something that needs continual tuning. Koizumi : Japan? Well that shows you the limitations of working with a dated corpus of documents.

But take a step back and think about it. In 24 hours, a well-written algorithm can take a blob of text and parse it for meaning and use that to teach itself something about the world in which those documents were created.

Now jump over to SmartNews and understand that our algorithms are processing 10 million news stories each day and figuring out the most important news of the moment. Not only are we looking for what’s important, we’re also determining which section to feature the story, how prominently, where to cut the headline and how to best crop the thumbnail photo.

The algorithm is continually being trained and the questions that it kicks back are just as interesting as the choices it makes.
- A story about President Obama playing a round of golf. Is it a sports story or is it a political story?
- A Medium post titled simply “2016” is an important political announcement.
- The Nation, normally covering politics from a progressive viewpoint, has things to say about sports too.
The push and pull between discovery, diversity, and relevance are all inputs into the ever-evolving algorithm. Today I learned about “exploration vs. exploitation”. How do we tell our users the most important stories of the day in a way that covers the bases but also teaches you something new?
March 18, 2015