On the surface, SmartNews is a news aggregator. Our server pulls in urls from a variety of feeds and custom crawls but the magic happens when we try and make sense of what we index to refine the 10 million+ stories down to several hundred most important stories of the day. That’s the technical challenge.
The BHAG is to address the increased polarization of society. The filter bubble that results from getting your news from social networks is caused by the echo chamber effect of a news feed optimized to show you more of what you engage with and less of what you do not. Personalization is excellent for increasing relevance in things like search where you need to narrow results to find what you’re looking for but personalization is dangerously limiting for a news product where a narrowly personalized experience has what Filter Bubble author Eli Pariser called the “negative implications for civic discourse.”
So how do you crawl 10 million URLs daily and figure out which stories are important enough for everyone to know? Enter Machine Learning.
I’m still a newbie to this but am beginning to appreciate the promise of the application of machine learning to provide a solution to the problem above. New to machine learning too? Here’s a compelling example of what you can do illustrated in a recent presentation by Samiur Rahman, and engineer at Mattermark that uses machine learning to match news to their company profiles.
The word relationship map above was the result of a machine learning algorithm being set loose on a corpus of 100,000 documents overnight. By scanning all the sentences in the documents and looking at the occurrence of words that appeared in those sentences and noting the frequency and proximity of those words, the algo was able to learn that Japan: sushi as USA : pizza, and that Einstein : scientist as Picasso : painter.
Those of you paying close attention will notice that some the relationships are off slightly – France : tapas? Google : Yahoo? This is the power of the human mind at work. We’re great with pattern matches. Machine learning algorithms are just that, something that needs continual tuning. Koizumi : Japan? Well that shows you the limitations of working with a dated corpus of documents.
But take a step back and think about it. In 24 hours, a well-written algorithm can take a blob of text and parse it for meaning and use that to teach itself something about the world in which those documents were created.
Now jump over to SmartNews and understand that our algorithms are processing 10 million news stories each day and figuring out the most important news of the moment. Not only are we looking for what’s important, we’re also determining which section to feature the story, how prominently, where to cut the headline and how to best crop the thumbnail photo.
The algorithm is continually being trained and the questions that it kicks back are just as interesting as the choices it makes.
- A story about President Obama playing a round of golf. Is it a sports story or is it a political story?
- A Medium post titled simply “2016” is an important political announcement.
- The Nation, normally covering politics from a progressive viewpoint, has things to say about sports too.
The push and pull between discovery, diversity, and relevance are all inputs into the ever-evolving algorithm. Today I learned about “exploration vs. exploitation”. How do we tell our users the most important stories of the day in a way that covers the bases but also teaches you something new?
This is a developing story, stay tuned!