everwas

a blog by Ian Kennedy

Category: Work

Preserving Publisher Rights in the Era of AI Chatbots

Last September, I gave a talk at the Media Party conference in New York to propose a method to track the origin of text as it travels through a Large Language Model (LLM). Tracking provenance is important because to evaluate reputation and assign credit to properly allocate licensing revenues to publishers that provide source material to an LLM.

What follows are the slides from the talk with some annotations to help explain.

The rough outline of the proposal is a simple type of HTML markup which allows the publisher or author of a page to mark unique phrases, facts, quotes or figures for which they would like to retain credit. This markup, if retained along with the indexed text, would allow the LLMs to store and trace the origin of these unique phrases back to the originating url or domain tracking the “knowledge” as it travels from the originating website to an LLM and then back out via a genAI chatbot in the form of an “answer.”

Setting some historical context, I explained how incentives can shape ecosystems . The pageview & advertising economy of online publishing incentivizes publishers to seek out traffic and has given rise to an ecosystem that put Google and their “ten blue links” at the center. A link drives traffic and traffic drives ad impressions which equals revenue in this ecosystem.

This well-established ecosystem is being upended by AI chatbots which efficiently extract knowledge from a page and serve it back to the user without generating a pageview. This cuts out an important way for publishers to make money, grow audience, and promote their brand.

To get a jump on this new ecosystem, large publishers are cutting deals with the AI companies but only the biggest will have the resources to benefit from such arrangements. Smaller publishers will be left out.

SimpleFeed (where I work) released a simple WordPress plugin that monitors your site to see who is crawling your site and allows the site admin to block selected bots. The idea is to educate smaller site owners how much indexing is going on and build awareness of how the LLMs are interacting with their content.

According to CloudFlare, bots make up 30% of a site’s traffic and this figure will surely increase.

Referrals from social networks are falling. This puts pressure on site owners who wish to control who comes to crawl their site. Who do you let in, who to block? The act of publishing something is to distribute your information far and wide but, right now, many are defending their sites from aggressive crawlers strip-mining their sites without compensation.

If we plot this situation to it’s conclusion, the largest publishes will survive off of whatever licensing terms they can secure while the smaller sites get starved of traffic and miss out on any significant licensing revenue. The result is that we lose the diversity of the web. This leads to the gentrification of anything going into and coming out of the LLMs. This is what is called, an ecosystem collapse.

Tim O’Reilly is my North Star when it comes to understanding technological tectonic shifts. Much of my thinking here is inspired by an O’Reilly piece, How to Fix “AI’s Original Sin” in which he writes about how incentives can influence ecosystem design and how pageview incentives of past result in the block & tackle behavior of publishers towards the LLM platforms today.

The challenge for the LLMs to break out of this cycle is to create a system for “detecting content ownership and providing compensation” so that LLMs can share the enormous, untapped potential everyone anticipates for the LLM platforms. In O’Reilly’s words,

This is one of the great business opportunities of the next few years, awaiting the kind of breakthrough that pay-per-click search advertising brought to the World Wide Web.

In the world of digital art (audio, photos, videos), the people and companies behind Content Credentials are already hard at work in creating this system.

If a picture is worth 1,000 words, there must be value assigned to text. If something has a value, it’s worth tracking. I propose a few elements worth tracking. Quotes, Statistics, and even unique phrases.

The next few slides told the story of how, when blogs and blogging were just getting started, there was a huge problem with comment spam. This was largely the result of incentives to get a high reputation site to link back to the commenter’s website to help improve their ranking in Google’s search results.

Over the course of a few days (the internet was a smaller place back then), engineers at Google and Six Apart (where I worked at the time) agreed to negate the relevance of the link back to the commenter’s site on a comment and dealt a blow to the comment spam problem. A small group of engineer’s extended the web and, in a very simple way, removed the incentives that rewarded bad behavior.

I told this story because I see the rel= link qualifier as something that could be used to markup text and prove provenance. I proposed something called a “knowledge unit” or KU for short.

The syntax of the markup worked alongside HTML, just bracket anything you want to track in the rel=ku markup and, as long as the consuming LLM keeps that markup intact, that text will be tagged as something originating from the url cited in the markup.

This provenance can be used to track the number of times a particular knowledge unit is mentioned in an LLM’s response. This enables a fundamentally different ecosystem from that of pageviews in that there is no need to constantly re-post something you wrote years ago to keep it fresh, relevant, and trending in Google’s search results. Hard work to produce durable knowledge should pay dividends on into the future.

More akin to the Wikipedia reputation model, a good, unique fact can continue to be cited over time and, in fact, revenues should flow towards durable knowledge units and will hopefully reward those that gather and present unique knowledge rather that the hot takes and re-writes that are rewarded in today’s pageview economy.

Taken a step further, we will then return to a web before ad targeting and enragement metrics to a world where we reward those that teach us something new.

This new internet no longer drives you to “acquire” a “user” to package up and sell to an advertiser. Publishers no longer need to lock their stories behind a paywall to prevent non-monetized access. In this new ecosystem, the incentive is to share knowledge, getting paid directly for the broad distribution and citation of your work.

This is just the germ of an idea that may well be totally naive. While I do like the bottoms-up, simplicity of the markup approach, it requires everyone to adopt and trust each other to collectively make it work.

What is to keep bad actors from hijacking Knowledge Units and claiming something as their own? Page index timestamps will need to be the arbitrator of provenance I suppose but how do you guarantee delivery of your post over others?

Also, why would LLMs adopt such a system that would fundamentally make their indexes more complex and expensive? My hope is that the LLMs eventually see that strip-mining the web is unsustainable. Just as in agriculture, an ecosystem that does not replenish it’s resources, both large and small, is not a diverse, healthy, and long-lasting ecosystem.

If you’ve made it this far, I’m super-interested in your thoughts and encourage you to get in touch.

December 16, 2024
Honing Your AI Spidey Sense
A checklist to help you spot AI-generated misinformation

Going through some papers from this year’s Online News Association conference in Atlanta, I found this handout put together by the folks at Verify.

I couldn’t find any reference to it online so I thought I’d post it here for posterity.

Images
- Zoom In! Look for distorted details
- Textures: Too smooth or unrealistic?
- Face/Body Features: Hands, Feet, Teeth, Ears
- Analyze intricate images (ie. Flags, Paintings)
- Look for repeating patterns
- Analyze shadows
- Depth of field issues
- Remnants (Extraneous pieces of data)
Audio
- Too pristine
- Voice doesn’t make common sounds (um, uh, like)
- No ambient background sound
- Voice mispronounces common words
- Voice is flat/unemotional. Doesn’t take breaths
Video
- Voice Doesn’t match mouth movement
- Subject’s movements unrealistic
- Background too static
- Subject is not in proportion to other elements
- Glitches in the movement
Journalism
- Go to the source: Who’s named in the content?
- Credits: Who posted it, who is credited as the photographer?
- Reverse image search to look for similar posts
- Do other credible reports exist?
- Does the context make sense?
- Is this realistic
- Google it!
- Who could benefit/be harmed by the claim?
December 15, 2024
Access as a Service

Tim O’Reilly popularized the term “Web 2.0” to explain the network effects of the participatory web enabled by dynamic web pages tied to personalization. He is excellent at summarizing large technical trends in a way that not only makes it relatable but also provides a useful framework when I need to explain these concepts to others.

So it was with great anticipation that I saw that O’Reilly has posted his thoughts on the intersection of copyright and AI.

The Risk

If the long-term health of AI requires the ongoing production of carefully written and edited content—as the currency of AI knowledge certainly does—only the most short-term of business advantage can be found by drying up the river AI companies drink from. Facts are not copyrightable, but AI model developers standing on the letter of the law will find cold comfort in that if news and other sources of curated content are driven out of business.
How to Fix “AI’s Original Sin”

The Opportunity

While large licensing deals are being cut by publishers that have the leverage & lawyers to negotiate massive, one-time deals, these are ultimately short-lived and only serve to build up the large AI-providers that can afford to subsidize premium materials for their users. These deals just make the rich even richer.

The longer term, sustainable opportunity he proposes is in allowing the internet-of-many to share in the revenues enabled by the output from these large AI systems.

But what is missing is a more generalized infrastructure for detecting content ownership and providing compensation in a general purpose way. This is one of the great business opportunities of the next few years, awaiting the kind of breakthrough that pay-per-click search advertising brought to the World Wide Web.
How to Fix “AI’s Original Sin”

The Challenge

Build a shared provenience and attribution service that keeps track of all documents available to AI systems and the permissions and royalty payment requirements around those documents.

O’Reilly alludes to the UNIX/LINUX filesystem architecture of files with permissions set at the global, group, and user levels as a potential solution to what publishers allow to AI vendors seeking out material for their training sets.

If we expand this analogy out to internet scale, could we apply the architecture of Hosts tables and the modern Domain Name Service to provide a dynamic infrastructure that could maintain a public “lookup” service so any particular AI could locate the origin of any attributable fact, quote or yet-to-be-determined “knowledge unit” and the license fee should an AI wish to leverage that data.

In UNIX, the chmod command is used to change permissions. Could setting copyright permissions via a specialized version of “chmod” be the key to a new way to control access and compensate publishers at scale?

Food for thought.

June 19, 2024
The Three Laws of AI
When my previous company started using technologies such as machine learning to automate tasks such as curation, Rich Jaroslovsky, an experienced newsman who pioneered using web technology to build the online version of The Wall Street Journal, circulated a memo with three simple guidelines that are applicable to anyone thinking of using AI to automate their newsroom.

SmartNews was at the forefront of using technology to process, curate, and rank large volumes of news stories so many of the hiccups we’re seeing in the application of AI to publishing today were front of mind for the company years ago.

Rich’s memo was a riff on Isaac Asimov’s Three Laws of Robotics reworked for today’s world where AI is being applied to any number of tasks in pursuit of scale and efficiency. This simple set of rules is useful as a checklist to help people think through the responsible application of autonomous technology.

I’d encourage anyone who builds products that use AI to link to these rules from your product requirements template. I can say from experience that building features with these three simple tenets in mind will save your organization a lot of headaches going forward.

Rich Jaroslovsky’s Three Laws of Automation
1. It has to be highly automated. Our technology is what makes us scalable, and allows us to accomplish so much with so few people. I realize there is often a manually intensive phase when a new feature is being tested. But even in the testing phase, the question of how the task can be automated should be front-of-mind — and should be implemented when the feature is moved into full production, not as a “we’ll get to it” enhancement at some point in the distant future.
2. It has to provide visibility. That is, we have to know what the system is actually doing — what content it is sending out — at any given time. It’s not enough to learn after the fact, and then have to grapple with unintended consequences. For us non-engineers, at least, It’s much less important that we have visibility into the why or the how, visibility into the what is critical.
3. It has to allow for intervention when we spot problems — the ability to stop something bad from happening when we see it is happening, or is going to happen. This is much different from the concept of “human control,” where actions only take place if they are approved; such a model flies in the face of Rule #1. But it isn’t good enough to say we’ll just depend on the technology, wash our hands of the consequences and figure we’ll fix it later if It is doing bad things.
What are your thoughts? Are there examples you’d care to share that are instructive on what can go wrong if you don’t heed these rules? I’m building my own list of how un-supervised AI has caused problems in publishing but if you’ve got some other stories, share them in the comments so we can all learn together.
January 23, 2024
OpenAI has an App Store

OpenAI’s DevDay keynote had the look and feel of all Silicon Valley product announcements – a well-scripted parade of announcements, a couple live demos, and even a “one more thing” that is revealed with low-key fanfare but, by it’s placement at the end of the talk, signals to the world that this is the game-changer.

That thing was the app store for custom AI chatbots. To make it easier to grok and talk about, OpenAI has co-opted the acronym for the rather technical mouthful that is “Generative Pre-trained Transformers” and made it into a product name. Custom versions of ChatGPT are now GPTs. This makes it easier for the broader public to understand and makes it a whole lot easier for marketers to fold into their campaigns in the same way, “There’s an App for that” became a catch phrase for Apple’s app ecosystem, I can see “Just GPT it!” becoming a verb for leveraging AI to do some grunt work for you.

That’s my 30,000 foot view before diving in and playing around more. Stratechery has a much more informed deep dive on the significance of what was announced and I recommend reading Ben Thompson’s analysis which includes important observations around the significance of OpenAI using Microsoft’s infrastructure and what that partnership means for the market going forward.

As a teaser, I found this passage thought-provoking,

This has two implications. First, while this may have been OpenAI’s first developer conference, I remain unconvinced that OpenAI is going to ever be a true developer-focused company. I think that was Altman’s plan, but reality in the form of ChatGPT intervened: ChatGPT is the most important consumer-facing product since the iPhone, making OpenAI The Accidental Consumer Tech Company. That, by extension, means that integration will continue to matter more than modularization, which is great for Microsoft’s compute stack and maybe less exciting for developers.
The OpenAI Keynote

November 8, 2023
Video: Simon Willison on AI

Simon Willison has been hacking on technology for years and blogging about it in his excellent blog where he posts on how to recreate his innovations and follow along on his adventures. He was a speaker at this year’s WordPress WordCamp US 2023 conference and gave a talk that I would highly recommend to anyone who wants to spend an hour to catch up on all the latest developments in the world of gen AI and LLM.

Posting this here today because I expect that I’ll be sending this link to people for weeks to come.

Large Language Models are the technology behind ChatGPT, Google’s Bard and more. They are weird and somewhat intimidating pieces of technology: we’re still trying to figure out how they work and what they can do, in a field that changes radically on an almost weekly basis.

In this talk I’ll break down how they work, what they’re useful for, what you can build with them and how to dodge their many pitfalls.
Making Large Language Models work for you

Simon Willison on AI at WordCamp 2023

If you’d rather just read Simon’s talk, he’s created an annotated version of his talk here.

August 31, 2023
Sitemaps for AI
Last week, I was double-booked in conferences. Wednesday & Thursday I was in Philadelphia for the beginning of the Online News Association conference, a gathering of journalists who work with words online. Friday & Saturday, I was in Washington DC for WordCamp, a gathering of people who work with WordPress, the CMS software that powers many of the websites journalists use to publish their news online.

Hopping between these two worlds, the editorial and technical, gave me a unique perspective of the change sweeping online media. Everyone agrees that AI Chat Bots, specifically generative AI from Large Language Models (LLMs), will have an enormous impact on what we read online. But, depending on who you’re talking to, it’s going to result in either the horror or something wonderful.

It’s still very early but a long Amtrak train ride home gave me some time to project out where we’re headed and ponder what we might need to make it work in a way that both publishers and AI Chat Bot companies feel comfortable.

Those that fear AI view it as something that will strip mine websites of their facts and process them into the bland, robotic responses that power AI Chat Bots. This characterization echoes the publishing industry’s initial reactions to Google search. In 2006, French and Belgium newspapers demanded to be removed from Google News only to come back begging for inclusion in 2011 after they experienced a precipitous drop in traffic.

Are we seeing the same thing play out with AI? Isn’t an AI chat bot just the conversational form of the Google SERP? Microsoft Bing Chat and Google Bard are crawling the web for tidbits to power their conversational engine. Concern about Bing and Bard abstracting facts without sending users back to a publisher’s site exposes a flaw in the publishing business model where a website is compensated by readers looking for answers on a page in adorned with advertisements designed to distract and harvest attention.

Dare Obasanjo on threads.net

It’s time to upgrade this business model. Instead of asking people to browse a bunch of search links, AI Chat Bots bring information to the reader, aggregated, summarized in a conversational tone. To a certain extent, this is an evolution of what has been happening for years.

Google Knowledge Graph

When Google Knowledge Graph launched in 2012, many publishers felt the Knowledge Panel (as it came to be known) did not provide enough attribution. Sound familiar?

If the reader no longer goes to the publisher’s site, they will end up spending time with the product providing the answers, not the source. Back then it was Google, today, it’s the AI Chat Bot.

The AI Chat Bot is the latest step in a journey that was started a long time ago. Bringing answers into a conversational UI is just improving on user experience for those in search of quick answers to their question.

Bing Chat AI

This new conversational UI is under rapid development. I’m not even sure a conversational is where we’ll end up. Microsoft is leading the way with Bing Chat AI results sprinkled with attributions that give credit and links back to the source material. From what I can tell, Microsoft is also is paying for this attribution in an early experiment in what I would call “licensing of facts.” Google’s Bard is following Microsoft’s lead and is also starting to add attribution to its SGE results, something that was missing at launch. I’d be curious to know if they are paying publishers for these links.

Microsoft is embedding Bing AI in not only into their Edge Browser but has also announced extensions for Chrome and Firefox. Bing Chat is also available as an Enterprise service as well as on their mobile app and Skype.

The pressure is on and Google is responding in kind with their version of generative AI chat, SGE, which is running in Google Labs.

If generative AI is the next generation of search, I can think of a number of things that are needed to build a relationship between the publisher and AI vendor that is transparent, trustworthy and thus, sustainable. Allow me to riff a bit.

Honor Robots.txt
Open AI already announced that they would honor robots.txt and not crawl sites that declare themselves off limits. This is now extended to optimize which sections of your site you want to make available to the AI Chat Bots. The New York Times, CNN, and others are already adopting this method to control what they make available.

This is a step in the right direction as it builds trust but more granular control over what is made available for the crawl is necessary. Within a restaurant review, maybe the address & phone number will be valued one way while the reviewer’s opinion valued another way.

Sitemaps for AI
A sitemap is a file that instructs a web crawler where to look for new pages. A sitemap for AI could be an intentional declaration by a site owner of what specific facts and information you want to make available and what link you want to serve up for the attribution. Addresses can be fielded and formatted one way, quotes another way so that they travel along with the name of the person quoted.

Ads.txt was developed to make programmatic advertising more transparent. What I’m thinking of is something in between a sitemaps.xml and ads.txt, a lightweight, machine-readable way for publishers to declare what they make available to the Chat Bot crawlers.

Real time Fact Exchange
The technology that enables the real time auction for ad impressions on sites in milliseconds is some the most impressive technology developed for the internet in the past couple of decades. The incredible revenue machines of the ad industry have fueled the advancements in this technology.

It’s time for a similar exchange for the facts which will be the new commodity. When looking for answers via a chat bot that has access to everything, maybe the deciding factor is the quality of the information or the party that is making it available. If every fact is distinct in the aforementioned Sitemap for AI, why not also attach a value to that fact that can inform the AI chat bot which information it can afford to share. If it’s a high value reader then more expensive information from higher quality sites might be presented. We are already headed down this path as both search results and social media links that go to paywalled sites attempt to capture subscriber budget.

Is it finally time to create a marketplace of micro-transactions brokered by the Chat Bot UI? Instead of subscribing to a bunch of subscription sites, maybe the AI is where “pay” for tidbits of information with either advertising or payment tiers and that revenue is shared by the Chat Bot companies with the companies providing the information?

In order for the Chat Bot AI ecosystem to grow, the publishers need to be fairly compensated and the Chat Bot vendors need a marketplace for the content they need to provide a quality experiences. Maybe the Real Time Fact Exchange is a far-fetched but I would have never thought the simple banner ad would have evolved into the complex ecosystem we have today.

Further Reading:
- Generative AI and intellectual property
- Thinking About AI and the Business of News
August 29, 2023
Notes from the (media) party

A couple of weeks ago, I had the good fortune to attend the Media Party conference in Chicago. As with previous, early-stage “what is this technology?” conferences, I found the three days in Chicago a great way to connect with others who are also stumbling around and learning about Generative AI (genAI), Large Language Models (LLMs) and other AI-based technologies and techniques that are poised to forever change the way we work and communicate.

The biggest takeaway from the conference for me is that we are all still learning the practical applications of genAI and that no one is an expert. Most of the subject matter experts do not have experience in real world applications and those of use working at the intersection of media and technology are only now beginning to understand the complexities of building production-ready genAI systems (how do you QA unexpected results?)

There were no dumb questions – everyone had something to add to the conversation so, in that sense, the conversations were refreshingly equitable. I mentioned to more than a few people that the collaborative atmosphere at the conference (there were about 100-150 of us there) reminded me of the BloggerCon conferences from the early-2000s when blogging was getting started.

While there were the expected skeptics that were tolling the bell of caution that genAI was going to steamroll journalists out of existence,

Martha Williams, World News Media Network

there was also a faction of proponents that ranged from the embrace-or-become-extinct clan to the this-tech-will-give-me-superpowers crowd. The message that had the most resonance with me was from Jennifer Brandel who coined the term AE (Actual Experience) as the thing that journalists, particularly local news journalists, bring to the table that is often forgotten.

Jennifer Brandel’s AE bingo card

Indeed, what people are craving, particularly post-Covid, is human connection to a community. As information sources, local News organizations are well-positioned to be the focal point of their community in a way that an AI can never replicate. This past weekend, I took a long bike ride through the side streets in Brooklyn and Queens and saw pick-up basketball games complete with DJs and announcers, “uh oh, looks like the eighth graders are here to play!”) that showed off the best of community in action.

Maybe we are at the tail end of an old model of journalism that is heading for “hospice” The new genAI systems have trained and perfected how to more efficiently deliver commoditized “news” so the new type of journalism that is only now organizing itself will be one that is resistant to automation.

What follows are some unstructured notes and a collection of shared links that I found useful.

Word Embeddings & LLM – an update to concepts I first learned about in 2015 when I learned about the concept of Word2Vec

The Practical Guides for Lange Language Models – besides a continually updated table of LLMs, their license restrictions and what corpus of data was used in the training set, this guide also references this cool, evolutionary tree of LLMs.

Beginner’s prompt handbook: ChatGPT for local news publishers – an excellent place to get started. Also Using GPT on Library Collections

Mike Reilley from the Journalist’s Toolbox put together this toolkit on AI in the Newsroom

Media Party Chicago schedule – list of all speakers and sessions.

Thank you to everyone that put this event together. It’s particularly valuable to collaboratively learn about a new technology together. There is another Media Party event taking place in Buenos Aires in October, if you are in the area and interested in the intersection of AI and Journalism, it’s worth checking out.

June 19, 2023
White-label AI Bots

I’ve been playing around with a hosted Chat AI offered by Chat Thing that was recently announced on Product Hunt. Seth Godin has indexed 5M words from his blog [Seth’s Blog bot] and Dave Winer uploaded his 30+ years of daily posts from scripting.com [Scripting News bot]. Both bots are instructive and give you a real-world example of how these bots can be used to leverage your readers to pull up and share “observational snippets” gleaned from the archives. I decided to play.

Here are some screenshots. You can see from the responses that it really is a new way to search. Here I ask the bot how Seth Godin, a marketing genius, would run a presidential campaign.

Transcript from Seth’s Blog bot

Here is the post the bot is referring (it would be nice if it provided a link as a footnote). Incidentally, searching on Seth’s blog for “presidential campaigns” yields a different result that may be tangentially relevant but not as specific a response as what came back from the bot.

On the Scripting News bot, I compared what the OpenAI Chat GPT bot knew to his white-labeled bot to see if I could find out Dave’s favorite basketball team.

OpenAI really had no clue. I know that scripting.com was used as training material from the WaPo story but apparently it hadn’t retained any particular tidbit of knowledge about his basketball preferences.

Transcript from ChatGPT at OpenAI

Over on the Scripting News bot I had a much richer exchange. Chat Thing uses Open AI as the backend but they’ve figured out how to “focus” it to the data added to the index, in this case, all of scripting.com.

Transcript of conversation with Scripting News bot

Again, it would be great if it linked directly to the source articles. I’ve put that in as a feature request on Chat Things’ Discord Server.

It’s still a bit buggy yet (sometimes it echos back an earlier response, like a broken record) but the team is moving fast and adding new features almost daily.

Two weeks ago you had to export your archives and convert them to Markdown before you could upload them to get indexed. Today they announced that you can add your site to be crawled and add your RSS feed to keep the index fresh.

Chat Thing data connection sources

As of today, the RSS feed link just pulls in links off your RSS feed. Hopefully they’ll get more precise in the future and let you upload just the relevant sections of your feed or use an API to add specific tables in a database. It would be nice to have more control over what gets indexed into the training set.

As Seth says, “You’ll have no trouble tricking it” and we all know how generative AIs hallucinate; there are a lot of kinks to be worked out but these early experiments offer up an entirely new way to unlock the value of archives that we haven’t seen since the early days of search.

April 26, 2023