How Search Engines, Aggregators & Blogs Use News Content

Ahead of next week’s FTC workshop on journalism and the internet, I gave an informal briefing to several people from the FTC about the differences between how search engines, aggregators and news blogs all gather content automatically and through human editing. They seemed to find it useful, so I thought I’d share it more widely and expand it a bit.

Search Engines

Most major search engines such as Google, Yahoo and Microsoft’s Bing are “crawler-based,” which means they use crawling software to automatically visit pages from across the web.

When a search engine’s crawler comes to a page, it makes a copy of that page, storing that in what’s called an “index.” You can think of an index as being like a giant book.

When someone searches at a search engine, software called an “algorithm” effectively flips through the book, finds pages that match the person’s query and list the pages that the software deems most relevant.

In the case of news articles, these are included in a major search engine by virtue of being web pages. Perform a search, and you might get general results that mix news articles, Wikipedia pages, shopping sites, blog posts and more all together. For example, look at the results for a search on breast cancer guidelines from Google:

breast cancer guidelines - Google Search

At the top is what’s called a news “OneBox,” a special display of content from Google News, which I’ll explain more below. But below that are “regular” results, which have information from a variety of web sites. Included within these results are two news articles that I’ve indicated, from the Seattle Times and from USA Today.

It’s important to understand that people who tap into a regular search engine for news may be in a variety of search modes. These are my own definitions, and there may be more than these:

1) Breaking News Mode: They’ve heard about something breaking, such as a rumor that Jeff Goldblum has died earlier this year (yes, it was just a rumor), so they conduct a search looking for news and information about the topic.

2) Researching News Mode: They’re interested in a news topic generally, but it’s not a breaking event. For example, someone might be looking into issues they heard about with inauguration ticket problems from when President Barack Obama was sworn in earlier this year. A search for that on Google brings up these results currently:

inauguration ticket problems - Google Search

This is no longer a breaking story. However, you get plenty of news reports about it showing up nearly a year later, matches that lead to the Wall Street Journal’s news blog, the Washington Post, The Huffington Post, ABC News and CNN, to name a few.

This is an example of where the “search tail” or the “long tail” kicks in. While the popularity of a search topic may dramatically drop off hours or days after breaking news first appears, there are still plenty of people who do a smaller number of searches on those topics over time. And because there are a lot of different news topics, a newspaper or other news publication can receive a substantial amount of visitors from search engines from “stale” or “old news” topics.

3) Just Searching Mode: News publishers may also receive traffic via regular search engines from people who are not deliberately seeking news content. For instance, consider this search for best summer books:

best summer books - Google Search

There’s no reason to assume a searcher is looking for content specifically from news publications in response to that search. There are easily thousands of non-news web sites that cover books. And yet, news publications have high visibility. NPR appears twice at the top, with USA Today and The Guardian showing further below on the page.

Finally, I said that the search engines make a copy of pages they visit. That can suggest to some, especially in the heated environment at the moment with some news publishers attacking Google as a supposed content thief, that search engines are publishing their news content without permission.

Yes, all the major search engines provide a way for you to read a news article (or any web page) right on their own search engine, without leaving it. These are called “cached copies,” and you access them using the “cached” links that Google, Yahoo and Bing all offer. Here’s how the feature appears at Bing, below the page’s description:

best summer books - Bing

The feature appears similarly at Google and Yahoo.

Very few people, to my knowledge, access pages this way. Moreover, publishers are totally in control to turn off this feature. They don’t have to be cached. In addition, publishers can opt out of being listed within search results entirely — no cached copy, no listing, nothing.

It sure would be easier if none of the search engines cached pages, of course, at least from the perspective of having to explain how the services don’t really use content but rather point at it on other sites.

My post, Search Engines, Permissions & Moving Forward In Copyright Battles, goes into more depth about cached pages, how it’s been ruled legal so far in the US, the type of copying search engines do to make content searchable and the confusion with that being reprinting. I recommend reading it, for those who want to learn more about the matter.

News Search Engines

Many major search engines operate similarly to regular search engines. They have crawlers that find pages, which go into an index, which are made searchable. For example, here are matches for “emission talks” from Google News:

emission talks - Google News

The key difference between news search engines and regular ones is that news search engines cover fewer sources. Rather than looking across the entire web for content, they’ll only visit a select list of news sites, a few thousand sources to maybe up to 30,000. They’ll also visit these sites constantly through the day, staying alert for when news is posted.

Unlike with regular search, none of the news search engines offered by Google (Google News), Yahoo (Yahoo News) or Bing (Bing News), present cached copies of news articles. However, Google and Yahoo allow you to read news right on their sites in another way, through licensing agreements. For example, the AP has deals with both companies to host AP stories within their respective news services.

As with regular search engines, publishers can opt-out of inclusion at any time.

News Aggregators

News aggregators, commonly shortened to aggregators, is a terrible name for incredibly useful services that bring together headlines from multiple news sources all into one place. The news stories are all “aggregated” together, hence the aggregator name.

The major news search engines I named above also have aggregator sides. Remember how you could search on Google News about emission talks? Well, you can also just browse news headlines and discover that topic:

Google News

That screenshot is the Google News home page. Google uses an algorithm to look at all the news stories out there and assembles them together to effectively make a custom newspaper for its visitors. Yahoo News does the same, as does Bing to some degree.

Ah, so this is how search engines are ripping off newspapers! They’re using content from all these newspapers to steal visitors away from the newspapers’ own web sites! Yes and no.

The search engines are only showing headlines and sometimes short summaries, along with thumbnail images, to create their blended news pages. You can’t read the actual stories, except in the few cases where there’s an explicit licensing agreement.

Instead, people click from the news aggregation site to the news sources themselves. The search engines typically content such linking is fair use. For the most part, news publications — with some notable exceptions — have been happy being listed, in exchange for the traffic they receive.

Beyond the major search engines, other aggregators exist. For example, I constantly use the Techmeme aggregator to keep up with news in the tech space:


Where as Google picks stories for its aggregator purely on an automated basis, Techmeme uses a combination of automation and human editors. So does Yahoo News, by the way.

The Drudge Report is another aggregator:


I’m not a regular reader, but to my understanding, Drudge is primarily human-powered, where editors are manually scanning news sites and deciding what to feature.

AllThingD, which is owned by News Corp that has spoken against aggregators, ironically offers another example of aggregation — the Voices section where AllThingsD editors pick interesting articles from across the web. Here’s an example of where my publication appeared in the Voices section recently:

AllThingsD & Voices

Personally, I find it flattering to get mentioned in Voices, not to mention I’ll take the visitors who my discover my site when it is featured there. I also appreciate that AllThingsD understands the odd situation it’s in, when its corporate owner is speaking out against something its editors find useful. The site maintains a page addressing issues on this, saying in part:

We are fully aware of the controversies around how linking and aggregating is done on the Web and we, in no way, are attempting to “scrape” original content created by others. Instead, regarding third-party posts, we are trying to point readers of this site to other posts from around the Web that we admire and are trying to do so in the quickest manner possible.

The Internet is full of terrific content that is not ours and we want to help our readers find it by making editorial suggestions–Look, Mom, no algorithm!–of posts we think are worth their time.

Aggregators run by the major search engines respect the ability for people to automatically opt-out from being included. Those that use human-power may require that a publication request that it is not linked to. They might also not honor such requests, arguing that they can link to any public document on the web that they’d like.

Personal Aggregators / Newsreaders

If aggregators sounds cool as a reader (or potentially evil to some publishers), the world gets even more complex when we talk about personal aggregators or newsreaders. These are services that allow you to take in content feeds from the sources you select, in order to form your own super personalized newspaper.

For example, here are headlines from a variety of news sources that automatically flow into my personal aggregator, Google Reader:

Google Reader

All the publications you see shown above — the New York Times, the Wall Street Journal and the Los Angeles Times — explicitly put their headlines out through a feed (also called RSS) in order to have individuals subscribe to their latest news.

Even the Associated Press, which seems to view its headlines and story summaries as content that should be licensed by companies such as Google, freely invites those with personal aggregators to take its feeds:

AP & RSS Feeds

As long as it’s for personal, non-commercial use, the AP is fine with its feeds being used. Commercial entities are supposed to see permission, as the AP says:

AP provides these RSS feeds to individuals for personal, noncommercial use under the following terms and conditions. All others, including AP members or Press Association subscribers must obtain express written permission prior to use of these RSS feeds. AP provides these RSS feeds at no charge to you for your personal, noncommercial use.

In my experience, however, the AP is relatively unique in being this exclusionary of feed content.

I think a key point to personal aggregators is that there are some publishers who wish aggregators never existed. That somehow, someway, they wish they could push that genie  back into the bottle. That if there were never aggregators, more people would somehow come directly to them each day.

I have a future “Ode To An Aggregator” post where I’ll give my own view on why I think aggregators likely give publications gains, not losses, in terms of visits. The short story is that we’re well past a world where people start their day with a single publication. Or if they do, aggregators serve an important opportunity for other publications to also be seen alongside someone’s primary news choice.

Even if public aggregation sites were suddenly outlawed, it’s hard to imagine that personal aggregation would disappear. The future seems to be for personalized mix-and-match news reading to continue.

Blogs, News Blogs & News Sites

Bloggers are probably the biggest challenge to news sites that believe they’re being “ripped off” somehow in an internet-related way. Bloggers is a big word, however. I’ll try to define it a bit more, along with some use situations.

At the most basic level, a “blog” is simply a web site that is characterized by using publishing software with certain characteristics: each new page or “post” typically pushes previous posts further down on the home page of the site. Archives are built, allowing you to find pages by date or category. Comments are often allowed. Typically, “trackbacks” may be shown, reflecting other blogs that link to a particular article.

Some bloggers cover personal topics and have no ads. Some bloggers cover personal topics and yet are commercial in nature. Some blogs cover news, but only link to what others have written. Some blogs cover news, writing both original content and linking to others. And this is only part of the spectrum.

When I hear publishers speak out against blogs, the view in my mind that they are angry about is maybe something like Gawker or Andrew Sullivan, highly visible sites that often highlight stories from other publications, sometimes quoting from them and adding their own commentary. Some publishers view these sites as “cherry-picking” all the good stuff and costing them visitors that won’t bother reading the original source.

For a good case study of this debate, see The Death of Journalism (Gawker Edition) from Ian Shapira, a Washington Post reporter who was initially pleased to have his article featured by Gawker but then reconsidered as an editor suggested that his story was “stolen.” Also see The Time Gawker Put the Washington Post Out of Business, which among other things, highlights how ironically, the Washington Post is pushing bloggers to, well, blog about its stories.

Of course, blogs like Gawker and Andrew Sullivan may also have original content. So might the Huffington Post, which can be notorious in some quarters for seemingly appropriating a story elsewhere and making it seem like a HuffPo article.

I deal with getting the balance right every day at my own publication, Search Engine Land. There, we review over 200 sources each day to spot interesting news. Each day, we feature some items that have been spotted by others. However, we also regularly have our own original content that in turn might be cited by others.

News publishers with a conservative view on fair use would seem to wish news blogs and non-mainstream news sites away. The reality is that they are unlikely to go. And proposals for new “hot news” laws are complicated when mainstream publications themselves borrow from each other or get tipped to stories from blogs.

Moving Forward

Right now, we seem to be moving further toward an impasse. Two major news publishers, the AP and News Corporation, continue to make noises that they feel even merely linking to their content is potentially a copyright violation. Both have suggested they’ll withdraw their content from Google or other search engines that don’t comply with their demands. There’s also the suggestion that News Corp may try to enlist other news publishers to pull out of Google in particular.

As my Thoughts On A “Killer” Bing-News Corp Deal & The Myth Of An “OPEC For News” and Josh Cohen Of Google News On Paywalls, Partnerships & Working With Publishers articles touch on more, news blogs in particular pose a challenge to any would-be news boycott. News is not solely discovered by mainstream news publications, nor legally is it clear they could completely shut down summaries of news out there. In addition, some mainstream news publications are perfectly happy to continue partnering with blogs or Google. Reuters, for example, is a believer in the “link economy.”

I’d like to see less antagonism and more discussion among interested parties to find a way that everyone can get along. Personally, I’ve felt both the AP and News Corporation have been overly hostile. Just last month, we had the managing editor of the Wall Street Journal accusing people of being net neanderthals. In turn, that can create hostility on the other “side” of things.

Part of the solution might be some type of common subscription that blogs could take out entitling them to make use of content from mainstream publications — the ability to quote, summarize and so on without fear of legal threats. It might be that people might simply be buying what’s already covered under fair use. But I think plenty of bloggers also appreciate the symbiotic relationship they have with mainstream media and would like to support them in some way, especially if they can more easily use their content. Of course, such fees would have to be reasonable. Blogs aren’t gold mines. But perhaps something could be done.

From the publishers, I think the aggregators, bloggers and others need something back. First and foremost, probably a little respect that they’re not all just rip-off artists that provide no value. In an economy where the mainstream media keeps laying off staff, many blogs are actually hiring journalists to do original journalism. They’re succeeding not by ripping off the mainstream media but for being fast, nimble and adapting to an internet world in the way mainstream publications should have.

Respect also means more credit, so that if a story tip comes off a blog (or forum or elsewhere), this is made clear. Crystal clear, too, such as in the form of an outbound link. Let’s see an end to the days of the mainstream media not linking out.

Anyway, those are some concluding remarks that I’ll continue to ponder.