I’ve been promising a long look at copyright, search engines in general and Google in particular in the wake of recent conflicts between Google and some newspapers and other content owners. A quote in a New York Times article today — implying that Google doesn’t get copyright permission — finally pushed me forward with my own look. Below, a revisit to the important difference between indexing and reprinting, how robots.txt works right now as a permissions system, why Google should stop scanning in-copyright books and the leadership role it could play by dropping cached pages.
A Struggle Over Dominance and Definition is the New York Times article that looks at Google and whether it is a media company that conflicts with other media owners, especially in terms of using content from others without permission. It’s a good article, covering common themes that have been going on for literally years.
Is Google a media company? The word from Google in the article remains a firm no, pretty similar to how Google cofounder Sergey Brin talked about still being a technology company when I put the same question to him back in 2003.
Regardless of what Google thinks, I consider them a media company, whether they own the content or not. My Schmidt: Google Still A Tech Company Despite The Billboards article from early this year looks in more depth at my reasons why.
Is Google a copyright violator? The answer is largely unknown, given that the laws have yet to catch up with actions. Google will say no; some say yes; it can also depend on the case, and it ultimately remains for a lot of courts to decide.
The Setup: Search Engines Asking For Permission In Action
Let’s start off with the case of indexing for inclusion in the core search engine. I’ll use a quote in today’s New York Times article from Gavin O’Reilly of the World Association Of Newspapers:
Gavin K. O’Reilly, the president of the World Association of Newspapers, argues that what is missing is that any search engine ought to be asking “explicit permission” to use copyrighted material, and that this should be part of the vaunted automation that has made search the phenomenon it is.
I met Gavin personally in September, when I was on a panel with him and several people at the Frankfurt Book Fair looking at the issue of search engines and copyright.
Most of the session was a setup for Gavin to roll out a proposed Automated Content Access Protocol (ACAP) that his group backs as a solution to the problems search engines supposedly have with copyright.
It was a receptive audience, given that from the discussion and questions, many were clearly upset with the part of Google’s library scanning project that indexes in copyright books without publisher permission. The presentation (you’ll find it here) generally made it seem like publishers had relatively little control over what search engines can index.
Google was on the panel, but they had no equal time, much less formal presentation time, to explain the existing automated ways to stay out of search engines. The Google panelist did make some remarks about things like robots.txt. I went it more depth on it myself.
I agreed that something like ACAP or an expanded robots.txt system would be a real plus, but I disagreed with the implication that search engines weren’t somehow asking permission already.
In fact, the major search engines all ask for permission to index a web site. They ask for this on a routine basis. They ask for a robots.txt file. It is a fairly simple way for any publisher to say no to having their content used.
Here’s an example of that asking from earlier this week, out of the log files at my consulting web site, Calafia.com:
220.127.116.11 – - [10/Nov/2006:23:36:38 -0800] “GET /robots.txt HTTP/1.1″ 200 24 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
I’ve bolded the key elements. The first is a request for my robots.txt file. This is a place where, as I’ll explain more below, people can give or take permission away for pages to be indexed by search engines. The second bolding is Googlebot asking for that file. That’s Google explicitly seeking permission to index a page. Other search engines including Yahoo and Microsoft Windows Live Search did the same.
With this setup of my own, now I’m going to step back and revisit how search engines work. Those experienced with them can skip ahead. Those new to them — or those wanting a refresher, please read on.
Making An Index
Search engines like Google, Yahoo, Windows Live and Ask use “crawlers” or “spiders” to build what’s called an index. These are like hyperactive web surfers that read pages all over the web.
You can think of an index as being like a big book of the web. These search engine spiders visit pages, “index” them, which effectively adds them to the book. Then when we search for something like “travel,” it is as if search engines use special software to sift through all those billions of pages and pick out the ones that match.
Generally, the pages they list contain the actual words we searched for. That’s not always the case, in some circumstances. How people link can have a role (as with the miserable failure query). A search for a singular word might also bring up plural matches with some. But by and large, they flip through that big book they’ve created and find the pages from across the web that have the words you looked for.
Now the big caveat. I say it’s like a big book of the web, but that’s not really correct. It’s more like a big spreadsheet of the web. As I’ve explained before in my Indexing Versus Caching & How Google Print Doesn’t Reprint article:
The index literally breaks apart the page. It stores where words were located, were they in bold, what other words were they near, were the words in a hyperlink and so on.
Nothing in the index is anything you as a human being could read. I’ve described the index in searching classes to being like a “big book of the web.” But it’s not, really. It’s more like a giant spreadsheet, where all the words of a page are in one row of the spreadsheet, each word to a different column, then the next page in the row below that, and so on. It’s not something a human being would read.
In fact, it’s even more complicated than that. Dan Thies dives deeper on how even a spreadsheet model is too simplistic. But the most important point from all this is that the index is not something a human can read. It is not a copy of a page. Let me put it again out there on its own:
An index is not a copy of a page.
Of course, to put a page in an index, you have to read it. Some might argue that act of reading is copying. Others will argue the act of reading any web page, even by a human, is copying since the browser has to make a copy of the page on your local computer to display it.
Let’s take the most conservative view, that indexing IS copying. If so, every major search engine already gets permission before doing such copying.
Asking Permission To Index
Way back when the internet was just getting started — in 1994 to be specific — there was concern about search engines, in particular their spiders. The concern wasn’t over copyright infringement. The concern was that the spiders were so aggressive or misbehaved that spidering activity could bring down web servers. A need for a “No Indexing” mechanism developed. That need turned into the robots.txt protocol.
More about robots.txt can be found at the Web Robots Pages, maintained by the person who brought the system into being, Martijn Koster. For history buffs, I recommend reading Bots: The Origin Of A New Species by Andrew Leonard. Pages 120-140 provide some classic history over concern about spiders and how robots.txt emerged and gained support. Those wanting history online can read about discussions of the protocol in archived messages of the WWW Robots Mailing List.
For anyone who DOES NOT want to be indexed by any search engine, the system is very simple. You make a file called robots.txt that you place on your web server. In the file, you place these lines:
That’s it. Put a file up with those lines, and your pages don’t get into any of the major search engines. You don’t need to call anyone up at Microsoft. You don’t need to send a threatening letter to Yahoo. You don’t need fight a court case against Google, as Belgian newspapers recently did. You can get out and stay out simply by using that file.
Need even more help? Aside from the robots.txt site, each major search engine provides more detailed instructions:
- Google: How can I remove content from Google’s index?
- Yahoo: How can I have my web site or web pages removed from the search engine?
- Microsoft Windows Live: Control which pages of your website are indexed
- Ask.com: Webmaster Help
But It’s Not Legal!
So why did the Belgian papers go to court, rather than use a simple system more than a decade old to stay out. My opinion, based on talking with the spokesperson of the Copiepresse group that led that case, is that the battle is not about staying out but trying to force search engines to pay content owners for inclusion in their services. I form that opinion since after talking with Copiepresse, they have an illogical circular argument of dealing with indexing.
First I was told, somewhat similar to Gavin’s “explicit permission” suggestion, that Google (and other search engines) should not even index documents without permission. Instead, they should somehow manage to come up with a way to contact every owner of every site beforehand. As I quoted Margaret Boribon, secretary general of Copiepresse before:
“I’m sure they can find a very easy system to send an email or a document to alert the site and ask for permission or maybe a system of opt-in or opt-out,” she said.
I’m sure they can’t, I explained. It would be an impossible task to do manually, given that some domains actually host hundreds of sites that lack any contact details whatsoever. A manual permission system wouldn’t work.
Boribon was somewhat sympathetic to this. A machine-to-machine automatic connection would be fine, she said. Great — that’s exactly what robots.txt is, a machine-to-machine permissions system. No, no, I was told — that’s not a legal system. You get more of that in this interview here with her at Groklaw:
We cannot choose between being dispossessed of our content or erased. It is not acceptable. It is not Google who can make the laws governing our content. That is not acceptable. And all the standards and techniques they use, as brilliant as they may be, are techniques which belong to them, but which have no legal value. None whatsoever. They are not standardized, they have no legal status, there is no law which says: if you are not opposed, it’s normal that we take; there is no law which says that.
Actually, robots.txt is standardized to some degree, especially in terms of keeping all content out. Nor was it created by Google. It existed before Google and, I’d wager, well before any of the online editions of the Belgium newspapers.
More important, arguing robots.txt isn’t legal suggests there IS something legal out there. There’s not. There’s no automatic legal system to deal with this. Gavin’s proposed Automated Content Access Protocol system will be no more legal than the existing robots.txt system. We simply don’t have the legal framework behind either system to give them support.
Moreover, even if it eventually does go for (or against) one of these systems in one country, every other country still has its own laws. Robots.txt could become a legal way to grant copyright permission in the US but not in Belgium.
But It’s Opt-Out!
Another concern over robots.txt and search engines in general is that they operate under an opt-out system. Unless you say no, you’ll be included.
It’s possible to argue the opposite, that search engines operate on an opt-in system. That’s because they do indeed ask for permission on a regular basis to index material.
My example above showed Google asking for permission. Microsoft, Yahoo and even French-based Voila were other search engine visitors that day that asked for my permission. And I granted that permission by not denying access via robots.txt.
If you go with that argument, then robots.txt indeed has search engines asking for permission and having it granted to them before indexing any documents. They are using a well-established system explicitly to gain this permission. In fact, the reason until now it has not been supported in the courts is because until now, the system works to keep out those who wish to stay out. It’s only coming under attack now as more traditional publishers (in particular news publishers) seek to protect business models they feel are under threat.
But It’s Not Flexible Enough!
Robots.txt is not perfect. It has inconsistencies between search engines (for example, Yahoo only recently added wildcard support). Any search engine can expand support without consulting some standards body. There’s no automated way to give spiders access to password-protected areas you might want them to index and list but not reveal access details to unregistered or unpaid visitors to a web site.
Far more serious are the rogue spiders that don’t respect it at all. That’s not a situation with any of the major search engines, but an improved robots.txt system might allow a way for good bots to be certified, helping webmasters put up blocks against uncertified bad bots.
These problems are things I hope get corrected. ACAP potentially could turn into Robots.txt 2.0. The search engines themselves could come together to improve the existing system. But then again, robots.txt and the related meta robots tag can provide fairly precise control over what is — and isn’t — indexed.
For example, one argument I’ve heard is that robots.txt doesn’t help prevent image indexing. Not so. Put your images in particular directories (as most are already), then use robots.txt to block those.
How about the fact that robots.txt won’t allow you to be specific about particular pages. Again — not so. It can do this. Alternatively, the meta robots tag can be placed on any page you don’t want indexed.
The pitch I heard for ACAP at the Frankfurt book fair painted a fairly poor level for the support existing systems can provide. I felt that was unfortunate, making the search engines seem worse than they are. Again, robots.txt can and should evolve — but the major legal complaints I’ve heard so far could be dealt with existing systems.
Cached Pages: Actual Reprinting
As best I recall, Google was the first major search engine to provide a “cached pages” feature. This is where you can see an exact copy of a page that Google has stored from when it visited a web site. Since Google introduced it, all the major search engines provide a similar feature.
I said the index wasn’t human readable, and it’s not. But as part of the indexing process, Google (and the other major search engines) does make a copy of page that’s stored separately from the index for purposes of providing a cached copy.
I and many others were long iffy on whether it was legal for Google to effectively reprint pages in this way, when it started. Google’s argument had been that it was fair use. In the US, they’ve since won legal support of that argument.
Despite that win, removing cached pages was something I put on my 25 Things I Hate About Google list from earlier this year:
Stop caching pages: I was all for opt-out with cached pages until a court gave you far more right to reprint anything than anyone could have expected. Now you’ve got to make it opt-in. You helped create the caching mess by just assuming it was legal to reprint web pages online without asking, using opt-out as your cover. Now you’ve had that backed up legally, but that doesn’t make it less evil.
At the Search Engine Watch Forums, Caching Made Legal – Do You Agree? I Don’t! has a much longer argument from me about this. It might also seem an odd position to take, given that I have no problem with indexing.
After all, Google and the other major search engines have system allowing you to prevent pages from being cached. Anyone who doesn’t want to be cached can use this. Why is opt-out OK with indexing but not with caching?
To me, caching goes a step beyond indexing. It is actual reprinting and should require the search engines to only do it if — yes — explicit permission is granted via robots.txt files or related meta tags.
I’d like all the major search engines to make this change as soon as possible. I’d especially love to see Google take the leadership role here. I want the company to say that while they believe caching is perfectly legal, as a good corporate citizen, they’re going to take an extra step here to ensure publishers aren’t upset.
Keep in mind that if Google makes this move, search engines will still operate as before. You can still search and find matching pages, which is all the vast majority of people do. Anecdotally, few access cached pages. But losing them would be a huge PR boon for Google.
Huge? Yes, absolutely. It is incredibly difficult to defend the company, or any search engine, against charges that they don’t reprint material when a cached page shows that they effectively do. You can roll out all the “it’s easy to opt-out” arguments you want. Bringing up a copy on Google loses them serious support.
Indeed, one of the reasons Google lost in Belgium was because of cached pages. From the ruling:
Considering that his research has led him to prove that, while an article is still online on the site of the Belgian publisher, Google redirects directly, via the underlying hyperlinks, to the page where the article can be found, but as soon as the article can no longer be seen on the site of the Belgian newspaper publisher, it is possible to obtain the contents of it via the “Cached” hyperlink which then goes back to the contents of the article that Google has registered in the “cached” memory of the gigantic data base which Google keeps within its enormous number of servers;
Google wasn’t there to defend itself. Had it been, it would have likely explained that anyone can prevent caching through the use of meta tags and that even without those, if an article comes offline, then a cached copy will disappear eventually at Google, from a few days to a month or so. Instead, the plaintiff witness gets to paint cached copies in the worst light.
Still, even in the best light, cached pages still make Google in particular and search engines in general look bad. Lose them, unless a publisher specifically requests this type of reprinting take place.
The use of thumbnail images is another issue. There have been a couple of lawsuits about images search engines in the US, and my last understanding was that showing thumbnails so far has legal support. That’s the US, of course. That’s also in terms of showing images such as when you do a search specifically for images.
Google does something different with thumbnails. It gathers them up not for image search purposes but to make its news portal seem better. Visit Google News, and images enhance the experience there.
That’s something Google should stop. Yes, there are ways to keep images out of Google using robots.txt. But that system was designed to keep them out of actual image search engines. Google News isn’t an image search engine. It’s a step beyond to assume a “yes” to image search means also yes to using images in other ways. Moreover, there’s no way I know of for someone to say yes to inclusion to Google Image Search but no for images to be used with Google News. It’s either allow the images in both places or not.
Images, in particular, are sensitive. There’s no real incentive for many people to click through from a thumbnail to a larger image, as someone might from a story headline to the actual story. My view is that showing images should require explicit permission through an automated means, rather than an opt-out. That’s true whether it be for Google News or for image search in general.
I know image search is useful. But my understanding is that most people are using image search to gather images for use on web sites, reports and other things without getting the permission of the artists or photographers. The sites with images themselves seem to get no strong return, unlike the case with web search.
How about video search? In that case, a spider-based video search service wouldn’t be so egregious because to actually view a video, you’d need to do the click-through and watch the content on the site.
FYI, while that’s what I’d hope happens with images, it’s still worth noting that objections over images being in Google (or elsewhere) could easily be handled with the existing robots.txt system. Just put up a block, and you’re done.
Google’s Library Project
I took two things away from the Frankfurt Book Fair relating to Google’s book project. First was absolute amazement that the publishing industry is so scared of Google. Giant hall after hall after hall was filled with publishers. Books publishers of all types were everywhere from the large:
to the small:
Print isn’t dead. Print is huge, giant, enormous!
Especially understand my perspective. I’ve overseen the largest conferences in search ever held — with the biggest attracting about 6,000 people and over 100 vendors, and it can be an amazing mass of people.
Well, our search expo hall could have fit within a corner of one of the book fair’s expo halls with plenty of room for the book fair to hardly notice us. Print is huge and Google but a booth — and a relatively small one — among many, many at the fair:
The second thing was a change of heart about Google’s indexing program. I’ve argued pretty strongly that indexing books isn’t making copies of them, so publishers shouldn’t be objecting. Google absolutely is not reprinting books that are in copyright on the web, despite what you often mistakenly hear.
Still, Google shouldn’t be scanning them, not the in copyright books, not without permission. First and foremost, this is because unlike with the web, there’s no automated way to ask permission. I fully support web indexing, but I support it because there’s an easy way to get permission. That’s not the case with books in copyright. Google can’t ask if indexing is OK. Since they can’t ask, I don’t think they should do it.
Similar to with cached pages, I think Google should back down. Google briefly paused scanning once before. I think they should again, say they feel they’re on solid legal ground but again to be a good corporate citizen, they’re putting things on hold until they can either work out an automated way to seek permission or until they negotiate deals.
Indexing & Inclusion Through Negotiation
Part of Google’s copyright battle woes come out of its culture. Born a search engine, like search engines before, Google operated under an opt-out world. That world was fine when dealing with site owners to this day that still want the traffic Google sends to them.
Things changed as Google’s “organize the world’s information” ambitions got bigger. Organize the world’s video by taping TV broadcasts over the air, and you anger a very strong television and film industry. That’s the same industry you need when it turns out much of what people want in video search is prime time, professionally produced broadcast content. In addition, since much of this isn’t hosted elsewhere, you can’t point at it as with web pages (nor insert your own video ads as easily). For success, you back away from your original opt-out culture and instead start cutting deals.
Deal cutting only seems to have accelerated as Google it seeks to cut off concerns over copyright as it acquires YouTube. But honestly, those video problems aren’t Google’s real challenge. It’s clearly putting huge effort into deal making and resolution because of the money it seeks to make there. Google paid far more for YouTube than any other purchase it made. Video — and related copyright concerns — will be a problem that gets solved, because everyone sees lots of money in doing so.
Some of that attention needs to flow into the other trouble areas, the trouble areas that erupted as Google pushed other industries into a search world before they were ready. For books, that means cutting the scanning of in copyright books and doing deals if Google wants that content, as painful as it might be for the company. In the long run, it’s the right thing.
News search is a more difficult area. Dropping thumbnails and cached pages may help. However, there’s no particular reason for Google to cut deals simply because newspapers worry that Google News might hurt their model.
Google News is simply a summary of content from selected sites. If titles and descriptions can’t be shown for a site, you might as well shut down the web. Linking to a web site and describing what it is about is done by thousands of blogs each day. Why should Google News be restricted from doing the same, especially when the legality of what’s happening seems far stronger?
In addition, it remains difficult to see how the existing publications aren’t gaining from this. Consider the Belgian papers. They’d long been indexed by Google, but they only decided they had a problem with this when Google News Belgium launched. Then it was court time. From what Copiepresse said:
It was the launching of the Google News service, presenting themselves as an information portal, which started our actions, those of the WAN, and that of the AFP. It’s not the search engine we blame. It’s a fabulous tool, we completely agree. Now, I would say, as a citizen, leaving aside the problems of copyrights, of my members, etc. — as a simple citizen, I have difficulties when I find myself facing a monopoly, a near-monopoly such as this one, because the influence that can have in terms of indexing or non-indexing of information, it’s not neutral — politically, globally, it is frankly not neutral. I mean, the attitude of Google and other search engines to the Chinese government accepting censorship, or selling keywords or ad pages to the National Front… where are the ethics in all that? I want to say that I don’t particularly want Google to lay down worldwide law on the Internet. That’s not OK. There have to be alternatives. There has to be fair competition. There has to be respect for content and the legal frameworks of the different parts of the world. Google cannot self-proclaim itself Emperor of the Internet. It’s not possible. There are major political consequences in all that.
Alternatives to the Belgian papers, as far as I can tell, potentially means banding together in hopes of forcing Google to pay them to be included in its news listings.
What happens if the key Belgian papers aren’t in Google News Belgium? It’s a less useful service that might not grow. As a results, perhaps the individual papers might launch their own collective news service and attract traffic that Google might otherwise get.
If so, what happens to that traffic that’s different than at Google News Belgium? Little, really. People at either place will see articles they are interested in and in some cases, click through to read the articles. IE, click through to the Belgium newspapers.
And that’s unfair to the Belgium papers how? And if there are smaller papers that want to be in the collective portal, do they get in and get treated as fairly as all the others, with vested interests at stake?
It’s pretty easy to assume that as bad as the Belgian paper group paints Google, the group itself could be just as bad. But more important is the underlying point — Google (as other news search engine) can only exist if it sends these places traffic. If they die, Google News dies — since it has no reporters of its own. It’s in Google’s interest to see these other sites do well.
The Wrap Up
I’ve hit a lot of areas, so I though some bullet-points would help close things off. First, some recent articles that are very good on legal issues, especially in terms of Google:
- Copyright tussles for Google, News.com, Aug. 7, 2006
- We’re Google. So Sue Us: New York Times, Oct. 23, 2006
Next, a recap of my major points. Overall, I want
- Wide-spread acceptance of robots.txt or a similar systems as a means for giving or denying inclusion into crawler-based search engines and similar resources that provide what would be considered fair-use titles and descriptions of stories. That means an end to pretending lawsuits were required to stay out of Google (as with Belgium) and consideration of such systems if new laws are enacted (as might NOT be happening in Australia).
- Search engines to require content owners to specifically opt-in to touchy areas such as image search, image thumbnail usage and cached pages even if there’s automated ways to opt-out. Don’t make these legal battles. Don’t hurt the core support of index either by continuing to be opt-out in these areas.
- Google to drop scanning of in-copyright books without permission (and those suing them to stop, in return).
- A new system to be developed with the search engines and a broad range of publishers for online indexing. That’s not ACAP, in the sense that ACAP had not specific solutions when it rolled out. Moreover, ACAP really represents the interests of a minority of publishers on the web, news publishers. Web publishers are online merchants and small bloggers and forum owners and those with personal home pages and B2B business and Fortune 1000 sites and local merchants with single pages and more. No, every constituency can’t be represented. But any new system needs more broad-based participation.
As we continue on into a new era of search, where we go well beyond web content, the challenges and legalities are only going to get more complex. I’m not foolish enough to believe that court cases will go away. But I can hope that some changes on Google’s part in particular, and by search engines in general, can help ease the transition as we go forward. More important, it makes it easier to defend them against the things that many people have supported for years, such as the core of web search indexing.
What do you think? I’d love to hear your comments below.