Search Engines, Permissions & Moving Forward In Copyright Battles

by Danny Sullivan on November 13, 2006

in Newspapers, Search

I’ve been
promising a long look at copyright, search engines in general and
Google in particular in the wake of recent conflicts between Google and some
newspapers and other content owners. A quote in a New York Times article today
– implying that Google
doesn’t get copyright permission — finally pushed me forward with my own look.
Below, a revisit to the important difference between indexing and reprinting,
how robots.txt works right now as a permissions system, why Google should stop
scanning in-copyright books and the leadership role it could play by dropping
cached pages.


A Struggle Over Dominance and Definition
is the New York Times article that looks at
Google and whether it is a media company that conflicts with other media owners,
especially in terms of using content from others without permission. It’s a good
article, covering common themes that have been going on for literally years.

Is Google a media company? The word from Google in the article remains a firm no,
pretty similar to how Google cofounder Sergey Brin talked about still being a
technology company
when I put
the same question to him back in 2003.

Regardless of what Google thinks, I consider
them a media company, whether they own the content or not.
My Schmidt:
Google Still A Tech Company Despite The Billboards
article from early this
year looks in more depth at my reasons why.

Is Google a copyright violator? The answer is largely unknown, given
that the laws have yet to catch up with actions. Google will say no; some say
yes; it can also depend on the case, and it ultimately remains for a lot of courts to decide.

The Setup: Search Engines Asking For Permission In Action

Let’s start off with the case of indexing for inclusion in the core search
engine. I’ll use a quote in today’s New York Times article from Gavin O’Reilly of
the World Association Of Newspapers:

Gavin K. O’Reilly, the president of the World Association of Newspapers,
argues that what is missing is that any search engine ought to be asking
"explicit permission" to use copyrighted material, and that this should be
part of the vaunted automation that has made search the phenomenon it is.

I met Gavin personally in September, when I was on a panel with him and several
people at
the Frankfurt Book Fair looking at the issue of
search engines and copyright.

Most of the session
was a setup for Gavin to roll out a proposed Automated Content Access Protocol
(ACAP) that
his group backs as a solution to the problems search engines supposedly have
with copyright.

It was a receptive audience, given that from the discussion and questions, many were
clearly upset with the part of Google’s library scanning project that indexes in
copyright books without publisher permission. The presentation (you’ll find it
here) generally made
it seem like publishers had relatively little control over what search engines
can index.

Google was on the panel, but they had no equal time, much less formal
presentation time, to explain the existing automated ways to stay out of search
engines. The Google panelist did make some remarks about things like robots.txt. I went it more depth on it myself.

I agreed that something like ACAP or an expanded robots.txt system would be a
real plus, but I disagreed with the implication that search engines weren’t
somehow asking permission already.

In fact, the major search engines all ask for permission to index a web site.
They ask for this on a routine basis. They ask for a robots.txt file. It is a
fairly simple way for any publisher to say no to having their content used.

Here’s an example of that asking from earlier this week, out of the log files
at my consulting web site, Calafia.com:

66.249.72.12 - - [10/Nov/2006:23:36:38 -0800] "GET /robots.txt
HTTP/1.1" 200 24 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I’ve bolded the key elements. The first is a request for my robots.txt file.
This is a place where, as I’ll explain more below, people can give or take
permission away for pages to be indexed by search engines. The second bolding is
Googlebot asking for that file. That’s Google explicitly seeking permission to
index a page. Other search engines including Yahoo and Microsoft Windows Live
Search did the same.

With this setup of my own, now I’m going to step back and revisit how search
engines work. Those experienced with them can skip ahead. Those new to them –
or those wanting a refresher, please read on.

Making An Index

Search engines like Google, Yahoo, Windows Live and Ask use "crawlers" or
"spiders" to build what’s called an index. These are like hyperactive web
surfers that read pages all over the web.

You can think of an index as being like a big book of the web. These search
engine spiders visit pages, "index" them, which effectively adds them to the
book. Then when we search for something like "travel," it is as if search
engines use special software to sift through all those billions of pages and
pick out the ones that match.

Generally, the pages they list contain the actual words we searched for.
That’s not always the case, in some circumstances. How people link can have a
role (as with the
miserable failure
query). A search for a singular word might also bring up
plural matches with some. But by and large, they flip through that big book
they’ve created and find the pages from across the web that have the words you
looked for.

Now the big caveat. I say it’s like a big book of the web, but that’s not
really correct. It’s more like a big spreadsheet of the web. As I’ve explained
before in my
Indexing Versus Caching & How Google Print Doesn’t Reprint
article:

The index literally breaks apart the page. It stores where words were
located, were they in bold, what other words were they near, were the words in
a hyperlink and so on.

Nothing in the index is anything you as a human being could read. I’ve
described the index in searching classes to being like a "big book of the
web." But it’s not, really. It’s more like a giant spreadsheet, where all the
words of a page are in one row of the spreadsheet, each word to a different
column, then the next page in the row below that, and so on. It’s not
something a human being would read.

In fact, it’s even more complicated than that. Dan Thies

dives deeper
on how even a spreadsheet model is too simplistic. But the most
important point from all this is that the index is not something a human can
read. It is not a copy of a page. Let me put it again out there on its own:

An index is not a copy of a page.

Of course, to put a page in an index, you have to read it. Some might argue
that act of reading is copying. Others will argue the act of reading any web
page, even by a human, is copying since the browser has to make a copy of the
page on your local computer to display it.

Let’s take the most conservative view, that indexing IS copying. If so, every
major search engine already gets permission before doing such copying.

Asking Permission To Index

Way back when the internet was just getting started — in 1994 to be specific
– there was concern about search engines, in particular their spiders. The
concern wasn’t over copyright infringement. The concern was that the spiders
were so aggressive or misbehaved that spidering activity could bring down web
servers. A need for a "No Indexing" mechanism developed. That need turned into
the robots.txt protocol.

More about robots.txt can be found at the
Web Robots Pages
, maintained by the person who brought the system into
being, Martijn Koster. For history buffs, I recommend reading

Bots: The Origin Of A New Species
by Andrew Leonard. Pages 120-140 provide
some classic history over concern about spiders and how robots.txt emerged and
gained support. Those wanting history online can read about discussions of the
protocol in
archived messages
of the WWW Robots Mailing List.

For anyone who DOES NOT want to be indexed by any search engine, the system
is very simple. You make a file called robots.txt that you place on your web
server. In the file, you place these lines:

User-agent: *
Disallow: /

That’s it. Put a file up with those lines, and your pages don’t get into any
of the major search engines. You don’t need to call anyone up at Microsoft. You
don’t need to send a threatening letter to Yahoo. You don’t need fight a court
case against Google, as Belgian newspapers
recently did.
You can get out and stay out simply by using that file.

Need even more help? Aside from the robots.txt site, each major search engine
provides more detailed instructions:

But It’s Not Legal!

So why did the Belgian papers go to court, rather than use a simple system
more than a decade old to stay out. My opinion, based on talking with the
spokesperson of the Copiepresse
group that led that case, is that the battle is not about staying out but trying
to force search engines to pay content owners for inclusion in their services. I
form that opinion since after talking with Copiepresse, they have an illogical
circular argument of dealing with indexing.

First I was told, somewhat similar to Gavin’s "explicit permission"
suggestion, that Google (and other search engines) should not even index
documents without permission. Instead, they should somehow manage to come up
with a way to contact every owner of every site beforehand. As I quoted Margaret
Boribon, secretary general of Copiepresse
before:

"I’m sure they can find a very easy system to send an email or a document to
alert the site and ask for permission or maybe a system of opt-in or opt-out,"
she said.

I’m sure they can’t, I explained. It would be an impossible task to do
manually, given that some domains actually host hundreds of sites that lack any
contact details whatsoever. A manual permission system wouldn’t work.

Boribon was somewhat sympathetic to this. A machine-to-machine automatic
connection would be fine, she said. Great — that’s exactly what robots.txt is,
a machine-to-machine permissions system. No, no, I was told — that’s not a
legal system. You get more of that in this interview
here
with her at Groklaw:

We cannot choose between being dispossessed of our content or erased. It is
not acceptable. It is not Google who can make the laws governing our content.
That is not acceptable. And all the standards and techniques they use, as
brilliant as they may be, are techniques which belong to them, but which have no
legal value. None whatsoever. They are not standardized, they have no legal
status, there is no law which says: if you are not opposed, it’s normal that we
take; there is no law which says that.

Actually, robots.txt is standardized to some degree, especially in terms of
keeping all content out. Nor was it created by Google. It existed before Google
and, I’d wager, well before any of the online editions of the Belgium
newspapers.

More important, arguing robots.txt isn’t legal suggests there IS something
legal out there. There’s not. There’s no automatic legal system to deal with
this. Gavin’s proposed Automated Content Access Protocol system will be no more
legal than the existing robots.txt system. We simply don’t have the legal
framework behind either system to give them support.

Moreover, even if it eventually does go for (or against) one of these systems
in one country, every other country still has its own laws. Robots.txt could
become a legal way to grant copyright permission in the US but not in Belgium.

But It’s Opt-Out!

Another concern over robots.txt and search engines in general is that they
operate under an opt-out system. Unless you say no, you’ll be included.

It’s possible to argue the opposite, that search engines operate on an opt-in
system. That’s because they do indeed ask for permission on a regular basis to
index material.

My example above showed Google asking for permission. Microsoft, Yahoo and
even French-based Voila were other search engine visitors that day that asked
for my permission. And I granted that permission by not denying access via
robots.txt.

If you go with that argument, then robots.txt indeed has search engines
asking for permission and having it granted to them before indexing any
documents. They are using a well-established system explicitly to gain this
permission. In fact, the reason until now it has not been supported in the
courts is because until now, the system works to keep out those who wish to stay
out. It’s only coming under attack now as more traditional publishers (in
particular news publishers) seek to protect business models they feel are under
threat.

But It’s Not Flexible Enough!

Robots.txt is not perfect. It has inconsistencies between search engines (for
example, Yahoo
only recently
added wildcard support). Any search engine can expand support
without consulting some standards body. There’s no automated way to give spiders
access to password-protected areas you might want them to index and list but not
reveal access details to unregistered or unpaid visitors to a web site.

Far more serious are the rogue spiders that don’t respect it at all. That’s
not a situation with any of the major search engines, but an improved robots.txt
system might allow a way for good bots to be certified, helping webmasters put
up blocks against uncertified bad bots.

These problems are things I hope get corrected. ACAP potentially could turn
into Robots.txt 2.0. The search engines themselves could come together to
improve the existing system. But then again, robots.txt and the related meta
robots tag can provide fairly precise control over what is — and isn’t –
indexed.

For example, one argument I’ve heard is that robots.txt doesn’t help prevent
image indexing. Not so. Put your images in particular directories (as most are
already), then use robots.txt to block those.

How about the fact that robots.txt won’t allow you to be specific about
particular pages. Again — not so. It can do this. Alternatively, the meta
robots tag can be placed on any page you don’t want indexed.

The pitch I heard for ACAP at the Frankfurt book fair painted a fairly poor
level for the support existing systems can provide. I felt that was unfortunate,
making the search engines seem worse than they are. Again, robots.txt can and
should evolve — but the major legal complaints I’ve heard so far could be dealt
with existing systems.

Cached Pages: Actual Reprinting

As best I recall, Google was the first major search engine to provide a
"cached pages" feature. This is where you can see an exact copy of a page that
Google has stored from when it visited a web site. Since Google introduced it,
all the major search engines provide a similar feature.

I said the index wasn’t human readable, and it’s not. But as part of the
indexing process, Google (and the other major search engines) does make a copy
of page that’s stored separately from the index for purposes of providing a
cached copy.

I and many others were long iffy on whether it was legal for Google to
effectively reprint pages in this way, when it started. Google’s argument had
been that it was fair use. In the US, they’ve since
won legal support
of that argument.

Despite that win, removing cached pages was something I put on my
25 Things I Hate
About Google
list from earlier this year:

Stop caching pages: I was all for opt-out with cached pages until a
court gave
you
far more right to reprint anything than anyone could have expected. Now
you’ve got to make it opt-in. You helped create the caching mess by just
assuming it was legal to reprint web pages online without asking, using opt-out
as your cover. Now you’ve had that backed up legally, but that doesn’t make it
less evil.

At the Search Engine Watch Forums,
Caching Made
Legal - Do You Agree? I Don’t!
has a much longer argument from me about
this. It might also seem an odd position to take, given that I have no problem
with indexing.

After all, Google and the other major search engines have system allowing you
to prevent pages from being cached. Anyone who doesn’t want to be cached can use
this. Why is opt-out OK with indexing but not with caching?

To me, caching goes a step beyond indexing. It is actual reprinting and
should require the search engines to only do it if — yes — explicit permission
is granted via robots.txt files or related meta tags.

I’d like all the major search engines to make this change as soon as
possible. I’d especially love to see Google take the leadership role here. I
want the company to say that while they believe caching is perfectly legal, as a
good corporate citizen, they’re going to take an extra step here to ensure
publishers aren’t upset.

Keep in mind that if Google makes this move, search engines will still
operate as before. You can still search and find matching pages, which is all
the vast majority of people do. Anecdotally, few access cached pages. But losing
them would be a huge PR boon for Google.

Huge? Yes, absolutely. It is incredibly difficult to defend the company, or
any search engine, against charges that they don’t reprint material when a
cached page shows that they effectively do. You can roll out all the "it’s easy
to opt-out" arguments you want. Bringing up a copy on Google loses them serious
support.

Indeed, one of the reasons Google lost in Belgium was because of cached
pages. From the ruling:

Considering that his research has led him to prove that, while an article is
still online on the site of the Belgian publisher, Google redirects directly,
via the underlying hyperlinks, to the page where the article can be found, but
as soon as the article can no longer be seen on the site of the Belgian
newspaper publisher, it is possible to obtain the contents of it via the
“Cached” hyperlink which then goes back to the contents of the article that
Google has registered in the “cached” memory of the gigantic data base which
Google keeps within its enormous number of servers;

Google wasn’t there to defend itself. Had it been, it would have likely
explained that anyone can prevent caching through the use of meta tags and that
even without those, if an article comes offline, then a cached copy will
disappear eventually at Google, from a few days to a month or so. Instead, the
plaintiff witness gets to paint cached copies in the worst light.

Still, even in the best light, cached pages still make Google in particular
and search engines in general look bad. Lose them, unless a publisher
specifically requests this type of reprinting take place.

Thumbnail Images

The use of thumbnail images is another issue. There have been a couple of
lawsuits about images search engines in the US, and my last understanding was
that showing thumbnails so far has legal support. That’s the US, of course.
That’s also in terms of showing images such as when you do a search specifically
for images.

Google does something different with thumbnails. It gathers them up not for
image search purposes but to make its news portal seem better. Visit
Google News, and images enhance the
experience there.

That’s something Google should stop. Yes, there are ways to keep images out
of Google using robots.txt. But that system was designed to keep them out of
actual image search engines. Google News isn’t an image search engine. It’s a
step beyond to assume a "yes" to image search means also yes to using images in
other ways. Moreover, there’s no way I know of for someone to say yes to
inclusion to Google Image Search but no for images to be used with Google News.
It’s either allow the images in both places or not.

Images, in particular, are sensitive. There’s no real incentive for many
people to click through from a thumbnail to a larger image, as someone might
from a story headline to the actual story. My view is that showing images should
require explicit permission through an automated means, rather than an opt-out.
That’s true whether it be for Google News or for image search in general.

I know image search is useful. But my understanding is that most people are
using image search to gather images for use on web sites, reports and other
things without getting the permission of the artists or photographers. The sites
with images themselves seem to get no strong return, unlike the case with web
search.

How about video search? In that case, a spider-based video search service
wouldn’t be so egregious because to actually view a video, you’d need to do the
click-through and watch the content on the site.

FYI, while that’s what I’d hope happens with images, it’s still worth noting
that objections over images being in Google (or elsewhere) could easily be
handled with the existing robots.txt system. Just put up a block, and you’re
done.

Google’s Library Project

I took two things away from the Frankfurt Book Fair relating to Google’s book
project. First was absolute amazement that the publishing industry is so scared
of Google. Giant hall after hall after hall was filled with publishers. Books
publishers of
all types were everywhere from the large:


Wiley At Frankfurt Book Fair

to the small:


Small Vendors At Frankfurt Book Fair

Print isn’t dead. Print is huge, giant, enormous!

Especially understand my perspective. Doing
Search Engine Strategies, I oversee the largest conference about search that
I know of. At our biggest, we attract about 6,000 people and over 100 vendors,
and it can be an amazing mass of people.

Well, our expo hall could have fit within a corner of one of the book fair’s
expo halls with plenty of room for the book fair to hardly notice us. Print is
huge and Google but a booth — and a relatively small one — among many, many at
the fair:


Google Booth At Frankfurth Book Fair

The second thing was a change of heart about Google’s indexing program. I’ve
argued pretty
strongly
that indexing books isn’t making copies of them, so publishers
shouldn’t be objecting. Google absolutely is not reprinting books that are in
copyright on the web, despite what you often mistakenly hear.

Still, Google shouldn’t be scanning them, not the in copyright books, not
without permission. First and
foremost, this is because unlike with the web, there’s no automated way to ask
permission. I fully support web indexing, but I support it because there’s an
easy way to get permission. That’s not the case with books in copyright. Google
can’t ask if indexing is OK. Since they can’t ask, I don’t think they should do
it.

Similar to with cached pages, I think Google should back down. Google
briefly paused
scanning once before. I think they should again, say they feel they’re on solid
legal ground but again to be a good corporate citizen, they’re putting things on
hold until they can either work out an automated way to seek permission or until
they negotiate deals.

Indexing & Inclusion Through Negotiation

Part of Google’s copyright battle woes come out of its culture. Born a search
engine, like search engines before, Google operated under an opt-out world. That
world was fine when dealing with site owners to this day that still want the
traffic Google sends to them.

Things changed as Google’s "organize the world’s information" ambitions got
bigger. Organize the world’s video by taping TV broadcasts over the air, and you
anger a very strong television and film industry. That’s the same industry you
need when it turns out much of what people want in video search is prime time, professionally produced broadcast
content. In addition, since much of this isn’t hosted elsewhere, you can’t point
at it as with web pages (nor insert your own video ads as easily). For success,
you back away from your original opt-out culture and instead start cutting
deals.

Deal cutting only seems to have accelerated as Google it seeks to cut off concerns over copyright as it acquires YouTube.
But honestly, those video problems aren’t Google’s real challenge. It’s clearly
putting huge effort into deal making and resolution because of the money it
seeks to make there. Google paid far more for YouTube than any other purchase it
made. Video — and related copyright concerns — will be a problem that gets
solved, because everyone sees lots of money in doing so.

Some of that attention needs to flow into the other trouble areas, the
trouble areas that erupted as Google pushed other industries into a search world
before they were ready. For books, that means cutting the scanning of in
copyright books and doing deals if Google wants that content, as painful as it
might be for the company. In the long run, it’s the right thing.

News search is a more difficult area. Dropping thumbnails and cached pages
may help. However, there’s no particular reason for Google to cut deals simply
because newspapers worry that Google News might hurt their model.

Google News is simply a summary of content from selected sites. If titles and
descriptions can’t be shown for a site, you might as well shut down the web.
Linking to a web site and describing what it is about is done by thousands of
blogs each day. Why should Google News be restricted from doing the same,
especially when the legality of what’s happening seems far stronger?

In addition, it remains difficult to see how the existing publications aren’t
gaining from this. Consider the Belgian papers. They’d long been indexed by Google,
but they only decided they had a problem with this when Google News Belgium
launched. Then it was court time. From what Copiepresse
said:

It was the launching of the Google News service, presenting themselves as
an information portal, which started our actions, those of the WAN, and that
of the AFP. It’s not the search engine we blame. It’s a fabulous tool, we
completely agree. Now, I would say, as a citizen, leaving aside the problems
of copyrights, of my members, etc. — as a simple citizen, I have difficulties
when I find myself facing a monopoly, a near-monopoly such as this one,
because the influence that can have in terms of indexing or non-indexing of
information, it’s not neutral — politically, globally, it is frankly not
neutral. I mean, the attitude of Google and other search engines to the
Chinese government accepting censorship, or selling keywords or ad pages to
the National Front… where are the ethics in all that? I want to say that I
don’t particularly want Google to lay down worldwide law on the Internet.
That’s not OK. There have to be alternatives. There has to be fair
competition. There has to be respect for content and the legal frameworks of
the different parts of the world. Google cannot self-proclaim itself Emperor
of the Internet. It’s not possible. There are major political consequences in
all that.

Alternatives to the Belgian papers, as far as I can tell, potentially means banding
together in hopes of forcing Google to pay them to be included in its news
listings.

What happens if the key Belgian papers aren’t in Google News Belgium? It’s a
less useful service that might not grow. As a results, perhaps the individual
papers might launch their own collective news service and attract traffic that
Google might otherwise get.

If so, what happens to that traffic that’s different than at Google News
Belgium? Little, really. People at either place will see articles they are
interested in and in some cases, click through to read the articles. IE, click
through to the Belgium newspapers.

And that’s unfair to the Belgium papers how? And if there are smaller papers
that want to be in the collective portal, do they get in and get treated as
fairly as all the others, with vested interests at stake?

It’s pretty easy to assume that as bad as the Belgian paper group paints
Google, the group itself could be just as bad. But more important is the underlying point –
Google (as other news search engine) can only exist if it sends these places
traffic. If they die, Google News dies — since it has no reporters of its own.
It’s in Google’s interest to see these other sites do well.

The Wrap Up

I’ve hit a lot of areas, so I though some bullet-points would help close
things off. First, some recent articles that are very good on legal issues,
especially in terms of Google:

Next, a recap of my major points. Overall, I want

  • Wide-spread acceptance of robots.txt or a similar systems as a means for
    giving or denying inclusion into crawler-based search engines and similar
    resources that provide what would be considered fair-use titles and
    descriptions of stories. That means an end to pretending lawsuits were
    required to stay out of Google (as with Belgium) and consideration of such
    systems if new laws are enacted (as might NOT
    be happening
    in Australia).
     
  • Search engines to require content owners to specifically opt-in to touchy
    areas such as image search, image thumbnail usage and cached pages even if
    there’s automated ways to opt-out. Don’t make these legal battles. Don’t hurt
    the core support of index either by continuing to be opt-out in these areas.
     
  • Google to drop scanning of in-copyright books without permission (and
    those suing them to stop, in return).
     
  • An new system to be developed with the search engines and a broad range of
    publishers for online indexing. That’s not ACAP, in the sense that ACAP had
    not specific solutions when it rolled out. Moreover, ACAP really represents
    the interests of a minority of publishers on the web, news publishers. Web
    publishers are online merchants and small bloggers and forum owners and those
    with personal home pages and B2B business and Fortune 1000 sites and local
    merchants with single pages and more. No, every constituency can’t be
    represented. But any new system needs more broad-based participation.

As we continue on into a new era of search, where we go well beyond web
content, the challenges and legalities are only going to get more complex. I’m
not foolish enough to believe that court cases will go away. But I can hope that
some changes on Google’s part in particular, and by search engines in general,
can help ease the transition as we go forward. More important, it makes it
easier to defend them against the things that many people have supported for
years, such as the core of web search indexing.

What do you think? I’d love to hear your comments below.

  • Share/Save/Bookmark

{ 4 comments… read them below or add one }

1 Joe Dolson November 13, 2006 at 4:33 pm

Wow - that’s a great article. It seems to me that Google has tended to be a bit too free with other’s intellectual property. Their goal is admirable: to make access to all information easier. Nonetheless, the owner of the copyrighted property needs to have a say in it.
In the case of indexing a page, Google is, in my mind, in the clear: there are standardized and reasonably efficient means to restrict access. In regards to caching, although Google is perhaps invasive, there are, again, means to restrict access.
Regardless, Google’s insistence that their use of copyrighted material (such as scanning of in-copyright books) is a benefit to all needs to be addressed.

2 Brian M November 13, 2006 at 5:08 pm

Excellent posting!
I agree with you on everything, except for one exception about the “opt-in” request for a robots.txt file. Yes, the search engine asks for that file, but if the file does not exist, then the search engine freely indexes every page it can find on the site. So, this is not exactly an “opt-in” situation. It is more of an “opt-in to opt-out” situation, since you have to know that you must create a robots.txt file in order to opt-out.
It would be much safer (with possibly fewer successful lawsuits) if the search engine left the site when the robots.txt file did not exist (or the server was unable to successfully transmit it, etc.). Not everyone knows about the robots.txt file, but they would certainly learn about in a real hurry if they wanted their pages to be included…
Brian M

3 geraldb28 November 13, 2006 at 10:33 pm

As a publisher more interested in getting INTO Google’s various Book and Scholar programs… I’m actually OK with what they’re doing on all counts. Also, as a publisher I’m keenly aware of what’s at stake to the publishers is MONEY. The libraries who ask Google to perform the scanning of their collections as Michigan has done are cutting significant revenues out of the publishers annual budgets. How? By bypassing the Copyright Clearance Center at the photocopier. See, every time a student copies a paper from a book… The CCC gets a fee from the library and the publisher gets a big fat check from the CCC every quarter.
Google’s online ambition stops that dead in its tracks. If you were Elsevier or Prentice Hall… you would fight a pretty big, pitched battle.

4 bood guy November 15, 2006 at 6:51 pm

Mostly agree, but I don’t think the following is a problem:
“There’s no automated way to give spiders access to password-protected areas you might want them to index and list but not reveal access details to unregistered or unpaid visitors to a web site.”
In fact, there should be no such way. Listing content that’s not actually available when you click through is a frustrating and annoying experience (that sometimes happens in Google News, unfortunately).
As long as it’s freely available content, what the Google SERP does is guiding me to useful information. Fine. Bur when it’s protected content the search engine is suggesting, it simply teases me, and acts as a direct sales maker for those sites. And that’s definitely NOT fine.
Just an example:
What if I created good content and achieved the no. 1 position for some relevant, high-traffic keywords - then hide all that content behind a payment scheme? Should Google still give me the no. 1 spot? No, because the general usefulness of my content has been seriously compromised by not being freely available any more.
Actually, I wouldn’t even have earned most of my incoming links had that content been protected from the beginning.
If there was such an “automated way”, the most obvious (and deceptive) link bait would be to go free with my content for a limited period, get all the links I need to obtain high rankings, then hide it all behind a password and make people (lots of them, because I’d have huge traffic!) pay for it.
[btw, the preview option shows all the above in a single, hardly readable paragraph when in fact it's broken into several]

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Previous post: Techmeme & Fast Look At Other Meme Trackers

Next post: AdSense & EFT Payments In Non-US Currencies