Crawlers, Spiders, Robots
So there are over 4 million sites in the Web now and the new ones are added
every single day, great, wow. The problem is, how do you find exactly what
you are looking for? Sounds like the proverbial needle in a haystack dilemma.
The Web, being a collection of webpages that reside in millions of
computers all over the world, is not organized in any orderly fashion as
we would hope it to be. There are no catalogs listing titles, authors,
and topics in any particular alphabetical, chronological, or numerical
order. This is the main reason why search search engines were developed.
A search engine does not exactly go forth and search these millions
of computers for the information you asked for. Search engines are programs
that search through databases of HTML documents that are indexed by key
words. Search engines rely on software programs called robots to build
these databases. Web robots are often referred to as crawlers, spiders,
wanderers, worms, ants, and even bots for short. Don't be misled
by their names, because robots don't literally move from one site to another.
Rather, the software visits a site then scans it for links to other sites
and moves on to these other sites. Robots of major search sites can visit
a million or more sites a day. They build databases by indexing the contents
of Web sites. Depending on how these were programmed, indexing robots parse
web pages the titles, the description, the first few paragraphs, and meta
tags, or even the entire body of the document. So if I use the word
"internet" and "robots" in this page 50 times there is a good chance this
page will be pulled up by a search engine in response to a request for
"Internet robots." But then this page would obviously not make any sense
to anybody. This is where meta tags come in handy.
Tags are codes that tell browsers how to display text, images, and other
files in a web page. For example, <I> this </I> the "bracketed" letters
are the tags that instruct your browser to display the word this
in italics like so. Meta tags are different because these provide information
that are not displayed on the web page itself. This includes the author,
content, and description of the page. Robots and search engines use keywords
and descriptions in meta tags to index HTML documents.
Robots serve many purposes other than indexing. There are robots that
do nothing but check or validate links and web pages, robots that monitors
new sites, and robots that verify mirror sites -- a website that is replicated
in other networks or servers.
Exercise
Press the keys Ctrl and U at the same time to view the page
source of this document, or click the View menu in the
toolbar above and select Page Source. This command will
open another window that will show you the HTML tags of this page. The
first lines that start with "meta name" are the meta tags:
|