Crawler – Short Conceptual Explanation

A crawler, also known as a spider or a bot, is a web program that scours the Internet, reading web pages and indexing the information it finds. A spider looks at the keywords, content, and links contained on each page and stores it to a database where a snapshot of that page can be retrieved at a later time. This process is used by search engines, like Google, so the most relevant information can be retrieved when searching for a term or phrase on the Internet.


What and how a crawler adds to its indexed pages

A web crawler starts with a list of URLs to visit on the Internet, often referred to as seeds. Each URL is scanned in order to determine the type of information it contains. A bot will catalogue keywords and phrases that are used throughout the page, as well links that are used throughout the content. The bot takes a snapshot of a page as it exists at a single moment in time. Once the information is collected, it is added to a database, sometimes referred to as a depository.

A spider is only able to collect a certain amount of information at any given time. It’s important that it prioritizes web pages according to which pages need to be crawled, because there are over four billion indexed pages on the Internet, with even more that haven’t been indexed. The ultimate goal of a crawler is to research and index as many pages as possible.

What a crawler does with the information it finds

Although a crawler can be used by businesses to catalogue their websites or researchers, web crawlers are mostly utilized by Internet search engines. The information that is catalogued by a bot about each web page gets deposited into a huge database where that information can be retrieved.

For example, a user can access a search engine like Google on the Internet. Then, that user can type in a word or a phrase that they’d like to know more about. A user might type ‘what is a web crawler’ into the search bar. The search engine will search its huge depository, looking for pages that contain information that is most relevant to that search.

Constant rebuilding of the database

In order to ensure that a search engine provides the most relevant information possible, a crawler must not only visit and catalogue new web pages. It must also revisit pages that have been catalogued in the past in order to determine if there have been any changes that would affect the relevancy of the information.

Crawls are being conducted all the time to identify new pages and update the information on existing pages.


Types of crawls

How often crawls take place depends on the type of crawl being performed. Deep crawls are more comprehensive and are meant to catalogue a page as if it is being catalogued for the first time.

Fresh crawls, in contrast, don’t go as deep. They can be performed more often, so they are able to keep the database more up-to-date. However, they index less, which means the sites they crawl may not be as searchable.

Crawling policy

How a web crawler behaves depends entirely on policy regulations. Some are meant to ensure efficiency, while others are meant to protect the site being crawled. They include

  • A selection policy that identifies the list of pages to be indexed. Because there are so many pages on the web, and a bot can only scan so quickly, it is important to select pages that include the most relevant content on the Internet.
  • A re-visit policy that allows a database to remain fresh, which means the local copy is as accurate and as current as possible.
  • A politeness policy that ensures a particular server isn’t overloaded by the ability of a crawler so the page continues to function normally even during a crawl.
  • A parallelization policy that maximizes downloads by avoiding repeated downloads and allowing a bot to run multiple processes at the same time.


Practical applications of web crawler technology

Knowing how web crawlers and search engines work, web designers and content writers can use this process to their advantage.

Not only does a web crawler catalogue keywords and revisit websites looking for updated information, it also modifies its selection policy to favor websites that are updated frequently. A website that is updated more frequently is more likely to be indexed appropriately, which increases the likelihood and frequency of showing up in an online search.

The information that is stored by a web crawler is not only used for search engine results. The data that is contained in a depository has many other applications as well.

Data mining is an application of crawler technology that allows a user to gather predictive information on a wide variety of topics. For example, insurance companies are able to determine spending and saving patterns of customers, while presidential campaigners use mining techniques to create pre-election campaigns by collecting information on electoral members and behavior patterns of their constituency.