Sometimes referred to as \’spiders\’ or crawlers, automated search engine robots seek out web pages for the user. Just how do they accomplish this and is this of importance? What is the real purpose of these robots?
These spiders actually have a rather limited scope of understanding and power available to them, far less than you would think considering they\’re minions of such great and mighty names as Google and Yahoo. There\’s a lot of things out of the scope of their understanding, such as frames, visuals such as movies or pictures, and scripting via java. Nor can they peek into parts of sites protected by passwords, or click buttons. Well, that\’s what they can\’t do. What CAN they do?
The robot makes a list of the web pages in the system at the \’submit a URL page, then searches for these web pages in order from the list the next time it goes on the web. Sometimes a robot will find your page whether you have submitted it or not because other site links may lead the robot to your site. Building your link popularity and getting links from other topical sites back to your site is important. The first thing a robot does when it arrives is to check for a robots.txt file. This file tells the robots which sites are off-limits. Usually these are files that should be of no concern because they are binaries or other files that are not needed by the robot.
Submitting a new URL to a search engine adds this URL to the queue which the spiders are due to \’crawl\’ or visit. However, even if a URL isn\’t submitted directly, the spiders usually find it through links from other websites. If you build link popularity, this will help the spiders find you faster. When the robots arrive, they\’ll check your site for a file called \’robots.txt,\’ which will tell them what areas of the website they are not allowed to visit. Off-limits files may include things like binaries or other information that the spiders need not report back.
When the robots return, the information they gathered is assimilated into the search engine\’s database. Through a complex algorithm, this data is interpreted and web sites are ranked according to how relevant they are to various topics that would be searched for. Some of the bots are quite easy to notice – Google\’s is the appropriately-named Googlebot, where Inktomi utilizes a more ambiguous bot named Slurp. Others may be difficult to identify at all.
A robot \’reads\’ your site by collecting data on any visible text, on tags you may have in the coding of your page, and on any links available. These are the things that determine what the search engines \’think\’ your content is about, so these are the things you really need to pay attention to when building a site that you want to have high visibility in search results.
If you\’re interested in seeing which pages the spiders have visited on your website, you can check your server logs or the results from your log statistics. From this information you\’ll know which spiders have visited, where they went, when they came, and which pages they crawl most often. Some are easy to identify, such as Google\’s \’Googlebot,\’ while others are harder: \’Slurp\’ from Inktomi, for example. In addition to identifying which spiders visit, you can also find if any spiders are draining your bandwidth so that you can block them from your site. The internet has plenty of information on identifying these bad bots. There are also certain things can prevent good spiders from crawling your site, such as the site being down or huge amounts of traffic. This can prevent your site from being re-indexed, though most spiders will eventually come by again to try re-accessing the page.
Justin Harrison is an internationally recognised Internet Marketing expert who provides world class Search Engine Optimization to website owners. For more information visit: http://www.seorankings.co.za
July 14th, 2010 at 4:50 am
I recommend Yahoo Small Business Web Hosting. Management of one’s internet site at Yahoo! is a breeze because of their web hosting manage panel. Every thing from setting up email accounts, acquiring monthly internet website statistics, to internet website development and maintenance could be easily controlled utilizing 1 standardized interface.