What is Googlebot?

Ankit Meena February 16, 2017
0 people like this post

What is Googlebot?

  • Googlebot is the webcrawler used by Google.
  • It is used by Google to find and retrieve webpages.
  • The information gathered by Googlebot is used to update the Google index.

Googlebot visits billions of webpages and is constantly visiting pages all over the web.

What is a webcrawler?

  • Web crawlers (also known as bots, robots or spiders) are a type of software designed to follow links, gather information and then send that information somewhere.

What does Googlebot do?

Googlebot scanning for and listing links

  • Googlebot retrieves the content of webpages (the words, code and resources that make up the webpage).
  • If the content it retrieves has links to other things, that is noted.
  • It then sends the information to Google.

Googlebot and your website

The information that Googlebot sends back to Google computers updates the Google index.

The Google index is where webpages are compared and ranked.

  • In order for your webpages to be found in Google, they must be visible to Googlebot.
  • In order for your webpages to rank optimally, all webpage resources must be accessible by Googlebot.

The difference between Googlebot and the Google index

Googlebot and Google servers

Googlebot

  • Googlebot retrieves content from the web.
  • Googlebot does not judge the content in anyway, it only retrieves it.
  • The only concerns Googlebot has is “Can I access this content?” and “Is there any further content that I can access?”

The Google index

  • The Google index takes the content it receives from Googlebot and uses it to rank pages

The first step of being ranked by Google is to be retrieved by Googlebot.

Ensuring Googlebot can see your pages

Since Googlebot is the way Google updates their index, it is essential that Googlebot can see your pages.

The fundamental first questions a webmaster should ask are…

  1. Can Googlebot “see” my pages?
  2. Can Googlebot access all my content and links completely?
  3. Can Googlebot access all of my page resources?

Let’s look at each of those closer…

1. Can Googlebot “see” my pages?

Googlebot looking at a webpage

To get an idea of what Google sees from your site do the following Google search…

site:yourwebsite.com

By putting “site:” infront of your domain name you will be requesting Google to list the pages Google has indexed for your site.

Tip: Make sure there is no space between “site:” and your domain name when you do this. Here is an example using this site…

site:Theinsightspost.com

If you see less than the amount of pages that you would expect, you will likely need to ensure that you are not blocking Googlebot with your robots.txt file (the robots.txt file is discussed further down this page).

2. Can Googlebot access all my content and links completely?

Googlebot confused by webpage

The next step is to ensure Google is seeing your content and links correctly.

Just because Googlebot can see your pages does not mean that Google has a perfect picture of exactly what those pages are.

Googlebot looking at webpage

Google bot does not see a website the same way as humans do. In the above image there is a webpage with one image on it. Humans can see the image, but what Googlebot sees is only the code calling that image.

Googlebot may be able to access that webpage (the html file), but not be able to access the image found on that webpage for various reasons.

In that scenario the Google index will not include that image, meaning that Google has an incomplete understanding of your webpage.

How Googlebot “sees” a webpage

Googlebot does not see complete web pages, it only sees the individual components of that page.

Googlebot looking at files

If any of those components are not accessible to Googlebot, it will not send them to the Google index.

To use our earlier example, here is Googlebot seeing a webpage (the html and css) but not seeing the image.

Googlebot can not access all resources

It isn’t just images. There are many pieces to a webpage. For Google to be able to rank your webpages optimally, Google needs the complete picture.

There are many scenarios where Googlebot might not be able to access web content, here are a few common ones.

  • Resource blocked by robots.txt
  • Page links not readable or incorrect
  • Over reliance on Flash or other technology that web crawlers may have issues with
  • Bad HTML or coding errors
  • Overly complicated dynamic links

If you have a Google account use the “fetch and render” tool found in the Google search console. This tool will provide you with a live example of exactly what Google sees for an individual page.

3. Can Googlebot access all of my page resources?

Googlebot blocked from resource files

If CSS and javascript files are blocked by your robots.txt file then it can cause some severe misunderstandings about your webpage content (much worse than just a missing image).

It is increasingly true that a webpage may actually be different, or have different content if the page resources are not loaded.

An example to illustrate this would be a mobile page that uses CSS or javascript to determine what to show depending on what device is looking at the page. If Googlebot can not access the CSS or Javascript of that page, it may not realize the page can be mobile.

In this scenario and others like it, Google will “see” your page, and may even understand it, but it may not know it enough to realize that it can be ranked in many other scenarios than what the HTML alone is presenting.

Can I control Googlebot?

Yes.

Googlebot follows the instructions it receives via the robots.txt standards and even has advanced ways to control it that are Google specific.

Some ways you can control Googlebot are…

  • Using a robots.txt file
  • Including robot instructions in the metadata of your webpages
  • Including robot instructions in your headers
  • Using sitemaps
  • Using Google search console

The most common way by far is using the the robots.txt file

What is a robots.txt file?

Googlebot and robots.txt file

The robots.txt file controls how search engine spiders like Googlebot see and interact with your webpages.

In short, a robots.txt file tells Googlebot what to do when it visits your pages by listing files and folders that you do not want Googlebot to access.

To see your robots.txt file (or to see if you have one) you can enter a url (your homepage for example) in the tool below and it will show you it right here on this page.

Here are a few resources from Google that speak of robot instructions:

Sitemaps and Googlebot

Googlebot in map

Sitemaps are a way that you can help Googlebot understand your website, or as Google says…

“A sitemap is a file where you can list the web pages of your site to tell Google and other search engines about the organization of your site content. Search engine web crawlers like Googlebot read this file to more intelligently crawl your site.”

Google states that sitemaps are best used in certain scenarios, specifically…

  • Your site is really large.
  • Your site has a large archive of content pages that are isolated or well not linked to each other.
  • Your site is new and has few external links to it.
  • Your site uses rich media content, is shown in Google News, or uses other sitemaps-compatible annotations.

Sitemaps are being used for many things now, but as far as Googlebot goes, sitemaps basically create a list of urls and other data that Googlebot may use as guidance when visiting your webpages.

Google explains how to build sitemaps here.

Googlebot and the Google search console

graphs in tool showing Googlebot activity

Another place you can control Googlebot is Google search console.

If Googlebot is accessing your web server too fast, you can change the crawl rate.

You can also see an overview of how Googlebot is accessing your website, test your robots.txt, see Googlebot crawl errors, and perform “fetch and render” requests which will help you understand how Google is seeing your webpages.

How many Googlebots / Google webcrawlers are there?

There are nine different types of Google webcrawlers.

Nine different Googlebots

  • Googlebot (Google Web search)
  • Google Smartphone
  • Google Mobile (Feature phone)
  • Googlebot Images
  • Googlebot Video
  • Googlebot News
  • Google Adsense
  • Google Mobile Adsense
  • Google Adsbot (landing page quality check)

If you want details on each, make sure to visit the Google crawlers help page provided by Google (It lists details about each webcrawler it uses).

What is the Googlebot User-agent?

Since there are several Googlebots, there are actualy several Googlebot User-agents, let’s look at the main ones:

Googlebot (Google web search)

User-agent names: Googlebot
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Googlebot Smartphone

User-agent names: Googlebot
Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Googlebot Image

User-agent names: Googlebot-Image (Googlebot)

Googlebot-Image/1.0

Googlebot Video

User-agent names: Googlebot-Video (Googlebot)

Googlebot-Video/1.0

The Google crawlers help page provides User-agent information about all of the Google webcrawlers and is the place you should look for the most updated and reliable information.

Googlebot and languages / locations

Googlebot saying hello in different languages

If your pages show different languages or content depending on the location or language of the request, Googlebot may not always see all your content (they recommend using hreflang).

But this article is about Googlebot, and what Googlebot has started doing for language and location based content is interesting.

Let’s take a look…

Users with different languages or locations

When users are visiting your page and you have a location or language based solution for different content then a user in Italy will see the Italian content and a user in America will see English content.

Googlebot is based in America, so how does that work? How will Googlebot see that Italian content?

Locale-aware crawling by Googlebot

Googlebot employs two main techniques (that Google tells us about) to create locale-aware crawling…

  • Geo-distributed crawling: Googlebot appears to be using IP addresses based outside the USA, in addition to the longstanding IP addresses Googlebot uses that appear to be based in the USA.
  • Language-dependent crawling: Googlebot crawls with an Accept-Language field set in the HTTP header.

So in other words, Googlebot employs methods to crawl the web as a user from anywhere, but (and this is a big “but”), Google still recommends using hreflang.

Always check the locale-aware Googlebot crawling page in the Google official help pages to make decisions!.

Category: Google, SEO, WordPress
  • 0
  • 13
Ankit Meena

Ankit Rajilwar is the founder and editor of The Insight Corporation. Learn more about him here and connect with him on Facebook.

Leave your comment