Googlebot visits billions of webpages and is constantly visiting pages all over the web.
The information that Googlebot sends back to Google computers updates the Google index.
The Google index is where webpages are compared and ranked.
The Google index
The first step of being ranked by Google is to be retrieved by Googlebot.
Since Googlebot is the way Google updates their index, it is essential that Googlebot can see your pages.
The fundamental first questions a webmaster should ask are…
Let’s look at each of those closer…
To get an idea of what Google sees from your site do the following Google search…
By putting “site:” infront of your domain name you will be requesting Google to list the pages Google has indexed for your site.
Tip: Make sure there is no space between “site:” and your domain name when you do this. Here is an example using this site…
If you see less than the amount of pages that you would expect, you will likely need to ensure that you are not blocking Googlebot with your robots.txt file (the robots.txt file is discussed further down this page).
The next step is to ensure Google is seeing your content and links correctly.
Just because Googlebot can see your pages does not mean that Google has a perfect picture of exactly what those pages are.
Google bot does not see a website the same way as humans do. In the above image there is a webpage with one image on it. Humans can see the image, but what Googlebot sees is only the code calling that image.
Googlebot may be able to access that webpage (the html file), but not be able to access the image found on that webpage for various reasons.
In that scenario the Google index will not include that image, meaning that Google has an incomplete understanding of your webpage.
Googlebot does not see complete web pages, it only sees the individual components of that page.
If any of those components are not accessible to Googlebot, it will not send them to the Google index.
To use our earlier example, here is Googlebot seeing a webpage (the html and css) but not seeing the image.
It isn’t just images. There are many pieces to a webpage. For Google to be able to rank your webpages optimally, Google needs the complete picture.
There are many scenarios where Googlebot might not be able to access web content, here are a few common ones.
If you have a Google account use the “fetch and render” tool found in the Google search console. This tool will provide you with a live example of exactly what Google sees for an individual page.
It is increasingly true that a webpage may actually be different, or have different content if the page resources are not loaded.
In this scenario and others like it, Google will “see” your page, and may even understand it, but it may not know it enough to realize that it can be ranked in many other scenarios than what the HTML alone is presenting.
Googlebot follows the instructions it receives via the robots.txt standards and even has advanced ways to control it that are Google specific.
Some ways you can control Googlebot are…
The most common way by far is using the the robots.txt file
The robots.txt file controls how search engine spiders like Googlebot see and interact with your webpages.
In short, a robots.txt file tells Googlebot what to do when it visits your pages by listing files and folders that you do not want Googlebot to access.
To see your robots.txt file (or to see if you have one) you can enter a url (your homepage for example) in the tool below and it will show you it right here on this page.
Here are a few resources from Google that speak of robot instructions:
Sitemaps are a way that you can help Googlebot understand your website, or as Google says…
“A sitemap is a file where you can list the web pages of your site to tell Google and other search engines about the organization of your site content. Search engine web crawlers like Googlebot read this file to more intelligently crawl your site.”
Google states that sitemaps are best used in certain scenarios, specifically…
Sitemaps are being used for many things now, but as far as Googlebot goes, sitemaps basically create a list of urls and other data that Googlebot may use as guidance when visiting your webpages.
Google explains how to build sitemaps here.
Another place you can control Googlebot is Google search console.
If Googlebot is accessing your web server too fast, you can change the crawl rate.
You can also see an overview of how Googlebot is accessing your website, test your robots.txt, see Googlebot crawl errors, and perform “fetch and render” requests which will help you understand how Google is seeing your webpages.
There are nine different types of Google webcrawlers.
If you want details on each, make sure to visit the Google crawlers help page provided by Google (It lists details about each webcrawler it uses).
Since there are several Googlebots, there are actualy several Googlebot User-agents, let’s look at the main ones:
Googlebot (Google web search)
User-agent names: Googlebot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
User-agent names: Googlebot
User-agent names: Googlebot-Image (Googlebot)
User-agent names: Googlebot-Video (Googlebot)
The Google crawlers help page provides User-agent information about all of the Google webcrawlers and is the place you should look for the most updated and reliable information.
If your pages show different languages or content depending on the location or language of the request, Googlebot may not always see all your content (they recommend using hreflang).
But this article is about Googlebot, and what Googlebot has started doing for language and location based content is interesting.
Let’s take a look…
When users are visiting your page and you have a location or language based solution for different content then a user in Italy will see the Italian content and a user in America will see English content.
Googlebot is based in America, so how does that work? How will Googlebot see that Italian content?
Googlebot employs two main techniques (that Google tells us about) to create locale-aware crawling…
So in other words, Googlebot employs methods to crawl the web as a user from anywhere, but (and this is a big “but”), Google still recommends using hreflang.
Always check the locale-aware Googlebot crawling page in the Google official help pages to make decisions!.