Indexing
After a page is discovered, Google tries to understand what the page is about. This process is called indexing. Google analyzes the content of the page, catalogs images and video files embedded on the page, and otherwise tries to understand the page. This information is stored in the Google index, a huge database stored in many computers.
To improve your page indexing:
-
Create short, meaningful page titles.
-
Use page headings that convey the subject of the page.
-
Use text rather than images to convey content. Google can understand some image and video, but not as well as it can understand text. At minimum, annotate your video and images with alt text and other attributes as appropriate.
Serving (and ranking)
When a user types a query, Google tries to find the most relevant answer from its index based on many factors. Google tries to determine the highest quality answers, and factor in other considerations that will provide the best user experience and most appropriate answer, by considering things such as the user's location, language, and device (desktop or phone). For example, searching for “bicycle repair shops" would show different answers to a user in Paris than it would to a user in Hong Kong. Google doesn't accept payment to rank pages higher, and ranking is done programmatically.
To improve your serving and ranking:
-
Make your page fast to load, and mobile-friendly.
-
Put useful content on your page and keep it up to date.
-
Follow the Google Webmaster Guidelines, which help ensure a good user experience.
-
Read more tips and best practices in our SEO starter guide.
-
You can find more information about how Search works, including the guidelines that we provide to our quality raters to ensure that we're providing good results.
An even longer version
Want more in-depth information about how Search works? Read our Advanced guide to how Google Search works.
Advanced: How Search Works
Understanding how Google Search crawls, indexes, and serves content is important when you're debugging issues and anticipating Search behavior on your site.
Crawling
Crawling is the process by which Googlebot visits new and updated pages to be added to the Google index.
We use a huge set of computers to fetch (or “crawl") billions of pages on the web. The program that does the fetching is called Googlebot (also known as a robot, bot, or spider). Googlebot uses an algorithmic process to determine which sites to crawl, how often, and how many pages to fetch from each site.
Google's crawl process begins with a list of web page URLs, generated from previous crawl processes, augmented by Sitemap data provided by website owners. When Googlebot visits a page it finds links on the page and adds them to its list of pages to crawl. New sites, changes to existing sites, and dead links are noted and used to update the Google index.
During the crawl, Google renders the page using a recent version of Chrome. As part of the rendering process, it runs any page scripts it finds. If your site uses dynamically-generated content, be sure that you follow the JavaScript SEO basics.
Primary crawl / secondary crawl
Google uses two different crawlers for crawling websites: a mobile crawler and a desktop crawler. Each crawler type simulates a user visiting your page with a device of that type.
Google uses one crawler type (mobile or desktop) as the primary crawler for your site. All pages on your site that are crawled by Google are crawled using the primary crawler. The primary crawler for all new websites is the mobile crawler.
In addition, Google recrawls a few pages on your site with the other crawler type (mobile or desktop). This is called the secondary crawl and is done to see how well your site works with the other device type.
How does Google know which pages do not crawl?
- Pages blocked in robots.txt won't be crawled, but still might be indexed if linked to another page. Google can infer the content of the page by a link pointing to it, and index the page without parsing its contents.
- Google can't crawl any pages not accessible by an anonymous user. Thus, any login or other authorization protection will prevent a page from being crawled.
- Pages that have already been crawled and are considered duplicates of another page are crawled less frequently.
Improve your crawling
Use these techniques to help Google discover the right pages on your site:
- Submit a sitemap.
- Submit crawl requests for individual pages.
- Use simple, human-readable, and logical URL paths for your pages and provide clear and direct internal links within the site.
- If you use URL parameters on your site for navigation, for instance, if you indicate the user's country in a global shopping site, use the URL parameters tool to tell Google about important parameters.
- Use robots.txt wisely: Use robots.txt to indicate to Google which pages you'd prefer Google to know about or crawl first, to protect your server load, not as a method to block material from appearing in the Google index.
- Use hreflang to point to alternate versions of your page in other languages.
- Identify your canonical page and alternate pages.
- View your crawl and index coverage using the Index Coverage Report.
- Be sure that Google can access the key pages, and also the important resources (images, CSS files, scripts) needed to render the page properly.
- Confirm that Google can access and render your page properly by running the URL Inspection tool on the live page.
Indexing
Googlebot processes each page it crawls to understand the content of the page. This includes processing the textual content, key content tags, and attributes, such as
tags and alt attributes, images, videos, and more. Googlebot can process many, but not all, content types. For example, we cannot process the content of some rich media files.
Somewhere between crawling and indexing, Google determines if a page is a duplicate or canonical of another page. If the page is considered a duplicate, it will be crawled much less frequently. Similar pages are grouped into a document, which is a group of one or more pages that includes the canonical page (the most representative of the group) and any duplicates found (which might simply be alternate URLs to reach the same page or might be alternate mobile or desktop versions of the same page).
Note that Google doesn't index pages with a noindex
directive (header or tag). However, it must be able to see the directive; if the page is blocked by a robots.txt file, a login page, or another device, the page might be indexed even if Google didn't visit it.
Improve your indexing
There are many techniques to improve Google's ability to understand the content of your page:
- Prevent Google from crawling or finding pages that you want to hide using the
noindex
tag. Don'tnoindex
a page that is blocked by robots.txt; if you do so, thenoindex
tag won't be seen and the page might still be indexed. - Use structured data.
- Follow the Google Webmaster Guidelines.
- Read our SEO starter guide and advanced user guide for more tips.
What is a “document"?
Internally, Google represents the web as an enormous set of documents. Each document represents one or more web pages. These pages are either identical or very similar but are essentially the same content, reachable by different URLs. The different URLs in a document can lead to the same page (for instance, example.com/dresses/summer/1234 and example.com?product=1234 might show the same page), or the same page with small variations intended for users on different devices (for example, example.com/mypage for desktop users and m.example.com/mypage for mobile users).
Google chooses one of the URLs in a document and defines it as the document's canonical URL. The document's canonical URL is the one that Google crawls and indexes most often; the other URLs are considered duplicates or alternates, and may occasionally be crawled, or served according to the user's request. For instance, if a document's canonical URL is the mobile URL, Google will still probably serve the desktop (alternate) URL for users searching on the desktop.
Most reports in Search Console attribute data to the document's canonical URL. Some tools (such as the URL Inspection tool) support testing alternate URLs, but inspecting the canonical URL provides information about the alternate URLs as well.
You can tell Google which URL you prefer to be canonical, but Google may choose a different canonical for various reasons.
Here is a summary of terms, and how they are used in Search Console:
- Document: A collection of similar pages. Has a canonical URL, and possibly alternate URLs, if your site has duplicate pages. URLs in the document can be from the same or different organization (the root domain, for example, “google" in www.google.com). Google chooses the best URL to show in search results according to the platform (mobile/desktop), user language or location, and many other variables. Google discovers related pages on your site by organic crawling, or by site-implemented features such as redirects or
tags. Related pages on other organizations can only be marked as alternates if explicitly coded by your site (through redirects or link tags).
- URL: The URL used to reach a given piece of content on a site.
- Page: A given web page, reached by one or more URLs. There can be different versions of a page, depending on the user's platform (mobile, desktop, tablet, and so on).
- Version: One variation of the page, typically categorized as “mobile", “desktop", and “AMP" (although AMP can itself have mobile and desktop versions). Each version can have a different URL (example.com vs m.example.com) or the same URL (if your site uses dynamic serving or responsive web design, the same URL can show different versions of the same page) depending on your site configuration. Language variations are not considered different versions, but different documents.
- Canonical page or URL: The URL that Google considers as most representative of the document. Google always crawls this URL; duplicate URLs in the document are occasionally crawled as well.
- Alternate/duplicate page or URL: The document URL that Google might occasionally crawl. Google also serves these URLs if they are appropriate to the user and request (for example, an alternate URL for desktop users will be served for desktop requests rather than a canonical mobile URL).
- Site: Usually used as a synonym for a website (a conceptually related set of web pages), but sometimes used as a synonym for a Search Console property, although the property can be defined as only part of a site. A site can span subdomains (and even domains, for properly linked AMP pages).
Serving results
When a user enters a query, our machines search the index for matching pages and return the results we believe are the most relevant to the user. Relevancy is determined by hundreds of factors, and we always work on improving our algorithm. Google considers the user experience in choosing and ranking results, so be sure that your page loads fast and is mobile-friendly.
Improve your serving
There are many ways to improve how Google serves the content of your page:
- If your results are aimed at users in specific locations or languages, you can tell Google your preferences.
- Be sure that your page loads fast and is mobile-friendly.
- Follow the Webmaster Guidelines to avoid common pitfalls and improve your site's ranking.
- Consider implementing Search result features for your site, such as recipe cards or article cards.
- Implement AMP for faster loading pages on mobile devices. Some AMP pages are also eligible for additional search features, such as the top stories carousel.
- Google's algorithm is constantly being improved; rather than trying to guess the algorithm and design your page for that, work on creating good, fresh content that users want, and following our guidelines.
Leave a reply