Go directly to: Define indexed content with the Searchmetrics Suite.
When "index" is mentioned in relation to the web, the term usually refers specifically to the search engine index and mainly to the Google index. In this case, the index is a huge amount of data that the search engine has stored for billions of URLs on its own servers in order to generate suitable search results.
How does a website get indexed?
At the very beginning of the Internet, backlinks were the most important way, and one of the few, for finding and visiting other websites. For this reason, universal link directories have been created to give users an overview of websites about all possible topics.
One of the most famous directories was DMOZ. With the increase of importance of search engines, primarily Google, link directories have become less and less important. It was important then to register your website in various search engine indexes, so that it could be included in search queries. If the website was registered, the so-called search engine robot received an order to crawl this website, i.e. search for content.
Registration is not needed (anymore) - the bot comes by itself
Such data as layout, screenshot and relevance information, were finally stored on the server of the respective search engine. To ensure that the results were up to date, the robot visited the website regularly. This is still the case also now.
However, today the activity of search engine robots is so high that it is usually enough to bring a website online or, for example, to set a link on a social network, so that it could be indexed by search engines.
For example, Google has also changed its indexing speed for the first time in 2003 with the so-called "Fritz Update" so that the index was adjusted daily. With further updates, such as relevance updates and real-time search tuned again, Google has the ability to adjust its search index almost in "real-time" if needed.
Which pages/content/media are indexed?
Due to the different types of vertical search, different indexes are also generated by different bots. For example, Google has its own crawlers for news, images or mobile content. Besides the text content, images, videos, URLs or sound recordings also end up in search engine indexes.
Task of the algorithms
If you imagine the search engine index as a giant mountain of terabytes of data, you can understand a difficult task for technology to sorting this mass of data. In search engines, the sorting is done by algorithms. You decide which content is displayed for which search query. Algorithms are supposed to determine a website's relevance to a search query based on more than 300 different factors (backlinks are still the most important). The index data provides the basis for this.
Excluding a website from indexing
If you don't want your web content to be indexed by search engines, you can integrate the meta tag in the area of the corresponding page. You can also prevent crawlers from indexing certain search engines. For example, if you want to prevent Googlebot from indexing, this meta specification should be used:
In addition, there is a possibility that the crawler does not index the page, but still follows the links on it. Then this meta specification should be used:
follow“>
In addition, Google webmaster tools offer the option to remove individual pages of a website from the index. This requires a valid Google account and a verified website.
Impact on search depth and the ease of indexing
The requirement for a website to be indexed is that it can be crawled. As a result, webmasters have to make sure that all content that also needs to be indexed remains easily accessible to robots. Therefore, flash content or Java scripts are not suitable for simple indexing.
It is also recommended to keep the menu structure or the page hierarchy as flat as possible, so that the bot can make the best use of the time available to crawl the website. One of the factors that determines the length and depth of crawl is the PageRank introduced by Google. It can be assumed that the depth and duration of the crawl depends on the PageRank level. The higher the rank, the better.
Define indexed content with the Searchmetrics Suite
If you want to know how many URLs on your domain have been indexed by search engines, you can find out this information on Google or Bing by using webmaster tools. This requires a Google or Microsoft account, depending on the provider. The quick fix is a so-called site query that anyone can do. The word "site" with a colon is placed before the URL in the search bar of Google or Bing.
Example: „site:samplepage.com“
Web pages indexed for this URL are then displayed as results.
Alternatively, the Site Experience in the Searchmetrcis Suite also offers the option to query the number of indexed websites in the URL. The Index vs. Noindex section has an overview of all crawled URLs, their index, and follow status.
The Searchmetrics Suite distinguishes 4 categories:
- Index-Follow: These pages are indexed by the respective search engine and the crawler is instructed to follow all the links on the page. This is the default setting, even if you don't specify it in your code.
-
Index-Nofollow: These pages are in the index. However, the crawler is instructed not to follow the links on the page. A blog article could be called an example of the use of this attribute, which does appear in the index, but you should not follow the links in the blog comments, so as not to affect the ranking.
- Noindex-Follow: These pages are excluded from indexing. However, the crawler is instructed to follow the links on the page. It makes sense to use it, for example, with redundant category pages that should not be included in the index, but for which the search engine should still track the content.
- Noindex-Nofollow: These pages were excluded from indexing, and the crawler was instructed not to follow the links on the page. This instruction is useful, for example, for pages with user generated content.
The overview chart is followed by a table in the bottom half of the page that lists the individual pages crawled and provides detailed indexing and tracking status information. It can also display information about the number of outgoing/incoming links, as well as the SPS and CheiRank.