Crawling is the process of collecting documents from the web. When it comes down to search engines, the purpose of crawling is to refresh the pages that have changed since the previous crawl and to discover new pages and expand the index.
The major problem that crawlers face is the growth of the web. Should a search engine crawl new pages or refresh old ones? There are too many pages to crawl and search engines must choose wisely. It is important to crawl documents that change frequently and documents that are of high quality as often as possible.
Crawling Priority
Search engines assign a crawling priority to every page. Crawling priority is a number that denotes the importance of a page in relation to crawling. Pages with a higher crawling priority number will be crawled before pages with a smaller priority number.
Main Factors that determine Google's Crawling Priority
PageRank - pages with a higher PageRank have a higher crawling priority
Number of slashes ('/') in the URLs - pages with fewer slashes in their URLs have a higher crawling priority because they tend to change more often. In other implementations, Google uses the number of slashes ('/') in the links that point to a page. Getting a link from a page with a lot of slashes in its URL results in a smaller crawling priority number.
New Sites and Crawling
It is really frustrating to release a new site, and discover that in the following 3 months Google has crawled just 5% of its pages.
In order for a new page A to get crawled:
1. Google must crawl/index a page B that contains a link to page A
2. Google will discover a new page A sooner if page B itself has a high crawling priority (if you get a link from a low PageRank page, Google might crawl it 6 weeks later to find about the existence of your site)
The worst situation happens when you have a new site with a lot of pages that are more than 2 clicks away from the home page. These new pages might get crawled months later because the pages that link to the 3rd ++ level pages are also new (they have to be found and crawled first, and at the same time have a low crawling priority).
Tips to get a new site crawled faster
1. Start working on getting incoming links. That is the fastest way to get your home page crawled. Obtaining many incoming links will get the PageRank of your home page up, which will propagate to the second, third etc. level pages and bring them up in the crawling queue.
2. Use an internal linking structure that minimizes the path (number of links it takes to get) from your home page to the majority of your pages
3. Point 2. can be augmented by the usage of site maps. Site maps can link to fourth, fifth etc. level pages.
4. Avoid having pages with too many slashes in the URLs such as mydomain.com/articles/section/1/article/2/article_title/
5. On sites with a tree-linking structure such as directories, you can rotate the third-level categories displayed on your home page. Usually on such sites, the home page lists the main categories and under each main category there are links to some of its subcategories. Rotate the subcategory links.
No comments:
Post a Comment