Web Crawlers - Top 10 Most Popular
When it comes to the world wide web, there are both bad bots and good bots. You definitely want to avoid bad bots as these consume your CDN bandwidth, take up server resources, and steal your content. On the other hand, good bots (also known as web crawlers) should be handled with care as they are a vital part of getting your content to index with search engines such as Google, Bing, and Yahoo. In this blog post, we will take a look at the top ten most popular web crawlers.
What are web crawlers?
Web crawlers are computer programs that browse the Internet methodically and automatedly. They are also known as robots, ants, or spiders.
Crawlers visit websites and read their pages and other information to create entries for a search engine's index. The primary purpose of a web crawler is to provide users with a comprehensive and up-to-date index of all available online content.
In addition, web crawlers can also gather specific types of information from websites, such as contact information or pricing data. By using web crawlers, businesses can keep their online presence (i.e. SEO, frontend optimization, and web marketing) up-to-date and effective.
Search engines like Google, Bing, and Yahoo use crawlers to properly index downloaded pages so that users can find them faster and more efficiently when searching. Without web crawlers, there would be nothing to tell them that your website has new and fresh content. Sitemaps also can play a part in that process. So web crawlers, for the most part, are a good thing.
However, there are also issues sometimes when it comes to scheduling and load, as a crawler might constantly be polling your site. And this is where a robots.txt file comes into play. This file can help control the crawling traffic and ensure that it doesn't overwhelm your server.
Web crawlers identify themselves to a web server using the User-Agent
request header in an HTTP request, and each crawler has its unique identifier. Most of the time, you will need to examine your web server referrer logs to view web crawler traffic.
Robots.txt
By placing a robots.txt file at the root of your web server, you can define rules for web crawlers, such as allowing or disallowing certain assets from being crawled. Web crawlers must follow the rules defined in this file. You can apply general rules to all bots or get more granular and specify their specific User-Agent
string.
Example 1
This example instructs all Search engine robots not to index any of the website's content. This is defined by disallowing the root /
of your website.
User-agent: *
Disallow: /
Example 2
This example achieves the opposite of the previous one. In this case, the instructions are still applied to all user agents. However, there is nothing defined within the Disallow instruction, meaning that everything can be indexed.
User-agent: *
Disallow:
To see more examples make sure to check out our in-depth post on how to use a robots.txt file.
Top 10 good web crawlers and bots
There are hundreds of web crawlers and bots scouring the Internet, but below is a list of 10 popular web crawlers and bots that we have collected based on ones that we see on a regular basis within our web server logs.
1. GoogleBot
As the world's largest search engine, Google relies on web crawlers to index the billions of pages on the Internet. Googlebot is the web crawler Google uses to do just that.
Googlebot is two types of crawlers: a desktop crawler that imitates a person browsing on a computer and a mobile crawler that performs the same function as an iPhone or Android phone.
The user agent string of the request may help you determine the subtype of Googlebot. Googlebot Desktop and Googlebot Smartphone will most likely crawl your website. On the other hand, both crawler types accept the same product token (user agent token) in robots.txt. You cannot use robots.txt to target either Googlebot Smartphone or Desktop selectively.
Googlebot is a very effective web crawler that can index pages quickly and accurately. However, it does have some drawbacks. For example, Googlebot does not always crawl all the pages on a website (especially if the website is large and complex).
In addition, Googlebot does not always crawl pages in real-time, which means that some pages may not be indexed until days or weeks after they are published.
User-Agent
Googlebot
Full User-Agent
string
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot example in robots.txt
This example displays a little more granularity about the instructions defined. Here, the instructions are only relevant to Googlebot. More specifically, it is telling Google not to index a specific page (/no-index/your-page.html
).
User-agent: Googlebot
Disallow: /no-index/your-page.html
Besides Google's web search crawler, they actually have 9 additional web crawlers:
Web crawler | User-Agent string |
---|---|
Googlebot News | Googlebot-News |
Googlebot Images | Googlebot-Image/1.0 |
Googlebot Video | Googlebot-Video/1.0 |
Google Mobile (featured phone) | SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) |
Google Smartphone | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |
Google Mobile Adsense | (compatible; Mediapartners-Google/2.1; +http://www.google.com/bot.html) |
Google Adsense | Mediapartners-Google |
Google AdsBot (PPC landing page quality) | AdsBot-Google (+http://www.google.com/adsbot.html) |
Google app crawler (fetch resources for mobile) | AdsBot-Google-Mobile-Apps |
You can use the Fetch tool in Google Search Console to test how Google crawls or renders a URL on your site. See whether Googlebot can access a page on your site, how it renders the page, and whether any page resources (such as images or scripts) are blocked to Googlebot.
You can also see the Googlebot crawl stats per day, the amount of kilobytes downloaded, and time spent downloading a page.
See Googlebot robots.txt documentation.
2. Bingbot
Bingbot is a web crawler deployed by Microsoft in 2010 to supply information to their Bing search engine. This is the replacement of what used to be the MSN bot.
User-Agent
Bingbot
Full User-Agent
string
Mozilla/5.0 (compatible; Bingbot/2.0; +http://www.bing.com/bingbot.htm)
Bing also has a very similar tool as Google, called Fetch as Bingbot, within Bing Webmaster Tools. Fetch As Bingbot allows you to request a page be crawled and shown to you as our crawler would see it. You will see the page code as Bingbot would see it, helping you understand if they see your page as you intended.
See Bingbot robots.txt documentation.
3. Slurp Bot
Yahoo Search results come from the Yahoo web crawler Slurp and Bing's web crawler, as a lot of Yahoo is powered by Bing. Sites should allow Yahoo Slurp access in order to appear in Yahoo Mobile Search results.
Additionally, Slurp does the following:
- Collects content from partner sites for inclusion within sites like Yahoo News, Yahoo Finance, and Yahoo Sports.
- Accesses pages from sites across the Web to confirm the accuracy and improve Yahoo's personalized content for our users.
User-Agent
Slurp
Full User-Agent
string
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
See Slurp robots.txt documentation.
4. DuckDuckBot
DuckDuckBot is the Web crawler for DuckDuckGo, a search engine that has become quite popular as it is known for privacy and not tracking you. It now handles over 93 million queries per day. DuckDuckGo gets its results from a variety of sources. These include hundreds of vertical sources delivering niche Instant Answers, DuckDuckBot (their crawler) and crowd-sourced sites (Wikipedia). They also have more traditional links in the search results, which they source from Yahoo! and Bing.
User-Agent
DuckDuckBot
Full User-Agent
string
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
It respects WWW::RobotRules and originates from these IP addresses:
- 72.94.249.34
- 72.94.249.35
- 72.94.249.36
- 72.94.249.37
- 72.94.249.38
5. Baiduspider
Baiduspider is the official name of the Chinese Baidu search engine's web crawling spider. It crawls web pages and returns updates to the Baidu index. Baidu is the leading Chinese search engine that takes an 80% share of China Mainland's overall search engine market.
User-Agent
Baiduspider
Full User-Agent
string
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Besides Baidu's web search crawler, they actually have 6 additional web crawlers:
Web crawler | User-Agent string |
---|---|
Image Search | Baiduspider-image |
Video Search | Baiduspider-video |
News Search | Baiduspider-news |
Baidu wishlists | Baiduspider-favo |
Baidu Union | Baiduspider-cpro |
Business Search | Baiduspider-ads |
Other search pages | Baiduspider |
See Baidu robots.txt documentation.
6. Yandex Bot
YandexBot is the web crawler to one of the largest Russian search engines, Yandex.
User-Agent
YandexBot
Full User-Agent
string
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
There are many different User-Agent strings that the YandexBot can show up as in your server logs.
7. Sogou Spider
Sogou Spider is the web crawler for Sogou.com, a leading Chinese search engine that was launched in 2004.
User-Agent
Sogou Pic Spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou head spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou Orion spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou-Test-Spider/4.0 (compatible; MSIE 5.5; Windows 98)
8. Exabot
Exabot is a web crawler for Exalead, which is a search engine based out of France. It was founded in 2000 and has more than 16 billion pages indexed.
User-Agent
Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Exabot-Thumbnails)
Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
See Exabot robots.txt documentation.
9. Facebook external hit
Facebook allows its users to send links to interesting web content to other Facebook users. Part of how this works on the Facebook system involves the temporary display of certain images or details related to the web content, such as the title of the webpage or the embed tag of a video. The Facebook system retrieves this information only after a user provides a link.
One of their main crawling bots is Facebot, which is designed to help improve advertising performance.
User-Agent
facebot
facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
See Facebot robots.txt documentation.
10. Applebot
The computer technology brand Apple uses the web crawler Applebot, and in particular Siri and Spotlight Suggestions, to provide personalized services to their users.
User-Agent
Applebot
Full User-Agent
string
Mozilla/5.0 (Device; OS_version) AppleWebKit/WebKit_version (KHTML, like Gecko)
Version/Safari_version Safari/WebKit_version (Applebot/Applebot_version)
Other popular web crawlers
Apache Nutch
Apache Nutch is an open-source web crawler written in Java. It is released under the Apache License and is managed by the Apache Software Foundation. Nutch can run on a single machine, but it is more commonly used in a distributed environment. In fact, Nutch was designed from the ground up to be scalable and easily extensible.
Nutch is very flexible and can be used for various purposes. For example, Nutch can be used to crawl the entire Internet or only specific websites. In addition, Nutch can be configured to index pages in real-time or on a schedule.
One of the main benefits of Apache Nutch is its scalability. Nutch can be easily scaled to accommodate large volumes of data and traffic. For example, a large ecommerce website may use Apache Nutch to crawl and index its product catalog. This would allow customers to search for products on their website using the company's internal search engine.
In addition, Apache Nutch can be used to gather data about websites. Companies could use Apache Nutch to crawl competitor websites and collect information about their products, prices, and contact information. This information could then be used to improve their online presence. However, Apache Nutch does have some drawbacks. For example, it can be challenging to configure and use. In addition, Apache Nutch is not as widely used as other web crawlers, which means less support is available for it.
Screaming Frog
Screaming Frog SEO Spider is a desktop program (PC or Mac) that crawls websites' links, images, CSS, scripts, and apps from an SEO perspective.
It fetches key onsite elements for SEO, presents them in tabs by types, and allows you to filter for common SEO issues or slice and dice the data how you like by exporting it into Excel.
You can view, analyze and filter the crawl data as it's gathered and extracted in real-time from the simple interface.
The program is free for small sites (up to 500 URLs). Larger sites require a license.
Screaming Frog uses the Chromium WRS to crawl dynamic websites that are rich in JavaScript, such as Angular, React, and Vue.js. WordPress sitemap creation, XPath extraction, and site architecture visualization are other top features.
The platform serves corporations like Apple, Amazon, Disney, and even Google. Screaming Frog is also a popular tool among agency owners and SEOs who manage SEO for multiple clients.
Deepcrawl
Deepcrawl is a cloud-based web crawler that allows users to crawl websites and collect data about their structure, content, and performance.
DeepCrawl provides users with several features and options, including the ability to crawl JavaScript-based websites, customize the crawling process, and generate detailed reports.
One of Deepcrawl's most unique features is its ability to crawl websites built with JavaScript. This is possible because Deepcrawl uses a headless browser (i.e. Chrome) to render the website's content before crawling it.
This means that Deepcrawl can crawl and collect data about websites that other crawlers would not always be able to reach.
Beyond flexible APIs, Deepcrawl's data integrates with Google Analytics, Google Search Console, and other popular tools. This allows users to easily compare their website's data with their competitors. It also allows them to connect business data (e.g. sales data) with their website's data to get a complete picture of how their website is performing.
Deepcrawl works best for companies with large websites with a lot of content and pages. The platform is less well-suited for small websites or those that do not change very often.
There are three different products that Deepcrawl offers:
- Automation Hub: This product integrates with your CI/CD pipeline and automatically crawls your website with 200+ SEO QA testing rules.
- Analytics Hub: This product allows you to surface actionable insights from your website data and improve your website's SEO.
- Monitoring Hub: This product monitors your website for changes and alerts you when new issues arise.
Businesses use these three products to improve their website's SEO, monitor it for changes, and collaborate with dev teams.
Octoparse
Octoparse is a user-friendly client-based web crawling software that lets you extract data from all over the Internet. The program is particularly developed for people who are not programmers and has a simple point-and-click interface.
With Octoparse, you can run scheduled cloud extractions to extract dynamic data, create workflows to extract data from websites automatically, and use its web scraping API to access data.
Its IP proxy servers let you crawl websites without being blocked, and its built-in Regex feature cleans data automatically.
And with its pre-built scraper templates, you can start extracting data from popular websites like Yelp, Google Maps, Facebook, and Amazon within minutes. You can also build your own scraper if there isn't one readily available for your target websites.
HTTrack
You can use HTTrack's freeware to download entire sites to your PC. With support for Windows, Linux, and other Unix systems, this open-source tool can be used by millions.
HTTrack's website copier lets you download a website to your computer so that you can browse it offline. The program can also be used to mirror websites, meaning that you can create an exact copy of a website on your server.
The program is easy to use and has many features, including the ability to resume interrupted downloads, update existing websites, and create static copies of dynamic websites.
You can get the files, photos, and HTML code from its mirrored website and resume interrupted downloads.
While HTTrack can be used to download any type of website, it's particularly useful for downloading websites that are no longer online.
HTTrack is a great tool for anyone who wants to download an entire website or mirror a website. However, it should be noted that the program can be used to download illegal copies of websites.
As such, you should only use HTTrack if you have permission from the website owner.
SiteSucker
SiteSucker is a macOS application that downloads websites. It asynchronously copies the site's webpages, images, PDFs, style sheets, and other files to your local hard drive, duplicating the site's directory structure.
You can also use SiteSucker to download specific files from websites, such as MP3 files.
The program can be used to create local copies of websites, making it ideal for offline browsing.
It's also useful for downloading entire sites so you can view them on your computer without an Internet connection.
One drawback to SiteSucker is that it cannot handle Javascript (though it can handle Flash). Nevertheless, it's still useful for downloading websites to your Mac.
Webz.io
Users can use the Webz.io web application to get real-time data by crawling online sources worldwide into various tidy formats. This web crawler allows you to crawl data and extract keywords in multiple languages based on numerous criteria from a diverse range of sources.
The Archive allows users to access historical data. Users can easily index and search the structured data crawled by Webhose using its intuitive interface/API. You can save the scraped data in JSON, XML, and RSS formats. Plus, Webz.io supports up to 80 languages with its crawling data results.
Webz.io's freemium business model should suffice for businesses with basic crawling requirements. For businesses that need a more robust solution, Webz.io also offers support for media monitoring, cybersecurity threats, risk intelligence, financial analysis, web intelligence, and identity theft protection.
They even support dark web API solutions for business intelligence.
UiPath
UiPath is a Windows application that can be used to automate repetitive tasks. It's beneficial for web scraping, as it can extract data from websites automatically.
The program is easy to use and doesn't require any programming knowledge. It features a visual drag-and-drop interface that makes it easy to create automation scripts.
With UiPath, you can extract tabular and pattern-based data from websites, PDFs, and other sources. The program can also be used to automate tasks such as filling out online forms and downloading files.
The commercial version of the tool provides additional crawling capabilities. When dealing with complicated UIs, this approach is very successful. The Screen Scraping Tool can extract data from tables in both individual words and groups of text, as well as blocks of text such as RSS feeds.
Also, you don't need any programming skills to create intelligent web agents, but if you're a .NET hacker, you'll be able to completely control their data.
Bad bots
While most web crawlers are benign, some can be used for malicious purposes. These malicious web crawlers, or "bots," can be used to steal information, launch attacks, and commit fraud. It has also been increasingly found that these bots ignore robots.txt directives and proceed directly to scan websites.
Some prominent bad bots are as listed below:
- PetalBot
- SEMrushBot
- Majestic
- DotBot
- AhrefsBot
Protecting your site from malicious web crawlers
To protect your website from bad bots, you can use a web application firewall (WAF) to protect your website from bots and other threats. A WAF is a piece of software that sits between your website and the Internet, filtering traffic before it reaches your site.
A CDN can also help to protect your website from bots. A CDN is a network of servers that deliver content to users based on their geographic location.
When a user requests a page from your website, the CDN will route the request to the server closest to the user's location. This can help to reduce the risk of bots attacking your website, as they will have to target each CDN server individually.
KeyCDN has a great feature that you can enable in your dashboard called Block Bad Bots. KeyCDN uses a comprehensive list of known bad bots and blocks them based on their User-Agent
string.
When a new Zone is added the Block Bad Bots feature is set to disabled
. This setting can be set to enabled
instead if you want bad bots to automatically be blocked.
Bot resources
Perhaps you are seeing some user-agent strings in your logs that have you concerned. Caio Almeida also has a pretty good list on his crawler-user-agents GitHub project.
Summary
There are hundreds of different web crawlers out there, but hopefully, you are now familiar with a couple of the more popular ones. Again you want to be careful when blocking any of these as they could cause indexing issues. It is always good to check your web server logs to see how often they are crawling your site.
What's your favorite web crawler? Let us know in the comments below.