Advanced Insights into Googlebot Crawling
Advanced Insights into Googlebot CrawlingHere are some interesting aspects of Googlebot's crawling of news websites that are useful to know when you want to optimise crawl efficiency.A few months ago I wrote a guest newsletter for my friends Shelby & Jessie: Crawl Budget 101 for news publishers. If you haven’t yet read the guest article, you should do so now as this is a follow-up piece, serving as a deeper dive into the intricacies of Googlebot. Improving Googlebot Crawl RateA common question is how to improve the rate at which Googlebot crawls your content. This is a simple question with complicated answers, as the tactics you can employ vary on your circumstance and end goals. First of all, let’s dig a little into how Google decides which pages should be crawled. There is a concept called URL Importance which plays a big role in Google’s crawl scheduling. Simplified, URLs that are seen as more important are crawled more often. So what makes for an important URL? Generally two factors apply:
If a URL has many links pointing to it from other sources, and the content of the page served on that URL changes frequently (say, on a daily basis or more), then Google will likely choose to crawl that URL often. The homepage and key section pages of news websites tend to fit both these criteria. That’s why news homepages and section pages are crawled very often, sometimes as much as several times a minute. Google crawls these pages very aggressively because it wants to find newly published articles as soon as possible, so they can be indexed and served in Google’s news-specific ranking elements. Users depend on Google to find the latest stories on developing news topics, which is why Google puts in extra effort to quickly crawl and index news articles. So one way to improve crawling of your website is to increase the importance of your homepage and section pages. Get more links pointing to these pages, for example with site-wide top navigation that features your homepage and key sections. And make sure that your homepage and section pages prominently feature the newest articles as soon as they’re published, so that Google knows to crawl often to find new articles. It’s All About The First CrawlOne aspect about Googlebot’s crawling of news websites that’s very important to understand is that Google doesn’t quickly re-crawl already crawled article URLs. I believe there are underlying infrastructure reasons for this, which I explain in this talk I gave at YoastCon. In summary, I suspect Google has multiple layers of crawling, and its most urgent crawler (what I call its ‘realtime crawler’) will crawl new URLs almost as soon as they’re available. However, it won’t revisit those URLs once they’ve been crawled. Any subsequent re-crawls of URLs is done by a less urgent crawling system (the ‘regular crawler’). This has a rather important implication for news: When you publish an article and it’s available on your website, Google will crawl it almost immediately and it will not be recrawled until hours or days later. So if you publish an article, and then update it (change the headline, fix some typos, add some SEO magic, whatever) Google most likely has already crawled and indexed the first version of the article and will not see your changes until much later. At that stage the article isn’t news any more and will have dropped out of Top Stories. So you get one chance to ensure your article is properly optimised for maximum visibility in Google’s news-specific ecosystem, and one chance only. And that is the moment you first publish it. This is why it’s so incredibly important to make SEO part of your editorial workflow and ensure articles are optimised before they are published. Any improvements made to an article after its publication is unlikely to have any impact on the article’s visibility. Unless, of course, you change the URL - because then Googlebot will treat it as an entirely new article. Robots.txt as Rank ManagementWe see the robots.txt file as a mechanism to control crawling. By default, Googlebot (and other crawlers) assume that every publicly accessible URL is freely crawlable, and will do their best to crawl all URLs on a website. With robots.txt disallow rules, you can prevent crawlers from accessing certain URLs that meet a specific pattern. For example, if you want to prevent crawling of all URLs that start with
However, there is an often overlooked additional dimension to robots.txt disallow rules: it’s also a mechanism to prevent ranking in specific Google verticals. When Google crawls your website, it will do so primarily with the Googlebot Smartphone user-agent. There used to be a different user-agent for crawling approved news websites: Googlebot-News. But since 2011, news websites are crawled with its regular Googlebot crawler, and Googlebot-News isn’t used anymore. Yet, robots.txt disallow rules can be specified for Googlebot-News:
The effect of this rule is not to prevent crawling. It will have no impact on Googlebot’s crawl activity on your site, because crawling doesn’t happen with Googlebot-News. What actually happens is that this Googlebot-News disallow rule will stop your content from showing up in Google News. So it is, in effect, a rank-prevention mechanism. This goes against the purpose of the robots.txt web standard. It makes sense inasmuch that it supports the purpose of a historic but now deprecated user-agent, but technically it’s not a proper use for robots.txt. Stop LLMS from CrawlingAs an additional note on robots.txt, there could be ways to prevent your content from being used by LLMs to train their generative AI. Google introduced the GoogleOther user-agent, which they strongly hint is what their LLM uses to crawl content. Additionally, we know that OpenAI uses the Common Crawl dataset to train their LLM, and we can block the Common Crawl bot from our site with rules for the CCBot user-agent. So, theoretically, with these disallow rules we can prevent some LLMs from being trained on our website’s content:
This is not full proof of course, and mostly theoretical. In practice, LLMs have been harvesting the web’s copyrighted content for years to train themselves on, with little to no advance warning that this was happening. There is some noise being made to create specific blocking mechanisms for large language models and other AI systems, but nothing has yet emerged that definitively prevents your content being used for building someone else’s proprietary AI product (except to lock it behind a very hard paywall). Google Detects Site ChangesOne element of Google’s crawl scheduling system is a clever piece of engineering, intended to allow Google to quickly come to grips with big changes on websites. Whenever Google detects that there have been major changes on a website, for example a new section has launched or the whole CMS has changed with new designs and content, the crawl rate will temporarily increase to enable Google to find all changes as fast as possible and update its index accordingly. This is reflected in the crawl stats report with a spike in crawl requests, which can look something like this: The bigger the changes, the longer the spike in crawl requests can last. After a while, Googlebot’s crawl activity will go back down to normal levels when Google is confident its index has been sufficiently updated. You’ll see such crawl spikes whenever a site migrates to a new domain, when URLs are updated site-wide, and/or when there is a significant change to the underlying codebase of the website. Crawl ChallengesIn addition to the frequent crawl issues I outlined in my original guest article, there are some crawl challenges that many news publishers struggle with. Internal Links with ParametersOne issue I regularly see is websites marking up internal links with URL parameters (also known as query strings) that allow them to track when the link is clicked. This is a terrible idea and utterly self-defeating. First, because these links with tracking parameters are new URLs, Google will crawl them whenever it sees them. This can consume quite a lot of crawl effort on Google’s end. Yet, these URLs aren’t new content in any way; they’re just the same articles, with tracking parameters added to their URLs. So there is no new content for Google to index, and crawl effort spent on these URLs is wasted. Secondly, these URLs will of course have a You know what else is a (very strong) canonicalisation signal? Internal links. When a website links to its internal URLs with tracking parameters, Google sees those internal links as canonicalisation signals and may choose to index the URL with the tracking parameter. That URL can be shown in Google’s results, and when users click on those results the visit will be registered in your web analytics as an internal click - not a Google click. Which, of course, makes the whole point of tracking your internal clicks with URL parameters entirely unreliable, and rather silly. So please, I implore you, do not use tracking parameters on internal links. There are other, much more effective ways to monitor how users move through your website that don’t rely on creating ridiculous amounts of crawl waste and introduce entirely unreliable data into your web analytics. Old Content & PaginationHere are two common questions about crawl optimisation:
This revolves around the same underlying question: should we keep old content? Personally I am not in favour of bluntly deleting older articles. Those news stories on your website from many years ago might not drive traffic, but they serve an important purpose: they show your topic authority. When you start deleting old articles, you are erasing part of your journalistic history and risk undermining the perceived authority you have on topics that you cover regularly. When Google sees a topic page on your website that has only 10 articles visible, while another website might have the same topic page with over 200 visible articles, guess which one Google will see as the more authoritative publisher on that topic? We want Google to see a substantial number of articles on a topic page to prove that you have a history of writing content on that topic. But we also don’t want Google to spend huge amounts of effort crawling old pages that don’t drive traffic. Rest assured, Google is quite smart about prioritising its crawling and unlikely to waste precious crawl resources on pages that haven’t changed in years. So it’s probably not an issue in the first place. If you want to limit crawling of older content, you can restrict pagination on your topic pages to, say, 10 pages and serve a 404 on page 11 and beyond. That means you will end up with orphaned articles once they drop off the 10th page, but as Google never actually forgets a URL this isn’t a particularly big issue. Alternatively, you can implement pagination with a single ‘Next Page’ link at the bottom of the topic page and keep paginating for as long as there are articles to show. With only one pagination link on every paginated page, the crawl priority of deeper paginated pages will decrease (thanks to the Pagerank Damping effect) and Google will de-prioritise crawling those URLs accordingly. In short, don’t worry too much about Google’s crawl effort on older URLs. It’s usually an imaged problem and rarely a real problem. GSC Crawl StatsMost of you will have your website verified as a property in Google Search Console, and you will see the Crawl Stats report for that main website - i.e. https://www.example.com. But have you verified the entire domain in GSC? With domain-wide verification, you get to see crawl stats reports in GSC for all subdomains associated with the main domain: This can give you much more detailed insights into Googlebot’s crawling of your entire website, plus sometimes you find some unexpected subdomains in there that Google may be crawling (for example, secret staging sites that Google probably shouldn’t be crawling). NESS 2023In my previous newsletter I announced the 2023 edition of the News & Editorial SEO Summit, our virtual conference dedicated entirely to SEO for news publishers. We were incredibly proud of the previous two editions, for which we managed to get truly awesome speakers on board. And once again, somehow, we pulled together another epic speaker roster with some huge names from SEO and publishing. This year, the News & Editorial SEO Summit will take place on Wednesday 11th and Thursday 12th October 2023. We’re in the process of finalising the schedule and will serve up a range of amazing talks on AI & SGE, algorithm updates, topic authority, technical SEO, content syndication, SEO in the newsroom, and much more. Get your tickets now, as early bird prices will only last until the end of August. MiscellaneaIt’s been a while since my last proper newsletter, so this round-up of interesting articles and resources will be a long one: Podcasts:
Interesting Articles:
Official Google Docs:
The latest in SEO:
Lastly, I’ve finally given up on the churning cesspool that Twitter/X has become. I’m sure I’ll miss the audience I built up over there, but we all have limits to what we’ll endure. Currently my active socials are LinkedIn, Mastodon, and Bluesky. We’ll see what else may come around that provides meaningful engagement without the overwhelming stench of toxicity and doesn’t serve as a privacy-invading surveillance platform channeling sensitive data to questionable entities (which is why I am not and will never be on any Meta-owned property). As always, thanks for reading and subscribing. Leave a comment if you have any thoughts or questions about Googlebot’s crawling, and I’ll see you at the next one. If you liked this article from SEO for Google News, please share it with anyone you think may find it useful. |
Older messages
The News & Editorial SEO Summit 2023
Monday, July 24, 2023
Our unique NESS event is coming back for its third year on 11 & 12 October with another excellent roster of expert speakers from SEO and publishing.
The 5 Article Headlines Google Cares About
Tuesday, May 2, 2023
Did you know that an article can have up to five headlines that each play a different role in Google's ecosystem? This edition will explain it all.
Integrating SEO into the Newsroom
Wednesday, April 5, 2023
This is a special guest edition of SEO for Google News, where Jessie and Shelby from WTF is SEO write about integrating SEO into newsrooms.
Best Practices for Paywalls and SEO
Tuesday, February 14, 2023
More publishers are exploring subscription models to generate revenue. Paywalls are powerful mechanisms for monetisation, but there are SEO risks involved.
Why Tagging and Categorisation is Critical for News SEO
Tuesday, January 10, 2023
A frequently overlooked yet immensely valuable part of success in news SEO are your content sections and article tags. Let's explore their dos and don'ts in depth.
You Might Also Like
Ahrefs’ Digest #210: Google manual actions, fake AI profiles, and more
Thursday, November 21, 2024
Welcome to a new edition of the Ahrefs' Digest. Here's our meme of the week: — Quick search marketing news ICYMI, Google is rolling out the November 2024 Core Update. Google quietly introduces
Closes Sunday • Black Fri TO CyberMon Book Promos for Authors
Thursday, November 21, 2024
Book Your Spot Now to Get Seen During the Busiest Shopping Season of the Year! Please enable images to see this email. Black Friday & Cyber
What Motivates Marketers? The Answers Will Shock You 🫢
Thursday, November 21, 2024
We surveyed marketers across the globe - here's what they say. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
🧙♂️ NEW 8 Sponsorship Opportunities
Thursday, November 21, 2024
Plus secret research on SoFi, Angara Jewelry, and Dyson ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Literature Lab vol. 1 - Rebecca Makkai | #122
Thursday, November 21, 2024
Fiction: I Have Some Questions for You by Rebecca Makkai ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The Farmer Strikes Back
Thursday, November 21, 2024
(by studying law)
Why Leaders Believe the Product Operating Model Succeeds Where Agile Initiatives Failed
Thursday, November 21, 2024
The psychological, organizational, and strategic reasons behind this seeming contradiction ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
December starts, prepare the 2025 marketing
Thursday, November 21, 2024
We're about a week from December 2024 😮 Did the time fly by for you? I would suggest NOW start planning for how to 2X your 2025. An easy way is to improve the effectiveness of everything in your
Time’s running out - 14 months at our lowest price💥
Wednesday, November 20, 2024
Limited offer inside - Only $1199 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The Ad. Product Backlog Management Course — Tools (1): Forensic Product Backlog Probe
Wednesday, November 20, 2024
A Great Tool to Understand the Status Quo and Change It ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏