Why Articles Should Be Optimised Before Publishing
Why Articles Should Be Optimised Before PublishingOne of Google's quirks means that once an article has been crawled and indexed, any changes won't necessarily be picked up by Google until it's too late.I have to start this newsletter with a disclaimer: Much of what follows is speculation, based on my experiences and those of other SEOs I’ve spoken with over the years, and entirely unconfirmed by Google. So what follows is not ‘established SEO knowledge’ by any means. It’s theory and hypothesis that matches obervable facts, but could still be totally wrong. Of course I don’t think it’s wrong (otherwise I wouldn’t be sharing it with all of you), but I’m not convinced it’s 100% right either. With that disclaimer out of the way, let’s dig in. The ScenarioAnyone who’s worked in online publishing for any length of time will have first- or second-hand knowledge of something like this: An article is published and contains an error. It may contain a blatant spelling mistake in the headline, a poor choice of phrase, a factual error, or something else. Immediately after it’s been published, the article is amended and the error fixed. But, for some reason, Google has indexed the initial, erroneous version, and it just will not update its index and show the corrected article. Hours pass and the initial version of the article containing the error remains in Google’s index and shows in search results. And then, finally, many hours later, Google re-indexes the article and the corrected version starts showing. At this stage the article is considered old for a news story and no longer shows up in Top Stories, and has fallen down the rankings in Google News as well. Sound familiar? The CauseTo explain why Google is often slow to update its index and show an article’s most recent version, we need to understand how its crawling systems work. We tend to base our understanding of Googlebot on what we see in relevant reports in Google Search Console: There’s a mobile crawler, a desktop crawler, a page resource crawler, and some other miscellaneous crawlers looking at specific file types like images and videos. And of course Google’s unruly AdsBot which obeys its own set of rules. But I don’t believe this is an accurate representation of Googlebot’s actual crawling system. I believe Googlebot, independent of its user-agents, is actually a tiered crawling system with at least three distinct crawling processes:
Allow me to explain. 1. Realtime CrawlerThe first and most aggressive crawling process is what I call the Realtime Crawler. This crawling process is focused on crawling VIPs: Very Important Pages. These VIPs are high value webpages that have many inbound links, change very frequently, and are regularly and consistently shown on the first page of Google’s results. VIPs are pages like popular ecommerce homepages (think Amazon.com and Etsy.com), classified portals (job boards, property sites, etc.), and other pages that are very popular and have a high turnover of content. Other pages, like news website homepages and key section pages. Some news homepages are crawled by Googlebot as often as once every five seconds. And this is because these news sites have a lot of incoming link value, they have top rankings in many Google search results, and there’s a high probability that Google will find a new article whenever it crawls the site’s homepage or one of its main section pages. Some publishing sites produce staggering amounts of content (more than 500 articles a day is not uncommon), and Google is eager to crawl and index this so it can make sure its news surfaces, like Top Stories, contain the latest stories. So this Realtime Crawler, which crawls homepages and key section pages very frequently, is eager to find new URLs to crawl and send on to Google’s indexing process. But, and this is crucial, once the Realtime Crawler finds a new URL and crawls it, it then passes that URL on to the second crawling process - the Regular Crawler. The Realtime Crawler doesn’t revisit a newly discovered article. The Realtime Crawler is focused on crawling VIPs, and once it finds new URLs it will crawl them almost instantly but then promptly forget about them. It’s then left to the Regular Crawler to recrawl those URLs and pick up any changes. Such a division of crawl activity allows Google to optimise its Realtime Crawling process for speed, ensuring worthwhile new content is rapidly discovered and added to Google’s index. And at the same time, Google’s second tier Regular Crawling process doesn’t have to be super fast and can carefully manage its crawl queue to focus on URLs that deserve to be recrawled. 2. Regular CrawlerThe Regular Crawler is Google’s main crawling process that does most of the work. The web has trillions of URLs and the Regular Crawler’s job is to decide which of those should be recrawled to check for changes. Many signals go into the Regular Crawler’s crawl queue, and new URLs get added to the crawl queue (which are in fact multiple crawl queues with various purposes and focusing on different signals, running as multithreaded processes across multiple data centres) when they’re first crawled. The Realtime Crawler will do a lot of discovery and send loads of new URLs to the Regular Crawler for crawling, and of course the Regular Crawler also does a lot of discovery itself. So these crawl queues - basically lists of URLs for Googlebot to crawl in sequence - are constantly being updated and changed as new URLs are added and new signals are taken into account. I believe the Regular Crawler is also the crawling process that fetches page resources as part of Google’s indexing rendering process. The Regular Crawler is not as urgent as the Realtime Crawler. Once a newly published URL has been crawled and indexed, it is added to the Regular Crawler’s crawl queue and will be recrawled at some stage - often many hours, if not days, after the initial crawl. The speed with which an already crawled URL, such as a news article, will be recrawled depends on many different signals. These include the timestamp shown with the article on a site’s homepage or section page, its <lastmod> attribute in relevant XML sitemaps, and whether or not the article is submitted in Google Search Console for reindexing. 3. Legacy CrawlerThe third tier of crawl processes is what I call the Legacy Crawler. This is a crawling process that focuses on old and unimportant URLs. Google has been around for 25 years now, and it has crawled astonishing amounts of URLs in that time. As part of its mission to make the world’s information accessible, Google doesn’t like to ‘forget’ URLs. Think of what happens to a news story after it disappears from a site’s homepage and section pages: It fades into the archives of the publisher, still accessible and part of the site’s history, but no longer news and no need to be crawled very often. Initially the Regular Crawler will revisit the URL, especially if changes have been made to the article. But after a while, a few months down the line, there’s no point to keep revisiting the article. If it’s not evergreen and won’t be updated, why recrawl it? This is where the Legacy Crawling process comes in. Google will keep the article in its index, but at the same time will want to make sure the article still exists and hasn’t changed. So the Legacy Crawler will occasionally recrawl the article, even when there are no signals that the article should be recrawled. The Legacy Crawler also recrawls URLs that once served content but now give a 404 Not Found or 410 Gone error. Google wants to make sure these errors persist and the URL hasn’t been reinstated, so it’ll sometimes recrawl those old URLs to make sure there’s still a Page Not Found error shown there. The Legacy Crawler has its own crawl queue, and has a very low sense of urgency. It’ll crawl at its leisure, checking if old pages still exist and ensuring URLs that have been known to Google for years are still live. The ImpactThe three distinct tiers of Googlebot crawling - Realtime, Regular, and Legacy - have a profound impact on news publishers and how they should approach their SEO. When a news article is published, Google’s Realtime Crawler will find it almost immediately. The delay between publishing and crawling is very short - usually a few minutes at most - after which the article will be indexed and can be shown in Top Stories. Any changes made to an article after Google’s initial crawling & indexing may not be picked up by Googlebot’s Regular Crawler until much later. Hours could pass before the Regular Crawler decides to recrawl the article, and Google’s index is updated to reflect any changes made. At this stage the article isn’t news any more. For its Top Stories boxes, Google has a strong preference for newer articles. An article that is hours old may not appear in Top Stories anymore if Google has many newer articles to show there instead. The article, with its updates and improvements, is relegated to the News tab or the Google News vertical where it attains a fraction of the traffic it could’ve achieved in Top Stories, before finally disappearing from Google’s news surfaces entirely. One ChanceThe consequence for news publishers is that you need to get an article’s SEO right before it is published. You get one chance at achieving Top Stories prominence, and that chance is the moment you click ‘publish’. Any changes or improvements made to an article after it’s been published are not guaranteed to be picked up by Google in time to make a difference on its Top Stories visibility. In fact, chances are those changes won’t be seen by Google until long after the article has ceased being news. This is why SEO needs to be an integral part of your publishing workflow. Optimising an article for Google after it’s been published is usually fruitless, as these optimisations won’t be seen in time to make a difference to the article’s performance. No Chance?Does that mean that once an article is published, there’s no way to improve it for SEO? No opportunity to get a second chance at Top Stories? Not quite. We do have some methods at our disposal to improve a published article’s recrawl rate. First, we can send signals to Google that the article has changed, in the hopes that the Regular Crawler revisits it with some urgency. These signals are:
These signals can help Googlebot understand that a known article has changed and should probably be recrawled. But even with all these signals, there is no guarantee that Googlebot will actually recrawl and re-index the article. So we have one final trick up our sleeve: Change the URL. If you absolutely want to guarantee Googlebot will recrawl and re-index the article, you can use the fullproof method of changing its URL. This will make Google see it as an entirely new article, and crawl and index it immediately. Breaking NewsWhen a major news event happens, the journalistic instinct is to immediately cover it even if there is limited information available. Get the story out there, even if it’s just a headline, a single line of text, and a ‘More to follow’ disclaimer. Then, as new facts are uncovered and verified, you expand the article and provide more coverage of the event. But, if you accept the tiered model of Googlebot crawling, you will understand that this is not an ideal approach to breaking news. Instead you may need to find ways to update the breaking news story on your website in such a way that Googlebot will recrawl and reindex it constantly - by feeding those aforementioned signals into Google, or by republishing the story with a new URL every time there is a major update to report. Don’t Be First - Be BestI would also emphasise that being the first to publish a breaking news story isn’t necessarily good for that story’s potential Google traffic. Take this simple graph that is a typical example of a news topic’s popularity. Once a news event happens, the public becomes aware of it gradually and the volume of searches on Google will steadily increase until it peaks and drops off: When is the ideal time to publish your story on this news event? Is it as soon as possible after the event happens? By the time the search volume on this news event reaches peak popularity, your story is relatively old. Publishers that were late to the party have newer stories, which often also contain more information. And as we know, Google has a strong preference for newer articles in Top Stories, especially if those newer articles are more detailed and contain better information. So you may just want to hold off on publishing your breaking news story - or, if you absolutely have to publish immediately, you will want to publish a new article (or change the existing article’s URL) when there’s a major development to report. Over the years I have heard dozens of publishers complain that their breaking news loses out on traffic because Google prefers to fill its Top Stories boxes with later articles from other publishers. Some ways to prevent this from happening is to publish your article a bit later to coincide with the projected peak in search volume, keep your coverage updated and publish new articles on the event, and/or encourage Googlebot to recrawl and reindex your content. What about Live Articles?I believe articles that have the LiveBlogPosting structured data are probably an exception to the Realtime/Regular crawling division. Live coverage articles (that are recognised as such by Google) will remain in the Realtime Crawler’s crawl queue for around 24 hours or until the coverage has ended (as defined in the coverageEndTime attribute). This allows Google to recrawl live articles frequently and ensure the article’s presence in Top Stories is accurate and up to date. You can use this for breaking news as well. Instead of a standard article, consider a Live article for a breaking story. That way you can be among the first to publish on a break news event and still reap the benefits of frequent recrawling and reindexing in Google’s search ecosystem. Want More Like This?If you enjoyed this newsletter, I will be presenting more about Google’s crawling and indexing systems at the 2023 News and Editorial SEO Summit next week. In my talk I’ll give an overview of the current state of technical SEO, including details on Googlebot’s tiered crawling and (spoiler!) tiered indexing infrastructure, and the latest developments in tech SEO for publishing sites. I’ll be joined at NESS by some of the best and brightest in SEO and publishing:
Check out the full schedule on the NewsSEO.io website where you can also buy tickets to this online event. If you haven’t yet bought your NESS23 ticket, you can use the barry25 coupon code at checkout to get 25% off your purchase. MiscellaneousAs usual I’ll end with my customary roundup of interesting articles and resources published in the last while. Official Google docs:
Interesting Articles:
Latest in SEO:
Self-promotion:
That’s it for another edition. As always, thanks for reading and subscribing. Feel free to leave a comment if you have any questions, and I’ll see you at the next one! If you liked this article from SEO for Google News, please share it with anyone you think may find it useful. |
Older messages
Advanced Insights into Googlebot Crawling
Thursday, July 27, 2023
Here are some interesting aspects of Googlebot's crawling of news websites that are useful to know when you want to optimise crawl efficiency.
The News & Editorial SEO Summit 2023
Monday, July 24, 2023
Our unique NESS event is coming back for its third year on 11 & 12 October with another excellent roster of expert speakers from SEO and publishing.
The 5 Article Headlines Google Cares About
Tuesday, May 2, 2023
Did you know that an article can have up to five headlines that each play a different role in Google's ecosystem? This edition will explain it all.
Integrating SEO into the Newsroom
Wednesday, April 5, 2023
This is a special guest edition of SEO for Google News, where Jessie and Shelby from WTF is SEO write about integrating SEO into newsrooms.
Best Practices for Paywalls and SEO
Tuesday, February 14, 2023
More publishers are exploring subscription models to generate revenue. Paywalls are powerful mechanisms for monetisation, but there are SEO risks involved.
You Might Also Like
3-2-1: The power of limiting your options, the value of eagerness, and what we undervalue
Thursday, November 21, 2024
3 ideas, 2 quotes, and 1 question to consider this week. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
🤯 You Will Be Left Behind (Unless You Learn These 10 Truths)
Thursday, November 21, 2024
PLUS: Live event. Big lessons. Huge prizes (for everyone) 10 Hard Truths You'll Need to Build Wealth Contrarians, Last week, we teased you with the biggest ownership event of the decade — the Main
Ahrefs’ Digest #210: Google manual actions, fake AI profiles, and more
Thursday, November 21, 2024
Welcome to a new edition of the Ahrefs' Digest. Here's our meme of the week: — Quick search marketing news ICYMI, Google is rolling out the November 2024 Core Update. Google quietly introduces
Closes Sunday • Black Fri TO CyberMon Book Promos for Authors
Thursday, November 21, 2024
Book Your Spot Now to Get Seen During the Busiest Shopping Season of the Year! Please enable images to see this email. Black Friday & Cyber
What Motivates Marketers? The Answers Will Shock You 🫢
Thursday, November 21, 2024
We surveyed marketers across the globe - here's what they say. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
🧙♂️ NEW 8 Sponsorship Opportunities
Thursday, November 21, 2024
Plus secret research on SoFi, Angara Jewelry, and Dyson ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Literature Lab vol. 1 - Rebecca Makkai | #122
Thursday, November 21, 2024
Fiction: I Have Some Questions for You by Rebecca Makkai ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The Farmer Strikes Back
Thursday, November 21, 2024
(by studying law)
Why Leaders Believe the Product Operating Model Succeeds Where Agile Initiatives Failed
Thursday, November 21, 2024
The psychological, organizational, and strategic reasons behind this seeming contradiction ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
December starts, prepare the 2025 marketing
Thursday, November 21, 2024
We're about a week from December 2024 😮 Did the time fly by for you? I would suggest NOW start planning for how to 2X your 2025. An easy way is to improve the effectiveness of everything in your