How to Find All Current and Archived URLs on a Website

There are many good reasons you would possibly need to seek out all the URLs on a website, but your exact aim will establish Anything you’re searching for. For instance, you might want to:

Recognize each and every indexed URL to analyze challenges like cannibalization or index bloat
Accumulate recent and historic URLs Google has noticed, specifically for internet site migrations
Obtain all 404 URLs to Recuperate from post-migration glitches
In Every scenario, just one Software gained’t Provide you everything you'll need. Regrettably, Google Lookup Console isn’t exhaustive, plus a “web-site:illustration.com” lookup is proscribed and tricky to extract info from.

During this article, I’ll wander you through some resources to develop your URL list and in advance of deduplicating the info utilizing a spreadsheet or Jupyter Notebook, dependant upon your site’s size.

Old sitemaps and crawl exports
In the event you’re trying to find URLs that disappeared in the Are living web-site just lately, there’s an opportunity an individual on the crew can have saved a sitemap file or perhaps a crawl export before the changes had been manufactured. Should you haven’t currently, check for these data files; they could frequently present what you may need. But, if you’re looking at this, you almost certainly did not get so lucky.

Archive.org
Archive.org
Archive.org is an invaluable tool for Search engine optimization duties, funded by donations. If you hunt for a website and select the “URLs” solution, you'll be able to entry as many as 10,000 detailed URLs.

Even so, Here are a few limits:

URL limit: You can only retrieve nearly web designer kuala lumpur ten,000 URLs, which can be inadequate for greater web pages.
Excellent: Numerous URLs may very well be malformed or reference resource documents (e.g., pictures or scripts).
No export selection: There isn’t a built-in strategy to export the record.
To bypass The dearth of an export button, make use of a browser scraping plugin like Dataminer.io. Having said that, these limits mean Archive.org may not provide an entire Alternative for larger sized web pages. Also, Archive.org doesn’t suggest whether or not Google indexed a URL—however, if Archive.org uncovered it, there’s a very good probability Google did, way too.

Moz Pro
When you may usually utilize a link index to uncover external web pages linking to you, these instruments also discover URLs on your internet site in the procedure.


The way to use it:
Export your inbound backlinks in Moz Professional to secure a speedy and simple listing of focus on URLs out of your website. In the event you’re coping with a huge website, consider using the Moz API to export facts further than what’s manageable in Excel or Google Sheets.

It’s crucial to Be aware that Moz Pro doesn’t verify if URLs are indexed or uncovered by Google. Nevertheless, considering that most internet sites utilize exactly the same robots.txt rules to Moz’s bots because they do to Google’s, this technique usually will work perfectly as being a proxy for Googlebot’s discoverability.

Google Research Console
Google Research Console presents quite a few important resources for setting up your listing of URLs.

Back links reviews:


Similar to Moz Pro, the Hyperlinks portion provides exportable lists of concentrate on URLs. Regretably, these exports are capped at 1,000 URLs Every. You can use filters for unique pages, but considering that filters don’t utilize towards the export, you may perhaps really need to rely on browser scraping applications—limited to five hundred filtered URLs at any given time. Not excellent.

General performance → Search engine results:


This export offers you a listing of webpages receiving research impressions. Even though the export is proscribed, You should use Google Search Console API for larger datasets. There are also cost-free Google Sheets plugins that simplify pulling extra in depth information.

Indexing → Pages report:


This portion presents exports filtered by challenge variety, although they're also limited in scope.

Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is a wonderful resource for collecting URLs, by using a generous Restrict of one hundred,000 URLs.


Better yet, you can utilize filters to build unique URL lists, proficiently surpassing the 100k limit. For instance, in order to export only web site URLs, observe these measures:

Move one: Add a section into the report

Stage 2: Click “Create a new segment.”


Action 3: Determine the section using a narrower URL pattern, such as URLs made up of /blog site/


Be aware: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide beneficial insights.

Server log data files
Server or CDN log files are Most likely the ultimate Device at your disposal. These logs seize an exhaustive listing of each URL route queried by users, Googlebot, or other bots over the recorded period.

Issues:

Knowledge dimensions: Log documents is usually large, lots of web pages only retain the final two weeks of information.
Complexity: Examining log information can be demanding, but different instruments are available to simplify the process.
Merge, and fantastic luck
When you’ve gathered URLs from every one of these sources, it’s time to mix them. If your web site is sufficiently small, use Excel or, for much larger datasets, instruments like Google Sheets or Jupyter Notebook. Guarantee all URLs are persistently formatted, then deduplicate the list.

And voilà—you now have a comprehensive listing of current, outdated, and archived URLs. Great luck!

Leave a Reply

Your email address will not be published. Required fields are marked *