A recent side project of me has been to write a scalable crawler which looks for broken resources (links, stylesheets, …) on a website. This project is meant to replace an existing crawler written in PHP with a more efficient implementation in golang.
Part of writing a crawler includes parsing URLs on pages. Thankfully golang has the
url.Parse method which makes this job easy, though there are a couple of caveats to look out.
<a href=""> tags have a space before the actual URL, which causes the golang URL parser to fail.
My suspicion is that this mostly happens on hand-written html pages.
This is quite a big one since it isn’t obvious that a trailing slash can make such a difference when parsing (relative) URLs.
When an URL has a trailing slash and you combine it with a relative URL, the relative URL just gets appended to the original URL behind the slash.
However if there’s no such trailing slash, the relative URL replaces the last part of the original URL.
https://example.com/foo/ --> bar.html # https://example.com/foo/bar.html https://example.com/foo --> bar.html # https://example.com/bar.html
This can get tricky if a link to a website has no trailing slash, however the website redirects the user agent to an URL with a trailing slash.
In such a case it’s important to use the redirected URL to parse relative URLs, instead of the original request URL (In golang this would be the
Url field of the
Request struct, which is automatically updated on redirects).