That is to say; with a whole bunch of wobbly and spinny bits that nobody quite understands the need of, but without which the mechanism just suddenly fails in spectacularly unpredictable ways.
You guessed it… Regular Expressions.
The greatest, most terrible tool, ever invented to do quick-and-dirty validation in web front-ends around the Interspaces.
I’ve been working to improve first-pass URL validation logic in the web front-end. I started by trying to read the existing regex, but it looked like a cat had just mashed some random symbol keys to a length of about 200 characters. And I knew it wasn’t allowing all the URLs we’d like to accept.
I decided to go back to first principles; RFC 3986 – URI Generic Syntax. The first shock was learning that the following is a perfectly legal URL:
And I haven’t even used Unicode characters anywhere in that example yet.
First, the temptation is to go to the back of the RFC, and just translate the BNF notation into a Regex and be done with it. Alas, I didn’t think I could accurately transcribe that many permutations without slipping up… and regexes are hard enough when you have a clear idea of what exactly you are parsing.
Second, the important realisation that it doesn’t have to disallow everything that isn’t a valid URL. This is about helping the users by catching the most important mistakes they might make. If anyone decides to actually use an IPv6 literal as a host identifier, then it really isn’t important to check whether the exact right number of hex words were used.
So, when squinting just-right at the RFC, it is easy enough to come to the following right-to-almost-right rules:
- The group [\w\$-\.!;=@~] is a great approximation for the permissible alphabet for most of the textual parts of a URL; in some places that might allow slightly too much, but it restricts all the characters that really do not belong.
- “#” is only permitted exactly once, after which all further text is considered the fragment identifier.
- “?” is not permitted until the query portion at the end of the URL, but can occur as many times after that as you want.
- Allowing excess square brackets makes capturing the part between the “//” and the first following “/” easier. Making the expression more specific helps break down the results into more logical parts.
What I have landed on for now is the following (finessed so that the capturing groups try to catch the logical parts of a URL):
(?:(https?|ftp):)? # URL Scheme Identifier: http, https, ftp
\/\/ # Literal //
([\w\$-\.!:;=~]*@)? # Followed by optional username:password@
([\w\$-\.!;=~]* # Followed by hostname
|\[[a-fA-F0-9\:\.]*\]) # Or IPv6 address
(\:\d*)? # Followed by optional :port
|\/[\w\$-\.!;=@~]+ # Or literal / and a path segment
|[\w\$-\.!;=@~]+ # Or no slashes and a path segment
| # Or... nothing at all!
((?:\/[\w\$-\.!:;=@~]*)*) # Rest of the URL path
(\?[^#]*)? # Optional query: ?...
(#.*)? # Optional fragment: #...
- The scheme: http, https or ftp
- Either “//” followed by a host (authority), or otherwise the first part of the path
- The username:password@ of the authority, or nothing if absent
- The hostname from the authority
- The :port of the authority
- All of the URL path if there was a host (authority), or otherwise the remainder of the path after the first level
- The ?query portion of the URL
- The #fragment portion of the URL
Clearly some more post-processing needed to extract the actual values if you want to. Although I strongly recommend using a proper Uri class if you really want to process the content, rather than just getting a quick yes/no whether a URL seems plausibly valid.
Next stop… email addresses – RFC 5322.
As agonising as all this sounds, even to me, I am actually having a great deal of fun right now.