Skip to page content or skip to Accesskey List.
Search evolt.org
evolt.org login: or register

Work

Main Page Content

Towards Next Generation URLs

Rated 4.15 (Ratings: 25) (Add your rating)

Log in to add a comment
(14 comments so far)

Want more?

 
Picture of port80

Joe Lima & Thomas Powell

Member info | Full bio

User since: May 21, 2003

Last login: May 21, 2003

Articles written: 2

For many years we have heard about the impending death of URLs that are difficult to type, remember and preserve. The use of URLs has actually improved little thus far, but changes are afoot in both development practices and Web server technology that should help advance URLs to the next generation.

Dirty URLs

Complex, hard-to-read URLs are often dubbed dirty URLs because they tend to be littered with punctuation and identifiers that are at best irrelevant to the ordinary user. URLs such as http://www.example.com/cgi-bin/gen.pl?id=4&view=basic are commonplace in today's dynamic Web. Unfortunately, dirty URLs have a variety of troubling aspects, including:

Dirty URLs are difficult to type.

The length, use of punctuation, and complexity of these URLs makes typos commonplace.

Dirty URLs do not promote usability.

Because dirty URLs are long and complex, they are difficult to repeat or remember and provide few clues for average users as to what a particular resource actually contains or the function it performs.

Dirty URLs are a security risk.

The query string which follows the question mark (?) in a dirty URL is often modified by hackers in an attempt to perform a front door attack into a Web application. The very file extensions used in complex URLs such as .asp, .jsp, .pl, and so on also give away valuable information about the implementation of a dynamic Web site that a potential hacker may utilize.

Dirty URLs impede abstraction and maintainability.

Because dirty URLs generally expose the technology used (via the file extension) and the parameters used (via the query string), they do not promote abstraction. Instead of hiding such implementation details, dirty URLs expose the underlying "wiring" of a site. As a result, changing from one technology to another is a difficult and painful process filled with the potential for broken links and numerous required redirects.

Why Use Dirty URLs?

Given the numerous problems with dirty URLs, one might wonder why they are used at all. The most obvious reason is simply convention -- using them has been, and so far still is, an accepted practice in Web development. This fact aside, dirty URLs do have a few real benefits, including:

They are portable.

A dirty URL generally contains all the information necessary to reconstruct a particular dynamic query. For example, consider how a query for "web server software" appears in Google — http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=Web+server+software. Given this URL, you can rerun the query at any time in the future. Though difficult to type, it is easily bookmarked.

They can discourage unwanted reuse.

The negative aspects of a dirty URL can be regarded as positive when the intent is to discourage the user from typing a URL, remembering it, or saving it as a bookmark. The intimidating look and length of a dirty URL can be a signal to both user and search engine to stay away from a page that is bound to change. This is often simply a welcome side effect, rather than a conscious access control policy — frequently nothing is done to prevent actual use of the URL by means of session variables or referring URL checks.

Cleaning URLs

The disadvantages of dirty URLs far outweigh their advantages in most situations. If the last 30 or 40 years of software development history are any indication of where development for the Web is headed, abstraction and data hiding will inevitably increase as Web sites and applications continue to grow in complexity. Thus, Web developers should work toward cleaner URLs by using the following techniques:

Keep them short and sweet.

The first path to better URLs is to design them properly from the start. Try to make the site directories and file names short but meaningful. Obviously, /products is better than /p, but resist the urge to get too descriptive. Having www.xyz.com/productcatalog doesn't add much meaning (if a user looks for a product catalog, they might well expect to find it at or near the top-level products page), but it does needlessly restrict what the page can reasonably contain in the future. It's also harder to remember or guess at. Shoot for the shortest identifiers consistent with a general description of the page's (or directory's) contents or function.

Avoid punctuation in file names.

Often designers use names like product_spec_sheet.html or product-spec-sheet.html. The underscore is often difficult to notice and type, and these connectors are usually a sign of a carelessly designed site structure. They are only required because the last rule wasn't followed.

Use lower case and try to address case sensitivity issues.

Given the last tip, you might instead name a file ProductSpecSheet.html. However, casing in URLs is troubling because depending on the Web server's operating system, file names and directories may or may not be case sensitive. For example, http://ww.xyz.com/Products.html and http://www.xyz.com/products.html are two different files on a UNIX system but the same file on a Windows system. Add to this the fact that www.xyz.com and WWW.XYZ.COM are always the same domain, and the potential for confusion becomes apparent. The best solution is to make all file and directory names lowercase by default and, in a case sensitive server operating environment, to ensure that URLs will be correctly processed no matter what casing is used. This is not easy to do under Apache on Unix/Linux systems (related info), although URL rewriting and spellchecking can help (discussed below).

Do not expose technology via directory names.

Directory names commonly or easily associated with a given server-side technology unnecessarily disclose implementation details and discourage permanent URLs. More generic paths should be used. For example, instead of /cgi-bin, use a /scripts directory, instead of /css, use /styles, instead of /javascript, use /scripts, and so on.

Plan for host name typos.

The reality of end user navigation is that around half of all site traffic is from direct type or bookmarked access. If users want to go to Amazon's web site, they know to type in www.amazon.com. However, accidentally typing ww.amazon.com or wwww.amazon.com is fairly easy if a user is in a hurry. Adding a few entries to a site's domain name service to map w, ww, and wwww to the main site, as well as the common www.site.com and site.com, is well worth the few minutes required to set them up.

Plan for domain name typos.

If possible, secure common "fat finger" typos of domain names. Given the proximity of the "z" and "x" keys on a standard computer QWERTY keyboard, it is no wonder Amazon also has contingency domains like amaxon.com. Google allows for such variations as gooogle.com and gogle.com. Unfortunately, many Web traffic aggregators will purchase the typo domains for common sites, but most organizations should find some of their typo domains readily available. Organizations with names that are difficult to spell, like "Ximed," might want to have related domains like "Zimed" or "Zymed" for users who know the name of the organization but not the correct spelling. The particular domains needed for a company should reveal themselves during the course of regular offline correspondence with customers.

Support multiple domain forms.

If an organization has many forms to its name, such as International Business Machines and IBM, it is wise to register both forms. Some companies will register their legal form as well, so XYZ, LLC or ABC, Inc. might register xyzllc.com and abcinc.com as well as primary domains. While it seems like a significant investment, if you use one of the new breed of low-cost registrars (like itsyourdomain.com), the price per year for numerous domains for a site is quite reasonable. Given alternate domain extensions like .net, .org, .biz and so on, the question begs -- where to stop? Anecdotally, the benefits are significantly reduced with new alternate domain forms (like .biz, .cc, and so on), so it is better to stick with the common domain form (.com) and any regional domains that are appropriate (e.g. co.uk).

Add guessable entry point URLs.

Since users guess domain names, it is not a stretch for users -- particularly power users -- to guess directory paths in URLs. For example, a user trying to find information about Microsoft Word might type http://www.microsoft.com/word. Mapping multiple URLs to common guessable site entry points is fairly easy to do. Many sites have already begun to create a variety of synonym URLs for sections. For example, to access the careers section of the site, the canonical URL might be http://www.xyz.com/careers. However, adding in URLs like http://www.xyz.com/career, http://www.xyz.com/jobs, or http://www.xyz.com/hr is easy and vastly improves the chances that the user will hit the target. You could even go so far as to add hostname remapping so that http://investor.xyz.com, http://ir.xyz.com, http://investors.xyz.com, and so on all go to http://www.xyz.com/investor. The effort made to think about URLs in this fashion not only improves their usability, but should also promote long term maintainability by encouraging the modularization of site information.

Where possible, remove query strings by pre-generating dynamic pages.

Often, complex URLs like http://www.xyz.com/press/releasedetail.asp?pressid=5 result from an inappropriate use of dynamic pages. Many developers use server-side scripting technologies like ASP/ASP.NET, ColdFusion, PHP, and so on to generate "dynamic" pages which are actually static. For example in the previous URL, the ASP script drills press release content out of a database using a primary key of 5 and generates a page. However, in nearly all cases, this type of page is static both in content and presentation. The generation of the page dynamically at user view time wastes precious server resources, slows the page down, and adds unnecessary complexity to the URL. Some dynamic caches and content distribution networks will alleviate the performance penalty here, but the unnecessarily complex URLs remain. It is easy to directly pre-generate a page to its static form and clean its URL. Thus, http://www.xyz.com/press/releasedetail.asp?pressid=5 might become www.xyz.com/press/pressrelease5 or something much more descriptive like http://www.xyz.com/press/03-02-2003 -- or even better like http://www.xyz.com/press/newproduct. The issue of when to generate a page, either at request time or beforehand, is not much different than the question of whether a program should be interpreted or compiled.

Rewrite query strings.

In the cases where pages should be dynamic, it is still possible to clean up their query strings. Simple cleaning usually remaps the ?, &, and + symbols in a URL to more readily typeable characters. Thus, a URL like http://www.xyz.com/presssearch.asp?key=New+Robot&year=2003&view=print might become something like http://www.xyz.com/pressearch.asp/key/New-Robot/year/2003/view/print. While this makes the page "look" static, it is indeed still dynamic. The look of the URL is a little less intimidating to users and may be more search engine friendly as well (search engines have been known to halt at the ? character). In conjunction with the next tip, this might even discourage URL parameter manipulation by potential site hackers who can't tell the difference between a dynamic page and a static one. The challenge with URL rewriting is that it takes some significant planning to do well, and the primary tools used for these purposes -- rule-based URL rewriters like mod_rewrite for Apache and ISAPI Rewrite for IIS -- have daunting rule syntax for developers unseasoned in the use of regular expressions. However, the effort to learn how to use these tools properly is well worth it.

Remove extensions from files in URL and source.

Probably the most interesting URL improvement that can be made involves the concept of content negotiation. Despite being a long-supported HTTP specification, content negotiation is rarely used on the Web today. The basic idea of content negotiation is that the browser transmits information about the resources it wants or can accept (MIME types preferred, language used, character encodings supported, etc.) to the server, and this information is then used, along with server configuration choices, to dynamically determine the actual content and format that should be transmitted back to the browser. Metaphorically, the browser and the server hold a negotiation over which of the available representations of a given resource is the best one to deliver, given the preferences of each side. What this means is that a user can request a URL like http://www.xyz.com/products, and the language of the content returned can be determined automatically -- resulting in the content being delivered from either a file like products-en.html for English speaking users or one like products-es.html for Spanish speakers. Technology choices such as file format (PNG or GIF, xhtml or HTML) can also be determined via content negotiation, allowing a site to support a range of browser capabilities in a manner transparent to the end user.

Content negotiation not only allows developers to present alternate representations of content but has a significant side effect of allowing URLs to be completely abstract. For example, a URL like http://www.xyz.com/products/robot, where robot is not a directory but an actual file, is completely legal when content negotiation is employed. The actual file used, be it robot.html, robot.cfm, robot.asp, etc., is determined using the negotiation rules. Abstracting away from the file extension details has two significant benefits. First, security is significantly improved as potential hackers can't immediately identify the Web site's underlying technology. Second, by abstracting the extension from the URL, the technology can be changed by the developer at will. If you consider URLs to be effectively function calls to a Web application, cleaned URLs introduce the very basics of data hiding.

URLs can be cleaned server-side using a Web server extension that implements content negotiation, such as mod_negotiation for Apache or PageXchanger for IIS. However, getting a filter that can do the content negotiation is only half of the job. The underlying URLs present in HTML or other files must have their file extensions removed in order to realize the abstraction and security benefits of content negotiation. Removing the file extensions in source code is easy enough using search and replace in a Web editor like Dreamweaver MX or HomeSite. Some tools like w3Compiler also are being developed to improve page preparation for negotiation and transmission. One word of assurance: don't jump to the conclusion that your files won't be named page.html anymore. Remember that, on your server, the precious extensions are safe and sound. Content negotiation only means that the extensions disappear from source code, markup, and typed URLs.

Automatically spell check directory and file names entered by users.

The last tip is probably the least useful, but it is the easiest to do: spell check your file and directory names. On the off chance that a user spells a file name wrong, makes a typo in extension or path, or encounters a broken link, recovery is easy enough with a spelling check. Given that the typo will start to generate a 404 in the server, a spelling module can jump in and try to match the file or directory name most likely typed. If file and directory names are relatively unique in a site, this last ditch effort can match correctly for numerous typos. If not, you get the 404 as expected. Creating simple "Did you mean X?"-style URLs requires the simple installation of a server filter like mod_speling for Apache or URLSpellCheck for IIS. The performance hit is not an issue, given that the correction filter is only called upon a 404 error, and it is better to result in a proper page than serve a 404 to save a minor amount of performance on your error page delivery. In short, there is no reason this shouldn't be done, and it is surprising that this feature is not built-in to all modern Web servers.

Conclusions

Most of the tips presented here are fairly straightforward, with the partial exception of URL cleaning and rewriting. All of them can be accomplished with a reasonable amount of effort. The result of this effort should be cleaned URLs that are short, understandable, permanent, and devoid of implementation details. This should significantly improve the usability, maintainability and security of a Web site. The potential objections that developers and administrators might have against next generation URLs will probably have to do with any performance problems they might encounter using server filters to implement them or issues involving search engine compatibility. As to the former, many of the required technologies are quite mature in the Apache world, and their newer IIS equivalents are usually explicitly modeled on the Apache exemplars, so that bodes well. As to the search engine concerns, fortunately, Google so far has not shown any issue at all with cleaned URLs. At this point, the main thing standing in the way of the adoption of next generation URLs is the simple fact that so few developers know they are possible, while some who do are too comfortable with the status quo to explore them in earnest. This is a pity, because while these improved URLs may not be the mythical URN-style keyword always promised to be just around the corner, they can substantially improve the Web experience for both users and developers alike in the long run.

Thomas Powell is founder of PINT, Inc. and a lecturer in the Computer Science department at University of California San Diego. His articles have appeared in serveral magazines and sites, including Network World, Internet Week and ZDNet. He has also published numerous books on Web technology and design, including the best-selling Web Design: The Complete Reference. Visit pint.com.

Joe Lima is the Director of Product Development for Port80 Software. He has worked for a variety of Internet, wireless and software development companies, specializing in research and development for server-centric technologies. Visit port80software.com.

Interesting article, but ...

Submitted by neuro on May 29, 2003 - 05:56.

I must admit, long and possibly messy URLs sometimes give me a pain in the head, but it's usually possible to clean some of them up off your own back - google for deamazonise to read my blog post about that.

The thing about showing .cgi, .pl, .py, .whatever at the end of a CGI or other interactive script/page is that it's hardly ever necessary. I never now put the file extension on the end of a script, although I retain .html, .pdf, .doc, .whatever at the end of *documents*, as it helps the reader determine *before* clicking what the link will do - will it fire up Acrobat? Flash? Quicktime? Real Player? Word? It helps :)

evolt.org is a good example of a site that doesn't expose the backend too much through the URL, or have very badly designed URLs.

login or register to post comments

An additional benefit of clean URLs

Submitted by port80 on May 29, 2003 - 11:08.

After the article was written we came upon a additional benefit of clean URLs. Though it was too late to add it to the article, we think it an important enough point to warrant a comment. It is this: URL abstraction allows one to arbitrarily extend a site's structure, without breaking the interface, by turning any leaf in that structure into into a node, off of which new leaves and nodes can grow. For instance, an URL that terminates in /productname can start off pointing to a page called productname.html but can later point to a directory instead -- and thus to the default page in that directory (/productname/index.html). The site can now expand at this new node by adding to (but not breaking) the old URL scheme. This confirms and encourages the idea of the URL as the site's public interface. Our thanks to Tommy Sundstrom for pointing this out this additional benefit of extensionless URLs. TAP JFL

login or register to post comments

Re: An additional benefit of clean URLs

Submitted by neuro on May 30, 2003 - 00:33.

The problem being that /productname and /productname/ can be two different things, or can cause problems in later life with specific web servers using more bandwidth to perform the redirect transition from /productname to /productname/ (e.g. Apache)

login or register to post comments

Sorting order: Dates and Numbers - Important too

Submitted by g1smd on May 30, 2003 - 06:51.


>> http://www.xyz.com/press/03-02-2003 <<

One usibility thing that can help end users with dated material that they might save, as well as helping the developer when compiling and uploading the content, is simply to use the Year-Month-Day date order for naming the files, then the saved pages automatically sort into date order if you sort them in name order. [ISO 8601] [RFC 3339]

This also allows for the content structure to extend in an easily managed way. I've seen some sites with a directory structure organised as one for each year named using the full four digits for the year, this then subdivided into 12, one for each month (don't forget to number them as 01 to 12 not as 1 to 12 -- see below), and these then further subdivided from 01 to (28, 30, or) 31 for the individual days.

A related naming pattern is to think in advance the maximum number that might be reached for simple serial numbered items and to then pad them with leading zeroes. This will then avoid sorting orders like 1, 10, 100, 101, 11, 12, ..., 19, 2, 20, 21, 22, ..., 29, 3, 30, 31, 32 etc. Padding with a leading zero or two gives the intended 001, 002, 003, 004, ..., 998, 999 effect required.

Small things, but ones that can help avoid mistakes. If you have a lot of information to organise then having it sorted into some sort of logical order can avoid simple problems going unnoticed. Storing dated material using either the Month-Day-Year or the Day-Month-Year notation throws that logical simplicity away.

login or register to post comments

Sorting order

Submitted by luminosity on June 2, 2003 - 04:56.

Not to mention that dates sorted in that order are universally recognisable, as opposed to something like 02-03-1985. Is that the 2nd of March or the 3rd of February?

login or register to post comments

Interesting Article

Submitted by swhiz10 on June 3, 2003 - 03:00.

I am more curious about the URL manipulation . Could you brief it up a bit more.

login or register to post comments

Good article

Submitted by tupholme on June 4, 2003 - 01:33.

A good article summarising most of what can currently be done with URLs.

neuro makes a good point that is not covered in the article, in that some Web servers will assume that a URL without a trailing slash is a filename rather than a directory name, even when there are no dots in it (rightly so in my opinion). Users, however, nearly always fail to appreciate this and so it is worth making sure that your server configuration does the redirection automatically. (And correspondingly making sure that your own links are correct, so that you don't incur those redirections unnecessarily.) Needless to say, you should never have a situation where you WANT a file and a directory with the same name in the same place!

The article could do with a few more references. There have been previous articles on evolt about rewriting query strings, and if the authors agree with them they should be linked here for those that didn't see them the first time around. Also, more technical readers may want to look at RFC 1738, which defines URLs.

login or register to post comments

Good article -- references available at other site

Submitted by port80 on June 4, 2003 - 14:50.

The orginal article is posted with extensive references near the bottom at http://www.port80software.com/support/articles/nextgenerationurls.

login or register to post comments

confusion and redirects

Submitted by skquinn on June 14, 2003 - 01:04.

The problem being that /productname and /productname/ can be two different things, or can cause problems in later life with specific web servers using more bandwidth to perform the redirect transition from /productname to /productname/ (e.g. Apache)
I don't see how this is a problem. Generally if you have /productname you won't also have /productname/ and vice versa. The only possible problem is the bandwidth chewed up by redirects but HTTP code 301 means moved permanently and good software will act upon it.

login or register to post comments

Want more info on removing extensions

Submitted by lisrael on June 20, 2003 - 02:31.

The article was fairly good, altho I've read a very similar one before somewhere, maybe W3C. What I was really hoping for the whole time I read it never materialized. The idea of removing extensions from web file names has caught my interest in the past, and I was hoping for some actual help on that. While this article and others promote that idea heartily, unfortunately, they are very short on the specifics of how to remove file name extensions. For web authors like myself to even have a clue about how to do this, we need step-by-step instructions on how to implement it (ideally, for both Unix/Apache and Windows/IIS servers). This could easily be an article in itself, and a good one.

login or register to post comments

ditching the "filename extensions" on URLs

Submitted by skquinn on June 20, 2003 - 04:57.

I assume you mean being able to use a URL without the "filename extension" on it, the filename itself needs to have the extension even if the URL doesn't (the filename would be "stuff.html" even though the URL you would refer to would be "http://www.example.com/stuff").

In Apache it's a one-line change. Add "MultiViews" to the "Options" directive of every site that you want to do this for. With IIS, I have not really used it, so I would not know where to begin (and given that Microsoft's programmers have made IE dependent on URLs having "filename extensions" in certain cases I would not be surprised if it was not possible).

Note that you can still put the "filename extension" on the URL, in fact if you are using content negotiation, this is how you would access a specific version of the resource (for example, a PNG, JPEG, and Targa version of the same image).

login or register to post comments

Typing? Remembering?

Submitted by efge on June 24, 2003 - 04:32.

Dirty URLs are difficult to type.
Do you seriously think that in real life people will type URLs like http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=Web+server+software ?

Avoid punctuation in file names.
I fail to see why this would be careless or unclean. I'd rather see Towards_Next_Generation_URLs than towardsnextgenerationurls.

And regarding typing them, see above, you don't type or remember the URL of a document, at most you do that for the website and maybe a few toplevel pages (à la your word example).

login or register to post comments

Search Engines and standard formats

Submitted by edmar on July 14, 2003 - 12:18.

Something missing from the article and comments is how this topic affects search engine ranking. The clean URLs definitely help usability for end-users but they can also be a bonus to your SEO efforts as the URLs will have more relevant keywords. Keywords in directories have higher values then those in query strings. As for page naming however, both users and search engines will have an easier time with "my-complex-page-name" then figuring out the keywords from "mycomplexpagename". The latter is to code orientated.

One concern might be how the engines view extension-less pages. This comes from when I recently tried to recreate a site locally using a number of site grabbers. The site employed some type of system where by the extension was hidden and for each of those pages the grabbers just made an empty directory. Obviously less resources went in to creating the grabbers' spidering engines then for search bots' but it would be good to confirm the results.

People too have got used to extensions so might be confused as to why they are missing on your pages. In relaying a URL to a coworker, it would be normal to say ".html". Any time I have tried stopping at the slash, I invariably get questioned about the rest of the file name. Just like ".com" for domains, people relate to ".html" as a web page. You can still do the content negation with out removing the extension so better not to confuse people.

login or register to post comments

URLs vs filenames

Submitted by skquinn on July 14, 2003 - 13:37.

People too have got used to extensions so might be confused as to why they are missing on your pages. In relaying a URL to a coworker, it would be normal to say ".html".

Today, it would be normal to use HTML. In 10, 20, 30, or more years, HTML may well be as antiquated as Gopher. The nice thing about a URL that is not strictly based on a filename, is you can refer to it as, for example, http://www.example.com/catalog/widget1234 and the underlying document can be HTML, plain text, RTF, or whatever format of the year. I actually took advantage of this on a site I made recently, moving from a plain text version to an HTML version keeping the same URL. If I decide not to use HTML anymore, converting back to plain text is as simple as switching back the script that generates the file and rm file.html.

Any time I have tried stopping at the slash, I invariably get questioned about the rest of the file name.

This doesn't really make much sense to me. URLs aren't filenames; they never were.

Just like ".com" for domains, people relate to ".html" as a web page.

I've seen URL "filename extensions" of .asp, .php, .php3, .shtml, .jsp, and who knows how many others. Security would be a good enough reason to leave off extensions; .asp is usually a dead giveaway your Web server is a Windows box (though ASP solutions exist for other platforms as well, and there are other places that the Web server's identity would need to be hidden).

You can still do the content negation with out removing the extension so better not to confuse people.

I don't see what's so confusing about it. What I find confusing is sites that convert everything from .html to .shtml and leave God knows how many links dangling, without so much as a code 301 redirect (I forget the example I had for this right off hand, but there was a company that did this).

login or register to post comments

The access keys for this page are: ALT (Control on a Mac) plus:

evolt.orgEvolt.org is an all-volunteer resource for web developers made up of a discussion list, a browser archive, and member-submitted articles. This article is the property of its author, please do not redistribute or use elsewhere without checking with the author.