Web development tips
Ten (or a few more) files every Web site needs
Covering the basics at your new, or old, site
This content is part # of # in the series: Web development tips
This content is part of the series:Web development tips
Stay tuned for additional content in this series.
There are a few standard files that every Web site should really have, but that most neglect. Most of these are matters of convention, not of technical requirement, but you are doing your site wrong not to provide them. Let users who make a wild guess about what they want to find usually succeed when they guess URLs. This tip discusses each of these standard files briefly.
Exactly how a given resource is provided depends on the Web server and Web application layers you use. In a "traditional," mostly static server like Apache, these resources are likely to be literal files on a server. But in a different configuration, they might actually be entries in a database, lines in a configuration file, classes in a server process, and so on. This tip focuses on what a user ultimately sees, not what you need to do to make it happen.
When users use your Web site, they will inevitably seek resources that do not exist. Probably this happens more often because of typos in URLs than for any other reason, but link rot, back-end misconfiguration, URL mangling at various points, and other causes contribute. When resources are unavailable, it is nice to provide some sort of fallback page that assists users in navigating to something more useful. A generic "not found" is enough for users to know a resource is unavailable, but it does not do anything to help them figure out "what next."
A warning when you create a custom 404.html (or whatever mechanism your Web server uses to deliver a custom "not found" message): Far too many Web sites are misconfigured to deliver "soft 404" messages. In other words, they deliver a page with a regular "200 OK" header that merely says "not available" somewhere in the text, perhaps (but not always) mentioning "404 Error" somewhere in there. You should not do this! Instead, give your users—and their Web browsers and other tools—a break and use accurate status headers!
So why did you create your Web site, anyway? Yes, you have a front page that may answer that question. More likely, though, it does not, and rather serves to let users login, "sells" your site, shows something splashy, and so on. Probably there is a way for users to navigate from the home page to the "about" page—go ahead and make that information available right from http://mysite.example.com/about.html. Someone will look there for it.
A good about.html page provides a quick overview of what your site does, maybe why you created it, why users might care, and probably has a few links to navigate back to the core functions of your site. This page need not, and usually should not, be extremely fancy. Just let it be factual and concise so that users can proceed to take advantage of all the neat things your site offers.
So who are you? As with the about.html, users can probably get to this information after sufficiently many clicks away from your existing home page. Do not make users work too hard for this information: Put it at http://mysite.example.com/contact.html. While you are at it, use contacts.html for the same page, too. Throw in the .htm extensions while you are at it. Names are cheap. Of course, you can also leave the information at the end of those clicks in your whiz-bang navigation screens; a little redundancy in finding resources is not bad.
To whom does this stuff belong? Probably the content belongs to you—who are you again? An individual? A corporation? A set of collaborators? A government organization? If your content is in the public domain or under a free content license of some sort, it is probably even more important to let users know that. Nowadays, everything is born privately copyrighted: If your material follows different rules, let users know. Not enough Web sites bother with this resource, but why not add it to yours? Someone will look for it.
Obviously, different pages or resources might have different copyright information. Let this general page provide some information for users on how to determine those individual differences, if that is relevant. If there are trademark issues, mention those as well.
index.html (and index.htm)
Not every Web server uses an actual index.html file to describe its home page. Depending on your setup, you might have URL rewriting, dynamic generation by pathname, and so on. Users don't care! Just make http://mysite.example.com/index.html point to your home page, even if you have to use a simple HTML redirect to make that happen.
Oh, and while you are at it, you might as well make the old Windows-crippled .htm extension work too. And if you are feeling particularly generous, even let index.cgi get to the same place also.
A lot of Web content is available through RSS. Doing that will not make sense for every Web site, but it will for many of them. It is perfectly reasonable to make RSS content be dependent on user-specific configuration options, or logging in, or paying for particular information. One size does not necessarily fit all with RSS.
Nonetheless, if there is something you can generically provide as RSS, go ahead and do so. Maybe all you give out under index.rss is "teaser" content, along with a recurring "story" about how to take advantage of the full RSS feed(s). Or even just a story about why RSS is not relevant to your Web site.
If you intend to collect any information from users (even only usernames or traffic logs), please let them know what you intend to do with that information. The legal issues around the rights and obligations of Web site creators and or users are complex—and I am not a lawyer, let alone your lawyer. Still, users will feel better knowing you have thought about their privacy. And maybe this is a good time to talk with your lawyer about exactly what you plan to do with user data.
If you do not want all the resources on your Web site to be indexed by automatic tools, say so in a robots.txt file. Heck, if you do want everything to be indexed, say that too. A Robots Exclusion Standard directive is not compulsory on users: If you really do not want something to be visible, either do not put it on your Web site at all, or make sure it lives behind adequate permission protection. But all the major and legitimate Web crawling engines obey the requests in robots.txt. Make your intentions clear.
The use of a security.html resource is not uniform. However, if your site can raise security concerns—for example, if you collect any sort of sensitive information from users—documenting your security procedures (at least in broad outline) is a good idea. Give some contact information on this page in case users have questions, or perhaps have useful improvements to suggest. Finding this information should follow the overall organization of your site's navigation options. But while you are at it, put a copy of that resource at this URL too.
Exactly how you will show a map of your overall Web site is not well standardized. Providing something along these lines is always useful, but exactly what level of detail is available depends on how dynamic your site is (and in what ways). Moreover, what you want to show users can depend on the purpose of the site. For example, it might not be appropriate for all users to know that Resource X exists at all if they do not have permission to use it. Use your judgment, but think of providing something.
For many sites, a sitemap is simply a way of being friendly to robots such as search engines. Google has published a convention that piggybacks on the robots.txt convention. In brief, you can create an XML file that documents all the resources that your site provides. This acts as an "inclusion list" to complement the "exclusion list" of robots.txt.
Not everything happens on the Web. In fact, just in case the navigation tools on your Web site do not quite live up to your hopes (or maybe your users have a brain glitch in discerning your elegant design), it is nice to let users reach you by e-mail too.
By all means, prominently publicize contact information at contact.html and elsewhere on your Web site. But as a fallback, make sure mail sent to a few general e-mail addresses gets to the right person. These include at least, firstname.lastname@example.org, email@example.com, and firstname.lastname@example.org. For the real old-timers, you might want to let email@example.com go somewhere meaningful too (but probably not to "root" for security reasons). While you are at it, throw in e-mail forwarding for a dozen more words that seem obvious to the purpose of your site. E-mail addresses are almost as cheap as symbolic links in your Web server directory.
- Wikipedia, as usual, also has a nice introduction to Google's Sitemaps convention.
- Find a friendly introduction to the Robots Exclusion Standard on Wikipedia.
- Configuring a custom 404 message in Apache is straightforward (as much as anything in Apache is). For other Web servers, consult Apache's documentation.
- Get your hands on more howto articles from the Web development zone's technical library.