cobwebs: Search through an HTML source tree and look for bad hyperlinks.
This program is run from the root directory of a Web site''s HTML source
tree. It goes through and finds all HTML source files.
It reads and parses each file, (usually) detecting all the hyperlinks
( and the following "protocols":
mailto: ftp:// gopher:// telnet:// rlogin://
It is, of course, not perfect, but it does find a lot of cobwebs. ;-)
Usage: cobwebs [] {file | dir} ...
You list the files you want checked and/or directories recursively searched
on the command line. These paths should be relative to the Server root
dir of the Website''s HTML source tree. Default is ".", i.e. search the
whole tree.
So you can just cd to the HTML root and type "cobwebs" to run the
check, however since that is rather quiet, it is better to first type
"cobwebs -v" to get some output as the program is running.
Options:
-h, -help This help.
-v Verbose mode. Print each piece of information to the screen
as it is found, in addition to writing it to the log file.
-verbose works too.
-d Turn on extra debugging printing.
-debug works too.
-l Place output in file "file" and errors in file "file.errs".
-logfile works too.
-root Directory should be the root of the Web site source.
This simply does a chdir() to before starting.
-m Check files matching glob Default: *.htm*
-match works too.
-file_skip
Skip (e.g. backup) files and dirs matching perl regular
expression Default: SCCS/|RCS/|Bak/
-url_skip
Skip checking any URL matching
-check_backups Don''t skip the backup files matching SCCS/|RCS/|Bak/.
-check_comments Check hyperlinks found inside too.
-keep_dots Do not try to remove upward references like:
a/b/../c.gif => a/c.gif
This is useful to see if your ../ links just work by
the luck of the browser or http server or if they really
point to where you intend.
-use_httpd
Test even local files by HTTP. should
refer to the server of the HTML tree being checked
(e.g. http://host.domain)
-show_header Include the returned HTTP in the report even for OK links.
-head_only Use only the HTTP "HEAD" method, no GET. (Good netizen mode)
(GET is used to check "#" anchors though)
-timeout Set timeout to wait for hung connections. The timeout
will be seconds. Default is to not check for hung
connections.
-alarm Use alarm() to wait for the hung connection. (Unix only)
-select Use select() on the file handle to wait for the hung
connection (used on Windows).
Both will wait "-timeout" seconds. Note that select()
is mixed with buffered sockethandle reads , so
there may be problems on some platforms.
If you don''t know which of -alarm or -select to pick,
don''t specify either and one will be automatically
selected whenever you use the -timeout switch.
-shutdown Call shutdown(2) after sending the HTTP request.
-shutdown_match Call shutdown(2) if URL matches pattern.
-jobs When doing the link verifying, create this many
processes (on Unix) checking the links in parallel.
-links_only Do not check any hyperlinks, just report a list of the
hyperlink URL''s found.
-local_only Do not check hyperlinks that require a remote
connection. (i.e. only checks the local filesystem)
-no_anchors Do not check the Anchor links
-no_images Do not check the Image links
-no_actions Do not check the Form Action links