cobwebs: Search through an HTML source tree and look for bad hyperlinks. This program is run from the root directory of a Web site''s HTML source tree. It goes through and finds all HTML source files. It reads and parses each file, (usually) detecting all the hyperlinks ( and the following "protocols": mailto: ftp:// gopher:// telnet:// rlogin:// It is, of course, not perfect, but it does find a lot of cobwebs. ;-) Usage: cobwebs [] {file | dir} ... You list the files you want checked and/or directories recursively searched on the command line. These paths should be relative to the Server root dir of the Website''s HTML source tree. Default is ".", i.e. search the whole tree. So you can just cd to the HTML root and type "cobwebs" to run the check, however since that is rather quiet, it is better to first type "cobwebs -v" to get some output as the program is running. Options: -h, -help This help. -v Verbose mode. Print each piece of information to the screen as it is found, in addition to writing it to the log file. -verbose works too. -d Turn on extra debugging printing. -debug works too. -l Place output in file "file" and errors in file "file.errs". -logfile works too. -root Directory should be the root of the Web site source. This simply does a chdir() to before starting. -m Check files matching glob Default: *.htm* -match works too. -file_skip Skip (e.g. backup) files and dirs matching perl regular expression Default: SCCS/|RCS/|Bak/ -url_skip Skip checking any URL matching -check_backups Don''t skip the backup files matching SCCS/|RCS/|Bak/. -check_comments Check hyperlinks found inside too. -keep_dots Do not try to remove upward references like: a/b/../c.gif => a/c.gif This is useful to see if your ../ links just work by the luck of the browser or http server or if they really point to where you intend. -use_httpd Test even local files by HTTP. should refer to the server of the HTML tree being checked (e.g. http://host.domain) -show_header Include the returned HTTP in the report even for OK links. -head_only Use only the HTTP "HEAD" method, no GET. (Good netizen mode) (GET is used to check "#" anchors though) -timeout Set timeout to wait for hung connections. The timeout will be seconds. Default is to not check for hung connections. -alarm Use alarm() to wait for the hung connection. (Unix only) -select Use select() on the file handle to wait for the hung connection (used on Windows). Both will wait "-timeout" seconds. Note that select() is mixed with buffered sockethandle reads , so there may be problems on some platforms. If you don''t know which of -alarm or -select to pick, don''t specify either and one will be automatically selected whenever you use the -timeout switch. -shutdown Call shutdown(2) after sending the HTTP request. -shutdown_match Call shutdown(2) if URL matches pattern. -jobs When doing the link verifying, create this many processes (on Unix) checking the links in parallel. -links_only Do not check any hyperlinks, just report a list of the hyperlink URL''s found. -local_only Do not check hyperlinks that require a remote connection. (i.e. only checks the local filesystem) -no_anchors Do not check the Anchor links -no_images Do not check the Image links -no_actions Do not check the Form Action links
(ACTIONS NOT IMPLEMENTED) -mailto Check mailto: URLs. Looks for working e-mail address. Hostname guesses for mail server are: mail mailhost mx (will be prefixed to domain) -telnet Check telnet: URLs. Looks for working telnet connection. (connection only, no dialog) -rlogin Check rlogin: URLs. Looks for working rlogin connection. (connection only, no dialog or auth) -windows Machine is Windows. (Requires Win32 perl in PATH) -unix Machine is Unix. -sysv Machine has SVR4 style sockets (e.g. Solaris) Default is has BSD style sockets (e.g. Linux). -who E-mail the errors found to the people on Hacks: If one of the recipients is "GT:n" where "n" is an integer, then the mail is only sent if there are more than "n" errors. Also, if one of the recipients is "WRAP" then the E-mail body is passed thru "cobwebs -wrap" to make the output a bit more readable. -proxy_host Don''t make direct connections when checking remote URLs, use proxy instead. -proxy_port The proxy''s port number is . -user_agent Send to remote http servers after the GET line. Percent-signs (%) will be converted to newlines except % which goes to %. -no_user_agent Don''t sent any User-Agent lines to remote servers. -internal_find Use the Perl5 File::Find module instead of the Unix /bin/find utility. -wrap If the wide one-line-per-url output is to difficult for you to view, run an output file thru "cobwebs -wrap" to wrap the output to multi-lines and also "format" a bit for better readability. E.g: cobwebs -wrap cob.out where "cob.out" is a one-line-per-url output file. The formatted output goes to stdout. Notes: Local references to people''s home websites, e.g. /~fred/index.html will not be expanded. Remote ones, http://foo.com/~fred/index.html of course will be expanded correctly. To check the local ones, you could use -use_httpd . TODO: mention -use_httpd http://server/~fred/ "cobwebs" has been ported to work on Win95 and WinNT using Perl5. Limitations: it cannot fork() parallel jobs or send email (TBD). A way to start it on Windows would be something like: perl cobwebs ... Provided perl is in the PATH and "cobwebs" in the current directory. See the Win32 Perl docs for more info on launching scripts. This program is mainly intended to be run as a batch job, e.g. via crontab(1). It is not blazingly fast. The -jobs switch can be used (on Unix) to spawn processes verifying links in parallel. Yes, there are a lot of options, and they have long names, so why not make yourself a little script, that sets your basic options which you can easily comment out functionality when you want to, and even type in extra options at the cmd line when you want to e.g.: #!/bin/sh root=$HOME/www_src uskip="javascript:|JavaScript:" extras="-mailto -telnet -rlogin" httpd="-use_httpd http://www.karlrunge.com/" timeout="-timeout 240" misc="-verbose" cobwebs $misc $timeout $extras $httpd -url_skip "$uskip" -root $root $* exit $? A script like the above could be the basis for a daily or weekly cron job. (see the -who e-mail flag). cobwebs v0.2 Copyright (c) 1997, 2000 Karl J. Runge