Kill a couple of "<>"
[python.git] / Tools / webchecker / webchecker.py
blob2ef82c4d7fb33ed5b9d2493749282c1972ad2cfe
1 #! /usr/bin/env python
3 # Original code by Guido van Rossum; extensive changes by Sam Bayer,
4 # including code to check URL fragments.
6 """Web tree checker.
8 This utility is handy to check a subweb of the world-wide web for
9 errors. A subweb is specified by giving one or more ``root URLs''; a
10 page belongs to the subweb if one of the root URLs is an initial
11 prefix of it.
13 File URL extension:
15 In order to easy the checking of subwebs via the local file system,
16 the interpretation of ``file:'' URLs is extended to mimic the behavior
17 of your average HTTP daemon: if a directory pathname is given, the
18 file index.html in that directory is returned if it exists, otherwise
19 a directory listing is returned. Now, you can point webchecker to the
20 document tree in the local file system of your HTTP daemon, and have
21 most of it checked. In fact the default works this way if your local
22 web tree is located at /usr/local/etc/httpd/htdpcs (the default for
23 the NCSA HTTP daemon and probably others).
25 Report printed:
27 When done, it reports pages with bad links within the subweb. When
28 interrupted, it reports for the pages that it has checked already.
30 In verbose mode, additional messages are printed during the
31 information gathering phase. By default, it prints a summary of its
32 work status every 50 URLs (adjustable with the -r option), and it
33 reports errors as they are encountered. Use the -q option to disable
34 this output.
36 Checkpoint feature:
38 Whether interrupted or not, it dumps its state (a Python pickle) to a
39 checkpoint file and the -R option allows it to restart from the
40 checkpoint (assuming that the pages on the subweb that were already
41 processed haven't changed). Even when it has run till completion, -R
42 can still be useful -- it will print the reports again, and -Rq prints
43 the errors only. In this case, the checkpoint file is not written
44 again. The checkpoint file can be set with the -d option.
46 The checkpoint file is written as a Python pickle. Remember that
47 Python's pickle module is currently quite slow. Give it the time it
48 needs to load and save the checkpoint file. When interrupted while
49 writing the checkpoint file, the old checkpoint file is not
50 overwritten, but all work done in the current run is lost.
52 Miscellaneous:
54 - You may find the (Tk-based) GUI version easier to use. See wcgui.py.
56 - Webchecker honors the "robots.txt" convention. Thanks to Skip
57 Montanaro for his robotparser.py module (included in this directory)!
58 The agent name is hardwired to "webchecker". URLs that are disallowed
59 by the robots.txt file are reported as external URLs.
61 - Because the SGML parser is a bit slow, very large SGML files are
62 skipped. The size limit can be set with the -m option.
64 - When the server or protocol does not tell us a file's type, we guess
65 it based on the URL's suffix. The mimetypes.py module (also in this
66 directory) has a built-in table mapping most currently known suffixes,
67 and in addition attempts to read the mime.types configuration files in
68 the default locations of Netscape and the NCSA HTTP daemon.
70 - We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
71 honor the <BASE> tag.
73 - We now check internal NAME anchor links, as well as toplevel links.
75 - Checking external links is now done by default; use -x to *disable*
76 this feature. External links are now checked during normal
77 processing. (XXX The status of a checked link could be categorized
78 better. Later...)
80 - If external links are not checked, you can use the -t flag to
81 provide specific overrides to -x.
83 Usage: webchecker.py [option] ... [rooturl] ...
85 Options:
87 -R -- restart from checkpoint file
88 -d file -- checkpoint filename (default %(DUMPFILE)s)
89 -m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
90 -n -- reports only, no checking (use with -R)
91 -q -- quiet operation (also suppresses external links report)
92 -r number -- number of links processed per round (default %(ROUNDSIZE)d)
93 -t root -- specify root dir which should be treated as internal (can repeat)
94 -v -- verbose operation; repeating -v will increase verbosity
95 -x -- don't check external links (these are often slow to check)
96 -a -- don't check name anchors
98 Arguments:
100 rooturl -- URL to start checking
101 (default %(DEFROOT)s)
106 __version__ = "$Revision$"
109 import sys
110 import os
111 from types import *
112 import StringIO
113 import getopt
114 import pickle
116 import urllib
117 import urlparse
118 import sgmllib
119 import cgi
121 import mimetypes
122 import robotparser
124 # Extract real version number if necessary
125 if __version__[0] == '$':
126 _v = __version__.split()
127 if len(_v) == 3:
128 __version__ = _v[1]
131 # Tunable parameters
132 DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
133 CHECKEXT = 1 # Check external references (1 deep)
134 VERBOSE = 1 # Verbosity level (0-3)
135 MAXPAGE = 150000 # Ignore files bigger than this
136 ROUNDSIZE = 50 # Number of links processed per round
137 DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
138 AGENTNAME = "webchecker" # Agent name for robots.txt parser
139 NONAMES = 0 # Force name anchor checking
142 # Global variables
145 def main():
146 checkext = CHECKEXT
147 verbose = VERBOSE
148 maxpage = MAXPAGE
149 roundsize = ROUNDSIZE
150 dumpfile = DUMPFILE
151 restart = 0
152 norun = 0
154 try:
155 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
156 except getopt.error, msg:
157 sys.stdout = sys.stderr
158 print msg
159 print __doc__%globals()
160 sys.exit(2)
162 # The extra_roots variable collects extra roots.
163 extra_roots = []
164 nonames = NONAMES
166 for o, a in opts:
167 if o == '-R':
168 restart = 1
169 if o == '-d':
170 dumpfile = a
171 if o == '-m':
172 maxpage = int(a)
173 if o == '-n':
174 norun = 1
175 if o == '-q':
176 verbose = 0
177 if o == '-r':
178 roundsize = int(a)
179 if o == '-t':
180 extra_roots.append(a)
181 if o == '-a':
182 nonames = not nonames
183 if o == '-v':
184 verbose = verbose + 1
185 if o == '-x':
186 checkext = not checkext
188 if verbose > 0:
189 print AGENTNAME, "version", __version__
191 if restart:
192 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
193 else:
194 c = Checker()
196 c.setflags(checkext=checkext, verbose=verbose,
197 maxpage=maxpage, roundsize=roundsize,
198 nonames=nonames
201 if not restart and not args:
202 args.append(DEFROOT)
204 for arg in args:
205 c.addroot(arg)
207 # The -t flag is only needed if external links are not to be
208 # checked. So -t values are ignored unless -x was specified.
209 if not checkext:
210 for root in extra_roots:
211 # Make sure it's terminated by a slash,
212 # so that addroot doesn't discard the last
213 # directory component.
214 if root[-1] != "/":
215 root = root + "/"
216 c.addroot(root, add_to_do = 0)
218 try:
220 if not norun:
221 try:
222 c.run()
223 except KeyboardInterrupt:
224 if verbose > 0:
225 print "[run interrupted]"
227 try:
228 c.report()
229 except KeyboardInterrupt:
230 if verbose > 0:
231 print "[report interrupted]"
233 finally:
234 if c.save_pickle(dumpfile):
235 if dumpfile == DUMPFILE:
236 print "Use ``%s -R'' to restart." % sys.argv[0]
237 else:
238 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
239 dumpfile)
242 def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
243 if verbose > 0:
244 print "Loading checkpoint from %s ..." % dumpfile
245 f = open(dumpfile, "rb")
246 c = pickle.load(f)
247 f.close()
248 if verbose > 0:
249 print "Done."
250 print "Root:", "\n ".join(c.roots)
251 return c
254 class Checker:
256 checkext = CHECKEXT
257 verbose = VERBOSE
258 maxpage = MAXPAGE
259 roundsize = ROUNDSIZE
260 nonames = NONAMES
262 validflags = tuple(dir())
264 def __init__(self):
265 self.reset()
267 def setflags(self, **kw):
268 for key in kw.keys():
269 if key not in self.validflags:
270 raise NameError, "invalid keyword argument: %s" % str(key)
271 for key, value in kw.items():
272 setattr(self, key, value)
274 def reset(self):
275 self.roots = []
276 self.todo = {}
277 self.done = {}
278 self.bad = {}
280 # Add a name table, so that the name URLs can be checked. Also
281 # serves as an implicit cache for which URLs are done.
282 self.name_table = {}
284 self.round = 0
285 # The following are not pickled:
286 self.robots = {}
287 self.errors = {}
288 self.urlopener = MyURLopener()
289 self.changed = 0
291 def note(self, level, format, *args):
292 if self.verbose > level:
293 if args:
294 format = format%args
295 self.message(format)
297 def message(self, format, *args):
298 if args:
299 format = format%args
300 print format
302 def __getstate__(self):
303 return (self.roots, self.todo, self.done, self.bad, self.round)
305 def __setstate__(self, state):
306 self.reset()
307 (self.roots, self.todo, self.done, self.bad, self.round) = state
308 for root in self.roots:
309 self.addrobot(root)
310 for url in self.bad.keys():
311 self.markerror(url)
313 def addroot(self, root, add_to_do = 1):
314 if root not in self.roots:
315 troot = root
316 scheme, netloc, path, params, query, fragment = \
317 urlparse.urlparse(root)
318 i = path.rfind("/") + 1
319 if 0 < i < len(path):
320 path = path[:i]
321 troot = urlparse.urlunparse((scheme, netloc, path,
322 params, query, fragment))
323 self.roots.append(troot)
324 self.addrobot(root)
325 if add_to_do:
326 self.newlink((root, ""), ("<root>", root))
328 def addrobot(self, root):
329 root = urlparse.urljoin(root, "/")
330 if self.robots.has_key(root): return
331 url = urlparse.urljoin(root, "/robots.txt")
332 self.robots[root] = rp = robotparser.RobotFileParser()
333 self.note(2, "Parsing %s", url)
334 rp.debug = self.verbose > 3
335 rp.set_url(url)
336 try:
337 rp.read()
338 except (OSError, IOError), msg:
339 self.note(1, "I/O error parsing %s: %s", url, msg)
341 def run(self):
342 while self.todo:
343 self.round = self.round + 1
344 self.note(0, "\nRound %d (%s)\n", self.round, self.status())
345 urls = self.todo.keys()
346 urls.sort()
347 del urls[self.roundsize:]
348 for url in urls:
349 self.dopage(url)
351 def status(self):
352 return "%d total, %d to do, %d done, %d bad" % (
353 len(self.todo)+len(self.done),
354 len(self.todo), len(self.done),
355 len(self.bad))
357 def report(self):
358 self.message("")
359 if not self.todo: s = "Final"
360 else: s = "Interim"
361 self.message("%s Report (%s)", s, self.status())
362 self.report_errors()
364 def report_errors(self):
365 if not self.bad:
366 self.message("\nNo errors")
367 return
368 self.message("\nError Report:")
369 sources = self.errors.keys()
370 sources.sort()
371 for source in sources:
372 triples = self.errors[source]
373 self.message("")
374 if len(triples) > 1:
375 self.message("%d Errors in %s", len(triples), source)
376 else:
377 self.message("Error in %s", source)
378 # Call self.format_url() instead of referring
379 # to the URL directly, since the URLs in these
380 # triples is now a (URL, fragment) pair. The value
381 # of the "source" variable comes from the list of
382 # origins, and is a URL, not a pair.
383 for url, rawlink, msg in triples:
384 if rawlink != self.format_url(url): s = " (%s)" % rawlink
385 else: s = ""
386 self.message(" HREF %s%s\n msg %s",
387 self.format_url(url), s, msg)
389 def dopage(self, url_pair):
391 # All printing of URLs uses format_url(); argument changed to
392 # url_pair for clarity.
393 if self.verbose > 1:
394 if self.verbose > 2:
395 self.show("Check ", self.format_url(url_pair),
396 " from", self.todo[url_pair])
397 else:
398 self.message("Check %s", self.format_url(url_pair))
399 url, local_fragment = url_pair
400 if local_fragment and self.nonames:
401 self.markdone(url_pair)
402 return
403 try:
404 page = self.getpage(url_pair)
405 except sgmllib.SGMLParseError, msg:
406 msg = self.sanitize(msg)
407 self.note(0, "Error parsing %s: %s",
408 self.format_url(url_pair), msg)
409 # Dont actually mark the URL as bad - it exists, just
410 # we can't parse it!
411 page = None
412 if page:
413 # Store the page which corresponds to this URL.
414 self.name_table[url] = page
415 # If there is a fragment in this url_pair, and it's not
416 # in the list of names for the page, call setbad(), since
417 # it's a missing anchor.
418 if local_fragment and local_fragment not in page.getnames():
419 self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
420 for info in page.getlinkinfos():
421 # getlinkinfos() now returns the fragment as well,
422 # and we store that fragment here in the "todo" dictionary.
423 link, rawlink, fragment = info
424 # However, we don't want the fragment as the origin, since
425 # the origin is logically a page.
426 origin = url, rawlink
427 self.newlink((link, fragment), origin)
428 else:
429 # If no page has been created yet, we want to
430 # record that fact.
431 self.name_table[url_pair[0]] = None
432 self.markdone(url_pair)
434 def newlink(self, url, origin):
435 if self.done.has_key(url):
436 self.newdonelink(url, origin)
437 else:
438 self.newtodolink(url, origin)
440 def newdonelink(self, url, origin):
441 if origin not in self.done[url]:
442 self.done[url].append(origin)
444 # Call self.format_url(), since the URL here
445 # is now a (URL, fragment) pair.
446 self.note(3, " Done link %s", self.format_url(url))
448 # Make sure that if it's bad, that the origin gets added.
449 if self.bad.has_key(url):
450 source, rawlink = origin
451 triple = url, rawlink, self.bad[url]
452 self.seterror(source, triple)
454 def newtodolink(self, url, origin):
455 # Call self.format_url(), since the URL here
456 # is now a (URL, fragment) pair.
457 if self.todo.has_key(url):
458 if origin not in self.todo[url]:
459 self.todo[url].append(origin)
460 self.note(3, " Seen todo link %s", self.format_url(url))
461 else:
462 self.todo[url] = [origin]
463 self.note(3, " New todo link %s", self.format_url(url))
465 def format_url(self, url):
466 link, fragment = url
467 if fragment: return link + "#" + fragment
468 else: return link
470 def markdone(self, url):
471 self.done[url] = self.todo[url]
472 del self.todo[url]
473 self.changed = 1
475 def inroots(self, url):
476 for root in self.roots:
477 if url[:len(root)] == root:
478 return self.isallowed(root, url)
479 return 0
481 def isallowed(self, root, url):
482 root = urlparse.urljoin(root, "/")
483 return self.robots[root].can_fetch(AGENTNAME, url)
485 def getpage(self, url_pair):
486 # Incoming argument name is a (URL, fragment) pair.
487 # The page may have been cached in the name_table variable.
488 url, fragment = url_pair
489 if self.name_table.has_key(url):
490 return self.name_table[url]
492 scheme, path = urllib.splittype(url)
493 if scheme in ('mailto', 'news', 'javascript', 'telnet'):
494 self.note(1, " Not checking %s URL" % scheme)
495 return None
496 isint = self.inroots(url)
498 # Ensure that openpage gets the URL pair to
499 # print out its error message and record the error pair
500 # correctly.
501 if not isint:
502 if not self.checkext:
503 self.note(1, " Not checking ext link")
504 return None
505 f = self.openpage(url_pair)
506 if f:
507 self.safeclose(f)
508 return None
509 text, nurl = self.readhtml(url_pair)
511 if nurl != url:
512 self.note(1, " Redirected to %s", nurl)
513 url = nurl
514 if text:
515 return Page(text, url, maxpage=self.maxpage, checker=self)
517 # These next three functions take (URL, fragment) pairs as
518 # arguments, so that openpage() receives the appropriate tuple to
519 # record error messages.
520 def readhtml(self, url_pair):
521 url, fragment = url_pair
522 text = None
523 f, url = self.openhtml(url_pair)
524 if f:
525 text = f.read()
526 f.close()
527 return text, url
529 def openhtml(self, url_pair):
530 url, fragment = url_pair
531 f = self.openpage(url_pair)
532 if f:
533 url = f.geturl()
534 info = f.info()
535 if not self.checkforhtml(info, url):
536 self.safeclose(f)
537 f = None
538 return f, url
540 def openpage(self, url_pair):
541 url, fragment = url_pair
542 try:
543 return self.urlopener.open(url)
544 except (OSError, IOError), msg:
545 msg = self.sanitize(msg)
546 self.note(0, "Error %s", msg)
547 if self.verbose > 0:
548 self.show(" HREF ", url, " from", self.todo[url_pair])
549 self.setbad(url_pair, msg)
550 return None
552 def checkforhtml(self, info, url):
553 if info.has_key('content-type'):
554 ctype = cgi.parse_header(info['content-type'])[0].lower()
555 if ';' in ctype:
556 # handle content-type: text/html; charset=iso8859-1 :
557 ctype = ctype.split(';', 1)[0].strip()
558 else:
559 if url[-1:] == "/":
560 return 1
561 ctype, encoding = mimetypes.guess_type(url)
562 if ctype == 'text/html':
563 return 1
564 else:
565 self.note(1, " Not HTML, mime type %s", ctype)
566 return 0
568 def setgood(self, url):
569 if self.bad.has_key(url):
570 del self.bad[url]
571 self.changed = 1
572 self.note(0, "(Clear previously seen error)")
574 def setbad(self, url, msg):
575 if self.bad.has_key(url) and self.bad[url] == msg:
576 self.note(0, "(Seen this error before)")
577 return
578 self.bad[url] = msg
579 self.changed = 1
580 self.markerror(url)
582 def markerror(self, url):
583 try:
584 origins = self.todo[url]
585 except KeyError:
586 origins = self.done[url]
587 for source, rawlink in origins:
588 triple = url, rawlink, self.bad[url]
589 self.seterror(source, triple)
591 def seterror(self, url, triple):
592 try:
593 # Because of the way the URLs are now processed, I need to
594 # check to make sure the URL hasn't been entered in the
595 # error list. The first element of the triple here is a
596 # (URL, fragment) pair, but the URL key is not, since it's
597 # from the list of origins.
598 if triple not in self.errors[url]:
599 self.errors[url].append(triple)
600 except KeyError:
601 self.errors[url] = [triple]
603 # The following used to be toplevel functions; they have been
604 # changed into methods so they can be overridden in subclasses.
606 def show(self, p1, link, p2, origins):
607 self.message("%s %s", p1, link)
608 i = 0
609 for source, rawlink in origins:
610 i = i+1
611 if i == 2:
612 p2 = ' '*len(p2)
613 if rawlink != link: s = " (%s)" % rawlink
614 else: s = ""
615 self.message("%s %s%s", p2, source, s)
617 def sanitize(self, msg):
618 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
619 # Do the other branch recursively
620 msg.args = self.sanitize(msg.args)
621 elif isinstance(msg, TupleType):
622 if len(msg) >= 4 and msg[0] == 'http error' and \
623 isinstance(msg[3], InstanceType):
624 # Remove the Message instance -- it may contain
625 # a file object which prevents pickling.
626 msg = msg[:3] + msg[4:]
627 return msg
629 def safeclose(self, f):
630 try:
631 url = f.geturl()
632 except AttributeError:
633 pass
634 else:
635 if url[:4] == 'ftp:' or url[:7] == 'file://':
636 # Apparently ftp connections don't like to be closed
637 # prematurely...
638 text = f.read()
639 f.close()
641 def save_pickle(self, dumpfile=DUMPFILE):
642 if not self.changed:
643 self.note(0, "\nNo need to save checkpoint")
644 elif not dumpfile:
645 self.note(0, "No dumpfile, won't save checkpoint")
646 else:
647 self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
648 newfile = dumpfile + ".new"
649 f = open(newfile, "wb")
650 pickle.dump(self, f)
651 f.close()
652 try:
653 os.unlink(dumpfile)
654 except os.error:
655 pass
656 os.rename(newfile, dumpfile)
657 self.note(0, "Done.")
658 return 1
661 class Page:
663 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
664 self.text = text
665 self.url = url
666 self.verbose = verbose
667 self.maxpage = maxpage
668 self.checker = checker
670 # The parsing of the page is done in the __init__() routine in
671 # order to initialize the list of names the file
672 # contains. Stored the parser in an instance variable. Passed
673 # the URL to MyHTMLParser().
674 size = len(self.text)
675 if size > self.maxpage:
676 self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
677 self.parser = None
678 return
679 self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
680 self.parser = MyHTMLParser(url, verbose=self.verbose,
681 checker=self.checker)
682 self.parser.feed(self.text)
683 self.parser.close()
685 def note(self, level, msg, *args):
686 if self.checker:
687 apply(self.checker.note, (level, msg) + args)
688 else:
689 if self.verbose >= level:
690 if args:
691 msg = msg%args
692 print msg
694 # Method to retrieve names.
695 def getnames(self):
696 if self.parser:
697 return self.parser.names
698 else:
699 return []
701 def getlinkinfos(self):
702 # File reading is done in __init__() routine. Store parser in
703 # local variable to indicate success of parsing.
705 # If no parser was stored, fail.
706 if not self.parser: return []
708 rawlinks = self.parser.getlinks()
709 base = urlparse.urljoin(self.url, self.parser.getbase() or "")
710 infos = []
711 for rawlink in rawlinks:
712 t = urlparse.urlparse(rawlink)
713 # DON'T DISCARD THE FRAGMENT! Instead, include
714 # it in the tuples which are returned. See Checker.dopage().
715 fragment = t[-1]
716 t = t[:-1] + ('',)
717 rawlink = urlparse.urlunparse(t)
718 link = urlparse.urljoin(base, rawlink)
719 infos.append((link, rawlink, fragment))
721 return infos
724 class MyStringIO(StringIO.StringIO):
726 def __init__(self, url, info):
727 self.__url = url
728 self.__info = info
729 StringIO.StringIO.__init__(self)
731 def info(self):
732 return self.__info
734 def geturl(self):
735 return self.__url
738 class MyURLopener(urllib.FancyURLopener):
740 http_error_default = urllib.URLopener.http_error_default
742 def __init__(*args):
743 self = args[0]
744 apply(urllib.FancyURLopener.__init__, args)
745 self.addheaders = [
746 ('User-agent', 'Python-webchecker/%s' % __version__),
749 def http_error_401(self, url, fp, errcode, errmsg, headers):
750 return None
752 def open_file(self, url):
753 path = urllib.url2pathname(urllib.unquote(url))
754 if os.path.isdir(path):
755 if path[-1] != os.sep:
756 url = url + '/'
757 indexpath = os.path.join(path, "index.html")
758 if os.path.exists(indexpath):
759 return self.open_file(url + "index.html")
760 try:
761 names = os.listdir(path)
762 except os.error, msg:
763 exc_type, exc_value, exc_tb = sys.exc_info()
764 raise IOError, msg, exc_tb
765 names.sort()
766 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
767 s.write('<BASE HREF="file:%s">\n' %
768 urllib.quote(os.path.join(path, "")))
769 for name in names:
770 q = urllib.quote(name)
771 s.write('<A HREF="%s">%s</A>\n' % (q, q))
772 s.seek(0)
773 return s
774 return urllib.FancyURLopener.open_file(self, url)
777 class MyHTMLParser(sgmllib.SGMLParser):
779 def __init__(self, url, verbose=VERBOSE, checker=None):
780 self.myverbose = verbose # now unused
781 self.checker = checker
782 self.base = None
783 self.links = {}
784 self.names = []
785 self.url = url
786 sgmllib.SGMLParser.__init__(self)
788 def check_name_id(self, attributes):
789 """ Check the name or id attributes on an element.
791 # We must rescue the NAME or id (name is deprecated in XHTML)
792 # attributes from the anchor, in order to
793 # cache the internal anchors which are made
794 # available in the page.
795 for name, value in attributes:
796 if name == "name" or name == "id":
797 if value in self.names:
798 self.checker.message("WARNING: duplicate ID name %s in %s",
799 value, self.url)
800 else: self.names.append(value)
801 break
803 def unknown_starttag(self, tag, attributes):
804 """ In XHTML, you can have id attributes on any element.
806 self.check_name_id(attributes)
808 def start_a(self, attributes):
809 self.link_attr(attributes, 'href')
810 self.check_name_id(attributes)
812 def end_a(self): pass
814 def do_area(self, attributes):
815 self.link_attr(attributes, 'href')
816 self.check_name_id(attributes)
818 def do_body(self, attributes):
819 self.link_attr(attributes, 'background', 'bgsound')
820 self.check_name_id(attributes)
822 def do_img(self, attributes):
823 self.link_attr(attributes, 'src', 'lowsrc')
824 self.check_name_id(attributes)
826 def do_frame(self, attributes):
827 self.link_attr(attributes, 'src', 'longdesc')
828 self.check_name_id(attributes)
830 def do_iframe(self, attributes):
831 self.link_attr(attributes, 'src', 'longdesc')
832 self.check_name_id(attributes)
834 def do_link(self, attributes):
835 for name, value in attributes:
836 if name == "rel":
837 parts = value.lower().split()
838 if ( parts == ["stylesheet"]
839 or parts == ["alternate", "stylesheet"]):
840 self.link_attr(attributes, "href")
841 break
842 self.check_name_id(attributes)
844 def do_object(self, attributes):
845 self.link_attr(attributes, 'data', 'usemap')
846 self.check_name_id(attributes)
848 def do_script(self, attributes):
849 self.link_attr(attributes, 'src')
850 self.check_name_id(attributes)
852 def do_table(self, attributes):
853 self.link_attr(attributes, 'background')
854 self.check_name_id(attributes)
856 def do_td(self, attributes):
857 self.link_attr(attributes, 'background')
858 self.check_name_id(attributes)
860 def do_th(self, attributes):
861 self.link_attr(attributes, 'background')
862 self.check_name_id(attributes)
864 def do_tr(self, attributes):
865 self.link_attr(attributes, 'background')
866 self.check_name_id(attributes)
868 def link_attr(self, attributes, *args):
869 for name, value in attributes:
870 if name in args:
871 if value: value = value.strip()
872 if value: self.links[value] = None
874 def do_base(self, attributes):
875 for name, value in attributes:
876 if name == 'href':
877 if value: value = value.strip()
878 if value:
879 if self.checker:
880 self.checker.note(1, " Base %s", value)
881 self.base = value
882 self.check_name_id(attributes)
884 def getlinks(self):
885 return self.links.keys()
887 def getbase(self):
888 return self.base
891 if __name__ == '__main__':
892 main()