add details of license fields and fix a few errors
[objavi2.git] / tests / lxml_parse.py
blob71e18bf43ac779dbf1a10cf5933930160d229e0f
1 #!/usr/bin/python
3 html2 ="""
4 <html xmlns="http://www.w3.org/1999/xhtml"><head>
5 <title>Gimp: Help</title>
6 <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
7 </head>
8 <body>
9 <h1>More Help
10 </h1>
11 <p>For more help with GIMP you can try these avenues:
12 </p>
13 <h2>GIMP Documentation&#160;
14 </h2>
15 <p>should first look at the very good documentation at the developers site:
16 </p>
17 <p><!--
18 Copyright (c) 14 May 2006 AdamHyde
19 Permission is granted to copy, distribute and/or modify this document
20 under the terms of the GNU Free Documentation License, Version 1.2
21 or any later version published by the Free Software Foundation.
22 <p />
23 For a full text of the licence please see:
24 <a href="http://www.gnu.org/licenses/fdl.txt" target="_top">http://www.gnu.org/licenses/fdl.txt</a>
25 -->
26 </p>
27 <p> link
28 <br>
29 </p>
30 <h2>GIMP Tutorials
31 </h2>
32 <p><a href="http://www.gimpguru.org/Tutorials/" target="_blank">http://www.gimpguru.org/Tutorials/&#160;</a>
33 </p>
34 <p><a href="http://gug.sunsite.dk/" target="_blank">http://gug.sunsite.dk/</a>
35 <br>
36 </p>
37 <p><a href="http://www.gimp-tutorials.com/" target="_blank">http://www.gimp-tutorials.com/</a>
38 </p>
39 <p><a href="http://www.gimp-tutorials.com/" target="_blank"></a><a href="http://gimpology.com/" target="_blank">http://gimpology.com/</a>
40 </p>
41 <p><a href="http://gimpology.com/" target="_blank"></a><a href="http://gimp-savvy.com/" target="_blank">http://gimp-savvy.com/</a>
42 <br>
43 </p>
44 <h2>Online Forums&#160;
45 </h2>
46 <p>You can also try searching through the forums for information. &#160;
47 </p>
48 <p>link
49 </p>
50 <p>The forums contain a lot of postings from users on many topics. You can use the search system to locate topics or just browse the categories. If you don't find what you want then try subscribing to the forums and posting your question to the relevant category. There are a few things to keep in mind when asking a question in a forum or to a mailing list. First, be as clear as you can with your question and provide any information that you might think would help some to try to help you. You might, for example, include information about the operating system you are using, or various specifics that relate to what you are trying to achieve. Additionally, it is always good practice to also post back to any forum or mailing list if you manage to solve your query and include clear information on how you solved the puzzle. This is so that someone else that may have the same issue can resolve it using what you have found out. If possible post back to the same thread (discussion topic) so that anyone searching through the forum can follow the discussion including the solution.
51 </p>
52 <h2><strong>Web Search</strong>
53 </h2>
54 <p>Searching the web is always useful. If you are looking for problems arising from errors reported by the software then try entering the error text into the search engine. Be sure to edit out any information that doesn't look generic when doing this. Some search engines also enable you to try searches of mailing lists, online groups etc, this can also provide good results.<strong>
55 <br></strong>
56 </p>
57 <h2>Mailing Lists
58 </h2>
59 <p>Mailing lists are good places to look through for answers to questions. The archives are located here :
60 </p>
61 <p>link
62 </p>
63 <p>You can browse the archives (although this can take a while). You can also subscribe to the mailing lists and ask a question:
64 </p>
65 <p>link
66 </p>
67 <p>Please note the suggestions about posting to forums and mailing lists in the above section.
68 </p>
69 <h2><strong>IRC</strong>
70 </h2>
71 <p><strong>IRC</strong> is a type of online chat. it is not the easiest to use if you are not familiar with it but it is a very good system. There are a variety of softwares for all operating systems that enable you to use <strong>IRC</strong>. The <strong>IRC</strong> channel is where a number of the developers are online and some 'superusers'. So logging into this channel can be useful but it is very important that you know exactly what you are trying to find out before trying this route. The protocol for using the channel is jus tot log in, and ask the question immediately. Don't try and be too chatty as you are probably going to be ignored. It is also preferable if you have done some research using the other methods above before trying the channel. The details for the <strong>IRC</strong> channel are:
72 </p>
73 <p>
74 </p>
75 <ul>
76 <li>IRC network: <code>detail</code> </li>
77 <li>Channel: <code>#detail</code>
78 <br> </li>
79 </ul>
82 </body></html>
83 """
86 html ="""
87 <html><body>
88 <!-- a comment -->
89 </body></html>
90 """
92 import lxml, lxml.html, lxml.html.clean
93 from lxml.html import html5parser
94 tree = lxml.html.document_fromstring(html)
95 tree = html5parser.document_fromstring(html)
97 cleaner = lxml.html.clean.Cleaner(scripts=True,
98 javascript=True,
99 comments=False,
100 style=True,
101 links=True,
102 meta=True,
103 page_structure=False,
104 processing_instructions=True,
105 embedded=True,
106 frames=True,
107 forms=True,
108 annoying_tags=True,
109 #allow_tags=OK_TAGS,
110 remove_unknown_tags=True,
111 safe_attrs_only=True,
112 add_nofollow=False
115 #cleaner = lxml.html.clean.Cleaner()
117 cleaner(tree)