HtmlCleaner - HTML parser in Java
HtmlCleaner is HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
http://htmlcleaner.sourceforge.net/
comments powered by Disqus
Related Products
Yghtmlparser - Rapid Java HTML Parser Project
IntroductionThis is the private project to research and develop the Java HTML parser. There are a number of open-source HTML parser developed by using JAVA but most of those parsers cannot parse some web pages correctly because of ambiguousness of HTML syntax and some of the parsers are too heavy to use. Developing the HTML parser is definitely differ from XML parser because HTML parser MUST solve and cover the ambiguous syntax by itself. For example, 'BR' tag is usually used only open tag but i
Php-mime-mail-parser - PHP Mime Mail Parser
This project strives to create a fast and efficient PHP Mime Mail Parser Class using PHP's MailParse Extension. Example Usage<?phprequire_once('MimeMailParser.class.php');$path = 'path/to/mail.txt';$Parser = new MimeMailParser();$Parser->setPath($path);$to = $Parser->getHeader('to');$from = $Parser->getHeader('from');$subject = $Parser->getHeader('subject');$text = $Parser->getMessageBody('text');$html = $Parser->getMessageBody('html');$attachments = $Parser->getAttachments();?>There are three i
Ganon - Fast (HTML DOM) parser written in PHP
GanonThe Ganon library gives access to HTML/XML documents in a very simple object oriented way. It eases modifying the DOM and makes finding elements easy with CSS3-like queries. Ganon is: A universal tokenizer A HTML/XML/RSS DOM Parser Ability to manipulate elements and their attributes Supports invalid HTML Supports UTF8 Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported) A HTML beautifier (like HTML Tidy) Minify CSS and Javascript Sort attributes, change c
Htmlanalyzer - An analyzer for HTML
IntroductionWelcome to HTML Analyzer! A fast, well-structured and simple analyzer.HTML analyzer is an open source analyzer used to analysis the HTML page either on local host or in the Internet. Mainly used to extract data from the HTML page, it is implemented in C++. HTML analyzer includes two parts: Lexer and Parser. Lexer extracts sequence of separate words from a HTML page, while Parser analysis the structure of source program. Using HTML analyzer, you can parser the URL of the pages in the
Jssaxparser - A SAX 2 parser written in Javascript
Javascript SAX 2 ParserA light weight JavaScript SAX 2 parser which reads an XML text and triggers standardized SAX 2 events. IntroductionThat parser is able to read XML and its associated DTD. It will throw the events of : contentHandler errorHandler dtdHandler entityResolver declarationHandler lexicalHandler conforming to specification at http://www.saxproject.org/ . How to use itImport library<script type="text/javascript" src="../jssaxparser/sax.js"></script><script type="text/javascript" sr
Polparser - Lightweight generic text parser in Obj-C
PolParser is lightweight generic text parser in Obj-C for Mac OS X Leopard and later. PolParser creates a tree from the parsing of the input text. It currently supports various text formats like XML, RSS, Atom, HTML, Apple Property Lists, CSV... as well as source code for C style languages like C, C++, Obj-C..., and it's quite easy to add support for new text formats or languages. The fact PolParser generates a tree makes it quite easier to use than NSScanner & friends for complex parsing and ba
Pyinstaweb - Python binding for instaweb to parse malformed HTML with Beautiful Soup
pyinstaweb is a Python binding for instaweb to parse malformed HTML with Beautiful Soup. You could parse HTML with Beautiful Soup interface base on pyinstaweb dom = BeautifulSoup.BeautifulSoup(html, builder=pyinstaweb.HTMLParserBuilder)or direct use pyinstaweb's filter mechanism class TestFilter(object): def __init__(self): self.text = [] def handle_data(self, data): self.text.append((data, data.contents))filter = TestFilter()with HtmlParser(<url>) as parser: adapter = HtmlFilterAdapter(parser,
Sharpparser - HTML parser library for parsing HTML using the HTML5 parsing algorithm. The library pr
A HTML parser library for parsing HTML using the HTML5 parsing algorithm. The library provides jQuery-like-syntax, and is written in C#. The goal for project is to provide a HTML parser that is compliant with the HTML5 parsing algorithm. The will provide a valid parsing that allows you to trust the returned structure, in which it allows you to write better html-scrapers, webcrawlers, microformat-reader etc. Furthermore is the goal of this project is to provide a jquery-like-syntax that allows yo
Xmlcc - A platform independent object-oriented C++ library for generating, writing and parsing XML a
XMLCCXMLCC is a C++ library for handling XML using Design Patterns especially the Composite Pattern. AboutXMLCC allows for generating XML structures using a hierarchical object-oriented model that can be written to an XML file easily. Parsing is available by several parsers; a DOM like parser building the complete object-oriented model that can be searched for XML tags afterwards, or a SAX like parser that can by specialized to an XML structure by implementing an API. Both parsers are char by ch
Xpath4sax - XPath for SAX XML Parser
A quick XPath analyser with a SAX Parser. Some syntaxes are invalide, but all using syntax are presents. It's possible to catch many XPath in the same time. XPathXMLHandler handler=new XPathXMLHandler() { @Override public void findXPathNode(SAXXPath xpath, Object node) { System.out.println("node="+node); } };handler.setXPaths(XPathXMLHandler.toXPaths("//b[@at_a='s3']/c"));SAXParser parser = SAXParserFactory.newInstance().newSAXParser();parser.parse(new InputSource(new StringReader(xml)), handler