HtmlCleaner - HTML parser in Java

HtmlCleaner is HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.



http://htmlcleaner.sourceforge.net/

Bookmark and Share          739



comments powered by Disqus


Related Products

Yghtmlparser - Rapid Java HTML Parser Project

IntroductionThis is the private project to research and develop the Java HTML parser. There are a number of open-source HTML parser developed by using JAVA but most of those parsers cannot parse some web pages correctly because of ambiguousness of HTML syntax and some of the parsers are too heavy to use. Developing the HTML parser is definitely differ from XML parser because HTML parser MUST solve and cover the ambiguous syntax by itself. For example, 'BR' tag is usually used only open tag but i

Read more

Php-mime-mail-parser - PHP Mime Mail Parser

This project strives to create a fast and efficient PHP Mime Mail Parser Class using PHP's MailParse Extension. Example Usage<?phprequire_once('MimeMailParser.class.php');$path = 'path/to/mail.txt';$Parser = new MimeMailParser();$Parser->setPath($path);$to = $Parser->getHeader('to');$from = $Parser->getHeader('from');$subject = $Parser->getHeader('subject');$text = $Parser->getMessageBody('text');$html = $Parser->getMessageBody('html');$attachments = $Parser->getAttachments();?>There are three i

Read more

Ganon - Fast (HTML DOM) parser written in PHP

GanonThe Ganon library gives access to HTML/XML documents in a very simple object oriented way. It eases modifying the DOM and makes finding elements easy with CSS3-like queries. Ganon is: A universal tokenizer A HTML/XML/RSS DOM Parser Ability to manipulate elements and their attributes Supports invalid HTML Supports UTF8 Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported) A HTML beautifier (like HTML Tidy) Minify CSS and Javascript Sort attributes, change c

Read more

Htmlanalyzer - An analyzer for HTML

IntroductionWelcome to HTML Analyzer! A fast, well-structured and simple analyzer.HTML analyzer is an open source analyzer used to analysis the HTML page either on local host or in the Internet. Mainly used to extract data from the HTML page, it is implemented in C++. HTML analyzer includes two parts: Lexer and Parser. Lexer extracts sequence of separate words from a HTML page, while Parser analysis the structure of source program. Using HTML analyzer, you can parser the URL of the pages in the

Read more

Jssaxparser - A SAX 2 parser written in Javascript

Javascript SAX 2 ParserA light weight JavaScript SAX 2 parser which reads an XML text and triggers standardized SAX 2 events. IntroductionThat parser is able to read XML and its associated DTD. It will throw the events of : contentHandler errorHandler dtdHandler entityResolver declarationHandler lexicalHandler conforming to specification at http://www.saxproject.org/ . How to use itImport library<script type="text/javascript" src="../jssaxparser/sax.js"></script><script type="text/javascript" sr

Read more

Polparser - Lightweight generic text parser in Obj-C

PolParser is lightweight generic text parser in Obj-C for Mac OS X Leopard and later. PolParser creates a tree from the parsing of the input text. It currently supports various text formats like XML, RSS, Atom, HTML, Apple Property Lists, CSV... as well as source code for C style languages like C, C++, Obj-C..., and it's quite easy to add support for new text formats or languages. The fact PolParser generates a tree makes it quite easier to use than NSScanner & friends for complex parsing and ba

Read more

Pyinstaweb - Python binding for instaweb to parse malformed HTML with Beautiful Soup

pyinstaweb is a Python binding for instaweb to parse malformed HTML with Beautiful Soup. You could parse HTML with Beautiful Soup interface base on pyinstaweb dom = BeautifulSoup.BeautifulSoup(html, builder=pyinstaweb.HTMLParserBuilder)or direct use pyinstaweb's filter mechanism class TestFilter(object): def __init__(self): self.text = [] def handle_data(self, data): self.text.append((data, data.contents))filter = TestFilter()with HtmlParser(<url>) as parser: adapter = HtmlFilterAdapter(parser,

Read more

Sharpparser - HTML parser library for parsing HTML using the HTML5 parsing algorithm. The library pr

A HTML parser library for parsing HTML using the HTML5 parsing algorithm. The library provides jQuery-like-syntax, and is written in C#. The goal for project is to provide a HTML parser that is compliant with the HTML5 parsing algorithm. The will provide a valid parsing that allows you to trust the returned structure, in which it allows you to write better html-scrapers, webcrawlers, microformat-reader etc. Furthermore is the goal of this project is to provide a jquery-like-syntax that allows yo

Read more

Xmlcc - A platform independent object-oriented C++ library for generating, writing and parsing XML a

XMLCCXMLCC is a C++ library for handling XML using Design Patterns especially the Composite Pattern. AboutXMLCC allows for generating XML structures using a hierarchical object-oriented model that can be written to an XML file easily. Parsing is available by several parsers; a DOM like parser building the complete object-oriented model that can be searched for XML tags afterwards, or a SAX like parser that can by specialized to an XML structure by implementing an API. Both parsers are char by ch

Read more

Xpath4sax - XPath for SAX XML Parser

A quick XPath analyser with a SAX Parser. Some syntaxes are invalide, but all using syntax are presents. It's possible to catch many XPath in the same time. XPathXMLHandler handler=new XPathXMLHandler() { @Override public void findXPathNode(SAXXPath xpath, Object node) { System.out.println("node="+node); } };handler.setXPaths(XPathXMLHandler.toXPaths("//b[@at_a='s3']/c"));SAXParser parser = SAXParserFactory.newInstance().newSAXParser();parser.parse(new InputSource(new StringReader(xml)), handler

Read more

Related Tags
Browse projects by tags.

Follow feeds Follow bestopensource on Twitter Follow bestopensource on Facebook


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.

Do you provide Consulting, Training, Support for any open source products. Register your business

Tag Cloud >>