AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The included parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code and ensures compatibility with results in evergreen browsers. Also standard DOM features such as querySelector or querySelectorAll work for tree traversal.

AngleSharp - The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications

TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

TagSoup - SAX-compliant parser in Java

Jodd is developer-friendly set of Java microframeworks, tools and utilities, under 1.7 MB. Build with common sense to make things simple, but not simpler. Its feature include slick IoC container, elegant MVC framework, unique AOP engine, thin DB-object mapper, standalone transaction manager, focused validation tool, versatile HTML parsers, pages decorator, super properties, powerful BeanUtil, timeless JDateTime, easy email, many super utilities... and more. <BR><BR>
Tools and utilities: 
<ul>
<li><code>jodd-core</code> contains many utilities, including <code>JDateTime</code>.</li>
<li><code>jodd-bean</code>, our infamous <code>BeanUtil</code>, type inspectors and converters.</li>
<li><code>jodd-props</code> is the super-replacement for Java <code>Properties</code>.</li>
<li><code>jodd-mail</code> for easier email sending.</li>
<li><code>jodd-upload</code>, handles HTTP uploads.</li>
<li><code>jodd-servlet</code> with many servlet utilities, including nice tag library.</li>
<li><code>jodd-http</code>, tiny HTTP client.</li>
</ul>
Micro-frameworks:

<ul>
<li><code>jodd-madvoc</code> - slick MVC framework.</li>
<li><code>jodd-petite</code> - pragmatic DI container.</li>
<li><code>jodd-lagarto</code> - HTML parser with <code>Jerry</code> and <code>CSSelly</code>.</li>
<li><code>jodd-decora</code> - pages decorator.</li>
<li><code>jodd-htmlstapler</code> - static page resources handler.</li>
<li><code>jodd-proxetta</code> - dynamic proxies and <code>Paramo</code>.</li>
<li><code>jodd-db</code> - thin database layer and object mapper.</li>
<li><code>jodd-json</code> - JSON parser and serializer.</li>
<li><code>jodd-vtor</code> - validation framework.</li>
<li><code>jodd-jtx</code> - transactions management.</li>
</ul>

Jodd is developer-friendly set of Java microframeworks, tools and utilities, under 1.7 MB. Build with common sense to make things simple, but not simpler. Its feature include slick IoC container, elegant MVC framework, unique AOP engine, thin DB-object mapper, standalone transaction manager, focused validation tool, versatile HTML parsers, pages decorator, super properties, powerful BeanUtil, timeless JDateTime, easy email, many super utilities... and more. 

Jodd - The Unbearable Lightness of Java

parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular2, Polymer and many more.

parse5 - HTML parsing / serialization toolset for Node.js

Hpricot is a fast, flexible HTML parser. Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out. Source code location: <A HREF="http://github.com/hpricot/hpricot" target="_blank">http://github.com/hpricot/hpricot</A> 
  <UL>
	<LI>Hpricot is a standalone library. It requires no other libraries. Just Ruby!</LI>
	<LI>Hpricot works hard to sort out bad HTML and pays a small penalty in order to get that right.</LI>
	<LI>If you can see it in Firefox, then Hpricot should parse it.</LI>
	<LI>Primarily, Hpricot is used for reading HTML and tries to sort out troubled HTML by having some idea of what good HTML is.</LI>
 </UL>

Hpricot is a fast, flexible HTML parser. Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out.

Hpricot - HTML parser for Ruby

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping. The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information.

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

TagSoup - HTML/XML parser for Haskell

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

JTidy - HTML parser and pretty printer in Java

Nokogiri (?) is an HTML, XML, SAX, DOM parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors, XML/HTML builder, XSLT transformer. Nokogiri parses and searches XML/HTML using native libraries (either C or Java, depending on your Ruby), which means it's fast and standards-compliant.

Nokogiri - HTML, XML, SAX, and Reader parser with XPath and CSS selector support

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
  <UL>
	<LI>It won't choke if you give it bad markup</LI>
	<LI>It provides Pythonic idioms for navigating, searching, and modifying a parse tree</LI>
	<LI>It Converts incoming documents to Unicode and outgoing documents to UTF-8.</LI>
 </UL>

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Beautiful Soup - Python HTML/XML parser

HtmlCleaner is HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

HtmlCleaner - HTML parser in Java

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.By default, the document will be an xml.etree element instance. Whenever possible, html5lib chooses the accelerated ElementTree implementation (i.e. xml.etree.cElementTree on Python 2.x). Two other tree types are supported: xml.dom.minidom and lxml.etree.

html5lib - Standards-compliant library for parsing and serializing HTML documents and fragments in Python

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and fix up many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements. Automatically closes elements with optional end tags and can handle mismatched inline element tags.<BR><BR> NekoHTML is written using the Xerces Native Interface (XNI) that is the foundation of the Xerces2 implementation. This enables you to use the NekoHTML parser with existing XNI tools without modification or rewriting code.

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and fix up many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements. Automatically closes elements with optional end tags and can handle mismatched inline element tags.

Neko HTML Parser - simple HTML scanner

Discover open source projects across all platforms

Projects