Displaying 1 to 10 from 23 results
iText is one of the popular and widely used PDF library. It is used to generate PDF documents dynamically. Mostly web developers will love it to generate PDF documents and reports based on data from an XML file or a database and serves it to the browser. It has support of adding bookmarks, watermarks, Encryption, Form filling and lot more.
Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.
Ghostscript is a rendering and conversion engine for page description languages, including Postscript and PDF. It has ability to convert PostScript language files to many raster formats, view them on displays, and print them on printers that don't have PostScript language capability built in.
Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.
Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. Google Refine is a web application but run on one's own machine and used by oneself. Its reconciliation support helps to link text names in your data to database identifiers.
SMILA is an extensible framework for building search solutions to access unstructured information in the enterprise. Besides providing essential infrastructure components and services, SMILA also delivers ready-to-use add-on components, like connectors to most relevant data sources. Using the framework as their basis will enable developers to concentrate on the creation of higher value solutions, like semantic driven applications etc.
Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.
GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.
The boilerpipe library provides algorithms to detect and remove the surplus clutter (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction).
PDFClown is a PDF library helps to generate, read and edit PDF. It helps to split and merge the PDF documents. It has support to add Images, Fonts, Barcodes, Bookmarks, Annotations, Form fields like checkbox, button, list box etc, Compression, text extraction.