Displaying 1 to 10 from 34 results
Nutch
Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
Grub
Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain index of the Web. It is client-server architecture where client crawls the web and updates the server. The peer-to-peer grubclient software crawls during computer idle time.
Open Search Server
Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.
Arachnode.net
An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.
ASPseek
ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date.
mnoGoSearch
mnoGoSearch for UNIX consists of a command line indexer and a search program which can be run under Apache Web Server, or any other HTTP server supporting CGI interface. mnoGoSearch for Unix is distributed in sources and can be compiled with a number of databases, depending on user's choice. It is known to work on a wide variety of the modern Unix operating systems including Linux, FreeBSD, Mac OSX, Solaris and others.
Crawler4j
Crawler4j is an open source Java Crawler which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes!
Heritrix
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.
Andjing - PHP web crawler/spider
Andjing Web Crawler 0.01 pre AlphaAndjing is a basic web crawler/spider written in PHP and running in CLI environment. Requirements:PHP MySQL To Do:Change database using SQLite instead of MySQL to save more CPU resource. What You Can Do:You can modify this application into a powerfull email harvester and or content crawler. Application Usage:Extract the files Create database and table from SQL dump file included Edit config.php and change as needed Run C:\\andjing>php.exe andjing.php http://some
Ccrawler - Web Crawler Engine, with web categorization extention
C Crawler is a web crawler build in C# with Dotnet framework, built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content ...