Cascalog - Data processing on Hadoop
Cascalog is a fully-featured data processing and querying library for Clojure. The main use cases for Cascalog are processing Big Data on top of Hadoop or doing analysis on your local computer from the Clojure REPL. Cascalog is a replacement for tools like Pig, Hive, and Cascading.
https://github.com/nathanmarz/cascalog
comments powered by Disqus
Related Products
OpenMobster - Open Source Mobile Cloud Platform
OpenMobster, is an open source Enterprise Backend for Mobile Apps. It provides a bi-directional data synchronization service for mobile apps to synchronize their locally stored database with Enterprise services in the Cloud such as server apps, CRM, ERP, etc. It supports a platform-agnostic Cloud-initiated Push Notification System. It has framework for creating end-to-end Location Aware Apps.
Handbrake - The open source video transcoder
HandBrake is a tool for converting video from nearly any format to a selection of modern, widely supported codecs. It converts video from nearly any format. Handbrake can process most common multimedia files and any DVD or BluRay sources that do not contain any kind of copy protection.
Hadoop4win - Hadoop for Windows using Cygwin
hadoop4winHadoop for Windows using Cygwin 本軟體專案由 國家高速網路與計算ä¸å¿ƒ(NCHC) 贊助 軟體簡介hadoop4win,顧å��æ€�義為『Hadoop for Windowsã€�,主è¦�是æ��ä¾› Windows å¹³å�°ä¸Šç°¡æ˜“安è£� Hadoop 的批次安è£�æª”ã€‚æ¤æ‰¹æ¬¡å®‰è£�檔內容,主è¦�å�ƒè€ƒè‡ªåœ‹ç¶²ä¸å¿ƒä¼�éµ�é¾�與å†�生é¾�團隊æˆ�員嫿Œ¯å‡±å…ˆç”Ÿä¹‹ drbl-winroll 作å“�,抽å�–安è£�部分程å¼�改寫æˆ� hadoop4win 所需的æ¥é©Ÿã€‚ hadoop4win ç›®å‰�包å�«å››å¤§è»Ÿé«”組æˆ�: Cygwin - æ��ä¾
Hbase-jdo - Simple util with hbase
What is HBase-util?HBase-util is open source module that enables it to store bean class directly into HBase tables (http://hbase.org/) running on the Hadoop Distributed FileSystem (http://hadoop.apache.org/core/) this project contributed apache hbase(http://wiki.apache.org/hadoop/Hbase) This is not JDO (persistence api). just simple module for hbase hbase-util can make to handle the hbase more easily this project can help you for executing java program simply. http://code.google.com/p/simple-jav
Pigpy - Pig report management tool
pypig - a python tool to manage Pig reports Pig provides an amazing set of tools to create complex relational processes on top of Hadoop, but it has a few missing pieces: # Looping constructs for easily creating multiple similar reports # Caching of intermediate calculations # Data management and cleanup code # Easy testing for report correctness pypig is an attempt to fill in these holes by providing a python module that knows how to talk to a Hadoop cluster and can create and manage complex re
GoldenOrb - Scalable Graph Analysis
GoldenOrb is a cloud-based project for massive-scale graph analysis, built upon Apache Hadoop and modeled after Google's Pregel architecture. It provides solutions to complex data problems, remove limits to innovation and contribute to the emerging ecosystem that spans all aspects of big data analysis. It enables users to run analytics on entire data sets instead of samples.
Mrcl - Hadoop + CUBLAS
Mr.CL ProjectOverviewWe combine the power of two major tools for data processing: Hadoop and NVIDIA CUDA. Hadoop is for scalability over multiple nodes and CUDA is for speeding up block-level calculations. The goal of this project was improving distributed matrix multiplication on Hadoop. Hadoop emphasizes data locality using HDFS, but matrix multiplication has to access remote data unless columns or rows are aligned to each machine. This makes matrix multiplication infeasible on MapReduce schem
Priter - Distributed Computing Framework for Prioritized Iteration
What is PrIter?PrIter is a modified version of Hadoop MapReduce framework that supports prioritized iterative computation, which support a large collection of iterative algorithms, including pagerank and shortest path. PrIter runs on a cluster of commodity PCs or in Amazon EC2 cloud. It ensures faster convergence of iterative process by reorganizing the update order of data items. Priter also supports online queries and generates top-k result snapshot every period of time. For details, please re
Parbash - Parallel BASH
Parallel BASH is a modified version of BASH intended for text processing on computer clusters. It enables use of common UNIX text processing tools (e.g., awk, perl, grep) across multicore or distributed systems. It is particularly suited for scalable processing of large (multi-GB or larger) files. parbash interprets scripts in the same way as BASH does except when a structure similar to the following is encountered: cat hdfs:/student_marks | grep ^A | sort | uniq -c > hdfs:/outIn this case, parb
Sqoop - Transfers data between Hadoop and Datastores
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.