Sqoop - Transfers data between Hadoop and Datastores
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.
comments powered by Disqus
pypig - a python tool to manage Pig reports Pig provides an amazing set of tools to create complex relational processes on top of Hadoop, but it has a few missing pieces: # Looping constructs for easily creating multiple similar reports # Caching of intermediate calculations # Data management and cleanup code # Easy testing for report correctness pypig is an attempt to fill in these holes by providing a python module that knows how to talk to a Hadoop cluster and can create and manage complex re
Mr.CL ProjectOverviewWe combine the power of two major tools for data processing: Hadoop and NVIDIA CUDA. Hadoop is for scalability over multiple nodes and CUDA is for speeding up block-level calculations. The goal of this project was improving distributed matrix multiplication on Hadoop. Hadoop emphasizes data locality using HDFS, but matrix multiplication has to access remote data unless columns or rows are aligned to each machine. This makes matrix multiplication infeasible on MapReduce schem
hadoop4winHadoop for Windows using Cygwin æœ¬è»Ÿé«”å°ˆæ¡ˆç”± åœ‹å®¶é«˜é€Ÿç¶²è·¯èˆ‡è¨ˆç®—ä¸å¿ƒ(NCHC) è´ŠåŠ© è»Ÿé«”ç°¡ä»‹hadoop4winï¼Œé¡§å��æ€�ç¾©ç‚ºã€ŽHadoop for Windowsã€�ï¼Œä¸»è¦�æ˜¯æ��ä¾› Windows å¹³å�°ä¸Šç°¡æ˜“å®‰è£� Hadoop çš„æ‰¹æ¬¡å®‰è£�æª”ã€‚æ¤æ‰¹æ¬¡å®‰è£�æª”å…§å®¹ï¼Œä¸»è¦�å�ƒè€ƒè‡ªåœ‹ç¶²ä¸å¿ƒä¼�éµ�é¾�èˆ‡å†�ç”Ÿé¾�åœ˜éšŠæˆ�å“¡å«æŒ¯å‡±å…ˆç”Ÿä¹‹ drbl-winroll ä½œå“�ï¼ŒæŠ½å�–å®‰è£�éƒ¨åˆ†ç¨‹å¼�æ”¹å¯«æˆ� hadoop4win æ‰€éœ€çš„æ¥é©Ÿã€‚ hadoop4win ç›®å‰�åŒ…å�«å››å¤§è»Ÿé«”çµ„æˆ�ï¼š Cygwin - æ��ä¾
Parallel BASH is a modified version of BASH intended for text processing on computer clusters. It enables use of common UNIX text processing tools (e.g., awk, perl, grep) across multicore or distributed systems. It is particularly suited for scalable processing of large (multi-GB or larger) files. parbash interprets scripts in the same way as BASH does except when a structure similar to the following is encountered: cat hdfs:/student_marks | grep ^A | sort | uniq -c > hdfs:/outIn this case, parb
cc2svn tool converts ClearCase view files with all history and given labels to SVN dump. The dump can be loaded by SVN using 'cat svndump.txt | svnadmin load' command. Features:transfers history of changes for files saving the date, author and comment for each revision converts all/some/none branches (configurable) converts all/some/none labels (configurable) incremental dump mode retry/ignore failed CC commands cache for ClearCase files tested on Linux/Solaris, python 2.5/2.6 Main points:The to
Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. It is a thin Java library and API that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application.
This python packages includes a wide variety of utilities, focused primarily on numerical python, statistics, and file input/output. News2012-04-27: Tagged version 0.5.1 Release Notes 2012-01-15: Tagged version 0.5.0 2012-01-15: The comoving volume returned by the cosmology class is now for the whole sky, whereas previously it was per steradian. To get the old behavior, divide by 4*pi older news. Click here to view a full list of changes Getting the codeTo use stable versions, get one of the dow
Transfer Entropy ToolboxA suite of MATLAB/C and C++ tools for computing standard and extended versions of Thomas Schreiber's transfer entropy on sparse, binary time series. What is Transfer Entropy (TE)?From Schreiber, 2000: An information theoretic measure is derived that quantifies the statistical coherence between systems evolving in time. The standard time delayed mutual information fails to distinguish information that is actually exchanged from shared information due to common history and
Project GoalToday, applications have two choices when it comes to the transport channel: TCP or UDP. While UDP is an unreliable protocol, TCP provides reliability, flow control, and congestion control. TCP has an explicit connection establishment phase, which some application may not find desirable. Your job in this project is to design and implement a file transfer application which has all the good features of TCP without the connection establishment phase. Your application, reliable UDP, shou
Cascalog is a fully-featured data processing and querying library for Clojure. The main use cases for Cascalog are processing Big Data on top of Hadoop or doing analysis on your local computer from the Clojure REPL. Cascalog is a replacement for tools like Pig, Hive, and Cascading.