Santhosh Gowda c88e4910e7 changed the tpch_2_16_0.zip url to s3		9 lat temu
..
ddl	452ab5d4df Update DDLs for TPC-H for scale & fix nation/region table gen	11 lat temu
patches	6334c1c082 fixed error in tpch-setup.sh of inconsistent db names; made tpch suite osx compatible; fixed errors in runSuite.pl to use the db name as generated by tpch-setup.sh	11 lat temu
src	452ab5d4df Update DDLs for TPC-H for scale & fix nation/region table gen	11 lat temu
Makefile	c88e4910e7 changed the tpch_2_16_0.zip url to s3	9 lat temu
README.md	2b4fa2e639 Added most of TPC-H, some queries need to be fixed.	11 lat temu
pom.xml	452ab5d4df Update DDLs for TPC-H for scale & fix nation/region table gen	11 lat temu

Mapreduce TPC-H Generator

This simplifies creating tpc-h data-sets on large scales on a hadoop cluster.

To get set up, you need to run

$ make

this will download the TPC-h dbgen program, compile it and use maven to build the MR app wrapped around it.

To generate the data-sets, you need to run (say, for scale = 200, parallelism = 100)

$ hadoop  jar target/tpch-gen-1.0-SNAPSHOT.jar   -d /user/hive/external/200/ -p 100 -s 200

This uses the existing parallelism in the dbgen program without modification and uses it to run the command on multiple machines.

The command generates multiple files for each map task, resulting in each table having its own subdirectory.

Assumptions made are that all machines in the cluster are OS/arch/lib identical.