Przeglądaj źródła

Added most of TPC-H, some queries need to be fixed.

Major improvements to the build and generate scripts.
cartershanklin 11 lat temu
rodzic
commit
2b4fa2e639

+ 31 - 27
README.md

@@ -12,7 +12,8 @@ Prerequisites
 =============
 
 You will need:
-* A Linux-based HDP cluster (or Sandbox).
+* Hadoop 2.2 or later cluster or Sandbox.
+* Hive 13 or later.
 * Between 15 minutes and 6 hours to generate data (depending on the Scale Factor you choose and available hardware).
 
 Install and Setup
@@ -20,53 +21,56 @@ Install and Setup
 
 All of these steps should be carried out on your Hadoop cluster.
 
-- Optional: Install a Tez capable version of Hive.
-
-  If you want to compare and contrast Hive on Map/Reduce versus Hive on Tez, install a version of Hive that works with Tez. For now that means installing the [Stinger Phase 3 Preview](http://www.hortonworks.com). Hive 13 and beyond, when they are released, will include Tez support by default.
-
 - Step 1: Prepare your environment.
 
-  Before you begin ensure ```gcc``` is installed and available on your system path. If you system does not have it, install it using yum or apt-get.
+  In addition to Hadoop and Hive 13+, before you begin ensure ```gcc``` is installed and available on your system path. If you system does not have it, install it using yum or apt-get.
 
-- Step 2: Compile and package the data generator.
+- Step 2: Decide which test suite(s) you want to use.
 
-  ```./build.sh``` downloads, compiles and packages the data generator.
+  hive-testbench comes with data generators and sample queries based on both the TPC-DS and TPC-H benchmarks. You can choose to use either or both of these benchmarks for experiementation. More information about these benchmarks can be found at the Transaction Processing Council homepage.
 
-- Step 3: Decide how much data you want to generate.
+- Step 3: Compile and package the appropriate data generator.
 
-  You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes. One terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 200 (200GB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 1000 (1TB) or more.
+  For TPC-DS, ```./tpcds-build.sh``` downloads, compiles and packages the TPC-DS data generator.
+  For TPC-H, ```./tpch-build.sh``` downloads, compiles and packages the TPC-H data generator.
 
-- Step 4: Generate and load the data.
+- Step 4: Decide how much data you want to generate.
 
-  The ```tpcds-setup.sh``` script generates and loads data for you. General usage is ```tpcds-setup.sh scale [directory] [mode]```. Only the scale is mandatory. The directory argument causes data to be generated in a specific location. Mode can be partitioned or unpartitioned. Partitioned causes data to be partitioned by day. Unpartitioned creates one flat schema and is faster to generate.
+  You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes and one terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 1000 (1 TB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 10000 (10 TB) or more. The notion of scale factor is similar between TPC-DS and TPC-H.
 
-  - Option 1: Generate data on a Hadoop cluster.
+- Step 5: Generate and load the data.
 
-    Use this approach if you want to try Hive out at scale. This approach assumes you have multiple physical Hadoop nodes with plenty of RAM. All tables will be created and large tables will be partitioned by date and bucketed which improves performance among queries that take advantage of partition pruning or SMB joins.
+  The scripts ```tpcds-setup.sh``` and ```tpch-setup.sh``` generate and load data for TPC-DS and TPC-H, respectively. General usage is ```tpcds-setup.sh scale_factor [directory]``` or ```tpch-setup.sh scale_factor [directory]```
 
-    Example: ```./tpcds-setup.sh 200```
+  Some examples:
+  Build 1 TB of TPC-DS data: ```./tpcds-setup 1000```
+  Build 1 TB of TPC-H data: ```./tpch-setup 1000```
+  Build 100 TB of TPC-DS data: ```./tpcds-setup 100000```
 
-  - Option 2: Generate data on a Sandbox.
+- Step 6: Run queries.
 
-    Use this approach if you want to try Hive or Hive/Tez in a Sandbox environment. This approach creates an unpartitioned schema by default, which is faster to generate. This option is appropriate for smaller data scales, say 20GB or smaller.
+  More than 50 sample TPC-DS queries and all TPC-H queries are included for you to try. You can use ```hive```, ```beeline``` or the SQL tool of your choice. The testbench also includes a set of suggested settings.
 
-    Example: ```./tpcds-setup-sandbox.sh 10```
+  This example assumes you have generated 1 TB of TPC-DS data during Step 5:
 
-- Step 5: Run queries.
+  	```
+  	cd sample-queries-tpcds
+  	hive -i testbench.settings
+  	hive> use tpcds_bin_partitioned_orc_1000;
+  	hive> source query55.sql;
+  	```
 
-  More than 50 sample TPC-DS queries are included for you to try out. You can use ```hive```, ```beeline``` or the SQL tool of your choice.
+  Note that the database is named based on the Data Scale chosen in step 3. At Data Scale 10000, your database will be named tpcds_bin_partitioned_orc_10000. At Data Scale 1000 it would be named tpcds_bin_partitioned_orc_1000. You can always ```show databases``` to get a list of available databases.
 
-  Example:
+  Similarly, if you generated 1 TB of TPC-H data during Step 5:
 
   	```
-  	cd sample-queries
-  	hive
-  	hive> use tpcds_bin_partitioned_orc_200;
-  	hive> source query12.sql;
+  	cd sample-queries-tpch
+  	hive -i testbench.settings
+  	hive> use tpch_bin_partitioned_orc_1000;
+  	hive> source tpch_query1.sql;
   	```
 
-  Note that the database is named based on the Data Scale chosen in step 3. At Data Scale 200, your database will be named tpcds_bin_partitioned_orc_200. At Data Scale 50 it would be named tpcds_bin_partitioned_orc_50. You can always ```show databases``` to get a list of available databases.
-
 Feedback
 ========
 

+ 96 - 0
ddl-tpch/bin_flat/alltables.sql

@@ -0,0 +1,96 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists lineitem;
+create external table lineitem 
+(L_ORDERKEY INT,
+ L_PARTKEY INT,
+ L_SUPPKEY INT,
+ L_LINENUMBER INT,
+ L_QUANTITY DOUBLE,
+ L_EXTENDEDPRICE DOUBLE,
+ L_DISCOUNT DOUBLE,
+ L_TAX DOUBLE,
+ L_RETURNFLAG STRING,
+ L_LINESTATUS STRING,
+ L_SHIPDATE STRING,
+ L_COMMITDATE STRING,
+ L_RECEIPTDATE STRING,
+ L_SHIPINSTRUCT STRING,
+ L_SHIPMODE STRING,
+ L_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE 
+LOCATION '${LOCATION}/lineitem';
+
+drop table if exists part;
+create external table part (P_PARTKEY INT,
+ P_NAME STRING,
+ P_MFGR STRING,
+ P_BRAND STRING,
+ P_TYPE STRING,
+ P_SIZE INT,
+ P_CONTAINER STRING,
+ P_RETAILPRICE DOUBLE,
+ P_COMMENT STRING) 
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE 
+LOCATION '${LOCATION}/part/';
+
+drop table if exists supplier;
+create external table supplier (S_SUPPKEY INT,
+ S_NAME STRING,
+ S_ADDRESS STRING,
+ S_NATIONKEY INT,
+ S_PHONE STRING,
+ S_ACCTBAL DOUBLE,
+ S_COMMENT STRING) 
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE 
+LOCATION '${LOCATION}/supplier/';
+
+drop table if exists partsupp;
+create external table partsupp (PS_PARTKEY INT,
+ PS_SUPPKEY INT,
+ PS_AVAILQTY INT,
+ PS_SUPPLYCOST DOUBLE,
+ PS_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION'${LOCATION}/partsupp';
+
+drop table if exists nation;
+create external table nation (N_NATIONKEY INT,
+ N_NAME STRING,
+ N_REGIONKEY INT,
+ N_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/nation';
+
+drop table if exists region;
+create external table region (R_REGIONKEY INT,
+ R_NAME STRING,
+ R_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/region';
+
+drop table if exists customer;
+create external table customer (C_CUSTKEY INT,
+ C_NAME STRING,
+ C_ADDRESS STRING,
+ C_NATIONKEY INT,
+ C_PHONE STRING,
+ C_ACCTBAL DOUBLE,
+ C_MKTSEGMENT STRING,
+ C_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/customer';
+
+drop table if exists orders;
+create external table orders (O_ORDERKEY INT,
+ O_CUSTKEY INT,
+ O_ORDERSTATUS STRING,
+ O_TOTALPRICE DOUBLE,
+ O_ORDERDATE STRING,
+ O_ORDERPRIORITY STRING,
+ O_CLERK STRING,
+ O_SHIPPRIORITY INT,
+ O_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/orders';

+ 8 - 0
ddl-tpch/bin_flat/customer.sql

@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists customer;
+
+create table customer
+stored as ${FILE}
+as select * from ${SOURCE}.customer;

+ 8 - 0
ddl-tpch/bin_flat/lineitem.sql

@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists lineitem;
+
+create table lineitem
+stored as ${FILE}
+as select * from ${SOURCE}.lineitem;

+ 8 - 0
ddl-tpch/bin_flat/nation.sql

@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists nation;
+
+create table nation
+stored as ${FILE}
+as select * from ${SOURCE}.nation;

+ 8 - 0
ddl-tpch/bin_flat/orders.sql

@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists orders;
+
+create table orders
+stored as ${FILE}
+as select * from ${SOURCE}.orders;

+ 8 - 0
ddl-tpch/bin_flat/part.sql

@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists part;
+
+create table part
+stored as ${FILE}
+as select * from ${SOURCE}.part;

+ 8 - 0
ddl-tpch/bin_flat/partsupp.sql

@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists partsupp;
+
+create table partsupp
+stored as ${FILE}
+as select * from ${SOURCE}.partsupp;

+ 8 - 0
ddl-tpch/bin_flat/region.sql

@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists region;
+
+create table region
+stored as ${FILE}
+as select * from ${SOURCE}.region;

+ 8 - 0
ddl-tpch/bin_flat/supplier.sql

@@ -0,0 +1,8 @@
+create database if not exists ${DB};
+use ${DB};
+
+drop table if exists supplier;
+
+create table supplier
+stored as ${FILE}
+as select * from ${SOURCE}.supplier;

+ 5 - 18
settings/init.sql

@@ -1,27 +1,14 @@
-set hive.enforce.bucketing=true;
-set hive.enforce.sorting=true;
-set hive.exec.dynamic.partition.mode=nonstrict;
-set hive.exec.max.dynamic.partitions.pernode=1000000;
-set hive.exec.max.dynamic.partitions=1000000;
-set hive.exec.max.created.files=1000000;
 set hive.map.aggr=true;
-set hive.optimize.bucketmapjoin=true;
-set hive.optimize.bucketmapjoin.sortedmerge=true;
-set hive.mapred.reduce.tasks.speculative.execution=false;
+set mapreduce.reduce.speculative=false;
 set hive.auto.convert.join=true;
-set hive.auto.convert.sortmerge.join=true;
-set hive.auto.convert.sortmerge.join.noconditionaltask=true;
-set hive.auto.convert.join.noconditionaltask=true;
-set hive.auto.convert.join.noconditionaltask.size=10000000000;
 set hive.optimize.reducededuplication.min.reducer=1;
 set hive.optimize.mapjoin.mapreduce=true;
+set hive.stats.autogather=true;
 
 set mapred.reduce.parallel.copies=30;
-set mapred.reduce.tasks=16;
 set mapred.job.shuffle.input.buffer.percent=0.5;
 set mapred.job.reduce.input.buffer.percent=0.2;
-set mapred.map.child.java.opts=-server -Xmx2248m -Djava.net.preferIPv4Stack=true;
-set mapred.reduce.child.java.opts=-server -Xmx4500m -Djava.net.preferIPv4Stack=true;
+set mapred.map.child.java.opts=-server -Xmx2800m -Djava.net.preferIPv4Stack=true;
+set mapred.reduce.child.java.opts=-server -Xmx3800m -Djava.net.preferIPv4Stack=true;
 set mapreduce.map.memory.mb=3072;
-set mapreduce.reduce.memory.mb=6144;
-set hive.optimize.tez=true;
+set mapreduce.reduce.memory.mb=4096;

+ 5 - 6
settings/load-flat.sql

@@ -5,10 +5,9 @@ set hive.exec.max.dynamic.partitions.pernode=1000000;
 set hive.exec.max.dynamic.partitions=1000000;
 set hive.exec.max.created.files=1000000;
 
-set mapred.min.split.size=240000000;
-set mapred.max.split.size=240000000;
-set mapred.min.split.size.per.node=240000000;
-set mapred.min.split.size.per.rack=240000000;
+set mapreduce.input.fileinputformat.split.minsize=240000000;
+set mapreduce.input.fileinputformat.split.maxsize=240000000;
+set mapreduce.input.fileinputformat.split.minsize.per.node=240000000;
+set mapreduce.input.fileinputformat.split.minsize.per.rack=240000000;
 set hive.exec.parallel=true;
-set hive.stats.autogather=false;
-set hive.optimize.tez=false;
+set hive.stats.autogather=true;

+ 8 - 11
settings/load-partitioned.sql

@@ -9,15 +9,12 @@ set hive.exec.reducers.max=2000;
 set hive.stats.autogather=true;
 
 set mapred.job.reduce.input.buffer.percent=0.0;
-set mapred.min.split.size=240000000;
-set mapred.min.split.size.per.node=240000000;
-set mapred.min.split.size.per.rack=240000000;
+set mapreduce.input.fileinputformat.split.minsizee=240000000;
+set mapreduce.input.fileinputformat.split.minsize.per.node=240000000;
+set mapreduce.input.fileinputformat.split.minsize.per.rack=240000000;
 
-set mapred.map.child.java.opts=-server -Xmx2500m -Djava.net.preferIPv4Stack=true;
-set mapred.reduce.child.java.opts=-server -Xms1024m -Xmx7900m -Djava.net.preferIPv4Stack=true;
-set mapreduce.map.memory.mb=3072;
-set mapreduce.reduce.memory.mb=8192;
-
-set io.sort.mb=800;
-
-set hive.optimize.tez=false;
+-- set mapred.map.child.java.opts=-server -Xmx2800m -Djava.net.preferIPv4Stack=true;
+-- set mapred.reduce.child.java.opts=-server -Xms1024m -Xmx3800m -Djava.net.preferIPv4Stack=true;
+-- set mapreduce.map.memory.mb=3072;
+-- set mapreduce.reduce.memory.mb=4096;
+-- set io.sort.mb=800;

+ 29 - 0
tpcds-build.sh

@@ -0,0 +1,29 @@
+#!/bin/sh
+
+# Check for all the stuff I need to function.
+for f in gcc; do
+	which $f > /dev/null 2>&1
+	if [ $? -ne 0 ]; then
+		echo "Required program $f is missing. Please install it and try again."
+		exit 1
+	fi
+done
+
+# Check if Maven is installed and install it if not.
+which mvn > /dev/null 2>&1
+if [ $? -ne 0 ]; then
+	echo "Maven not found, automatically installing it."
+	curl -O http://www.us.apache.org/dist/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz 2> /dev/null
+	if [ $? -ne 0 ]; then
+		echo "Failed to download Maven, check Internet connectivity and try again."
+		exit 1
+	fi
+	tar -zxf apache-maven-3.0.5-bin.tar.gz > /dev/null
+	CWD=$(pwd)
+	export MAVEN_HOME="$CWD/apache-maven-3.0.5"
+	export PATH=$PATH:$MAVEN_HOME/bin
+fi
+
+echo "Building TPC-DS Data Generator"
+(cd tpcds-gen; make)
+echo "TPC-DS Data Generator built, you can now use tpcds-setup.sh to generate data."

+ 1 - 1
tpcds-gen/Makefile

@@ -12,7 +12,7 @@ tpcds_kit.zip:
 	curl --output tpcds_kit.zip http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp?download_key=NaN
 
 target/lib/dsdgen.jar: target/tools/dsdgen
-	cd target/; mkdir -p lib/; gjar cvf lib/dsdgen.jar tools/
+	cd target/; mkdir -p lib/; ( jar cvf lib/dsdgen.jar tools/ || gjar cvf lib/dsdgen.jar tools/ )
 
 target/tools/dsdgen: target/tpcds_kit.zip
 	test -d target/tools/ || (cd target; unzip tpcds_kit.zip; cd tools; cat ../../*.patch | patch -p0 )

+ 55 - 20
tpcds-setup.sh

@@ -1,12 +1,23 @@
 #!/bin/bash
 
 function usage {
-	echo "Usage: tpcds-setup.sh scale [temp directory]"
+	echo "Usage: tpcds-setup.sh scale_factor [temp_directory]"
 	exit 1
 }
 
+function runcommand {
+	if [ "X$DEBUG_SCRIPT" != "X" ]; then
+		$1
+	else
+		$1 2>/dev/null
+	fi
+}
+
+BOLD=`tput bold`
+NORMAL=`tput sgr0`
+
 if [ ! -f tpcds-gen/target/tpcds-gen-1.0-SNAPSHOT.jar ]; then
-	echo "Build the data generator with build.sh first"
+	echo "Please build the data generator with ./build-tpcds.sh first"
 	exit 1
 fi
 which hive > /dev/null 2>&1
@@ -22,47 +33,71 @@ FACTS="store_sales store_returns web_sales web_returns catalog_sales catalog_ret
 # Get the parameters.
 SCALE=$1
 DIR=$2
+BUCKETS=13
+RETURN_BUCKETS=13
+if [ "X$DEBUG_SCRIPT" != "X" ]; then
+	set -x
+fi
 
-# Ensure arguments exist.
+# Sanity checking.
 if [ X"$SCALE" = "X" ]; then
 	usage
 fi
 if [ X"$DIR" = "X" ]; then
 	DIR=/tmp/tpcds-generate
 fi
-
-# Sanity checking.
 if [ $SCALE -eq 1 ]; then
 	echo "Scale factor must be greater than 1"
 	exit 1
 fi
 
-BUCKETS=13
-
-set -x
-set -e
-
-hadoop dfs -mkdir -p ${DIR}
-hadoop dfs -ls ${DIR}/${SCALE} || (cd tpcds-gen; hadoop jar target/*.jar -d ${DIR}/${SCALE}/ -s ${SCALE})
-hadoop dfs -ls ${DIR}/${SCALE} || ( echo "No data available" )
-hadoop dfs -ls ${DIR}/${SCALE}
+# Do the actual data load.
+hdfs dfs -mkdir -p ${DIR}
+hdfs dfs -ls ${DIR}/${SCALE} > /dev/null
+if [ $? -ne 0 ]; then
+	echo "${BOLD}Generating data at scale factor $SCALE.${NORMAL}"
+	(cd tpcds-gen; hadoop jar target/*.jar -d ${DIR}/${SCALE}/ -s ${SCALE})
+fi
+hdfs dfs -ls ${DIR}/${SCALE} > /dev/null
+if [ $? -ne 0 ]; then
+	echo "${BOLD}Data generation failed, exiting.${NORMAL}"
+	exit 1
+fi
+echo "${BOLD}TPC-DS text data generation complete.${NORMAL}"
 
 # Create the text/flat tables as external tables. These will be later be converted to ORCFile.
-hive -i settings/load-flat.sql -f ddl-tpcds/text/alltables.sql -d DB=tpcds_text_${SCALE} -d LOCATION=${DIR}/${SCALE}
+echo "${BOLD}Loading text data into external tables.${NORMAL}"
+runcommand "hive -i settings/load-flat.sql -f ddl-tpcds/text/alltables.sql -d DB=tpcds_text_${SCALE} -d LOCATION=${DIR}/${SCALE}"
 
-# Create the partitioned tables.
+# Create the partitioned and bucketed tables.
+i=1
+total=24
 for t in ${FACTS}
 do
-	hive -i settings/load-partitioned.sql -f ddl-tpcds/bin_partitioned/${t}.sql \
+	echo "${BOLD}Optimizing table $t ($i/$total).${NORMAL}"
+	COMMAND="hive -i settings/load-partitioned.sql -f ddl-tpcds/bin_partitioned/${t}.sql \
 	    -d DB=tpcds_bin_partitioned_orc_${SCALE} \
 	    -d SOURCE=tpcds_text_${SCALE} -d BUCKETS=${BUCKETS} \
-	    -d FILE=orc
+	    -d RETURN_BUCKETS=${RETURN_BUCKETS} -d FILE=orc"
+	runcommand "$COMMAND"
+	if [ $? -ne 0 ]; then
+		echo "Command failed, try 'export DEBUG_SCRIPT=ON' and re-running"
+		exit 1
+	fi
+	i=`expr $i + 1`
 done
 
 # Populate the smaller tables.
 for t in ${DIMS}
 do
-	hive -i settings/load-partitioned.sql -f ddl-tpcds/bin_partitioned/${t}.sql \
+	echo "${BOLD}Optimizing table $t ($i/$total).${NORMAL}"
+	COMMAND="hive -i settings/load-partitioned.sql -f ddl-tpcds/bin_partitioned/${t}.sql \
 	    -d DB=tpcds_bin_partitioned_orc_${SCALE} -d SOURCE=tpcds_text_${SCALE} \
-	    -d FILE=orc
+	    -d FILE=orc"
+	runcommand "$COMMAND"
+	if [ $? -ne 0 ]; then
+		echo "Command failed, try 'export DEBUG_SCRIPT=ON' and re-running"
+		exit 1
+	fi
+	i=`expr $i + 1`
 done

+ 29 - 0
tpch-build.sh

@@ -0,0 +1,29 @@
+#!/bin/sh
+
+# Check for all the stuff I need to function.
+for f in gcc; do
+	which $f > /dev/null 2>&1
+	if [ $? -ne 0 ]; then
+		echo "Required program $f is missing. Please install it and try again."
+		exit 1
+	fi
+done
+
+# Check if Maven is installed and install it if not.
+which mvn > /dev/null 2>&1
+if [ $? -ne 0 ]; then
+	echo "Maven not found, automatically installing it."
+	curl -O http://www.us.apache.org/dist/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz 2> /dev/null
+	if [ $? -ne 0 ]; then
+		echo "Failed to download Maven, check Internet connectivity and try again."
+		exit 1
+	fi
+	tar -zxf apache-maven-3.0.5-bin.tar.gz > /dev/null
+	CWD=$(pwd)
+	export MAVEN_HOME="$CWD/apache-maven-3.0.5"
+	export PATH=$PATH:$MAVEN_HOME/bin
+fi
+
+echo "Building TPC-H Data Generator"
+(cd tpch-gen; make)
+echo "TPC-H Data Generator built, you can now use tpch-setup.sh to generate data."

+ 22 - 0
tpch-gen/Makefile

@@ -0,0 +1,22 @@
+
+all: target/lib/dbgen.jar target/tpch-gen-1.0-SNAPSHOT.jar
+
+target/tpch-gen-1.0-SNAPSHOT.jar: $(shell find -name *.java) 
+	mvn package
+
+target/tpch_kit.zip: tpch_kit.zip
+	mkdir -p target/
+	cp tpch_kit.zip target/tpch_kit.zip
+
+tpch_kit.zip:
+	curl --output tpch_kit.zip http://www.tpc.org/tpch/spec/tpch_2_16_0.zip
+
+target/lib/dbgen.jar: target/tools/dbgen
+	cd target/; mkdir -p lib/; ( jar cvf lib/dbgen.jar tools/ || gjar cvf lib/dbgen.jar tools/ )
+
+target/tools/dbgen: target/tpch_kit.zip
+	test -d target/tools/ || (cd target; unzip tpch_kit.zip -x __MACOSX/; ln -sf $$PWD/*/dbgen/ tools)
+	cd target/tools/; make -f makefile.suite clean; make -f makefile.suite CC=gcc DATABASE=ORACLE MACHINE=LINUX WORKLOAD=TPCH
+
+clean:
+	mvn clean

+ 20 - 0
tpch-gen/README.md

@@ -0,0 +1,20 @@
+Mapreduce TPC-H Generator
+=========================
+
+This simplifies creating tpc-h data-sets on large scales on a hadoop cluster.
+
+To get set up, you need to run
+
+	$ make 
+
+this will download the TPC-h dbgen program, compile it and use maven to build the MR app wrapped around it.
+
+To generate the data-sets, you need to run (say, for scale = 200, parallelism = 100)
+
+	$ hadoop  jar target/tpch-gen-1.0-SNAPSHOT.jar   -d /user/hive/external/200/ -p 100 -s 200 
+
+This uses the existing parallelism in the dbgen program without modification and uses it to run the command on multiple machines.
+
+The command generates multiple files for each map task, resulting in each table having its own subdirectory.
+
+Assumptions made are that all machines in the cluster are OS/arch/lib identical.

+ 85 - 0
tpch-gen/ddl/text.sql

@@ -0,0 +1,85 @@
+create external table lineitem 
+(L_ORDERKEY INT,
+ L_PARTKEY INT,
+ L_SUPPKEY INT,
+ L_LINENUMBER INT,
+ L_QUANTITY DOUBLE,
+ L_EXTENDEDPRICE DOUBLE,
+ L_DISCOUNT DOUBLE,
+ L_TAX DOUBLE,
+ L_RETURNFLAG STRING,
+ L_LINESTATUS STRING,
+ L_SHIPDATE STRING,
+ L_COMMITDATE STRING,
+ L_RECEIPTDATE STRING,
+ L_SHIPINSTRUCT STRING,
+ L_SHIPMODE STRING,
+ L_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE 
+LOCATION '${LOCATION}/lineitem';
+
+create external table part (P_PARTKEY INT,
+ P_NAME STRING,
+ P_MFGR STRING,
+ P_BRAND STRING,
+ P_TYPE STRING,
+ P_SIZE INT,
+ P_CONTAINER STRING,
+ P_RETAILPRICE DOUBLE,
+ P_COMMENT STRING) 
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE 
+LOCATION '${LOCATION}/part/';
+
+create external table supplier (S_SUPPKEY INT,
+ S_NAME STRING,
+ S_ADDRESS STRING,
+ S_NATIONKEY INT,
+ S_PHONE STRING,
+ S_ACCTBAL DOUBLE,
+ S_COMMENT STRING) 
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE 
+LOCATION '${LOCATION}/supplier/';
+
+create external table partsupp (PS_PARTKEY INT,
+ PS_SUPPKEY INT,
+ PS_AVAILQTY INT,
+ PS_SUPPLYCOST DOUBLE,
+ PS_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION'${LOCATION}/partsupp';
+
+create external table nation (N_NATIONKEY INT,
+ N_NAME STRING,
+ N_REGIONKEY INT,
+ N_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/nation';
+
+create external table region (R_REGIONKEY INT,
+ R_NAME STRING,
+ R_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/region';
+
+create external table customer (C_CUSTKEY INT,
+ C_NAME STRING,
+ C_ADDRESS STRING,
+ C_NATIONKEY INT,
+ C_PHONE STRING,
+ C_ACCTBAL DOUBLE,
+ C_MKTSEGMENT STRING,
+ C_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/customer';
+
+create external table orders (O_ORDERKEY INT,
+ O_CUSTKEY INT,
+ O_ORDERSTATUS STRING,
+ O_TOTALPRICE DOUBLE,
+ O_ORDERDATE STRING,
+ O_ORDERPRIORITY STRING,
+ O_CLERK STRING,
+ O_SHIPPRIORITY INT,
+ O_COMMENT STRING)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE
+LOCATION '${LOCATION}/orders';

+ 86 - 0
tpch-gen/pom.xml

@@ -0,0 +1,86 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
+                        http://maven.apache.org/maven-v4_0_0.xsd">
+
+  <modelVersion>4.0.0</modelVersion>
+
+  <groupId>org.notmysock.tpch</groupId>
+  <artifactId>tpch-gen</artifactId>
+  <version>1.0-SNAPSHOT</version>
+  <packaging>jar</packaging>
+
+  <name>tpch-gen</name>
+  <url>http://maven.apache.org</url>
+
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-client</artifactId>
+      <version>2.2.0</version>
+      <scope>compile</scope>
+    </dependency>
+    <dependency>
+      <groupId>commons-cli</groupId>
+      <artifactId>commons-cli</artifactId>
+      <version>1.1</version>
+      <scope>compile</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.mockito</groupId>
+      <artifactId>mockito-core</artifactId>
+      <version>1.8.5</version>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>junit</groupId>
+      <artifactId>junit</artifactId>
+      <version>4.7</version>
+      <scope>test</scope>
+    </dependency>
+  </dependencies>
+
+  <build>
+    <plugins>
+      <plugin>
+        <artifactId>maven-compiler-plugin</artifactId>
+        <configuration>
+          <source>1.6</source>
+          <target>1.6</target>
+        </configuration>
+      </plugin>
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-jar-plugin</artifactId>
+        <configuration>
+          <archive>
+            <manifest>
+              <addClasspath>true</addClasspath>
+			  <classpathPrefix>lib/</classpathPrefix>
+              <mainClass>org.notmysock.tpch.GenTable</mainClass>
+            </manifest>
+          </archive>
+        </configuration>
+      </plugin>
+	  <plugin>
+		<groupId>org.apache.maven.plugins</groupId>
+		<artifactId>maven-dependency-plugin</artifactId>
+        <executions>
+		  <execution>
+            <id>copy-dependencies</id>
+            <phase>package</phase>
+            <goals>
+			  <goal>copy-dependencies</goal>
+            </goals>
+            <configuration>
+                <outputDirectory>${project.build.directory}/lib</outputDirectory>
+            </configuration>
+          </execution>
+        </executions>
+      </plugin>
+    </plugins>
+  </build>
+
+</project>

+ 242 - 0
tpch-gen/src/main/java/org/notmysock/tpch/GenTable.java

@@ -0,0 +1,242 @@
+package org.notmysock.tpch;
+
+import org.apache.hadoop.conf.*;
+import org.apache.hadoop.fs.*;
+import org.apache.hadoop.hdfs.*;
+import org.apache.hadoop.io.*;
+import org.apache.hadoop.util.*;
+import org.apache.hadoop.filecache.*;
+import org.apache.hadoop.mapreduce.*;
+import org.apache.hadoop.mapreduce.lib.input.*;
+import org.apache.hadoop.mapreduce.lib.output.*;
+import org.apache.hadoop.mapreduce.lib.reduce.*;
+
+import org.apache.commons.cli.*;
+import org.apache.commons.*;
+
+import java.io.*;
+import java.nio.*;
+import java.util.*;
+import java.net.*;
+import java.math.*;
+import java.security.*;
+
+
+public class GenTable extends Configured implements Tool {
+	
+	private static enum TableMappings {
+		ALL("all"),
+		CUSTOMERS("c"),
+		SUPPLIERS("s"),
+		NATION("l"),
+		ORDERS("o"),
+		PARTS("p");
+		
+		/*
+		-T c   -- generate cutomers ONLY
+		-T l   -- generate nation/region ONLY
+		-T o   -- generate orders/lineitem ONLY
+		-T p   -- generate parts/partsupp ONLY
+		-T s   -- generate suppliers ONLY
+		*/
+		
+		
+		final String option;
+		
+		TableMappings(String option) {
+			this.option = option;
+		}
+	}
+	
+    public static void main(String[] args) throws Exception {
+        Configuration conf = new Configuration();
+        int res = ToolRunner.run(conf, new GenTable(), args);
+        System.exit(res);
+    }
+
+    @Override
+    public int run(String[] args) throws Exception {
+        String[] remainingArgs = new GenericOptionsParser(getConf(), args).getRemainingArgs();
+
+        CommandLineParser parser = new BasicParser();
+        getConf().setInt("io.sort.mb", 4);
+        org.apache.commons.cli.Options options = new org.apache.commons.cli.Options();
+        options.addOption("s","scale", true, "scale");
+        options.addOption("t","table", true, "table");
+        options.addOption("d","dir", true, "dir");
+        options.addOption("p", "parallel", true, "parallel");
+        CommandLine line = parser.parse(options, remainingArgs);
+
+        if(!(line.hasOption("scale") && line.hasOption("dir"))) {
+          HelpFormatter f = new HelpFormatter();
+          f.printHelp("GenTable", options);
+          return 1;
+        }
+        
+        int scale = Integer.parseInt(line.getOptionValue("scale"));
+        String table = "all";
+        if(line.hasOption("table")) {
+          table = line.getOptionValue("table");
+          table = TableMappings.valueOf(table.toUpperCase()).option;
+        }
+        Path out = new Path(line.getOptionValue("dir"));
+
+        int parallel = scale;
+
+        if(line.hasOption("parallel")) {
+          parallel = Integer.parseInt(line.getOptionValue("parallel"));
+        }
+
+        if(parallel == 1 || scale == 1) {
+          System.err.println("The MR task does not work for scale=1 or parallel=1");
+          return 1;
+        }
+
+        Path in = genInput(table, scale, parallel);
+
+        Path dbgen = copyJar(new File("target/lib/dbgen.jar"));
+        URI dsuri = dbgen.toUri();
+        URI link = new URI(dsuri.getScheme(),
+                    dsuri.getUserInfo(), dsuri.getHost(), 
+                    dsuri.getPort(),dsuri.getPath(), 
+                    dsuri.getQuery(),"dbgen");
+        Configuration conf = getConf();
+        conf.setInt("mapred.task.timeout",0);
+        conf.setInt("mapreduce.task.timeout",0);
+        DistributedCache.addCacheArchive(link, conf);
+        Job job = new Job(conf, "GenTable+"+table+"_"+scale);
+        job.setJarByClass(getClass());
+        job.setNumReduceTasks(0);
+        job.setMapperClass(dbgen.class);
+        job.setOutputKeyClass(Text.class);
+        job.setOutputValueClass(Text.class);
+
+        job.setInputFormatClass(NLineInputFormat.class);
+        NLineInputFormat.setNumLinesPerSplit(job, 1);
+
+        FileInputFormat.addInputPath(job, in);
+        FileOutputFormat.setOutputPath(job, out);
+
+        // use multiple output to only write the named files
+        LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
+        MultipleOutputs.addNamedOutput(job, "text", 
+          TextOutputFormat.class, LongWritable.class, Text.class);
+
+        boolean success = job.waitForCompletion(true);
+
+        // cleanup
+        FileSystem fs = FileSystem.get(getConf());
+        
+        fs.delete(in, false);
+        fs.delete(dbgen, false);
+
+        return 0;
+    }
+
+    public Path copyJar(File jar) throws Exception {
+      MessageDigest md = MessageDigest.getInstance("MD5");
+      InputStream is = new FileInputStream(jar);
+      try {
+        is = new DigestInputStream(is, md);
+        // read stream to EOF as normal...
+      }
+      finally {
+        is.close();
+      }
+      BigInteger md5 = new BigInteger(md.digest()); 
+      String md5hex = md5.toString(16);
+      Path dst = new Path(String.format("/tmp/%s.jar",md5hex));
+      Path src = new Path(jar.toURI());
+      FileSystem fs = FileSystem.get(getConf());
+      fs.copyFromLocalFile(false, /*overwrite*/true, src, dst);
+      return dst; 
+    }
+
+    public Path genInput(String table, int scale, int parallel) throws Exception {
+        long epoch = System.currentTimeMillis()/1000;
+
+        Path in = new Path("/tmp/"+table+"_"+scale+"-"+epoch);
+        FileSystem fs = FileSystem.get(getConf());
+        FSDataOutputStream out = fs.create(in);
+        for(int i = 1; i <= parallel; i++) {
+          if(table.equals("all")) {
+            out.writeBytes(String.format("$DIR/dbgen/tools/dbgen -b $DIR/dbgen/tools/dists.dss -f -s %d -C %d -S %d\n", scale, parallel, i));
+          } else {
+        	out.writeBytes(String.format("$DIR/dbgen/tools/dbgen -b $DIR/dbgen/tools/dists.dss -f -s %d -C %d -S %d -T %s\n", scale, parallel, i, table));           
+          }
+        }
+        out.close();
+        return in;
+    }
+
+    static String readToString(InputStream in) throws IOException {
+      InputStreamReader is = new InputStreamReader(in);
+      StringBuilder sb=new StringBuilder();
+      BufferedReader br = new BufferedReader(is);
+      String read = br.readLine();
+
+      while(read != null) {
+        //System.out.println(read);
+        sb.append(read);
+        read =br.readLine();
+      }
+      return sb.toString();
+    }
+
+    static final class dbgen extends Mapper<LongWritable,Text, Text, Text> {
+      private MultipleOutputs mos;
+      protected void setup(Context context) throws IOException {
+        mos = new MultipleOutputs(context);
+      }
+      protected void cleanup(Context context) throws IOException, InterruptedException {
+        mos.close();
+      }
+      protected void map(LongWritable offset, Text command, Mapper.Context context) 
+        throws IOException, InterruptedException {
+        String parallel="1";
+        String child="1";
+
+        String[] cmd = command.toString().split(" ");
+
+        for(int i=0; i<cmd.length; i++) {
+          if(cmd[i].contains("$DIR")) {
+            cmd[i] = cmd[i].replace("$DIR",(new File(".")).getAbsolutePath());
+          }
+          if(cmd[i].equals("-C")) {
+            parallel = cmd[i+1];
+          }
+          if(cmd[i].equals("-S")) {
+            child = cmd[i+1];
+          }
+        }
+
+        Process p = Runtime.getRuntime().exec(cmd, null, new File("."));
+        int status = p.waitFor();
+        if(status != 0) {
+          String err = readToString(p.getErrorStream());
+          throw new InterruptedException("Process failed with status code " + status + "\n" + err);
+        }
+
+        File cwd = new File(".");
+        final String suffix = String.format(".tbl.%s", child);
+
+        FilenameFilter tables = new FilenameFilter() {
+          public boolean accept(File dir, String name) {
+            return name.endsWith(suffix) || name.endsWith(".tbl");
+          }
+        };
+
+        for(File f: cwd.listFiles(tables)) {
+          BufferedReader br = new BufferedReader(new FileReader(f));          
+          String line;
+          String name = f.getName().replace(suffix,"/data").replace(".tbl", "/data");
+          while ((line = br.readLine()) != null) {
+            // process the line.
+            mos.write("text", line, null, name);
+          }
+          br.close();
+          f.deleteOnExit();
+        }
+      }
+    }
+}

+ 86 - 0
tpch-setup.sh

@@ -0,0 +1,86 @@
+#!/bin/bash
+
+function usage {
+	echo "Usage: tpch-setup.sh scale_factor [temp_directory]"
+	exit 1
+}
+
+function runcommand {
+	if [ "X$DEBUG_SCRIPT" != "X" ]; then
+		$1
+	else
+		$1 2>/dev/null
+	fi
+}
+
+BOLD=`tput bold`
+NORMAL=`tput sgr0`
+
+if [ ! -f tpch-gen/target/tpch-gen-1.0-SNAPSHOT.jar ]; then
+	echo "Please build the data generator with ./build-tpch.sh first"
+	exit 1
+fi
+which hive > /dev/null 2>&1
+if [ $? -ne 0 ]; then
+	echo "Script must be run where Hive is installed"
+	exit 1
+fi
+
+# Tables in the TPC-H schema.
+TABLES="part partsupp supplier customer orders lineitem nation region"
+
+# Get the parameters.
+SCALE=$1
+DIR=$2
+BUCKETS=13
+if [ "X$DEBUG_SCRIPT" != "X" ]; then
+	set -x
+fi
+
+# Sanity checking.
+if [ X"$SCALE" = "X" ]; then
+	usage
+fi
+if [ X"$DIR" = "X" ]; then
+	DIR=/tmp/tpch-generate
+fi
+if [ $SCALE -eq 1 ]; then
+	echo "Scale factor must be greater than 1"
+	exit 1
+fi
+
+# Do the actual data load.
+hdfs dfs -mkdir -p ${DIR}
+hdfs dfs -ls ${DIR}/${SCALE} > /dev/null
+if [ $? -ne 0 ]; then
+	echo "${BOLD}Generating data at scale factor $SCALE.${NORMAL}"
+	(cd tpch-gen; hadoop jar target/*.jar -d ${DIR}/${SCALE}/ -s ${SCALE})
+fi
+hdfs dfs -ls ${DIR}/${SCALE} > /dev/null
+if [ $? -ne 0 ]; then
+	echo "${BOLD}Data generation failed, exiting.${NORMAL}"
+	exit 1
+fi
+echo "${BOLD}TPC-H text data generation complete.${NORMAL}"
+
+# Create the text/flat tables as external tables. These will be later be converted to ORCFile.
+echo "${BOLD}Loading text data into external tables.${NORMAL}"
+runcommand "hive -i settings/load-flat.sql -f ddl-tpch/text/alltables.sql -d DB=tpch_text_${SCALE} -d LOCATION=${DIR}/${SCALE}"
+
+# Create the partitioned and bucketed tables.
+i=1
+total=8
+for t in ${TABLES}
+do
+	echo "${BOLD}Optimizing table $t ($i/$total).${NORMAL}"
+	COMMAND="hive -i settings/load-flat.sql -f ddl-tpch/bin_flat/${t}.sql \
+	    -d DB=tpch_bin_partitioned_orc_${SCALE} \
+	    -d SOURCE=tpch_text_${SCALE} -d BUCKETS=${BUCKETS} \
+	    -d FILE=orc"
+	runcommand "$COMMAND"
+	if [ $? -ne 0 ]; then
+		echo "Command failed, try 'export DEBUG_SCRIPT=ON' and re-running"
+		exit 1
+	fi
+	i=`expr $i + 1`
+done