Procházet zdrojové kódy

Simplifications to the process.

cartershanklin před 12 roky
rodič
revize
de330e9d99
2 změnil soubory, kde provedl 10 přidání a 16 odebrání
  1. 6 12
      README.md
  2. 4 4
      tpcds-setup.sh

+ 6 - 12
README.md

@@ -26,37 +26,31 @@ All of these steps should be carried out on your Hadoop cluster.
 
 - Step 1: Prepare your environment.
 
-  Before you begin, ```gcc```, and maven (```mvn```)must be in your system path. This is needed to compile the data generation program and package it for running inside Hadoop. These only need to be installed on one node of your Hadoop cluster.
-
-  On Ubuntu systems you can install all these via ```sudo apt-get install gcc maven```.
-  On RHEL / CentOS, most of these are availabile, start with ```sudo yum install gcc```. Maven must be installed manually.
+  Before you begin ensure ```gcc``` is installed and available on your system path. If you system does not have it, install it using yum or apt-get.
 
 - Step 2: Compile and package the data generator.
 
   ```./build.sh``` builds the data generator. Missing dependencies from step 1 will be detected and reported.
 
-- Step 2: Create a working directory in HDFS.
-
-  ```hadoop fs -mkdir /tmp/tpcds-staging```
-  creates a staging directory. This directory can be removed later to free up space.
-
 - Step 3: Decide how much data you want to generate.
 
   You need to decide on a "Scale Factor" which represents how much data you will generate. Scale Factor roughly translates to gigabytes, so a Scale Factor of 100 is about 100 gigabytes. One terabyte is Scale Factor 1000. Decide how much data you want and keep it in mind for the next step. If you have a cluster of 4-10 nodes or just want to experiment at a smaller scale, scale 200 (200GB) of data is a good starting point. If you have a large cluster, you may want to choose Scale 1000 (1TB) or more.
 
 - Step 4: Generate and load the data.
 
+  The ```tpcds-setup.sh``` script generates and loads data for you. General usage is ```tpcds-setup.sh scale [directory] [mode]```. Only the scale is mandatory. The directory argument causes data to be generated in a specific location. Mode can be partitioned or unpartitioned. Partitioned causes data to be partitioned by day. Unpartitioned creates one flat schema and is faster to generate.
+
   - Option 1: Generate data on a Hadoop cluster.
 
     Use this approach if you want to try Hive out at scale. This approach assumes you have multiple physical Hadoop nodes with plenty of RAM. All tables will be created and large tables will be partitioned by date and bucketed which improves performance among queries that take advantage of partition pruning or SMB joins.
 
-    Example: ```./tpcds-setup.sh 200 /tmp/tpcds-staging```
+    Example: ```./tpcds-setup.sh 200```
 
   - Option 2: Generate data on a Sandbox.
 
-    Use this approach if you want to try Hive or Hive/Tez out in a Sandbox environment. This is for experimentation only and you should not generate too much data if you choose this route, 20 GB or less would be appropriate. This approach does not partition data.
+    Use this approach if you want to try Hive or Hive/Tez in a Sandbox environment. This approach creates an unpartitioned schema by default, which is faster to generate. This option is appropriate for smaller data scales, say 20GB or smaller.
 
-    Example: ```./tpcds-setup-sandbox.sh 10 /tmp/tpcds-staging```
+    Example: ```./tpcds-setup-sandbox.sh 10```
 
 - Step 5: Run queries.
 

+ 4 - 4
tpcds-setup.sh

@@ -1,7 +1,7 @@
 #!/bin/bash
 
 function usage {
-	echo "Usage: tpcds-setup.sh scale directory"
+	echo "Usage: tpcds-setup.sh scale [temp directory] [partitioned|unpartitioned]"
 	exit 1
 }
 
@@ -28,7 +28,7 @@ if [ X"$SCALE" = "X" ]; then
 	usage
 fi
 if [ X"$DIR" = "X" ]; then
-	usage
+	DIR=/tmp/tpcds-generate
 fi
 if [ X"$MODE" = "X" ]; then
 	MODE=partitioned
@@ -82,8 +82,8 @@ else
 		do
 			hive -i settings/load.sql -f ddl/bin_flat/${t}.sql \
 			    -d DB=tpcds_bin_flat_${FILE_FORMATS[$i]}_${SCALE} \
-			    -d SOURCE=tpcds_text_${SCALE} -d BUCKETS=${BUCKETS} \
-			    -d FILE="${file}" -d SERDE=${SERDES[$i]} -d SPLIT=${SPLIT}
+			    -d SOURCE=tpcds_text_${SCALE} -d FILE="${file}" \
+			    -d SERDE=${SERDES[$i]}
 		done
 	i=$((i+1))
 	done