If you have not had the opportunity of testing out Apache Zeppelin with Spark, I highly recommend it. It is one of those projects that you thank the Gods for having placed it in your line of sight: for anyone that has attempted (and hopefully succeeded) in visualizing data with the traditional approach, this is the new and better method.

The traditional approach involves storing your data in a database, visualizing the data, which mostly comes down to a JavaScript library on the front-end, and a controller in the middle that is the brain behind the operation (i.e. your main backend language such as NodeJS or PHP).

The headache behind this is that you have to get the three to work together, and that’s as simple as getting three drunk people to agree on Donald Trump’s policies. Now, don’t get me wrong: even Zeppelin-Spark has databases, visualizers, and controllers, the main difference being that as a developer you do not have to worry about this interaction. It is designed so that you can be dropped in, for lack of a better term.

Having said that, for any developer who wishes to dig into the bones of Zeppelin-Spark and change the way it functions, the open source project has been organized marvelously, as well as using modern languages, such as NodeJS and AngularJS.

For example, I had the problem of ingesting about 1GB of data (hardly Big Data), doing some basic clean-up of the data, and then using this information to display an analytics dashboard. Normally, half of the work would be in the first two (ingesting and clean-up), but with Zeppelin-Spark it took me less than one hour.

The benefit of Zeppelin-Spark is that it functions perfectly with MySQL. So the typical approach in creating a visualization is 1) get the data and save as a Database [this is very easy, you will see in the code below]; 2) use Zeppelin’s MySQL interpreter to run a typical query on this database– and that’s all. Zeppelin takes care of the rest, as far as displaying the data.

The user interface is clean and modern, but can easily be changed as it is built with Bootstrap. I also mentioned AngularJS in the title of this article– for more advanced user functionality, we can add another step in between 1 & 2 that does this, but one step at a time. First let’s understand how to get your data and make basic visualizations (Visit this page to see a tutorial on how to easily install Apache Zeppelin on your PC).

Anything after a “//” or between “/*    */” is a comment, which means it is just text for you to read and not Scala code.  Create a new Note in Zeppelin, then begin writing some code in the first paragraph.

//add Spark SQL functions for use within Scala
import org.apache.spark.sql._

/*
Notice how we use method chaining in Scala. In Java the code would   look much bulkier and longer. We use the read function to look for a file, we specify the file type, in this case a CSV file, with a couple of options: 

1) header (true/false) - this tells Spark if the data has column headers on the first line; 

2) inferScheme (true/false) - this tells Spark to automatically analyze the file and use the column names in the headers as the names of the columns in the DataFrame/Database.

Lastly, we use the load function given the path of the CSV file in your system. 

Also to see the difference between a "val" and "var" in Scala go here.
 */
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data/the-data.csv")

/*
Now that the data has been ingested into a Spark DataFrame (the basic Spark data structure), once we've done clean up, we will transfer it to a MySQL Database. 
*/

// add an id field to the database
var ar = df.withColumn("id", monotonically_increasing_id())

/*
This set of code looks long and complex, but all it does is use the select command to add some new columns to the DataFrame. First it selects all the current columns using ar("*"), where the asterisk is the all command from MySQL; then it includes three others, dollar_growth, volume_growth, perc_acv_growth, which are all calculated using other columns in the DataFrame. 

Notice that it performs the operations, then uses the as() function to give these three columns a name. The parenthesis are very important, be careful!
*/
ar = ar.select(ar("*"), (((ar("dollars_current") - ar("dollars_last"))/ar("dollars_last"))*100).as("dollar_growth"), (((ar("volume_current") - ar("volume_last"))/ar("volume_last"))*100).as("volume_growth"),(((ar("perc_acv_current") - ar("perc_acv_last"))/ar("perc_acv_last"))*100).as("perc_acv_growth") )


/*
Here we round the growth variables that we just calculated by using the withColumn() function. Usually this function creates a new column, but since we just want to round already existing columns, we use the same names and then use the round() function with the number of decimals to round to that number of decimals places.

 In other words, the withColumn() function takes the name of the column as the first parameter and the value that it should store for that column. 
*/
ar = ar.withColumn("dollar_growth", round($"dollar_growth", 2))
ar = ar.withColumn("volume_growth", round($"volume_growth", 2))
ar = ar.withColumn("perc_acv_growth", round($"perc_acv_growth", 2))


/*
Now that we've done all our clean-up, we self-register the DataFrame as a Temporary MySQL table and we are ready to start visualizing!
*/
ar.registerTempTable("the_data")

Now to visualize this data, we will create a new paragraph within this Note (see here for more information on paragraphs and notes). There are a couple of options in Zeppelin: 1) since it has built-in interpreters, we can write actual MySQL code as long as it is prefaced with “%sql”, this also being the case with other languages, such as AngularJS; 2) we can also avoid the interpreters and use Spark’s show() function in Scala. I will show both examples below, if you are just testing this out for the first time, I recommend using MySQL directly because it is very intuitive. Later, we will have to use option #2 for more complex visualizations with AngularJS.

/*
notice how we use z.show()-- the z variable is the predefined zeppelin context variable which connects all interpreters/languages. We want a MySQL database that has been temporarily saved under this context.

Then we use the sql function to run any given query on the proper columns. 
*/
z.show(sqlContext.sql("select custom_segment, markets, base_size, periods, flavor  FROM the_data"))
%sql 
SELECT custom_segment, markets, base_size, periods, flavor FROM the_data

Now the database will be populated into the graphing libraries that Zeppelin has installed already and shown. See below:

Screen Shot 2017-02-04 at 11.48.51 AM

The above is the query shown as a table. You can click the buttons above it to change the format to a different visualization (bar, pie, line graphs, etc) as well as export the data to comma separated files (CSV) or tab separated files (TSV).

Screen Shot 2017-02-04 at 11.49.01 AM

You can click on the “settings” link to the right of the buttons to change exactly which variables are shown, if you want the SUM, COUNT, AVERAGE, MIN or MAX of the columns, and many more options.

This is the basic intro to Zeppelin-Spark, for more tips on how to make a completely customizable analytics dashboard for big data, keep in touch, we’ll release part 2 very soon.

LEAVE A REPLY

Please enter your comment!
Please enter your name here