Within Predictive Cloud Computing we use IBM BigInsights to process data at rest for deep
insights. I’m Aaron Baughman with the GTS Sports and Entertainment Group. I will take you briefly through our Hadoop-based code to show
you how we build a BigInsights deployment file, how to deploy that file, and start using Java. We use IBM’s BigInsights to accelerate our
data science objectives on top of Hadoop and the data at rest ecosystem such as ZooKeeper,
Pig, Hive, and Oozie. Within Predictive Cloud Computing we process logs and web crawl information
to find the most popular players based on web page
access, find the most popular pages, and also to find
the total size of requested content at specific time
bins. As a result, processing data at rest within
the map produced paradigm produces latent gained data correlations. Local map produced jobs
that are defined in an assembly run in sequence. The first job creates data cache for all the
mappers so that hundreds of processes are not simultaneously accessing a relational database such as DB2. All of the players, web crawled pages for
a player, and the web crawl times are loaded into memory. The algorithm that iterates over each entry within the blog entry defined the time within
a particular minute is the time key for the
mapper, and a count of one is the value. The format really goes by the epic time, the
count, and then the total size of all the requests. If for the given log time, a player within the tournament is not mentioned on the access
path, the resource path becomes a URL key where the value then in turn becomes one. The format is the URL and then the count. Assessed resource mentions a player, the player identifier, and the time become the player
key. The value is the count of the player mentions over the minute time span. The format is player ID, epic time, and then
the count. Reduced stage then correlates and counts the number of times a URL is accessed that does
not contain the player mentioned. Players are accessed over a minute time bin
and the number of log entries over a minute time
bin. And to show you some examples, what I have
up here is the web console for BigInsights so I can
go up here to my application and the application is a tab that drops down and shows all of the running applications. In this case we have a WebLogAnalyzer application that has been deployed. This is a Java application that was built by Maven, so we did a Maven
install and then it created a zip file for us, and
then I can go and update and upload the file itself. And whenever I do that then the WebLogAnalyzer application pops up over in the applications
view, and then I have in applications that xml file that defines the various fields that I would want to input. For example, the Input Directory. The Input Directory would then map over to
a HDFS file system so if I do a browse, then this brings up my HDFS file system so
I can go ahead and navigate to the directory that
I want and then that becomes the data that’s going to be input into my mappers. And the Output Directory again is HDFS output, and that gives me all of my reduced output. Persistence Unit, that is used within JPA so I know within the persistence.xml file which database that I want to use. And then my Site, this tells me what sport
is it, Australian Open, is it going to be the Masters, French Open, USGA, the Wimbledon Tournament,
US Open. Then I’d look at the year, would be 2014,
2015, 2016, alright because we can even process historical sites if we want. And then I also provide a JBDC Connection. I do this so that I in turn can dynamically
update the database that we would really want to
point to. And then for the Connection Name, would be
the database user, and of course the Password
and Number of Buckets helps us to parallelize
the job. There is a schedule in an Advanced section, and this is analogous to Oozie where I can
set up where I want the job output to go into so this is called BigSheets and then I can schedule the job with Oozie itself, that would run
periodically. Now over in the code, so if I bring up Eclipse, I have a project which you can see here. So let me scroll up here. It’s called the WebLogAnalyzer. This is our base map reduce project, and if
I go down and I show you my application xml file this is the xml file that defines the user
interface that I just showed you that have the required fields that I must enter in order for a job
to run. So first I have a base path of log files that
have my Output directory, so where do I want that
to go within HDFS, my persistence unit that I mentioned before, my Site Name, as well as JBDC Connection, Connection Username and Password, as well
as the Number of Buckets. Then I can look at my workflow, and the workflow is an assembly line that tells map reduce
the jobs to run. So here within the workflow I have a Start section and the Start section forks off into a RequestsJob and a Sqoop Job. A Sqoop Job brings in all of the data that
I need such as the web crawl information into memory so that all of my mappers and reducers don’t have to all simultaneously hit the JBDC Connection. Then I fork off two jobs I can run in parallel. One is the PlayerContentAnalysis Job and this
tells me the most popular players based upon the web
pages that are accessed that had player mentions, and then WebLogs, this tells me the most popular websites that are accessed over time. And then I join the two together so I can
get a cross product of the results and then this
gives me a crawler page count as well a popularity. And I go ahead and I output this into JPA/DB2. And the RequestJob I do the same, I output that over into JPA/DB2. And the xml file itself, alright, I first
had a Setup and I set the various parameters that
I want for this job that’s going to run including timeouts, queue names, output directories, and so forth and so on. The output directories, those are chosen by global variables that are created through the web interface that I showed you. Here’s the sqoop, so if this job finishes successfully, ok, then I’m going to go to
sqoop. If it errors, then we’re still going to try and go to sqoop and get all the data and bring it in, and then these are the different configurations that we pass through here. So I import the driver, connect, the usernames, the sites, right, for JDBC. Then if the sqoop that does succeed, we go to the next job, otherwise the job failed because this is a fatal error. So then the RequestJob, this goes through and it gets how large each of the requests
were. And so I set my different, my map classes
so to reflection, Java Reflection BigInsights
will know exactly what Java class to use for the mappers, for the reducers, as well as the key classes and the value classes. And then I keep going through, right, and finally I get down to a crawlerFork, and the crawlerFork if this job was ok, then I say ok let’s enter crawlerFork, otherwise fatal error let’s fail. And the fork, I set up two parallel paths
that can run. I have my WebLogsJob that can run in parallel with my PlayerContentAnalysisJob that can run in parallel. And my WebLogsJob, you can see here that I
have all of the different parameters again set very similar, it’s just that they’re going
to be different classes because we need different
logic that supports the underlying map reduce operations. Then if that succeeded, then I want to join
the results and I go to initialForkJoin, and at the same
time the PlayerContentAnalysisJob is going to continue and it’s going to output into its mapper output that’s been defined through the gouey and
then this also goes into the initialFork. And now this join name says we’re going to
join the initialForkJoin, we’re going to go into
the CrawlerJoin, so the initialForkJoin cannot
go into the CrawlerJoin until the previous two jobs
finished, and when it is finished then we in turn go
through and we do additional map reduces on the PlayerContent and the web content, and then we again output that into the job. And then if this did succeed, then we do a CrawlerPopularityJob and continue forward
and we define additional mappers such as the CrawlerPopularityMapper in Java and we also define a reducer such as the CrawlerPopularityReducer, and then we also
have our keys and then we have our values as far
as where we accumulate information and data and
store it. We have a CleanupJob. The CleanupJob helps
us to save space within HDFS so after we finish
each of the jobs then we want to go back and we want to remove some of the data that we have put in to HDFS that we no longer need. And then this here shows the fail message,
alright, so we do a Map/Reduce failed and then we output the error message as to what happened and why it failed, and then we kill the job
itself. While we’re here let me go ahead and show
you some simple map reduce logic. So here is one of our WebLogMappers, ok, so use Log for J and we set up a Log Factory with the WebLogMapper. We then do a Query_String_Splitter and that is used to help us get the different types
of results that we’re looking for, and then our LogParser is how we go through and we go line by line in each of the logs and get the data that we need so that we can go ahead and do our different types of matching. And this enumeration, right, helps tell us for each of the lines how many were filtered, matched, not matched, or parse failed. Here, in the mapper, we start going through and we get the value itself from the value
object that is passed in, which is a text piece,
and we keep going through and we check to see the different crawler types, has this path been
crawled or not, alright so crawled path has a static method that in turn looks over into memory
and determines has that been crawled. If it has been, we want to go ahead and add
that as a key, so the key is going to be the path that was crawled, and then we’re going to
just initialize it to one because it was crawled
one time. Then within our context we just want to know
how many were matched, we just increment accumulation, and that’s how we set initialize our mapper. Then within our Partitioner this is how we
speed up and we parallelize our different types
of logs and we look at the number of Partitions that we set which would be the number of bins and we use the modulus of the hashCode so
that we can evenly distribute out the keys with
respect to Partitions and then help to parallelize
data. Finally our Reducer, this is how we in turn accumulate all the different types of counts
so here I initialize my sum to zero, but for
each of the values that I have, that have passed
in through itterable, I want to get them, sum
it up, and then write out the sum value here so I have, I still have my sequence right, I still have the same key, but then the sum is going to be different, ok. And so as we keep reducing for all of the mappers to get that final value, that’s how you go about doing that. Going back to the browser, I would go ahead and set my parameters and run the job, and then the job would execute from the assembly file that I mentioned before so that we have the different jobs that execute. And to show you an example of output we’ll go over to the Files tab and this shows us
HDFS, and we’re going to go ahead and bring down
HDFS, and we’ll go to our users, and we want to
look in the Hadoop directory, and once we go down to the Hadoop directory we’re going to open
up data. So in the data piece we will look at the version
2. And the version 2 just simply means that we
have different code iterations and updates. So let’s just look at the Wimbledon deck at
2014 or I could look at the Australian Open at
2015. Like I said we removed a lot of this to save
space, and if I open up the day on the 13th you can
see the outputs from the different jobs that are
running. If I open up the sqoop job I can see that
I have my output from the mappers, I have logs, and
then whether or not it was a success, right. If we have _SUCCESS that means it was successful. If this file doesn’t exist then it was not successful and I would want to go look in
the logs to find out why. Now if I go back and I want to deploy the job with Oozie then what I would do is I would go on the command line with my biadmin. So I’m simply in the home on the admin then
I created a LogDaemon and the LogDaemon are
custom shell scripts written in Bash and the shell
scripts connect over to Oozie that execute the job
itself. So once I’m in my LogDaemon directory if I
do an ls, you can see that I have different properties
files, and the core file I’m going to run would be the LogDaemon-noargs.sh. Let’s go ahead and look at that now. So what I do is I create a lock file so I
can make sure that my jobs are not, so I can make sure that my jobs are not going to overlap, and if it does then I’m not going to execute the job itself. Then in turn I also want to have output for
logs and this is where they’re going to go. So I have my logs, I set my DIR, I also export the path so I have my Oozie command on the
path, then I set my Oozie URL, just going to be w30084, the machine I’m on, and then the standard port for Oozie. Finally I issue the Oozie job -run -config with my LogDaemon.properties file. And if I look at an example properties file I can see that I set the definition file on that path so wimbledon2014LogDaemon. This tells Oozie how to run the job. I set the event which in this case is a Wimbledon even event and then the output event. And these set different parameters that are used in that xml file. And if I want to see the different jobs that are running I can just issue oozie jobs –jobtype coord, so this tells
me the previous job that I had run, so the Australian Open 2015. It tells you when it started, when it stopped, and the status of
it. So none of them are currently running at this
time because we don’t have any live events. And the job that does start running with Oozie, it’s going to read a configuration file. And let’s just look at an example. So we’ll look at this, the same example I just showed you a minute ago. This is an Oozie configuration file. So the application name in Oozie when it does
launch is going to be Australian Open 2015 Log Daemon
Plex3, and this tells me the frequency as to when I want it to run, so I want it to run every
day, when it starts, when it ends, and then the
time zone. If I scroll over a little bit more I can see my logInput, my logOutput, so the event parameter that I set in that Bash shell script would be here, the year, month, and then the day. And this here again tells me the frequency, the initial instance of when it’s going to
start for the input, and likewise for the output the frequency and when it’s going to start. And the workflow I just set in whenever I
upload that zip file of BigInsights it is expanded and there’s a workflow.xml file which tells
Oozie how to run all the BigInsights jobs. Then I set my path to the file so that Oozie will know where to get it from and that instructs Oozie which jobs to run when and which ones are going to be parallelized. Then of course I can set different types of database names, other types of constructs, I set where the jobTracker is, the persistence unit, then I launch it. And that’s pretty much how it works, and it’s very robust and enables us to have large scale insights on data that’s at rest, alright, so we stream log files over into our events. So this is how we use BigInsights and Hadoop for IBM GTS Sports and Special Events within the Continuous Availability Services.

Tagged : # # # # # # #

Dennis Veasley

Leave a Reply

Your email address will not be published. Required fields are marked *