spa2016 Conference


SPA Conference session: Real-World Big Data In Action
One-line description:	A workshop in which you will create a local Big Data cluster, load it with real data and run queries against it

Session format:	Workshop 150 minutes [read about the different session types]

Abstract:	Session attendees will create and configure a multi-node Big Data cluster in the classroom, and load it with a real Big Data dataset (which we will provide). Attendees will then be able to use a variety of tools to run Spark and Hive queries against this dataset. You will need to bring your own laptop. OS X, Linux and Windows are all supported (although Windows is a little more difficult). Your laptop will need at least 8GB RAM and 5GB free disk space.

Audience background:	General programming experience; you should be comfortable with using a shell and the command line. Some Python knowledge will help in the Spark examples, and you will need a working knowledge of SQL for the Hive examples. Code will be provided.

Benefits of participating:	You will start to learn how Big Data technologies work, how to configure Hadoop, Spark and Hive servers, and how load them with data. You will then get a chance to try out some tools to run queries and gain insights from a Big Data dataset.

Materials provided:	Big Data dataset Reference materials for tools and languages used All script, config and programs

Process:	The session will start with a brief overview of Big Data technologies. We will then walk through the installation and configuration of Apache Hadoop, Spark and Hive on users' laptops. (Windows, OS X and Linux are all supported, although Windows is more difficult. Laptops need at least 8GB RAM.) Once everyone is up and running with their own servers, we will walk you through the first exercise, which is to initialise the Apache Hadoop filesystem and load some real big data into it (analysis of reported London Fire Brigade incidents). The second exercise is to run some Apache Spark queries against the data: how many incidents were there last year? What sort of incidents occurred and where? Where are the safest / most dangerous places to live in London? The third exercise is to run some Apache Hive SQL queries against the same data. If there is time, and you have enough capacity on your laptop, we have a bigger Big Data dataset (London house prices) which you can also load and query. We have also set up a couple of Virtual Private Servers in the cloud. If you don't have a powerful enough laptop, or if you can't get the exercise to run, you can use one of these.

Detailed timetable:	Introduction to Big Data (15 minutes) Exercise 1: Hadoop (60 minutes including setup) Exercise 2: Spark (30 minutes) Exercise 3: Hive (45 minutes)

Outputs:	We will put all session materials, and links to further information, on the SPA conference page.

History:	Not previously presented.

Presenters
1. Nick Rozanski ICBC Standard Bank	2. Eoin Woods Endava	3. Chris Cooper-Bland Endava (UK) Ltd

SPA Conference session: Real-World Big Data In Action