Analyze crime data with Apache Spark and Hive ETL, Part 2

Explore the analyses


Content series:

This content is part # of 2 in the series: Analyze crime data with Apache Spark and Hive ETL, Part 2

Stay tuned for additional content in this series.

This content is part of the series:Analyze crime data with Apache Spark and Hive ETL, Part 2

Stay tuned for additional content in this series.

This tutorial series describes the analysis of United Kingdom crime data from inception to final results. Follow along to learn about data download, data transformation, loading into a distributed data warehouse, Apache Hive, and subsequent analysis using Apache Spark. Part 1 of this series describes the Extract, Transform, and Load (ETL) activities, and Part 2 explores the analyses performed.

In this second part of this series, you will walk through an analysis of crime data that shows a foundation of data analytics. Follow along and you will encounter a standard problem in modern big data systems: integrating data from different sources. Different government departments often have different standards for their data products and usually different definitions for even basic data types. Data scientists routinely manage these different definitions and must harmonize them to obtain good results.

Most data scientists like to load data into the data warehouse and retain an exact copy of the imported data. Then new tables with harmonized data definitions are created by transformations within the data warehouse itself. This retains a copy of the original imported data as an audit trace within the data warehouse and also the code that performs the transformation.

Key ideas

Keep a central project architecture for the data lake and map external data providers to your internal standard. A raw copy of ingested data is kept in a set of data tables, and transformations are applied to create a set of data tables with harmonized data definitions. This provides an audit trace of data transformations within the data lake itself.

Normalize the data

For comparing crime rates across counties, it helps to standardize the crime figures. Comparing total crime figures between Croydon and Bromley would misrepresent because the larger population of Croydon would generate higher crime figures anyway. Barnet, with half the population of Croydon, would naturally have lower crime rates. To enable direct comparison of crime rates, you normalize the data. Generally, you want to quote crime per 100,000 population. This allows a direct comparison of crime data among areas with widely different population figures. We immediately see a big difference in crime per 100,000, then further analysis can be created to identify root causes.

The police data does not quote population figures, so we need to acquire population data from the U.K. Census dataset. To do this, you need to download the London Census data. The London Census dataset is in a standard Excel file. The file has two tabs, one containing the LSOA data on population and the second containing a breakdown of people by single year of age. The LSOA is a small geographic area defined by the Census team, typically covering 1,500 people. Save the data file for each tab in CSV format. The second tab contains data broken down into age bands by single year of age. In this tutorial, you will see the first of these tables to compute population data. For future use, you already have the population per age bracket provided and can analyze crime rates per age group.

Import census data into Hive

The source code for the London Census data import is given below. The code is also available on GitHub. As before, the code is stored in a Hive SQL file. This source file is then invoked from the bash command line as follows:
$ hive -f import_census.hql.

The code below shows the Spark-Hive SQL data import utility file.

create table if not exists lsoa_borough ( xblank string, area_code string, year INT, lsoa_name string,  borough_name string, male INT, female INT, population INT) 
row format delimited fields terminated by ',' 
stored as textfile;

load data local inpath '/home/hduser/zdata/uk-crime/spark-lake/census/' into table lsoa_borough;
select year, area_code, lsoa_name, borough_name from lsoa_borough where year=2014 limit 20;

The computation that calculates population by aggregating the LSOA population data is available on GitHub.

Overview of crime and census datasets

On examination of the data download, one fact is immediately apparent: There is no population data, which we need for any crime comparison. The U.K. census data is in an Excel file and provides detailed population data. However, we need to integrate data from different providers, meaning you need to process this data. The datasets provide a detailed view of population counts.

Here, the data is not provided by county, but instead by Lower Layer Super Output Area (LSOA). This is a small geographic area of about 1,500 people. Fortunately, the police provide crime data based on the Census LSOA as well, so we can join the crime dataset data tables directly against the Census data. Data scientists like to enable easy comparison of analytics, and they like to calculate crime per 100,000. This enables easy comparison among counties with varying populations. You will aggregate the crimes into a borough level for London and aggregate population counts from the small LSOA data into more familiar borough areas.

U.K. crime dataset

The U.K. crime dataset is compiled from reports sent by the 43 geographic police forces of England and Wales, the British Transport Police, and the Police Service of Northern Ireland. In addition, the Ministry of Justice supplements the crime reports with court outcomes data. This data is collated and submitted monthly by the regional police forces.

Each month, the regional police forces assemble a Crime and Anti-Social Behaviour (ASB) report file and submit this to central government. The Ministry of Justice then adds court outcomes from its records and references this against the police data as possible. All data is made anonymous before publication.

Police force IT systems overview

Police force IT systems are usually developed internally to meet their own needs. Some forces procure third-party software, but the systems are not identical. Police IT systems fall into four main classes:

  • Command and Control, and Incident Management— Used for handling all incidents handled by the police force. Crime reports, road traffic incidents ASB.
  • Crime Management— Once an incident is confirmed as a crime, the crime management system is used to record all steps of the ongoing investigation.
  • Custody File System— This records details of any arrests made during crime investigations.
  • Case Management— This system manages the case in preparation for court trial, and the result of the trial in some police forces.

These four systems are tightly integrated in some police forces, while in others, they're distinct silos, with little connectivity. Due to the differences between the systems throughout the police forces, the exact steps to create a harmonized report vary.

Analytics for census and crime data

Next, we use Spark-SQL to perform the computations. The Spark-SQL program is written in the Java™ programming language using the Spark Java API. The code is built using the industry-standard build tool Maven. The Maven Project Object Model (POM) file is available from GitHub. The program is invoked from the following bash command line:

            $mvn package
$usr/local/spark/bin/spark-submit –class "DemoHive" --master local[4]

The first command, mvn package, builds the Java program and integrates the Spark and Hive API functions automatically. This results in a Java JAR file as output. The second command runs the program, which interrogates the Hive data warehouse, performs calculations, and displays the results.

The program first interrogates the census population data table and aggregates the population counts for a borough level (about the size of a town). The second step is a Hive right outer join, which computes the crime data and integrates the population data. The third step computes the crime rates, divided by 100,000 population, and displays the resulting table.

Find the Spark Java code for accessing Hive data lake below.

public class DemoHive {
  public static void main(String[] args) throws Exception {

// warehouseLocation points to the default location for managed databases and tables
String warehouseLocation = "file:" + System.getProperty("user.dir") + "/" + "spark-warehouse";   
    System.out.println("user-dir= " + System.getProperty("user.dir"));
SparkSession spark = SparkSession
  .appName("Java Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
    Dataset<Row> sqlboroDF =  spark.sql("select  sc.crime_type,  count(*) as numCrimes, l.borough_name FROM test_street_crime sc LEFT JOIN lsoa_borough l ON (sc.lsoaname = l.lsoa_name) WHERE sc.crime_type='Robbery' GROUP BY l.borough_name, sc.crime_type  ORDER BY borough_name ");   
    // Register the DataFrame as a temporary view      

    Dataset<Row> sqlDF = spark.sql("select borough_name, sum(population) as bpopulation FROM lsoa_borough GROUP BY borough_name");;
    // Register the DataFrame as a temporary view
    spark.sql("select b.crime_type, numCrimes, b.borough_name, bpopulation as population, numCrimes / (bpopulation / 100000 ) as crime_per_100000 from sqlborodf1 b, sqldf1 p where b.borough_name = p.borough_name").show();


Output from Java Spark Hive RDD

Figure 1. U.K. crime data download
Image shows U.K. crime data download
Image shows U.K. crime data download

In this calculation, you see the number of robbery crimes committed in each of the 32 London boroughs for one month in 2014. The second column shows the actual number of crimes, and the fifth column shows crimes per 100,000. Several areas have low robbery crime rates of less than 10, with Kingston having just 7.06 per 100,000, while other areas have higher robbery rates of about 40 crimes. This shows a tenfold greater risk of robbery in highest crime vs. lowest crime areas.

R histogram

Here you use the R component and import the csv file exported by the SparkSQL step above. The histogram shows crime rates in each London borough ordered by crime per 100,000 population.

Figure 2. R histogram
Image shows shows crime rates in each London                 borough ordered by crime per 100,000 population
Image shows shows crime rates in each London borough ordered by crime per 100,000 population

Crime rates per 100,000 permit a direct comparison of crimes. Clearly, comparing total crimes would be a poor measure of the likelihood of becoming a victim as the population counts per borough can vary significantly. Now we have the data for normalized crime rates we have the analytics foundation for more advanced analytics.


This tutorial has shown the integration of U.K. crime data with U.K. Census data. Illustrating data integration from disparate sources, a typical data management issue. Subsequently, you saw the computation of normalized statistics for crime rates enabling easy comparison of crime rates across different geographic areas.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Data and analytics
ArticleTitle=Analyze crime data with Apache Spark and Hive ETL, Part 2: Explore the analyses