IBM Support

Spam Classification using Spark MLlib in IBM BigInsights - Hadoop Dev

Technical Blog Post


Abstract

Spam Classification using Spark MLlib in IBM BigInsights - Hadoop Dev

Body

This blog talks on classifying the SMS messages into Span and Ham using the Spark MLlib.

 
Environment : IBM BigInsights 4.2

 
Step 1:  Download the dataset

We are using the dataset from UCI Machine Learning Repository – SMS Spam Collection Data Set. 

For more details refer –
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Download the dataset – https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip

Unzip and upload the file (SMSSpamCollection) to HDFS (/tmp).

 
 Step 2: Login to Spark Shell

su hdfs
cd /usr/iop/current/spark-client
./bin/spark-shell –master yarn-client –driver-memory 512m –executor-memory 512m

 
Step 3: In Scala prompt, run below commands

# Read the dataset.
val inputRdd = sc.textFile(“/tmp/SMSSpamCollection”)

# Get the records that are Spam and Ham
val linesWithSpam = inputRdd.filter(line => line.contains(“spam”))
val spam = linesWithSpam.map( x => x.split(“\t”)(1))

val linesWithHam = inputRdd.filter(line => line.contains(“ham”))
val ham = linesWithHam.map( x => x.split(“\t”)(1))
 

 
# Import the required mllib classes
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

# Convert the text to vector of 100 features based on term frequency.
var tf = new HashingTF(numFeatures = 100)
val spamFeatures = spam.map(email => tf.transform(email.split(” “)))
val hamFeatures = ham.map(email => tf.transform(email.split(” “)))

# Label the Spam as 1 and ham as 0.
val positiveExamples = spamFeatures.map( features => LabeledPoint(1, features))
val negativeExamples = hamFeatures.map( features => LabeledPoint(0, features))
val training_data = positiveExamples.union(negativeExamples)
# cache the training data
training_data.cache()

# We use 60% of dataset for training and remaining for testing the model.
val Array(trainset, testset) = training_data.randomSplit(Array(0.6, 0.4))

 

 
# We use Logistic Regression model, and make predictions with the resulting model
val lrLearner = new LogisticRegressionWithSGD()
val model = lrLearner.run(trainset)

val predictionLabel = testset.map( x => (model.predict(x.features), x.label))

val accuracy = predictionLabel.filter(r => r._1 == r._2).count.toDouble / testset.count

println(“Model accuracy : ” + accuracy)

 

 
Thus, we are able to create and run the model to predict the Spam or Ham.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16260025