Scala version can look as follow:

    val hash = new HashingTF(numFeatures = 100000)
    val raw = sc.textFile("data/sms-labeled.txt").distinct().map {
      _.split("\\t+")
    }.map {
      a => (a(0), a(1).split("\\s+").map(_.toLowerCase()))
    }.map {
      t => (t._1, t._2, hash.transform(t._2))
    }.cache

    val idf = new IDF().fit(raw.map(_._3))
    val data = raw.map {
      t => LabeledPoint(if (t._1 == "spam") 1 else 0, idf.transform(t._3))
    }

    val Array(train, test) = data.randomSplit(Array(.8, .2), 102059L)
    val model = NaiveBayes.train(train)
    evaluateModel(model, test)

RDD named raw could be kept in simpler form but, some additional computations are provided.

First of all, the helper method for model evaluation:

  def evaluateModel(model: NaiveBayesModel, test: RDD[LabeledPoint]) = {
    val predict = model.predict(test.map(_.features))

    test.take(5).foreach {
      x => println(s"Predicted: ${model.predict(x.features)}, Label: ${x.label}")
    }

    val predictionsAndLabels = test.map {
      point => (model.predict(point.features), point.label)
    }

    val stats = Stats(confusionMatrix(predictionsAndLabels))
    println(stats.toString)

    val metrics = new BinaryClassificationMetrics(predictionsAndLabels)
    printMetrics(metrics)
  }

Typical response of prediction for 5 top records from test dataset can look like this:

Predicted: 1.0, Label: 1.0
Predicted: 1.0, Label: 1.0
Predicted: 0.0, Label: 0.0
Predicted: 0.0, Label: 0.0
Predicted: 0.0, Label: 0.0

Output after model evaluation can look like that:

TP: 123.0, TN: 879.0, FP: 24.0, FN: 12.0 
TPR (recall/sensitivity): 0.9111111111111111 
TNR (specificity): 0.973421926910299 
PPV (precision): 0.8367346938775511 
NPV: 0.9865319865319865 
FPR (fall-out): 0.02657807308970095 
FNR: 0.0888888888888889 
FDR: 0.16326530612244894 
ACC (accuracy): 0.9653179190751445 
F1 (F-Measure): 0.8723404255319148 
MCC (Matthews correlation coefficient): 0.8533502082524206 
Threshold: 1.0, Precision: 0.8367346938775511
Threshold: 0.0, Precision: 0.13005780346820808
Threshold: 0.0, Recall: 1.0
Threshold: 1.0, Recall: 0.9111111111111111
Threshold: 0.0, F-score: 0.23017902813299232, Beta = 1
Threshold: 1.0, F-score: 0.8723404255319148, Beta = 1
Threshold: 1.0, F-score: 0.8723404255319148, Beta = 0.5
Threshold: 0.0, F-score: 0.23017902813299232, Beta = 0.5
Area under PR (precision-recall curve) = 0.8797032493151403
Area under ROC (Receiver Operating Characteristic) = 0.9422665190107051

Supplementary method that does the same computations as those provided by Spark creators:

  def tfidf(data: RDD[(String, Array[String])])(implicit sc: SparkContext) = {
    val docs = data.count.toDouble
    //TF - terms frequencies 
    val tfs = data.map {
      t => (t._1, t._2.foldLeft(Map.empty[String, Int])((m, s) => m + (s -> (1 + m.getOrElse(s, 0)))))
    }

    //TF-IDF
    val idfs = data.flatMap(_._2).map((_, 1)).reduceByKey(_ + _).map {
      case (term, count) => (term, math.log(docs / (1 + count)))
    }.collectAsMap

    //idfs.lookup("").lift(0).getOrElse(0d)
    tfs.map {
      case (m, tf) =>
        (m, tf.map {
          case (term, freq) => (term, freq * idfs.getOrElse(term, 0d))
        })
    }
  }

which can be used e.g. to display terms with highest frequencies that are used in spam messages:

    val termsInSpamMsgs = tfidf(raw.filter(_._1 == "spam").map(t => (t._1, t._2))).sortBy(_._2.values, ascending = false)
    termsInSpamMsgs.take(10).foreach(println)

Output (limited to just 3 records) can look as follows:

(spam,Map(poly -> 11.749883315444682, sleepingwith, -> 5.788429948716486, 4 -> 2.1508437889901, pobox365o4w45wq -> 5.788429948716486, tones -> 3.3035232989284853, all -> 3.223480591254949, gr8 -> 4.402135587596595, to -> 0.09301572373080139, direct -> 4.08368185647806, eg -> 3.9966704794884307, finest, -> 5.788429948716486, 8007 -> 3.48584485572244, crazyin, -> 5.788429948716486, 2u -> 5.788429948716486, rply -> 4.689817660048376, with -> 1.9382823470064272, titles: -> 5.09528276815654, title -> 5.09528276815654, ymca -> 5.788429948716486, :getzed.co.uk -> 5.788429948716486, 300p -> 5.788429948716486, mobs -> 5.788429948716486, breathe1 -> 5.788429948716486))
(spam,Map(it's -> 10.19056553631308, your -> 0.9926394031197446, learn -> 5.788429948716486, txts! -> 5.788429948716486, but -> 4.535666980221118, incredible -> 5.788429948716486, mind. -> 5.788429948716486, blow -> 5.788429948716486, it -> 3.3460829133472814, 18p/txt -> 5.788429948716486, reply -> 1.9382823470064272, that -> 3.3035232989284853, to -> 0.09301572373080139, now -> 2.0995504946025494, you -> 1.0566271117950283, believe -> 5.788429948716486, g -> 5.788429948716486, truly -> 5.382964840608321, things -> 5.788429948716486, will -> 2.768005062572123, from -> 1.7024536361649016, true. -> 5.788429948716486, o2fwd -> 5.788429948716486, won't -> 5.382964840608321, amazing -> 5.382964840608321, only -> 2.456225438541282))
(spam,Map(am -> 7.833255543629789, borin -> 5.788429948716486, xx -> 4.8721392168423305, u -> 3.8137323015460964, & -> 1.981767458946166, luv -> 4.689817660048376, 09099725823 -> 5.788429948716486, calls£1/minmoremobsemspobox45po139wa -> 5.382964840608321, now -> 4.199100989205099, chat -> 2.985069567809951, here -> 4.8721392168423305, cum -> 5.09528276815654, over -> 4.402135587596595, hope -> 4.8721392168423305, claire -> 10.765929681216642, 2nite? -> 5.382964840608321, alone -> 5.09528276815654, 2 -> 1.4576966084301548, havin -> 5.788429948716486, c -> 3.70898840703665, time -> 3.70898840703665, wanna -> 4.08368185647806))
...

results matching ""

    No results matching ""