Naive Bayes exploiting TF-IDF for Spam classification
Following example is run on data set described as the SMS Spam Collection v.1. This is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. More details can be found on web page.
As classifier NaiveBayes is used. Every SMS message regardless of its label is transformed into features vector. The term frequency (TF) represents the number of occurrences of particular term within message. The inverse document frequency (IDF) represents frequency of term for the whole messages set. Term Frequency–Inverse Document Frequency that stands for TF-IDF is a product of these two statistics.