Classifying Texts and Documents

How Classification is used

Classifying text is used for a number of purposes:

  • Spam detection
  • Authorship attribution
  • Sentiment analysis
  • Age and gender identification
  • Determining the subject of a document
  • Language identification

Understanding sentiment analysis

With sentiment analysis, we are concerned with who holds what type of feeling about a specific product or topic.

Sentiment analysis can be applied to a sentence, a clause, or an entire document.

Further complicating the process, within a single sentence or document, different sentiments could be expressed against different topics.

Text classifying techniques

two basic techniques:

  • Rule-based
  • Supervised Machine Learning

Rule-based classification uses a combination of words and other attributes organized around expert crafted rules. These can be very effective but creating them is a time-consuming process.

Supervised Machine Learning (SML) takes a collection of annotated training documents to create a model. The model is normally called the classifier. There are many different machine learning techniques including Naive Bayes, Support-Vector Machine (SVM),and k-nearest neighbor.


//Opennlp//Train a classifierDoccatModel model = null;try (InputStream dataIn = new FileInputStream("en-animal.train");OutputStream dataOut = new FileOutputStream("en-animal.model");) {    ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");    ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);    model = DocumentCategorizerME.train("en", sampleStream);    OutputStream modelOut = null;    modelOut = new BufferedOutputStream(dataOut);    model.serialize(modelOut);} catch (IOException e) {// Handle exceptions}//Use the model above to classify doctry (InputStream modelIn = new FileInputStream(new File("en-animal.model"));) {    DoccatModel model = new DoccatModel(modelIn);    DocumentCategorizerME categorizer = new DocumentCategorizerME(model);    double[] outcomes = categorizer.categorize(inputText);    for (int i = 0;i<categorizer.getNumberOfCategories(); i++)     {        String category = categorizer.getCategory(i);        System.out.println(category + " - " + outcomes[i]);    }} catch (IOException ex) {// Handle exceptions}
//Stanfordnlp//train a classifierColumnDataClassifier cdc = new ColumnDataClassifier("box.prop");Classifier<String, String> classifier = cdc.makeClassifier(cdc.readTrainingExamples("box.train"));//test the classifierfor (String line : ObjectBank.getLineIterator("box.test", "utf-8")) {    Datum<String, String> datum = cdc.makeDatumFromLine(line);    System.out.println("Datum: {" + line + "]\tPredicted Category: " + classifier.classOf(datum));}//predictString sample[] = {"", "6.90", "9.8", "15.69"};Datum<String, String> datum = cdc.makeDatumFromStrings(sample);System.out.println("Category: " + classifier.classOf(datum));
//Stanford nlp piplineProperties props = new Properties();props.put("annotators", "tokenize, ssplit, parse, sentiment");StanfordCoreNLP pipeline = new StanfordCoreNLP(props);Annotation annotation = new Annotation("Text String");pipeline.annotate(annotation);String[] sentimentText = {"Very Negative", "Negative", "Neutral", "Positive", "Very Positive"};for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {    Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);    int score = RNNCoreAnnotations.getPredictedClass(tree);    System.out.println(sentimentText[score]);}

Using LingPipe to classify text

I actually used LingPipe befoer. APIs on its website are quite clear. Good experience in processing English but not Chinese.

//LingPipgeString[] categories = {"soc.religion.christian", "talk.religion.misc","alt.atheism",""};int nGramSize = 6;//Initial DynamicLMCClassifierDynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(categories, nGramSize);String directory = ".../demos";File trainingDirectory = new File(directory+ "/data/fourNewsGroups/4news-train");for (int i = 0; i < categories.length; ++i) {    File classDir = new File(trainingDirectory, categories[i]);    String[] trainingFiles = classDir.list();    // Inner for-loop    for (int j = 0; j < trainingFiles.length; ++j)     {        try         {            File file = new File(classDir, trainingFiles[j]);           String text = Files.readFromFile(file, "ISO-8859-1");           Classification classification = new Classification(categories[i]);            Classified<CharSequence> classified = new Classified<>(text, classification);            classifier.handle(classified);        }         catch (IOException ex)         {           // Handle exceptions        }    }    try     {        AbstractExternalizable.compileTo( (Compilable) classifier, new File("classifier.model"));    }     catch (IOException ex)     {        // Handle exceptions    }}
//LingPipe sentimentcategories = new String[2];categories[0] = "neg";categories[1] = "pos";nGramSize = 8;classifier = DynamicLMClassifier.createNGramProcess(categories, nGramSize);String directory = "...";File trainingDirectory = new File(directory, "txt_sentoken");for (int i = 0; i < categories.length; ++i) {    Classification classification = new Classification(categories[i]);    File file = new File(trainingDirectory,categories[i]);    File[] trainingFiles = file.listFiles();    for (int j = 0; j < trainingFiles.length; ++j)     {        try         {            String review = Files.readFromFile(            trainingFiles[j], "ISO-8859-1");            Classified<CharSequence> classified =            new Classified<>(review,classification);            classifier.handle(classified);        }         catch (IOException ex)         {            ex.printStackTrace();        }    }}
