Classifying Texts and Documents
来源:互联网 发布:卓有成效的管理者知乎 编辑:程序博客网 时间:2024/06/05 02:05
How Classification is used
Classifying text is used for a number of purposes:
- Spam detection
- Authorship attribution
- Sentiment analysis
- Age and gender identification
- Determining the subject of a document
- Language identification
Understanding sentiment analysis
With sentiment analysis, we are concerned with who holds what type of feeling about a specific product or topic.
Sentiment analysis can be applied to a sentence, a clause, or an entire document.
Further complicating the process, within a single sentence or document, different sentiments could be expressed against different topics.
Text classifying techniques
two basic techniques:
- Rule-based
- Supervised Machine Learning
Rule-based classification uses a combination of words and other attributes organized around expert crafted rules. These can be very effective but creating them is a time-consuming process.
Supervised Machine Learning (SML) takes a collection of annotated training documents to create a model. The model is normally called the classifier. There are many different machine learning techniques including Naive Bayes, Support-Vector Machine (SVM),and k-nearest neighbor.
Process
//Opennlp//Train a classifierDoccatModel model = null;try (InputStream dataIn = new FileInputStream("en-animal.train");OutputStream dataOut = new FileOutputStream("en-animal.model");) { ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream); model = DocumentCategorizerME.train("en", sampleStream); OutputStream modelOut = null; modelOut = new BufferedOutputStream(dataOut); model.serialize(modelOut);} catch (IOException e) {// Handle exceptions}//Use the model above to classify doctry (InputStream modelIn = new FileInputStream(new File("en-animal.model"));) { DoccatModel model = new DoccatModel(modelIn); DocumentCategorizerME categorizer = new DocumentCategorizerME(model); double[] outcomes = categorizer.categorize(inputText); for (int i = 0;i<categorizer.getNumberOfCategories(); i++) { String category = categorizer.getCategory(i); System.out.println(category + " - " + outcomes[i]); }} catch (IOException ex) {// Handle exceptions}
//Stanfordnlp//train a classifierColumnDataClassifier cdc = new ColumnDataClassifier("box.prop");Classifier<String, String> classifier = cdc.makeClassifier(cdc.readTrainingExamples("box.train"));//test the classifierfor (String line : ObjectBank.getLineIterator("box.test", "utf-8")) { Datum<String, String> datum = cdc.makeDatumFromLine(line); System.out.println("Datum: {" + line + "]\tPredicted Category: " + classifier.classOf(datum));}//predictString sample[] = {"", "6.90", "9.8", "15.69"};Datum<String, String> datum = cdc.makeDatumFromStrings(sample);System.out.println("Category: " + classifier.classOf(datum));
//Stanford nlp piplineProperties props = new Properties();props.put("annotators", "tokenize, ssplit, parse, sentiment");StanfordCoreNLP pipeline = new StanfordCoreNLP(props);Annotation annotation = new Annotation("Text String");pipeline.annotate(annotation);String[] sentimentText = {"Very Negative", "Negative", "Neutral", "Positive", "Very Positive"};for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) { Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class); int score = RNNCoreAnnotations.getPredictedClass(tree); System.out.println(sentimentText[score]);}
Using LingPipe to classify text
I actually used LingPipe befoer. APIs on its website are quite clear. Good experience in processing English but not Chinese.
//LingPipgeString[] categories = {"soc.religion.christian", "talk.religion.misc","alt.atheism","misc.forsale"};int nGramSize = 6;//Initial DynamicLMCClassifierDynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(categories, nGramSize);String directory = ".../demos";File trainingDirectory = new File(directory+ "/data/fourNewsGroups/4news-train");for (int i = 0; i < categories.length; ++i) { File classDir = new File(trainingDirectory, categories[i]); String[] trainingFiles = classDir.list(); // Inner for-loop for (int j = 0; j < trainingFiles.length; ++j) { try { File file = new File(classDir, trainingFiles[j]); String text = Files.readFromFile(file, "ISO-8859-1"); Classification classification = new Classification(categories[i]); Classified<CharSequence> classified = new Classified<>(text, classification); classifier.handle(classified); } catch (IOException ex) { // Handle exceptions } } try { AbstractExternalizable.compileTo( (Compilable) classifier, new File("classifier.model")); } catch (IOException ex) { // Handle exceptions }}
//LingPipe sentimentcategories = new String[2];categories[0] = "neg";categories[1] = "pos";nGramSize = 8;classifier = DynamicLMClassifier.createNGramProcess(categories, nGramSize);String directory = "...";File trainingDirectory = new File(directory, "txt_sentoken");for (int i = 0; i < categories.length; ++i) { Classification classification = new Classification(categories[i]); File file = new File(trainingDirectory,categories[i]); File[] trainingFiles = file.listFiles(); for (int j = 0; j < trainingFiles.length; ++j) { try { String review = Files.readFromFile( trainingFiles[j], "ISO-8859-1"); Classified<CharSequence> classified = new Classified<>(review,classification); classifier.handle(classified); } catch (IOException ex) { ex.printStackTrace(); } }}
- Classifying Texts and Documents
- 2.30 Constructing and Displaying Styled Texts
- Get notification texts colors and background
- 《css and documents》读书笔记;
- Documents and manuals
- Printing with Documents and Views
- 转移Documents and Settings妙法
- Documents and Settings拒绝访问
- Accept documents and offer public documents in ODF
- Documents
- .Documents
- Identifying and Tracking Sentiments and Topics from Social Media Texts during Natural Disasters
- styled Texts
- Texts 语法
- Create and manipulate PDF documents - 100% .NET
- Chapter 11 Mutiple Documents and Mutiple Views
- View Web Intelligence and Desktop Intelligence documents
- 转移Documents and Settings中的特殊文件夹
- Matlab读写xml文件
- Java批量插入数据
- 网页设置下载apk
- React-Native学习之第三方开源组件--侧滑栏----react-native-side-menu
- linux C 段错误一览
- Classifying Texts and Documents
- 螺丝和螺帽Nuts and bolts 《算法》2.3.15
- torch入门笔记4:用torch实现MNIST手写数字识别
- leetCode练习(51)
- Spring in action--Part2-Spring On The Web
- 第7周项目5-排队看病模拟
- phpcms v9:序列号自增代码
- 抽象类不能定于对象,但可以用抽象的类定义指针!
- 第七周 项目二 建立链队算法库