Document classification using opennlp

  nlp

Order

This paper mainly studies how to use opennlp to classify documents.

DoccatModel

To classify documents, a maximum entropy model (Maximum Entropy Model), corresponding to DoccatModel in opennlp.

    @Test
    public void testSimpleTraining() throws IOException {

        ObjectStream<DocumentSample> samples = ObjectStreamUtils.createObjectStream(
                new DocumentSample("1", new String[]{"a", "b", "c"}),
                new DocumentSample("1", new String[]{"a", "b", "c", "1", "2"}),
                new DocumentSample("1", new String[]{"a", "b", "c", "3", "4"}),
                new DocumentSample("0", new String[]{"x", "y", "z"}),
                new DocumentSample("0", new String[]{"x", "y", "z", "5", "6"}),
                new DocumentSample("0", new String[]{"x", "y", "z", "7", "8"}));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 100);
        params.put(TrainingParameters.CUTOFF_PARAM, 0);

        DoccatModel model = DocumentCategorizerME.train("x-unspecified", samples,
                params, new DoccatFactory());

        DocumentCategorizer doccat = new DocumentCategorizerME(model);

        double[] aProbs = doccat.categorize(new String[]{"a"});
        Assert.assertEquals("1", doccat.getBestCategory(aProbs));

        double[] bProbs = doccat.categorize(new String[]{"x"});
        Assert.assertEquals("0", doccat.getBestCategory(bProbs));

        //test to make sure sorted map's last key is cat 1 because it has the highest score.
        SortedMap<Double, Set<String>> sortedScoreMap = doccat.sortedScoreMap(new String[]{"a"});
        Set<String> cat = sortedScoreMap.get(sortedScoreMap.lastKey());
        Assert.assertEquals(1, cat.size());
    }

In order to facilitate the test, first manually write DocumentSample to do the training text.
The categorize method returns a probability, and getBestCategory can return the most matching classification according to the probability.

The output is as follows:

Indexing events with TwoPass using cutoff of 0

    Computing event counts...  done. 6 events
    Indexing...  done.
Sorting and merging events... done. Reduced 6 events to 6.
Done indexing in 0.13 s.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 6
        Number of Outcomes: 2
      Number of Predicates: 14
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-4.1588830833596715    0.5
  2:  ... loglikelihood=-2.6351991759048894    1.0
  3:  ... loglikelihood=-1.9518912133474995    1.0
  4:  ... loglikelihood=-1.5599038834410852    1.0
  5:  ... loglikelihood=-1.3039748361952568    1.0
  6:  ... loglikelihood=-1.1229511041438864    1.0
  7:  ... loglikelihood=-0.9877356230661396    1.0
  8:  ... loglikelihood=-0.8826624290652341    1.0
  9:  ... loglikelihood=-0.7985244514476817    1.0
 10:  ... loglikelihood=-0.729543972551105    1.0
//...
 95:  ... loglikelihood=-0.0933856684859806    1.0
 96:  ... loglikelihood=-0.09245907503183291    1.0
 97:  ... loglikelihood=-0.09155090064000486    1.0
 98:  ... loglikelihood=-0.09066059844628399    1.0
 99:  ... loglikelihood=-0.08978764309881068    1.0
100:  ... loglikelihood=-0.08893152970793908    1.0

Summary

Opennlp’s categorize method needs to cut words first, so it is not very convenient to call it separately. however, if it is designed based on pipeline, it can be understood that operations such as cutting words first precede pipeline. This article only uses the official test source code to introduce, the reader may download a Chinese classification text training set to train, then carries on the classification to the Chinese text.

doc