Part of speech tagging using opennlp

  nlp

Order

This paper mainly studies how to use opennlp to tag part of speech.

POS Tagging

Part of Speech, POS) is the process of describing a word or a paragraph of text. This description is called a label.

At present, there are two types of popular Chinese POS tags: the North POS tag set and the Pennsylvania POS tag set. The words in modern Chinese can be divided into two types of 12 parts of speech: one type is content words: nouns, verbs, adjectives, numerals, quantifiers and pronouns; The other is function words: adverbs, prepositions, conjunctions, auxiliary words, interjections and onomatopoeia.

Most of this technology uses HMM (Hidden Markov Model)+Viterbi algorithm and Maximum Entropy algorithm.

The POSTaggerME class can be used in OpenNLP to perform basic labeling and the ChunkerME class to perform blocking.

POSTaggerME

    public static POSModel trainPOSModel(ModelType type) throws IOException {
        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ALGORITHM_PARAM, type.toString());
        params.put(TrainingParameters.ITERATIONS_PARAM, 100);
        params.put(TrainingParameters.CUTOFF_PARAM, 5);

        return POSTaggerME.train("eng", createSampleStream(), params,
                new POSTaggerFactory());
    }

    private static ObjectStream<POSSample> createSampleStream() throws IOException {
        InputStreamFactory in = new ResourceAsStreamFactory(POSTaggerMETest.class,
                "postag/AnnotatedSentences.txt");

        return new WordTagSampleStream(new PlainTextByLineStream(in, StandardCharsets.UTF_8));
    }

    @Test
    public void testPOSTagger() throws IOException {
        POSModel posModel = trainPOSModel(ModelType.MAXENT);

        POSTagger tagger = new POSTaggerME(posModel);

        String[] tags = tagger.tag(new String[] {
                "The",
                "driver",
                "got",
                "badly",
                "injured",
                "."});

        Assert.assertEquals(6, tags.length);
        Assert.assertEquals("DT", tags[0]);
        Assert.assertEquals("NN", tags[1]);
        Assert.assertEquals("VBD", tags[2]);
        Assert.assertEquals("RB", tags[3]);
        Assert.assertEquals("VBN", tags[4]);
        Assert.assertEquals(".", tags[5]);
    }

First of all, model training is conducted here, in which the training text style is as follows:

Last_JJ September_NNP ,_, I_PRP tried_VBD to_TO find_VB out_RP the_DT address_NN of_IN an_DT old_JJ school_NN friend_NN whom_WP I_PRP had_VBD not_RB seen_VBN for_IN 15_CD years_NNS ._.
I_PRP just_RB knew_VBD his_PRP$ name_NN ,_, Alan_NNP McKennedy_NNP ,_, and_CC I_PRP 'd_MD heard_VBD the_DT rumour_NN that_IN he_PRP 'd_MD moved_VBD to_TO Scotland_NNP ,_, the_DT country_NN of_IN his_PRP$ ancestors_NNS ._.
So_IN I_PRP called_VBD Julie_NNP ,_, a_DT friend_NN who's_WDT still_RB in_IN contact_NN with_IN him_PRP ._.
She_PRP told_VBD me_PRP that_IN he_PRP lived_VBD in_IN 23213_CD Edinburgh_NNP ,_, Worcesterstreet_NNP 12_CD ._.
I_PRP wrote_VBD him_PRP a_DT letter_NN right_RB away_RB and_CC he_PRP answered_VBD soon_RB ,_, sounding_VBG very_RB happy_JJ and_CC delighted_JJ ._.

Note description:

  • DT(Determiner)
  • NN (Noun, singular or mass)
  • VBD (Verb, past tense)
  • RB (Adverb)
  • VBN (Verb, past participle)

ChunkerME

    private Chunker chunker;

    private static String[] toks1 = { "Rockwell", "said", "the", "agreement", "calls", "for",
            "it", "to", "supply", "200", "additional", "so-called", "shipsets",
            "for", "the", "planes", "." };

    private static String[] tags1 = { "NNP", "VBD", "DT", "NN", "VBZ", "IN", "PRP", "TO", "VB",
            "CD", "JJ", "JJ", "NNS", "IN", "DT", "NNS", "." };

    private static String[] expect1 = { "B-NP", "B-VP", "B-NP", "I-NP", "B-VP", "B-SBAR",
            "B-NP", "B-VP", "I-VP", "B-NP", "I-NP", "I-NP", "I-NP", "B-PP", "B-NP",
            "I-NP", "O" };

    @Before
    public void startup() throws IOException {
        ResourceAsStreamFactory in = new ResourceAsStreamFactory(getClass(),
                "chunker/test.txt");

        ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(
                new PlainTextByLineStream(in, StandardCharsets.UTF_8));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 70);
        params.put(TrainingParameters.CUTOFF_PARAM, 1);

        ChunkerModel chunkerModel = ChunkerME.train("eng", sampleStream, params, new ChunkerFactory());

        this.chunker = new ChunkerME(chunkerModel);
    }

    @Test
    public void testChunkAsArray() throws Exception {

        String[] preds = chunker.chunk(toks1, tags1);

        Assert.assertArrayEquals(expect1, preds);
    }

The model training is also conducted here. The training text style is as follows:

Rockwell NNP B-NP
International NNP I-NP
Corp. NNP I-NP
's POS B-NP
Tulsa NNP I-NP
unit NN I-NP
said VBD B-VP
it PRP B-NP
signed VBD B-VP
a DT B-NP
tentative JJ I-NP
agreement NN I-NP
extending VBG B-VP
its PRP$ B-NP
contract NN I-NP
with IN B-PP
Boeing NNP B-NP
Co. NNP I-NP
to TO B-VP
provide VB I-VP
structural JJ B-NP
parts NNS I-NP
for IN B-PP
Boeing NNP B-NP
's POS B-NP
747 CD I-NP
jetliners NNS I-NP

Note description:

  • B mark start
  • I middle of label
  • End of e-label
  • NP noun block
  • VB verb block

Summary

This paper initially shows how to use opennlp to tag part of speech. Model training is an important aspect, which can improve the accuracy of text tagging in specific fields through specific training.

doc