Resource Light Part of Speech Tagger (Urdu, Sindhi & Punjabi)


In linguistics, part-of-speech tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms. POS-tagging algorithms fall into two distinctive groups rule-based and stochastic.

POS tagger are backbone of Sequence Tagging, Shallow Parsing and its major Applications are Noun Phrase (and other) Chunkers, Named Entity Recognition, Information Extraction, (Deep) Parsing, 

Pakistani languages lack annotated resources and the goal of POS tagger is to assign linguistic (mostly grammatical) information to sub-sentential units The word order and syntax of Indo-Aryan languages such as Urdu, Sindhi and Punjabi are similar, so the tagger model of any of these languages can help create the POS tagger for other related languages too.

The approach we have used is cheaper and requires almost no trained human resource, is time efficient. We have used the available tagged data of Pakistani language (Urdu) and created a whole new tagger for Pakistani language (Urdu) and then adapted this tagger for other related Pakistani languages (Sindhi and Punjabi) which don’t have any available annotated corpus.

Following modules have been developed in order to achieve the desired goal:

  • Generic Tagger - Algorithm
  • Tag – Set Convertor
  • Tools for Improvement of Tagging
  • Tokenization Tool

