Enhanced Dictionary (Beta)


The Enhanced Dictionary provides the most frequent usages of the words. The statistics about co-occurrence and relations of the words are collected by shallow parsing of an Urdu Corpus. The corpus used was Urdu Monolingual Corpus of Charles University, Prague.  The corpus is automatically POS tagged. We applied shallow parsing to extract pseudo-dependencies of different nouns with verbs, noun+verb complex predicates and adjective+verb sequences. The application about this system is under alpha testing and enhancement.

For nouns, the adjectives used with the noun and other nouns used in different contexts/constructions are counted and frequent and significant co-occurrence are calculated. The nouns and adjectives are stemmed to solve the sparsity issue. The words are de-stemmed during the display of results.

The system is presented for the proof of concept. A bigger effort is in planning phase in which a larger corpus will be automatically pos tagged after better tokenization (segmentation and multiword detection).  




Search CRCL

Find Us