Word Clusters (Urdu and Sindhi)

Word clustering is a technique for partitioning sets of words into subsets of semantically similar words and is increasingly becoming a major technique used in a number of NLP tasks ranging from word sense or structural disambiguation to information retrieval and filtering.

We have performed word clustering on large corpus of Urdu and Sindhi, which can be used to:

  • helping lexicographers in identifying normal and conventional usage;
  • helping computational linguists in compiling lexicons with lexico-semantic knowledge;
  • providing disambiguation cues for:
    • parsing highly ambiguous syntactic structures (such as noun compounds, complex coordinated structures, complements attachment, subject/object assignment for languages like Italian);
    • sense identification;
  • retrieving texts and/or information from large databases;
  • constraining the language model for speech recognition and optical character recognition (to help disambiguating among phonetically or optically confusable words).

