Shared Task

CLT14 Shared Task is the first in series that concerns the research & development of human  language technology for Pakistani languages. The goal of this series is to provide common and consistent task definitions, datasets and evaluation of the task and most importantly to trigger research in this important field. 


The Shared Task focuses on converting Free Roman Urdu to standard Urdu script (the modified Arabic script used for Urdu)Free Roman Urdu is the Urdu written by using Roman/Latin alphabets. It is extensively used in SMS, chatting, emails, on social networking sites like Facebook, Twitter and on discussion forums as well. It does not follow any standard transliteration scheme.

An example of Roman Urdu is: “ye roman urdu hay”. It will be converted to “یہ رومن اردو ہے ”

The participants are inivited to develop a Free Roman Urdu Transliteration system, generate transliteration output corresponding to the test data, and submit the output and a short paper to the orgainzers.

A sample data of approximately 5,000 words will be provided to show the scope and variety of the test data. However because of its small size, it is not recommended to consider that sample as training data.
The hyperlinks of some websites having Free Roman Urdu are given in the linked text file. The developers can use these and any other resources to understand the scope and variety of Free Roman Urdu. However, it is not an exhaustive list.

One cannot use a third party transliteration system in the developed system. However, other resources and libraries for intermediate processing can be used. The review process of submitted papers will evaluate the major innovation in the developed system.


The Test Data consisting of sentences comprising of approximately 5,000 words will be provided to the registered participants. The participants will generate the output file (in the format defined in sample data files) and return it to the organizers.

The evaluation metric is: No. of matched words / Total no. of words

The evaluation system will have list of alternate Urdu script spellings of words, and all of those alternate spellings will be acceptable. Moreover, a team of language experts will inspect all the transliteration that are tagged as wrong by the evaluation software. If any tagged-as-wrong transliteration is actually a legitimate alternate spelling (that was not present in alternate spelling list), it will be re-tagged as correct transliteration.

System Demo:

The selected systems will be invited for demonstration in the conference. Before public demonstration, the system will be demonstrated to a panel of software engineers.


A prize of 15,000 PKR* will be awarded to the best transliteration system. The systems will be evaluated for precision as well as innovation.
*The amount of the prize can be increased, if sponsorship is obtained.

Input-Output Format & Sample Data:

The smaple input and output files are: input.txt, output.txt
The input file has one sentence or text chunk in a line.
The output file has one word (or a word having space) in a line. The first line has the word in roman script. The next line has the word in Urdu second line.
To show the end of line of the each line of text in the input file, a # is used in the output file.

The sample data (roman Urdu alongwith Urdu script transcription) of approximately 5,000 words is present in sample.txt. However,
because of its small size, it is not recommended to consider this sample as training data.

Important Dates:

    Release of Sample Data: 7th April, 2014*
    Registration Deadline: 22nd July, 2014*
    Release of Test Data (to registered participants): 2nd August, 2014*
    Submission of 3 Page Short Paper (without results): 3rd August, 2014*
    Submission of Output: 10th August, 2014*
    Notification of Acceptance: 27th August, 2014*
    Camera Ready Paper Due: 14th Septemebr, 2014*

    * shows that the date is changed.

Shared Task Subcommittee:

For any query, contact