A sequence of variable characters that stands for a word or string of words in a corpus.
For Example:
"The cow jumps over the moon". If N=2 (known as bigrams), then the ngrams would be:
-
the cow
-
cow jumps
-
jumps over
-
over the
- the moon
Source(s)
|
No. of words (tokens)
|
No. of Unique words (types)
|
|
Urdu
|
1600 books |
120,756,442
|
306,942
|
Sindhi
|
Books |
10,260,412
|
85,331
|
Punjabi
|
Wichaar website |