# [ACCEPTED]-nltk language model (ngram) calculate the prob of a word from context-nltk

Score: 15

I know this question is old but it pops 33 up every time I google nltk's NgramModel 32 class. NgramModel's prob implementation 31 is a little unintuitive. The asker is confused. As 30 far as I can tell, the answers aren't great. Since 29 I don't use NgramModel often, this means 28 I get confused. No more.

The source code 27 lives here: https://github.com/nltk/nltk/blob/master/nltk/model/ngram.py. Here is the definition of 26 NgramModel's prob method:

``````def prob(self, word, context):
"""
Evaluate the probability of this word in this context using Katz Backoff.

:param word: the word to get the probability of
:type word: str
:param context: the context the word is in
:type context: list(str)
"""

context = tuple(context)
if (context + (word,) in self._ngrams) or (self._n == 1):
return self[context].prob(word)
else:
return self._alpha(context) * self._backoff.prob(word, context[1:])
``````

(note: 'self[context].prob(word) is 25 equivalent to 'self._model[context].prob(word)')

Okay. Now 24 at least we know what to look for. What 23 does context need to be? Let's look at an 22 excerpt from the constructor:

``````for sent in train:
context = tuple(ngram[:-1])
token = ngram[-1]
cfd[context].inc(token)

if not estimator_args and not estimator_kwargs:
self._model = ConditionalProbDist(cfd, estimator, len(cfd))
else:
self._model = ConditionalProbDist(cfd, estimator, *estimator_args, **estimator_kwargs)
``````

Alright. The 21 constructor creates a conditional probability 20 distribution (self._model) out of a conditional 19 frequency distribution whose "context" is 18 tuples of unigrams. This tells us 'context' should 17 not be a string or a list with a single multi-word 16 string. 'context' MUST be something iterable 15 containing unigrams. In fact, the requirement 14 is a little more strict. These tuples or 13 lists must be of size n-1. Think of it this 12 way. You told it to be a trigram model. You 11 better give it the appropriate context for 10 trigrams.

Let's see this in action with a 9 simpler example:

``````>>> import nltk
>>> obs = 'the rain in spain falls mainly in the plains'.split()
>>> lm = nltk.NgramModel(2, obs, estimator=nltk.MLEProbDist)
>>> lm.prob('rain', 'the') #wrong
0.0
>>> lm.prob('rain', ['the']) #right
0.5
>>> lm.prob('spain', 'rain in') #wrong
0.0
>>> lm.prob('spain', ['rain in']) #wrong
'''long exception'''
>>> lm.prob('spain', ['rain', 'in']) #right
1.0
``````

(As a side note, actually 8 trying to do anything with MLE as your estimator 7 in NgramModel is a bad idea. Things will 6 fall apart. I guarantee it.)

As for the original 5 question, I suppose my best guess at what 4 OP wants is this:

``````print lm.prob("word", "generates a".split())
print lm.prob("b", "generates a".split())
``````

...but there are so many 3 misunderstandings going on here that I can't 2 possible tell what he was actually trying 1 to do.

Score: 7

Quick fix:

``````print lm.prob("word", ["This is a context which generates a word"])
# => 0.00493261081006
``````

0

Score: 7

As regards your second question: this happens 5 because `"b"` doesn't occur in the Brown corpus 4 category `news`, as you can verify with:

``````>>> 'b' in brown.words(categories='news')
False
``````

whereas

``````>>> 'word' in brown.words(categories='news')
True
``````

I 3 admit the error message is very cryptic, so 2 you might want to file a bug report with 1 the NLTK authors.

Score: 4

I would stay away from NLTK's NgramModel 6 for the time being. There is currently a 5 smoothing bug that causes the model to greatly 4 overestimate likelihoods when n>1. If 3 you do end up using NgramModel, you should 2 definitely apply the fix mentioned in the 1 git issue tracker here: https://github.com/nltk/nltk/issues/367

More Related questions