[ACCEPTED]-How to read values from numbers written as words?-nlp
I was playing around with a PEG parser to 18 do what you wanted (and may post that as 17 a separate answer later) when I noticed 16 that there's a very simple algorithm that 15 does a remarkably good job with common forms 14 of numbers in English, Spanish, and German, at 13 the very least.
Working with English for 12 example, you need a dictionary that maps 11 words to values in the obvious way:
"one" -> 1, "two" -> 2, ... "twenty" -> 20,
"dozen" -> 12, "score" -> 20, ...
"hundred" -> 100, "thousand" -> 1000, "million" -> 1000000
...and 10 so forth
The algorithm is just:
total = 0
prior = null
for each word w
v <- value(w) or next if no value defined
prior <- case
when prior is null: v
when prior > v: prior+v
else prior*v
else
if w in {thousand,million,billion,trillion...}
total <- total + prior
prior <- null
total = total + prior unless prior is null
For example, this 9 progresses as follows:
total prior v unconsumed string
0 _ four score and seven
4 score and seven
0 4
20 and seven
0 80
_ seven
0 80
7
0 87
87
total prior v unconsumed string
0 _ two million four hundred twelve thousand eight hundred seven
2 million four hundred twelve thousand eight hundred seven
0 2
1000000 four hundred twelve thousand eight hundred seven
2000000 _
4 hundred twelve thousand eight hundred seven
2000000 4
100 twelve thousand eight hundred seven
2000000 400
12 thousand eight hundred seven
2000000 412
1000 eight hundred seven
2000000 412000
1000 eight hundred seven
2412000 _
8 hundred seven
2412000 8
100 seven
2412000 800
7
2412000 807
2412807
And so on. I'm not 8 saying it's perfect, but for a quick and 7 dirty it does quite well.
Addressing your 6 specific list on edit:
- cardinal/nominal or ordinal: "one" and "first" -- just put them in the dictionary
- english/british: "fourty"/"forty" -- ditto
- hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred" -- works as is
- separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot -- just define "next word" to be the longest prefix that matches a defined word, or up to the next non-word if none do, for a start
- colloqialisms: "thirty-something" -- works
- fragments: 'one third', 'two fifths' -- uh, not yet...
- common names: 'a dozen', 'half' -- works; you can even do things like "a half dozen"
Number 6 is the only 5 one I don't have a ready answer for, and 4 that's because of the ambiguity between 3 ordinals and fractions (in English at least) added 2 to the fact that my last cup of coffee was 1 many hours ago.
It's not an easy issue, and I know of no 18 library to do it. I might sit down and try 17 to write something like this sometime. I'd 16 do it in either Prolog, Java or Haskell, though. As 15 far as I can see, there are several issues:
- Tokenization: sometimes, numbers are written eleven hundred fifty two, but I've seen elevenhundred fiftytwo or eleven-hundred-fifty-two and whatnot. One would have to conduct a survey on what forms are actually in use. This might be especially tricky for Hebrew.
- Spelling mistakes: that's not so hard. You have a limited amount of words, and a bit of Levenshtein-distance magic should do the trick.
- Alternate forms, like you already mentioned, exist. This includes ordinal/cardinal numbers, as well as forty/fourty and...
- ... common names or commonly used phrases and NEs (named entities). Would you want to extract 30 from the Thirty Years War or 2 from World War II?
- Roman numerals, too?
- Colloqialisms, such as "thirty-something" and "three Euro and shrapnel", which I wouldn't know how to treat.
If 14 you are interested in this, I could give 13 it a shot this weekend. My idea is probably 12 using UIMA and tokenizing with it, then 11 going on to further tokenize/disambiguate 10 and finally translate. There might be more 9 issues, let's see if I can come up with 8 some more interesting things.
Sorry, this 7 is not a real answer yet, just an extension 6 to your question. I'll let you know if I 5 find/write something.
By the way, if you 4 are interested in the semantics of numerals, I 3 just found an interesting paper by Friederike Moltmann, discussing 2 some issues regarding the logic interpretation 1 of numerals.
I have some code I wrote a while ago: text2num. This 4 does some of what you want, except it does 3 not handle ordinal numbers. I haven't actually 2 used this code for anything, so it's largely 1 untested!
Use the Python pattern-en library:
>>> from pattern.en import number
>>> number('two thousand fifty and a half') => 2050.5
0
You should keep in mind that Europe and 3 America count differently.
European standard:
One Thousand
One Million
One Thousand Millions (British also use Milliard)
One Billion
One Thousand Billions
One Trillion
One Thousand Trillions
Here is a small reference 2 on it.
A simple way to see the difference 1 is the following:
(American counting Trillion) == (European counting Billion)
Ordinal numbers are not applicable because 10 they cant be joined in meaningful ways with 9 other numbers in language (...at least in 8 English)
e.g. one hundred and first, eleven 7 second, etc...
However, there is another 6 English/American caveat with the word 'and'
i.e.
one 5 hundred and one (English) one hundred one 4 (American)
Also, the use of 'a' to mean one 3 in English
a thousand = one thousand
...On 2 a side note Google's calculator does an 1 amazing job of this.
one hundred and three thousand times the speed of light
And even...
two thousand and one hundred plus a dozen
...wtf?!? a score plus a dozen in roman numerals
Here is an extremely robust solution in 2 Clojure.
AFAIK it is a unique implementation 1 approach.
;----------------------------------------------------------------------
; numbers.clj
; written by: Mike Mattie codermattie@gmail.com
;----------------------------------------------------------------------
(ns operator.numbers
(:use compojure.core)
(:require
[clojure.string :as string] ))
(def number-word-table {
"zero" 0
"one" 1
"two" 2
"three" 3
"four" 4
"five" 5
"six" 6
"seven" 7
"eight" 8
"nine" 9
"ten" 10
"eleven" 11
"twelve" 12
"thirteen" 13
"fourteen" 14
"fifteen" 15
"sixteen" 16
"seventeen" 17
"eighteen" 18
"nineteen" 19
"twenty" 20
"thirty" 30
"fourty" 40
"fifty" 50
"sixty" 60
"seventy" 70
"eighty" 80
"ninety" 90
})
(def multiplier-word-table {
"hundred" 100
"thousand" 1000
})
(defn sum-words-to-number [ words ]
(apply + (map (fn [ word ] (number-word-table word)) words)) )
; are you down with the sickness ?
(defn words-to-number [ words ]
(let
[ n (count words)
multipliers (filter (fn [x] (not (false? x))) (map-indexed
(fn [ i word ]
(if (contains? multiplier-word-table word)
(vector i (multiplier-word-table word))
false))
words) )
x (ref 0) ]
(loop [ indices (reverse (conj (reverse multipliers) (vector n 1)))
left 0
combine + ]
(let
[ right (first indices) ]
(dosync (alter x combine (* (if (> (- (first right) left) 0)
(sum-words-to-number (subvec words left (first right)))
1)
(second right)) ))
(when (> (count (rest indices)) 0)
(recur (rest indices) (inc (first right))
(if (= (inc (first right)) (first (second indices)))
*
+))) ) )
@x ))
Here are some examples
(operator.numbers/words-to-number ["six" "thousand" "five" "hundred" "twenty" "two"])
(operator.numbers/words-to-number ["fifty" "seven" "hundred"])
(operator.numbers/words-to-number ["hundred"])
My LPC implementation of some of your requirements 1 (American English only):
internal mapping inordinal = ([]);
internal mapping number = ([]);
#define Numbers ([\
"zero" : 0, \
"one" : 1, \
"two" : 2, \
"three" : 3, \
"four" : 4, \
"five" : 5, \
"six" : 6, \
"seven" : 7, \
"eight" : 8, \
"nine" : 9, \
"ten" : 10, \
"eleven" : 11, \
"twelve" : 12, \
"thirteen" : 13, \
"fourteen" : 14, \
"fifteen" : 15, \
"sixteen" : 16, \
"seventeen" : 17, \
"eighteen" : 18, \
"nineteen" : 19, \
"twenty" : 20, \
"thirty" : 30, \
"forty" : 40, \
"fifty" : 50, \
"sixty" : 60, \
"seventy" : 70, \
"eighty" : 80, \
"ninety" : 90, \
"hundred" : 100, \
"thousand" : 1000, \
"million" : 1000000, \
"billion" : 1000000000, \
])
#define Ordinals ([\
"zeroth" : 0, \
"first" : 1, \
"second" : 2, \
"third" : 3, \
"fourth" : 4, \
"fifth" : 5, \
"sixth" : 6, \
"seventh" : 7, \
"eighth" : 8, \
"ninth" : 9, \
"tenth" : 10, \
"eleventh" : 11, \
"twelfth" : 12, \
"thirteenth" : 13, \
"fourteenth" : 14, \
"fifteenth" : 15, \
"sixteenth" : 16, \
"seventeenth" : 17, \
"eighteenth" : 18, \
"nineteenth" : 19, \
"twentieth" : 20, \
"thirtieth" : 30, \
"fortieth" : 40, \
"fiftieth" : 50, \
"sixtieth" : 60, \
"seventieth" : 70, \
"eightieth" : 80, \
"ninetieth" : 90, \
"hundredth" : 100, \
"thousandth" : 1000, \
"millionth" : 1000000, \
"billionth" : 1000000000, \
])
varargs int denumerical(string num, status ordinal) {
if(ordinal) {
if(member(inordinal, num))
return inordinal[num];
} else {
if(member(number, num))
return number[num];
}
int sign = 1;
int total = 0;
int sub = 0;
int value;
string array parts = regexplode(num, " |-");
if(sizeof(parts) >= 2 && parts[0] == "" && parts[1] == "-")
sign = -1;
for(int ix = 0, int iix = sizeof(parts); ix < iix; ix++) {
string part = parts[ix];
switch(part) {
case "negative" :
case "minus" :
sign = -1;
continue;
case "" :
continue;
}
if(ordinal && ix == iix - 1) {
if(part[0] >= '0' && part[0] <= '9' && ends_with(part, "th"))
value = to_int(part[..<3]);
else if(member(Ordinals, part))
value = Ordinals[part];
else
continue;
} else {
if(part[0] >= '0' && part[0] <= '9')
value = to_int(part);
else if(member(Numbers, part))
value = Numbers[part];
else
continue;
}
if(value < 0) {
sign = -1;
value = - value;
}
if(value < 10) {
if(sub >= 1000) {
total += sub;
sub = value;
} else {
sub += value;
}
} else if(value < 100) {
if(sub < 10) {
sub = 100 * sub + value;
} else if(sub >= 1000) {
total += sub;
sub = value;
} else {
sub *= value;
}
} else if(value < sub) {
total += sub;
sub = value;
} else if(sub == 0) {
sub = value;
} else {
sub *= value;
}
}
total += sub;
return sign * total;
}
Well, I was too late on the answer for this 67 question, but I was working a little test 66 scenario that seems to have worked very 65 well for me. I used a (simple, but ugly, and 64 large) regular expression to locate all 63 the words for me. The expression is as 62 follows:
(?<Value>(?:zero)|(?:one|first)|(?:two|second)|(?:three|third)|(?:four|fourth)|
(?:five|fifth)|(?:six|sixth)|(?:seven|seventh)|(?:eight|eighth)|(?:nine|ninth)|
(?:ten|tenth)|(?:eleven|eleventh)|(?:twelve|twelfth)|(?:thirteen|thirteenth)|
(?:fourteen|fourteenth)|(?:fifteen|fifteenth)|(?:sixteen|sixteenth)|
(?:seventeen|seventeenth)|(?:eighteen|eighteenth)|(?:nineteen|nineteenth)|
(?:twenty|twentieth)|(?:thirty|thirtieth)|(?:forty|fortieth)|(?:fifty|fiftieth)|
(?:sixty|sixtieth)|(?:seventy|seventieth)|(?:eighty|eightieth)|(?:ninety|ninetieth)|
(?<Magnitude>(?:hundred|hundredth)|(?:thousand|thousandth)|(?:million|millionth)|
(?:billion|billionth)))
Shown here with line breaks for 61 formatting purposes..
Anyways, my method 60 was to execute this RegEx with a library 59 like PCRE, and then read back the named 58 matches. And it worked on all of the different 57 examples listed in this question, minus 56 the "One Half", types, as I didn't add them 55 in, but as you can see, it wouldn't be hard 54 to do so. This addresses a lot of issues. For 53 example, it addresses the following items 52 in the original question and other answers:
- cardinal/nominal or ordinal: "one" and "first"
- common spelling mistakes: "forty"/"fourty" (Note that it does not EXPLICITLY address this, that would be something you'd want to do before you passed the string to this parser. This parser sees this example as "FOUR"...)
- hundreds/thousands: 2100 -> "twenty one hundred" and also "two thousand and one hundred"
- separators: "eleven hundred fifty two", but also "elevenhundred fiftytwo" or "eleven-hundred fifty-two" and whatnot
- colloqialisms: "thirty-something" (This also is not TOTALLY addressed, as what IS "something"? Well, this code finds this number as simply "30").**
Now, rather 51 than store this monster of a regular expression 50 in your source, I was considering building 49 this RegEx at runtime, using something like 48 the following:
char *ones[] = {"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve",
"thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"};
char *tens[] = {"", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"};
char *ordinalones[] = { "", "first", "second", "third", "fourth", "fifth", "", "", "", "", "", "", "twelfth" };
char *ordinaltens[] = { "", "", "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", "eightieth", "ninetieth" };
and so on...
The easy part here is we are 47 only storing the words that matter. In 46 the case of SIXTH, you'll notice that there 45 isn't an entry for it, because it's just 44 it's normal number with TH tacked on... But 43 ones like TWELVE need different attention.
Ok, so 42 now we have the code to build our (ugly) RegEx, now 41 we just execute it on our number strings.
One 40 thing I would recommend, is to filter, or 39 eat the word "AND". It's not necessary, and 38 only leads to other issues.
So, what you 37 are going to want to do is setup a function 36 that passes the named matches for "Magnitude" into 35 a function that looks at all the possible 34 magnitude values, and multiplies your current 33 result by that value of magnitude. Then, you 32 create a function that looks at the "Value" named 31 matches, and returns an int (or whatever 30 you are using), based on the value discovered 29 there.
All VALUE matches are ADDED to your 28 result, while magnitutde matches multiply 27 the result by the mag value. So, Two Hundred 26 Fifty Thousand becomes "2", then "2 * 100", then 25 "200 + 50", then "250 * 1000", ending up 24 with 250000...
Just for fun, I wrote a vbScript 23 version of this and it worked great with 22 all the examples provided. Now, it doesn't 21 support named matches, so I had to work 20 a little harder getting the correct result, but 19 I got it. Bottom line is, if it's a "VALUE" match, add 18 it your accumulator. If it's a magnitude 17 match, multiply your accumulator by 100, 1000, 1000000, 1000000000, etc... This 16 will provide you with some pretty amazing 15 results, and all you have to do to adjust 14 for things like "one half" is add them to 13 your RegEx, put in a code marker for them, and 12 handle them.
Well, I hope this post helps 11 SOMEONE out there. If anyone want, I can 10 post by vbScript pseudo code that I used 9 to test this with, however, it's not pretty 8 code, and NOT production code.
If I may.. What 7 is the final language this will be written 6 in? C++, or something like a scripted language? Greg 5 Hewgill's source will go a long way in helping 4 understand how all of this comes together.
Let 3 me know if I can be of any other help. Sorry, I 2 only know English/American, so I can't help 1 you with the other languages.
I was converting ordinal edition statements 5 from early modern books (e.g. "2nd edition", "Editio 4 quarta") to integers and needed support 3 for ordinals 1-100 in English and ordinals 2 1-10 in a few Romance languages. Here's 1 what I came up with in Python:
def get_data_mapping():
data_mapping = {
"1st": 1,
"2nd": 2,
"3rd": 3,
"tenth": 10,
"eleventh": 11,
"twelfth": 12,
"thirteenth": 13,
"fourteenth": 14,
"fifteenth": 15,
"sixteenth": 16,
"seventeenth": 17,
"eighteenth": 18,
"nineteenth": 19,
"twentieth": 20,
"new": 2,
"newly": 2,
"nova": 2,
"nouvelle": 2,
"altera": 2,
"andere": 2,
# latin
"primus": 1,
"secunda": 2,
"tertia": 3,
"quarta": 4,
"quinta": 5,
"sexta": 6,
"septima": 7,
"octava": 8,
"nona": 9,
"decima": 10,
# italian
"primo": 1,
"secondo": 2,
"terzo": 3,
"quarto": 4,
"quinto": 5,
"sesto": 6,
"settimo": 7,
"ottavo": 8,
"nono": 9,
"decimo": 10,
# french
"premier": 1,
"deuxième": 2,
"troisième": 3,
"quatrième": 4,
"cinquième": 5,
"sixième": 6,
"septième": 7,
"huitième": 8,
"neuvième": 9,
"dixième": 10,
# spanish
"primero": 1,
"segundo": 2,
"tercero": 3,
"cuarto": 4,
"quinto": 5,
"sexto": 6,
"septimo": 7,
"octavo": 8,
"noveno": 9,
"decimo": 10
}
# create 4th, 5th, ... 20th
for i in xrange(16):
data_mapping[str(4+i) + "th"] = 4+i
# create 21st, 22nd, ... 99th
for i in xrange(79):
last_char = str(i)[-1]
if last_char == "0":
data_mapping[str(20+i) + "th"] = 20+i
elif last_char == "1":
data_mapping[str(20+i) + "st"] = 20+i
elif last_char == "2":
data_mapping[str(20+i) + "nd"] = 20+i
elif last_char == "3":
data_mapping[str(20+i) + "rd"] = 20+i
else:
data_mapping[str(20+i) + "th"] = 20+i
ordinals = [
"first", "second", "third",
"fourth", "fifth", "sixth",
"seventh", "eighth", "ninth"
]
# create first, second ... ninth
for c, i in enumerate(ordinals):
data_mapping[i] = c+1
# create twenty-first, twenty-second ... ninty-ninth
for ci, i in enumerate([
"twenty", "thirty", "forty",
"fifty", "sixty", "seventy",
"eighty", "ninety"
]):
for cj, j in enumerate(ordinals):
data_mapping[i + "-" + j] = 20 + (ci*10) + (cj+1)
data_mapping[i.replace("y", "ieth")] = 20 + (ci*10)
return data_mapping
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.