[ACCEPTED]-How to use regular expressions to match everything before a certain type of word-regex
Replace
^.*?(?=[A-Z][a-z])
with the empty string. This works 3 for ASCII input. For non-ASCII input (Unicode, other 2 languages), different strategies apply.
Explanation
.*? Everything, until
(?= followed by
[A-Z] one of A .. Z and
[a-z] one of a .. z
)
The 1 Java Unicode-enabled variant would be this:
^.*?(?=\p{Lu}\p{Ll})
Having woken up a bit, you don't need to 13 delete anything, or even create a sub-group 12 - just find the pattern expressed elsewhere 11 in answers. Here's a complete example:
import java.util.regex.*;
public class Test
{
public static void main(String args[])
{
Pattern pattern = Pattern.compile("[A-Z][a-z].*");
String original = "THIS IS A TEST - - +++ This is a test";
Matcher match = pattern.matcher(original);
if (match.find())
{
System.out.println(match.group());
}
else
{
System.out.println("No match");
}
}
}
EDIT: Original 10 answer
This looks like it's doing the right 9 thing:
import java.util.regex.*;
public class Test
{
public static void main(String args[])
{
Pattern pattern = Pattern.compile("^.*?([A-Z][a-z].*)$");
String original = "THIS IS A TEST - - +++ This is a test";
String replaced = pattern.matcher(original).replaceAll("$1");
System.out.println(replaced);
}
}
Basically the trick is not to ignore 8 everything before the proper word - it's 7 to group everything from the proper word 6 onwards, and replace the whole text with 5 that group.
The above would fail with "*** FOO *** I am fond of peanuts"
because 4 the "I" wouldn't be considered a proper 3 word. If you want to fix that, change the 2 [a-z] to [a-z\s] which will allow for whitespace 1 instead of a letter.
I know my opinion on this really isn't that 43 popular so you guys can down-vote me into 42 oblivion if you want, but I have to rant 41 a little (and this contains an solution, just 40 not in the way the poster asked for).
I really 39 don't get why people go to regular expressions 38 so quickly.
I've done a lot of string parsing 37 (Used to screen-scrape vt100 menu screens) and 36 I've never found a single case where Regular 35 Expressions would have been much easier 34 than just writing code. (Maybe a couple 33 would have been a little easier, but not 32 much).
I kind of understand they are supposed 31 to be easier once you know them--but you 30 see someone ask a question like this and 29 realize they aren't easy for every programmer 28 to just get by glancing at it. If it costs 27 1 programmer somewhere down the line 10 26 minutes of thought, it has a huge net loss 25 over just coding it, even if you took 5 24 minutes to write 5 lines.
So it's going 23 to need documentation--and if someone who 22 is at that same level comes across it, he 21 won't be able to modify it without knowledge 20 outside his domain, even with documentation.
I 19 mean if the poster had to ask on a trivial 18 case--then there just isn't such thing as 17 a trivial case.
public String getRealText(String scanMe) {
for(int i=0 ; i < scanMe.length ; i++)
if( isUpper(scanMe[i]) && isLower(scanMe[i+1]) )
return scanMe.subString(i);
return null; }
I mean it's 5 lines, but 16 it's simple, readable, and faster than most 15 (all?) RE parsers. Once you've wrapped 14 a regular expression in a method and commented 13 it, the difference in size isn't measurable. The 12 difference in time--well for the poster 11 it would have obviously been a LOT less 10 time--as it might be for the next guy that 9 comes across his code.
And this string operation 8 is one of the ones that are even easier 7 in C with pointers--and it would be even 6 quicker since the testing functions are 5 macros in C.
By the way, make sure you look 4 for a space in the second slot, not just 3 a lower case variable, otherwise you'll 2 miss any lines starting with the words A 1 or I.
then you can do something like this
'.*([A-Z][a-z].*)\s*'
.* matches anything
( [A-Z] #followed by an uper case char
[a-z] #followed by a lower case
.*) #followed by anything
\s* #followed by zeror or more white space
Which 1 is what you are looking for I think
([A-Z][a-z].+)
would match:
This is a text
0
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.