# [ACCEPTED]-Complexity of Regex substitution-complexity-theory

From a purely theoretical stance:

The implementation 10 I am familiar with would be to build a Deterministic 9 Finite Automaton to recognize the regex. This 8 is done in O(2^m), m being the size of the 7 regex, using a standard algorithm. Once 6 this is built, running a string through 5 it is linear in the length of the string 4 - O(n), n being string length. A replacement 3 on a match found in the string should be 2 constant time.

So overall, I suppose O(2^m 1 + n).

Other theoretical info of possible interest.

For 65 clarity, assume the standard definition 64 for a regular expression

http://en.wikipedia.org/wiki/Regular_language

from the formal 63 language theory. Practically, this means 62 that the only building material are alphabet 61 symbols, operators of concatenation, alternation 60 and Kleene closure, along with the unit 59 and zero constants (which appear for group-theoretic 58 reasons). Generally it's a good idea not 57 to overload this term despite the everyday 56 practice in scripting languages which leads 55 to ambiguities.

There is an NFA construction 54 that solves the matching problem for a regular expression 53 r and an input text t in O(|r| |t|) time 52 and O(|r|) space, where |-| is the length 51 function. This algorithm was further improved 50 by Myers

http://doi.acm.org/10.1145/128749.128755

to the time and space complexity 49 O(|r| |t| / log |t|) by using automaton 48 node listings and the Four Russians paradigm. This 47 paradigm seems to be named after four Russian 46 guys who wrote a groundbreaking paper which 45 is not online. However, the paradigm is 44 illustrated in these computational biology lecture 43 notes

http://lyle.smu.edu/~saad/courses/cse8354/lectures/lecture5.pdf

I find it hilarious to name a paradigm 42 by the number and the nationality of authors 41 instead of their last names.

The matching 40 problem for regular expressions with added 39 backreferences is NP-complete, which was 38 proven by Aho

http://portal.acm.org/citation.cfm?id=114877

by a reduction from the vertex-cover 37 problem which is a classical NP-complete 36 problem.

To match regular expressions with 35 backreferences deterministically we could employ 34 backtracking (not unlike the Perl regex 33 engine) to keep track of the possible subwords 32 of the input text t that can be assigned 31 to the variables in r. There are only O(|t|^2) subwords 30 that can be assigned to any one variable in 29 r. If there are n variables in r, then there 28 are O(|t|^2n) possible assignments. Once 27 an assignment of substrings to variables 26 is fixed, the problem reduces to the plain 25 regular expression matching. Therefore the worst-case 24 complexity for matching regular expressions 23 with backreferences is O(|t|^2n).

Note however, regular 22 expressions with backreferences are not 21 yet full-featured regexen.

Take, for example, the 20 "don't care" symbol apart from 19 any other operators. There are several polynomial 18 algorithms deciding whether a set of patterns 17 matches an input text. For example, Kucherov 16 and Rusinowitch

http://dx.doi.org/10.1007/3-540-60044-2_46

define a pattern as a word 15 w_1@w_2@...@w_n where each w_i is a word 14 (not a regular expression) and "@" is 13 a variable length "don't care" symbol 12 not contained in either of w_i. They derive 11 an O((|t| + |P|) log |P|) algorithm for 10 matching a set of patterns P against an 9 input text t, where |t| is the length of 8 the text, and |P| is the length of all the 7 words in P.

It would be interesting to know 6 how these complexity measures combine and 5 what is the complexity measure of the matching 4 problem for regular expressions with backreferences, "don't 3 care" and other interesting features 2 of practical regular expressions.

Alas, I 1 haven't said a word about Python... :)

Depends on what you define by regex. If 13 you allow operators of concatenation, alternative 12 and Kleene-star, the time can actually be 11 `O(m*n+m)`

, where `m`

is size of a regex and `n`

is length 10 of the string. You do it by constructing 9 a NFA (that is linear in `m`

), and then simulating 8 it by maintaining the set of states you're 7 in and updating that (in `O(m)`

) for every letter 6 of input.

Things that make regex parsing 5 difficult:

- parentheses and backreferences: capturing is still OK with the aforementioned algorithm, although it would get the complexity higher, so it might be infeasable. Backreferences raise the recognition power of the regex, and its difficulty is well
- positive look-ahead: is just another name for intersection, which raises the complexity of the aforementioned algorithm to
`O(m^2+n)`

- negative look-ahead: a disaster for constructing the automaton (
`O(2^m)`

, possibly PSPACE-complete). But should still be possible to tackle with the dynamic algorithm in something like`O(n^2*m)`

Note that with a concrete implementation, things 4 might get better or worse. As a rule of 3 thumb, simple features should be fast enough, and 2 unambiguous (eg. not like `a*a*`

) regexes are 1 better.

To delve into theprise's answer, for the 5 construction of the automaton, O(2^m) is 4 the worst case, though it really depends 3 on the form of the regular expression (for 2 a very simple one that matches a word, it's 1 in O(m), using for example the Knuth-Morris-Pratt algorithm).

Depends on the implementation. What language/library/class? There 3 may be a best case, but it would be very 2 specific to the number of features in the 1 implementation.

You can trade space for speed by building 5 a nondeterministic finite automaton instead 4 of a DFA. This can be traversed in linear 3 time. Of course, in the worst case this 2 could need O(2^m) space. I'd expect the 1 tradeoff to be worth it.

If you're after matching and substitution, that 10 implies grouping and backreferences.

Here 9 is a perl example where grouping and backreferences 8 can be used to solve an NP complete problem: http://perl.plover.com/NPC/NPC-3SAT.html

This 7 (coupled with a few other theoretical tidbits) means 6 that using regular expressions for matching 5 and substitution is NP-complete.

Note that 4 this is different from the formal definition 3 of a regular expression - which don't have 2 the notion of grouping - and match in polynomial 1 time as described by the other answers.

More Related questions

We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.