News‎ > ‎

Automatic Generation of Regular Expression: version 1.1

posted Dec 12, 2012, 4:11 AM by Alberto Bartoli   [ updated Dec 12, 2012, 5:08 AM by Eric Medvet ]
We have slightly improved our automatic generator of regular expressions—a few internal optimizations, a few more computing resources and better load balancing. We describe below how the system works—very informally and very briefly. Then, we will list some of the open questions.

The webapp generates regular expressions automatically, only by means of text extraction examples. An example is a string coupled with the substring to be extracted:

String Substring to be extracted
today is 11-12-12 all day 11-12-12
the world will end on 21-12-12 21-12-12
5/11/55 is an important date for time travelling 5/11/55
it’s 7:50 a.m. on 3/4/12, I think 3/4/12

It is a research prototype developed by our lab and we believe it is the only existing tool where users provide only examples of the desired behavior. Being a prototype, though, it is far from being "perfect" and we will greatly appreciate any comments or criticism.

How it works

The system internally runs an evolutionary search based on genetic programming (GP). With a fair amount of oversimplification, it works as follows.

It partitions the examples in two subsets, a training set and a validation set. Then it executes a gp search:
  1. Generate a set of 500 regular expressions at random (a “population” of 500 “individuals”, in gp parlance).
  2. For each regular expression r:
    • Apply r to each example in the training set.
    • Determine the edit distance between the string that r extracts from each example and the string that should have been extracted.
    • Associate r with a performance index (called fitness in gp parlance); this index is a combination of (i) the sum of the edit distances computed at the previous step and (ii) the length of r.
    • Rank all the regular expressions according to their fitness.
    • Discard those with worst fitness, say 10%.
    • Modify some of the surviving ones at random, say 10%.
    • Generate some regular expressions, say 80%, by mixing at random pairs of regular expressions picked up at random (“crossover” in gp parlance).
  3. Repeat step 2 1000 times (the number of regular expressions is kept constant, i.e., 500; each iteration is called a “generation” in gp parlance).
  4. For each regular expression, determine the fitness value on the validation set. That is, the system assesses each candidate regex on examples never seen before.
  5. Take the one with best fitness.
The system repeats the gp search described above 32 times, thereby obtaining 32 regular expressions---one for each gp search. At this point the system ranks these 32 regular expressions according to their fitness on the validation set. The one with best fitness is taken as final result and presented to the user.

That's it. Please keep in mind this description is grossly oversimplified, though. In order to obtain useful results, we had to carefully design and analyze a number of issues, including fitness definition, handling of multiple objectives, choice of the operators that can be used for constructing a regular expression.

The backend is implemented in Java and makes use of a GP API that we developed. We are not planning to make this code publicly available, one of the reasons being it is not documented and we are not able to provide any serious support.

Open questions

There are a number of open research questions:
  • Is this example-based approach indeed useful?
  • Are the regular expressions generated by our webapp indeed useful?
  • Could we improve the results with a fitness that measures how much a human user would “like” a given regex?
  • How to actually quantify that notion?
  • How to assess whether a set of examples is adequate?
  • Given a very small set of examples, would it be possible to grow the set automatically by finding further suitable examples somewhere on the Internet? How?
As we said above, we will greatly appreciate any comments and criticisms.

Please be patient: regex generation is computationally expensive; we are a very small research group with very few resources, so the app might occasionally be overloaded.