This web page contains additional data about the experiment described in "Can A Machine Replace Humans In Building Regular Expressions? A Case Study": box-and-whiskers diagrams with all the completed tasks; statistical significance analysis (p-values); histograms with average values including different portions of data. Measured quantities are: F-measure on learning set, F-measure on testing set, time for constructing a regular expression. Data are provided at the granularity of each extraction task.

The box-and-whiskers diagrams show that there is ample variability in the results associated with humans, both in F-measure and time, while results obtained with our tool are much more repeatable. The only cases in which there is a relatively wide variability of results with our tool is F-measure for References-LeadAuthor and time for WebHTML-HeadingContent.

The F-measure diagram shows that, for each category of humans, one may always find a fraction of humans which obtain better results than our tool. Not surprisingly, thus, the improvement of our tool with respect to the three categories is statistically significant only for some tasks (see tables of p-values). In other words, while our tool is not systematically better than humans from the point of view of F-measure on all tasks, it does deliver F-measure that is comparable to humans and that, on the average, is even better.

The time diagram, on the other hand, indicates that our tool tends to be systematically faster than humans and the table of p-values confirms that this indication is indeed statistically significant for most tasks.

The following tables show the p-value obtained with a Wilcoxon ranked-sum test. The hypothesis H1 is: the F-measure obtained by GP is greater than the one obtained by the human.

The following table shows the p-value obtained with a Wilcoxon ranked-sum test. The hypothesis H1 is: the time taken by GP to construct a regular expression is lower than the time taken by the human.