Skip to content
Go back

Making string validation faster by not using a regular expression. A story.

Edit page

A while back, we were performance profiling an application and noticed a big performance bottleneck while mapping objects using AutoMapper. Mapping is of course somewhat expensive, but the numbers we were seeing were way higher than expected: mapping was ridiculously slow! And “just mapping” was not a good explanation for these numbers. Trusting the work of Jimmy and trusting AutoMapper, we expected something else was probably causing this. And it was: a regular expression match was to blame!

The code was to blame

Actually not the regular expression itself, but the regular expression and the number of times it was called. While mapping, the target class’ Identifier property was obviously being set, and validating the incoming value using a regular expression.

Not a super big deal in itself, except that while mapping this validation was executed for a few thousand objects, resulting in around one second (sometimes more) to map these objects. Not a happy place!

Here’s the code that had the bottleneck:

private string _identifier;

public string Identifier
{
    get
    {
        return _identifier;
    }
    set
    {
        if (Regex.IsMatch(value, "^[A-Za-z0-9@/._-]{1,254}$", RegexOptions.Compiled))
        {
            _identifier = value;
        }
        else
        {
            throw new InvalidPropertyValueException(nameof(Identifier), "The identifier is invalid.");
        }
    }
}

Aren’t regular expressions, especially those with RegexOptions.Compiled, supposed to be super fast? And especially in this case, where we’re only validating the string consists of a set of allowed characters, and making sure the string length is between 1 and 254 characters in length?

Some DuckDuckGo-ing (horrible as a verb…) later, we found a few interesting articles:

Unfortunately, none of these seemed to explain the numbers we were seeing. And doing the math on the number of calls times regex execution time, the regex was actually fast enough, the number of calls was the big issue…

So how could we make this faster? Trial and error! We decided to try a couple of solutions to the problem and benchmark them. But before we benchmark: let’s look at the solution candidates first.

Candidates for improvement

To be able to compare results, we decided on a baseline for our benchmark. This baseline would be:

Regex.IsMatch("Some.Sample-Data.To-Valid@te", "^[A-Za-z0-9@/._-]{1,254}$", RegexOptions.Compiled)

It is the same validation we found in production code, so let’s see if we can improve from there.

Candidate 1: don’t use RegexOptions.Compiled

Given RegexOptions.Compiled is considered to not always be the best option, we decided to do a non-compiled version as part of the benchmark:

Regex.IsMatch("Some.Sample-Data.To-Valid@te", "^[A-Za-z0-9@/._-]{1,254}$")

Our common sense told us this would not be better, but the only way to find out is by trying.

Candidate 2: using a Regex instance

While MSDN told us that regular expressions are cached when using Regex.IsMatch() and others, we decided to also try another option: creating an instance of Regex and using that one. Or, in code, we’d create one instance:

private static readonly Regex _precompiledRegex = new Regex("^[A-Za-z0-9@/._-]{1,254}$", RegexOptions.Compiled);

And then run the benchmark using:

_precompiledRegex.IsMatch("Some.Sample-Data.To-Valid@te");

Who knows, this could be better than our baseline!

Candidate 3: compiling the regular expression to an external assembly

This sounded interesting: instead of compiling the regular expression at run time, it’s also possible to use Regex.CompileToAssembly() and create an assembly that hold the precompiled regular expression.

Unfortunately, doing this comes with a little bit of extra work: we have to compile the assembly first. This, in itself, isn’t too hard, but it’s having to do the extra step. But anyway, here’s how:

var compilations = new []
{
    new RegexCompilationInfo(
        pattern: @"^[A-Za-z0-9@/._-]{1,254}$",
        options: RegexOptions.Compiled,
        name: "ValidationPattern",
        fullnamespace: "RegexVsCode.Compiled",
        ispublic: true)
};
            
Regex.CompileToAssembly(compilations, 
    new AssemblyName("RegexVsCode.Compiled, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null"));

After running this, we can find a new assembly RegexVsCode.Compiled.dll in our bin folder, with a class ValidationPattern which holds our precompiled regular expression.

Always fascinated by the inner workings of things, I decided to open and decompile this generated assembly using dotPeek. Three classes seem to be generated:

Decompiled regular expression in dotPeek

A factory (used internally by the Regex engine), a ValidationPattern (as expected) and ValidationPatternRunner1 - the class that holds the regular expression logic. Don’t be scared, but this simple regex (again validating character classes + string length) does seem quite elaborate:

    public override void Go()
    {
      string runtext = this.runtext;
      int runtextstart = this.runtextstart;
      int runtextbeg = this.runtextbeg;
      int runtextend = this.runtextend;
      int runtextpos = this.runtextpos;
      int[] runtrack = this.runtrack;
      int runtrackpos1 = this.runtrackpos;
      int[] runstack = this.runstack;
      int runstackpos = this.runstackpos;
      this.CheckTimeout();
      int num1;
      runtrack[num1 = runtrackpos1 - 1] = runtextpos;
      int num2;
      runtrack[num2 = num1 - 1] = 0;
      this.CheckTimeout();
      int num3;
      runstack[num3 = runstackpos - 1] = runtextpos;
      int num4;
      runtrack[num4 = num2 - 1] = 1;
      this.CheckTimeout();
      int end;
      if (runtextpos <= runtextbeg)
      {
        this.CheckTimeout();
        if (1 <= runtextend - runtextpos)
        {
          end = runtextpos + 1;
          int num5 = 1;
          while (RegexRunner.CharInClass(runtext[end - num5--], "\0\b\0-:@[_`a{"))
          {
            if (num5 <= 0)
            {
              this.CheckTimeout();
              int num6 = runtextend - end;
              int num7 = 253;
              if (num6 >= num7)
                num6 = 253;
              int num8 = num6;
              int num9 = 1;
              int num10 = num6 + num9;
              while (--num10 > 0)
              {
                if (!RegexRunner.CharInClass(runtext[end++], "\0\b\0-:@[_`a{"))
                {
                  --end;
                  break;
                }
              }
              if (num8 > num10)
              {
                int num11;
                runtrack[num11 = num4 - 1] = num8 - num10 - 1;
                int num12;
                runtrack[num12 = num11 - 1] = end - 1;
                runtrack[num4 = num12 - 1] = 2;
                goto label_13;
              }
              else
                goto label_13;
            }
          }
          goto label_16;
        }
        else
          goto label_16;
      }
      else
        goto label_16;
label_13:
      this.CheckTimeout();
      int num13;
      if (end >= runtextend - 1 && (end >= runtextend || (int) runtext[end] == 10))
      {
        this.CheckTimeout();
        int[] numArray = runstack;
        int index = num3;
        int num5 = 1;
        int num6 = index + num5;
        int start = numArray[index];
        this.Capture(0, start, end);
        int num7;
        runtrack[num7 = num4 - 1] = start;
        runtrack[num13 = num7 - 1] = 3;
      }
      else
        goto label_16;
label_15:
      this.CheckTimeout();
      this.runtextpos = end;
      return;
label_16:
      while (true)
      {
        this.runtrackpos = num4;
        this.runstackpos = num3;
        this.EnsureStorage();
        int runtrackpos2 = this.runtrackpos;
        num3 = this.runstackpos;
        runtrack = this.runtrack;
        runstack = this.runstack;
        int[] numArray = runtrack;
        int index = runtrackpos2;
        int num5 = 1;
        num4 = index + num5;
        switch (numArray[index])
        {
          case 1:
            this.CheckTimeout();
            ++num3;
            continue;
          case 2:
            goto label_19;
          case 3:
            this.CheckTimeout();
            runstack[--num3] = runtrack[num4++];
            this.Uncapture();
            continue;
          default:
            goto label_17;
        }
      }
label_17:
      this.CheckTimeout();
      int[] numArray1 = runtrack;
      int index1 = num4;
      int num14 = 1;
      num13 = index1 + num14;
      end = numArray1[index1];
      goto label_15;
label_19:
      this.CheckTimeout();
      int[] numArray2 = runtrack;
      int index2 = num4;
      int num15 = 1;
      int num16 = index2 + num15;
      end = numArray2[index2];
      int[] numArray3 = runtrack;
      int index3 = num16;
      int num17 = 1;
      num4 = index3 + num17;
      int num18 = numArray3[index3];
      if (num18 > 0)
      {
        int num5;
        runtrack[num5 = num4 - 1] = num18 - 1;
        int num6;
        runtrack[num6 = num5 - 1] = end - 1;
        runtrack[num4 = num6 - 1] = 2;
        goto label_13;
      }
      else
        goto label_13;
    }

There is not a lot of documentaton on this Go() method, but reading through it we can see a few things:

That’s… a lot of code to power a simple validation. But nevertheless: it’s all compiled, so performance may just be awesome!

Candidate 4: Custom code

Having seen the compiled code from candidate 3, we decided on writing our own valiation code without using regular expressions. We want to validate the string consists of a set of allowed characters, and making sure the string length is between 1 and 254 characters in length. No backtracking, no capturing, no nothing which the regular expression engine provides us. Just a boolean giving us a clue on whether the value is valid or not.

Our code?

private static bool Matches(string value)
{
    var len = value.Length;
    var matches = len >= 1 && len <= 254;

    if (matches)
    {
        for (int i = 0; i < len; i++)
        {
            matches = char.IsLetterOrDigit(value[i])
                      || value[i] == '@'
                      || value[i] == '/'
                      || value[i] == '.'
                      || value[i] == '_'
                      || value[i] == '-';

            if (!matches) return false;
        }
    }

    return matches;
} 

Simple, readable. If the length is not ok, just return false. In other cases, check each character and when one does not match, return false early without validating the rest of the string. Looks good, easy to read.

Candidate 5: Custom code (just ASCII)

Edit - Some people pointed out in the comments that char.IsLetterOrDigit() works on Unicode and thus the custom code is not 100% similar to the original regular expression. They also pointed out performance could be better, so here’s an addition to the benchmark!

The char.IsLetterOrDigit() is already optimized a bit. It checks for the character class (latin or not) and then determines if it’s a letter or a digit base on the character class. That’s important: it checks the class, not the value! And turns out there are quite a few classes to loop trough.

So instead of using char.IsLetterOrDigit(), which checks on Unicode, let’s take the ASCII table at hand and manually check for character ranges instead.

Of course .NET is based on Unicode, but for the values from the ASCII character set the indexes in the character set match - so it’s safe to do this here.

The code:

private static bool MatchesASCII(string value)
{
    var len = value.Length;
    var matches = len >= 1 && len <= 254;

    if (matches)
    {
        for (int i = 0; i < len; i++)
        {
            matches = (value[i] >= 48 && value[i] <= 57) // 0-9
                      || (value[i] >= 65 && value[i] <= 90) // A-Z
                      || (value[i] >= 97 && value[i] <= 122) // a-z
                      || value[i] == '@'
                      || value[i] == '/'
                      || value[i] == '.'
                      || value[i] == '_'
                      || value[i] == '-';

            if (!matches) return false;
        }
    }

    return matches;
} 

Still simple and readable. If the length is not ok, just return false. In other cases, check each character and when one does not match, return false early without validating the rest of the string. Let’s see how all candidates rank against each other…

Running the benchmark

I’d heard about BenchmarkDotNet before, but never had a good reason to try until now. BenchmarkDotNet makes it incredibly easy to write benchmarks and get results in a structured way. Go check their getting started with BenchmarkDotNet page - it takes a couple of attributes and a method call to run a reliable benchmark.

Our benchmark code

The benchmark we built was this one:

public class RegexVsCodeBenchmark
{
    private static readonly Regex _precompiledRegex = new Regex("^[A-Za-z0-9@/._-]{1,254}$", RegexOptions.Compiled);
    private static readonly ValidationPattern _assemblyCompiledRegex = new ValidationPattern();

    private static bool Matches(string value)
    {
        // code from above candidate 4
    }
	
    private static bool MatchesASCII(string value)
    {
        // code from above candidate 5
    } 

    [Benchmark(Description = "Regex.IsMatch - no options")]
    public bool RegexMatch()
    {
        return Regex.IsMatch("Some.Sample-Data.To-Valid@te", "^[A-Za-z0-9@/._-]{1,254}$");
    }

    [Benchmark(Baseline = true, Description = "Regex.IsMatch - with RegexOptions.Compiled")]
    public bool RegexCompiledMatch()
    {
        return Regex.IsMatch("Some.Sample-Data.To-Valid@te", "^[A-Za-z0-9@/._-]{1,254}$", RegexOptions.Compiled);
    }

    [Benchmark(Description = "Regex instance.IsMatch")]
    public bool RegexPrecompiledMatch()
    {
        return _precompiledRegex.IsMatch("Some.Sample-Data.To-Valid@te");
    }

    [Benchmark(Description = "Assembly-compiled Regex instance.IsMatch")]
    public bool RegexAssemblyCompiledMatch()
    {
        return _assemblyCompiledRegex.IsMatch("Some.Sample-Data.To-Valid@te");
    }

    [Benchmark(Description = "Custom code")]
    public bool CustomCodeMatch()
    {
        return Matches("Some.Sample-Data.To-Valid@te");
    }

    [Benchmark(Description = "Custom code - ASCII only")]
    public bool CustomCodeASCIIMatch()
    {
        return MatchesASCII("Some.Sample-Data.To-Valid@te");
    }
}

We did a long run of this benchmark, to flatten out any compilation or optimization steps when starting the validations.

The results!

The results of our benchmark, run on my laptop:


BenchmarkDotNet=v0.10.3.0, OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240912 Hz, Resolution=446.2469 ns, Timer=TSC
  [Host]    : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1637.0
  Clr       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1637.0
  LongRun   : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1637.0
  RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1637.0

Runtime=Clr  
MethodJobJitPlatformMeanStdErrStdDevMedianScaledScaled-StdDev
'Regex.IsMatch - no options'ClrLegacyJitX861,218.3968 ns11.8194 ns54.1633 ns1,217.1536 ns1.180.06
'Regex.IsMatch - with RegexOptions.Compiled'ClrLegacyJitX861,031.7433 ns5.3394 ns20.6795 ns1,035.4243 ns1.000.00
'Regex instance.IsMatch'ClrLegacyJitX86700.5653 ns6.9606 ns41.1793 ns684.1019 ns0.680.04
'Assembly-compiled Regex instance.IsMatch'ClrLegacyJitX86661.8107 ns2.0592 ns7.9751 ns663.8463 ns0.640.01
'Custom code'ClrLegacyJitX86224.7163 ns0.7633 ns2.9562 ns224.3792 ns0.220.01
'Custom code - ASCII only'ClrLegacyJitX8664.3772 ns0.1642 ns0.6142 ns64.3874 ns0.060.00
'Regex.IsMatch - no options'LongRunLegacyJitX861,198.7954 ns1.8423 ns30.8819 ns1,189.2547 ns1.120.03
'Regex.IsMatch - with RegexOptions.Compiled'LongRunLegacyJitX861,068.7349 ns0.5691 ns9.3863 ns1,068.4618 ns1.000.00
'Regex instance.IsMatch'LongRunLegacyJitX86679.9538 ns1.3558 ns22.9687 ns678.8977 ns0.640.02
'Assembly-compiled Regex instance.IsMatch'LongRunLegacyJitX86669.3583 ns1.0979 ns18.5671 ns669.0480 ns0.630.02
'Custom code'LongRunLegacyJitX86213.6533 ns0.4295 ns7.1998 ns213.9596 ns0.200.01
'Custom code - ASCII only'LongRunLegacyJitX8664.9729 ns0.1251 ns2.1187 ns64.6303 ns0.060.00
'Regex.IsMatch - no options'RyuJitX64RyuJitX641,215.8591 ns11.9689 ns56.1390 ns1,223.4556 ns1.280.06
'Regex.IsMatch - with RegexOptions.Compiled'RyuJitX64RyuJitX64947.7736 ns5.1000 ns19.0825 ns947.6938 ns1.000.00
'Regex instance.IsMatch'RyuJitX64RyuJitX64604.0008 ns1.5033 ns5.8223 ns605.4844 ns0.640.01
'Assembly-compiled Regex instance.IsMatch'RyuJitX64RyuJitX64622.9912 ns6.3879 ns33.8018 ns605.1827 ns0.660.04
'Custom code'RyuJitX64RyuJitX64220.1304 ns1.0477 ns4.0575 ns218.1767 ns0.230.01
'Custom code - ASCII only'RyuJitX64RyuJitX6451.4059 ns0.0785 ns0.2603 ns51.4702 ns0.050.00

First of all, there is no real difference between JIT versions. We sort of expected that but still wanted to see if there were any big changes in selecting the JIT version.

There are big differences in our different candidates, though. From slow to fast:

Based on these result, our custom validation logic was added into production code and proved much, much faster!

Conclusion

Based on this post, you may think regular expressions are bad. In reality, they are not. They do seem to come with some considerations to make and some pitfalls you may encounter (see the articles linked higher up in this post), but they are great.

In the case at hand, though, custom code was the better path:

So then what is the takeaway for this post? I’d say there are three:

Enjoy!

Edit - April 29, 2017 - Daniel Hegener took this a bit further and wrote another few alternative solutions that make string validation for the above case really fast. Go check it out!


Edit page
Share this post on:

Previous Post
Building a scheduled task in ASP.NET Core/Standard 2.0
Next Post
Extending .NET CLI with custom tools - dotnet init initializes your NuGet package