Making string validation faster by not using a regular expression. A story.

A while back, we were performance profiling an application and noticed a big performance bottleneck while mapping objects using AutoMapper. Mapping is of course somewhat expensive, but the numbers we were seeing were way higher than expected: mapping was ridiculously slow! And “just mapping” was not a good explanation for these numbers. Trusting the work of Jimmy and trusting AutoMapper, we expected something else was probably causing this. And it was: a regular expression match was to blame!

The code was to blame

Actually not the regular expression itself, but the regular expression and the number of times it was called. While mapping, the target class’ Identifier property was obviously being set, and validating the incoming value using a regular expression.

Not a super big deal in itself, except that while mapping this validation was executed for a few thousand objects, resulting in around one second (sometimes more) to map these objects. Not a happy place!

Here’s the code that had the bottleneck:

private string _identifier;

public string Identifier
{
    get
    {
        return _identifier;
    }
    set
    {
        if (Regex.IsMatch(value, "^[[email protected]/._-]{1,254}$", RegexOptions.Compiled))
        {
            _identifier = value;
        }
        else
        {
            throw new InvalidPropertyValueException(nameof(Identifier), "The identifier is invalid.");
        }
    }
}

Aren’t regular expressions, especially those with RegexOptions.Compiled, supposed to be super fast? And especially in this case, where we’re only validating the string consists of a set of allowed characters, and making sure the string length is between 1 and 254 characters in length?

Some DuckDuckGo-ing (horrible as a verb…) later, we found a few interesting articles:

  • Back in 2005, Jeff Atwood wrote about using RegexOptions.Compiled and that it’s not always the best thing to use. RegexOptions.Compiled emits IL code and precompiles the regular expression, but that comes with a bit of a startup cost. It’s a tradeoff to make, and in our case we did expect this tradeoff to be one we could live with. Validation here happens quite a few times, so it makes sense to compile once and then reap the benefits later.
  • We read the excellent MSDN article “Best Practices for Regular Expressions in the .NET Framework”, describing common pitfalls in working with regular expressions and performance considerations. Make sure to read it, this is a really nice article about the Regex engine in .NET.

Unfortunately, none of these seemed to explain the numbers we were seeing. And doing the math on the number of calls times regex execution time, the regex was actually fast enough, the number of calls was the big issue…

So how could we make this faster? Trial and error! We decided to try a couple of solutions to the problem and benchmark them. But before we benchmark: let’s look at the solution candidates first.

Candidates for improvement

To be able to compare results, we decided on a baseline for our benchmark. This baseline would be:

Regex.IsMatch("[email protected]", "^[[email protected]/._-]{1,254}$", RegexOptions.Compiled)

It is the same validation we found in production code, so let’s see if we can improve from there.

Candidate 1: don’t use RegexOptions.Compiled

Given RegexOptions.Compiled is considered to not always be the best option, we decided to do a non-compiled version as part of the benchmark:

Regex.IsMatch("[email protected]", "^[[email protected]/._-]{1,254}$")

Our common sense told us this would not be better, but the only way to find out is by trying.

Candidate 2: using a Regex instance

While MSDN told us that regular expressions are cached when using Regex.IsMatch() and others, we decided to also try another option: creating an instance of Regex and using that one. Or, in code, we’d create one instance:

private static readonly Regex _precompiledRegex = new Regex("^[[email protected]/._-]{1,254}$", RegexOptions.Compiled);

And then run the benchmark using:

_precompiledRegex.IsMatch("[email protected]");

Who knows, this could be better than our baseline!

Candidate 3: compiling the regular expression to an external assembly

This sounded interesting: instead of compiling the regular expression at run time, it’s also possible to use Regex.CompileToAssembly() and create an assembly that hold the precompiled regular expression.

Unfortunately, doing this comes with a little bit of extra work: we have to compile the assembly first. This, in itself, isn’t too hard, but it’s having to do the extra step. But anyway, here’s how:

var compilations = new []
{
    new RegexCompilationInfo(
        pattern: @"^[[email protected]/._-]{1,254}$",
        options: RegexOptions.Compiled,
        name: "ValidationPattern",
        fullnamespace: "RegexVsCode.Compiled",
        ispublic: true)
};
            
Regex.CompileToAssembly(compilations, 
    new AssemblyName("RegexVsCode.Compiled, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null"));

After running this, we can find a new assembly RegexVsCode.Compiled.dll in our bin folder, with a class ValidationPattern which holds our precompiled regular expression.

Always fascinated by the inner workings of things, I decided to open and decompile this generated assembly using dotPeek. Three classes seem to be generated:

Decompiled regular expression in dotPeek

A factory (used internally by the Regex engine), a ValidationPattern (as expected) and ValidationPatternRunner1 - the class that holds the regular expression logic. Don’t be scared, but this simple regex (again validating character classes + string length) does seem quite elaborate:

    public override void Go()
    {
      string runtext = this.runtext;
      int runtextstart = this.runtextstart;
      int runtextbeg = this.runtextbeg;
      int runtextend = this.runtextend;
      int runtextpos = this.runtextpos;
      int[] runtrack = this.runtrack;
      int runtrackpos1 = this.runtrackpos;
      int[] runstack = this.runstack;
      int runstackpos = this.runstackpos;
      this.CheckTimeout();
      int num1;
      runtrack[num1 = runtrackpos1 - 1] = runtextpos;
      int num2;
      runtrack[num2 = num1 - 1] = 0;
      this.CheckTimeout();
      int num3;
      runstack[num3 = runstackpos - 1] = runtextpos;
      int num4;
      runtrack[num4 = num2 - 1] = 1;
      this.CheckTimeout();
      int end;
      if (runtextpos <= runtextbeg)
      {
        this.CheckTimeout();
        if (1 <= runtextend - runtextpos)
        {
          end = runtextpos + 1;
          int num5 = 1;
          while (RegexRunner.CharInClass(runtext[end - num5--], "\0\b\0-:@[_`a{"))
          {
            if (num5 <= 0)
            {
              this.CheckTimeout();
              int num6 = runtextend - end;
              int num7 = 253;
              if (num6 >= num7)
                num6 = 253;
              int num8 = num6;
              int num9 = 1;
              int num10 = num6 + num9;
              while (--num10 > 0)
              {
                if (!RegexRunner.CharInClass(runtext[end++], "\0\b\0-:@[_`a{"))
                {
                  --end;
                  break;
                }
              }
              if (num8 > num10)
              {
                int num11;
                runtrack[num11 = num4 - 1] = num8 - num10 - 1;
                int num12;
                runtrack[num12 = num11 - 1] = end - 1;
                runtrack[num4 = num12 - 1] = 2;
                goto label_13;
              }
              else
                goto label_13;
            }
          }
          goto label_16;
        }
        else
          goto label_16;
      }
      else
        goto label_16;
label_13:
      this.CheckTimeout();
      int num13;
      if (end >= runtextend - 1 && (end >= runtextend || (int) runtext[end] == 10))
      {
        this.CheckTimeout();
        int[] numArray = runstack;
        int index = num3;
        int num5 = 1;
        int num6 = index + num5;
        int start = numArray[index];
        this.Capture(0, start, end);
        int num7;
        runtrack[num7 = num4 - 1] = start;
        runtrack[num13 = num7 - 1] = 3;
      }
      else
        goto label_16;
label_15:
      this.CheckTimeout();
      this.runtextpos = end;
      return;
label_16:
      while (true)
      {
        this.runtrackpos = num4;
        this.runstackpos = num3;
        this.EnsureStorage();
        int runtrackpos2 = this.runtrackpos;
        num3 = this.runstackpos;
        runtrack = this.runtrack;
        runstack = this.runstack;
        int[] numArray = runtrack;
        int index = runtrackpos2;
        int num5 = 1;
        num4 = index + num5;
        switch (numArray[index])
        {
          case 1:
            this.CheckTimeout();
            ++num3;
            continue;
          case 2:
            goto label_19;
          case 3:
            this.CheckTimeout();
            runstack[--num3] = runtrack[num4++];
            this.Uncapture();
            continue;
          default:
            goto label_17;
        }
      }
label_17:
      this.CheckTimeout();
      int[] numArray1 = runtrack;
      int index1 = num4;
      int num14 = 1;
      num13 = index1 + num14;
      end = numArray1[index1];
      goto label_15;
label_19:
      this.CheckTimeout();
      int[] numArray2 = runtrack;
      int index2 = num4;
      int num15 = 1;
      int num16 = index2 + num15;
      end = numArray2[index2];
      int[] numArray3 = runtrack;
      int index3 = num16;
      int num17 = 1;
      num4 = index3 + num17;
      int num18 = numArray3[index3];
      if (num18 > 0)
      {
        int num5;
        runtrack[num5 = num4 - 1] = num18 - 1;
        int num6;
        runtrack[num6 = num5 - 1] = end - 1;
        runtrack[num4 = num6 - 1] = 2;
        goto label_13;
      }
      else
        goto label_13;
    }

There is not a lot of documentaton on this Go() method, but reading through it we can see a few things:

  • It loops through the input string and checks various conditions
  • It makes a number of cals to this.CheckTimeout(), to verify the current processor tick count against a configured timeout tick count (so there are a few side-tracks in this code)
  • There are a few calls to Capture()and Uncapture(), the Regex-y stuff that keeps track of capture groups etc.

That’s… a lot of code to power a simple validation. But nevertheless: it’s all compiled, so performance may just be awesome!

Candidate 4: Custom code

Having seen the compiled code from candidate 3, we decided on writing our own valiation code without using regular expressions. We want to validate the string consists of a set of allowed characters, and making sure the string length is between 1 and 254 characters in length. No backtracking, no capturing, no nothing which the regular expression engine provides us. Just a boolean giving us a clue on whether the value is valid or not.

Our code?

private static bool Matches(string value)
{
    var len = value.Length;
    var matches = len >= 1 && len <= 254;

    if (matches)
    {
        for (int i = 0; i < len; i++)
        {
            matches = char.IsLetterOrDigit(value[i])
                      || value[i] == '@'
                      || value[i] == '/'
                      || value[i] == '.'
                      || value[i] == '_'
                      || value[i] == '-';

            if (!matches) return false;
        }
    }

    return matches;
} 

Simple, readable. If the length is not ok, just return false. In other cases, check each character and when one does not match, return false early without validating the rest of the string. Looks good, easy to read.

Candidate 5: Custom code (just ASCII)

Edit - Some people pointed out in the comments that char.IsLetterOrDigit() works on Unicode and thus the custom code is not 100% similar to the original regular expression. They also pointed out performance could be better, so here’s an addition to the benchmark!

The char.IsLetterOrDigit() is already optimized a bit. It checks for the character class (latin or not) and then determines if it’s a letter or a digit base on the character class. That’s important: it checks the class, not the value! And turns out there are quite a few classes to loop trough.

So instead of using char.IsLetterOrDigit(), which checks on Unicode, let’s take the ASCII table at hand and manually check for character ranges instead.

Of course .NET is based on Unicode, but for the values from the ASCII character set the indexes in the character set match - so it’s safe to do this here.

The code:

private static bool MatchesASCII(string value)
{
    var len = value.Length;
    var matches = len >= 1 && len <= 254;

    if (matches)
    {
        for (int i = 0; i < len; i++)
        {
            matches = (value[i] >= 48 && value[i] <= 57) // 0-9
                      || (value[i] >= 65 && value[i] <= 90) // A-Z
                      || (value[i] >= 97 && value[i] <= 122) // a-z
                      || value[i] == '@'
                      || value[i] == '/'
                      || value[i] == '.'
                      || value[i] == '_'
                      || value[i] == '-';

            if (!matches) return false;
        }
    }

    return matches;
} 

Still simple and readable. If the length is not ok, just return false. In other cases, check each character and when one does not match, return false early without validating the rest of the string. Let’s see how all candidates rank against each other…

Running the benchmark

I’d heard about BenchmarkDotNet before, but never had a good reason to try until now. BenchmarkDotNet makes it incredibly easy to write benchmarks and get results in a structured way. Go check their getting started with BenchmarkDotNet page - it takes a couple of attributes and a method call to run a reliable benchmark.

Our benchmark code

The benchmark we built was this one:

public class RegexVsCodeBenchmark
{
    private static readonly Regex _precompiledRegex = new Regex("^[[email protected]/._-]{1,254}$", RegexOptions.Compiled);
    private static readonly ValidationPattern _assemblyCompiledRegex = new ValidationPattern();

    private static bool Matches(string value)
    {
        // code from above candidate 4
    }
	
    private static bool MatchesASCII(string value)
    {
        // code from above candidate 5
    } 

    [Benchmark(Description = "Regex.IsMatch - no options")]
    public bool RegexMatch()
    {
        return Regex.IsMatch("[email protected]", "^[[email protected]/._-]{1,254}$");
    }

    [Benchmark(Baseline = true, Description = "Regex.IsMatch - with RegexOptions.Compiled")]
    public bool RegexCompiledMatch()
    {
        return Regex.IsMatch("[email protected]", "^[[email protected]/._-]{1,254}$", RegexOptions.Compiled);
    }

    [Benchmark(Description = "Regex instance.IsMatch")]
    public bool RegexPrecompiledMatch()
    {
        return _precompiledRegex.IsMatch("[email protected]");
    }

    [Benchmark(Description = "Assembly-compiled Regex instance.IsMatch")]
    public bool RegexAssemblyCompiledMatch()
    {
        return _assemblyCompiledRegex.IsMatch("[email protected]");
    }

    [Benchmark(Description = "Custom code")]
    public bool CustomCodeMatch()
    {
        return Matches("[email protected]");
    }

    [Benchmark(Description = "Custom code - ASCII only")]
    public bool CustomCodeASCIIMatch()
    {
        return MatchesASCII("[email protected]");
    }
}

We did a long run of this benchmark, to flatten out any compilation or optimization steps when starting the validations.

The results!

The results of our benchmark, run on my laptop:


BenchmarkDotNet=v0.10.3.0, OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4712HQ CPU 2.30GHz, ProcessorCount=8
Frequency=2240912 Hz, Resolution=446.2469 ns, Timer=TSC
  [Host]    : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1637.0
  Clr       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1637.0
  LongRun   : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1637.0
  RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1637.0

Runtime=Clr  

MethodJobJitPlatformMeanStdErrStdDevMedianScaledScaled-StdDev
'Regex.IsMatch - no options'ClrLegacyJitX861,218.3968 ns11.8194 ns54.1633 ns1,217.1536 ns1.180.06
'Regex.IsMatch - with RegexOptions.Compiled'ClrLegacyJitX861,031.7433 ns5.3394 ns20.6795 ns1,035.4243 ns1.000.00
'Regex instance.IsMatch'ClrLegacyJitX86700.5653 ns6.9606 ns41.1793 ns684.1019 ns0.680.04
'Assembly-compiled Regex instance.IsMatch'ClrLegacyJitX86661.8107 ns2.0592 ns7.9751 ns663.8463 ns0.640.01
'Custom code'ClrLegacyJitX86224.7163 ns0.7633 ns2.9562 ns224.3792 ns0.220.01
'Custom code - ASCII only'ClrLegacyJitX8664.3772 ns0.1642 ns0.6142 ns64.3874 ns0.060.00
'Regex.IsMatch - no options'LongRunLegacyJitX861,198.7954 ns1.8423 ns30.8819 ns1,189.2547 ns1.120.03
'Regex.IsMatch - with RegexOptions.Compiled'LongRunLegacyJitX861,068.7349 ns0.5691 ns9.3863 ns1,068.4618 ns1.000.00
'Regex instance.IsMatch'LongRunLegacyJitX86679.9538 ns1.3558 ns22.9687 ns678.8977 ns0.640.02
'Assembly-compiled Regex instance.IsMatch'LongRunLegacyJitX86669.3583 ns1.0979 ns18.5671 ns669.0480 ns0.630.02
'Custom code'LongRunLegacyJitX86213.6533 ns0.4295 ns7.1998 ns213.9596 ns0.200.01
'Custom code - ASCII only'LongRunLegacyJitX8664.9729 ns0.1251 ns2.1187 ns64.6303 ns0.060.00
'Regex.IsMatch - no options'RyuJitX64RyuJitX641,215.8591 ns11.9689 ns56.1390 ns1,223.4556 ns1.280.06
'Regex.IsMatch - with RegexOptions.Compiled'RyuJitX64RyuJitX64947.7736 ns5.1000 ns19.0825 ns947.6938 ns1.000.00
'Regex instance.IsMatch'RyuJitX64RyuJitX64604.0008 ns1.5033 ns5.8223 ns605.4844 ns0.640.01
'Assembly-compiled Regex instance.IsMatch'RyuJitX64RyuJitX64622.9912 ns6.3879 ns33.8018 ns605.1827 ns0.660.04
'Custom code'RyuJitX64RyuJitX64220.1304 ns1.0477 ns4.0575 ns218.1767 ns0.230.01
'Custom code - ASCII only'RyuJitX64RyuJitX6451.4059 ns0.0785 ns0.2603 ns51.4702 ns0.050.00

First of all, there is no real difference between JIT versions. We sort of expected that but still wanted to see if there were any big changes in selecting the JIT version.

There are big differences in our different candidates, though. From slow to fast:

  • Candidate 1, Regex.IsMatch (not using RegexOptions.Compiled) is clearly slowest. We expected this but still wanted to try.
  • The baseline, Regex.IsMatch (using RegexOptions.Compiled) isn’t significantly faster. It is faster, but the improvement is not spectacular.
  • Candidate 2, using an instance, only takes (roughly) 65% of the time our baseline takes. That’s significantly faster, and would be a good improvement in our codebase.
  • Candidate 3, using a regular expression compiled into an external assembly, is only slightly faster than candidate 2. The difference could be in the startup time (where candidate 2 still has to be compiled at runtime), but we did not measure.
  • Candidate 4, our custom code is fast! Our focused piece of code runs in 21% of the time the baseline took to run. That’s a massive improvement!
  • Candidate 5, our custom code that checks just ASCII characters, seems to outperform all others. It runs in 6% of the time the baseline took to run. Nice!

Based on these result, our custom validation logic was added into production code and proved much, much faster!

Conclusion

Based on this post, you may think regular expressions are bad. In reality, they are not. They do seem to come with some considerations to make and some pitfalls you may encounter (see the articles linked higher up in this post), but they are great.

In the case at hand, though, custom code was the better path:

  • The validation logic was simple, and writing it in code is still very readable
  • We did not need capturing and all other features regular expressions give us

So then what is the takeaway for this post? I’d say there are three:

  • Always be measuring. Use a profiler and regularly measure different code paths in an application. If anything looks out of expected ranges, look at how it can be improved.
  • Always be learning. I had no idea char.IsLetterOrDigit() would be “slow” compared with just checking ASCII characters.
  • Use regular expressions! Just not for validating string length.

Enjoy!

Edit - April 29, 2017 - Daniel Hegener took this a bit further and wrote another few alternative solutions that make string validation for the above case really fast. Go check it out!

Leave a Comment

avatar

19 responses

  1. Avatar for Oded Coster
    Oded Coster April 24th, 2017

    One thing about your custom solution - it uses `char.IsLetterOrDigit`, which returns true for any Unicode character or digit. The regular expression is much more restrictive - only allows for ASCII digits and characters (in addition to the "special" characters @/._-).

    So, semantics have changed. I suspect that if you restrict to the same set in your custom solution you will see a further performance improvement.

  2. Avatar for Maarten Balliauw
    Maarten Balliauw April 24th, 2017

    Good catch, thanks for that!

  3. Avatar for Daniel Crabtree
    Daniel Crabtree April 25th, 2017

    A further performance improvement is quite likely. I recently replaced char.IsLetterOrDigit with a custom check where only ASCII characters were possible and got a nice performance boost.

  4. Avatar for Richard Tallent
    Richard Tallent April 26th, 2017

    Oded beat me to the same comment. I had a similar situation awhile back, but I was using Regex *replacements* to "conform" unusual Unicode characters to simpler ASCII. My custom code ended up being FAR faster than regular expressions:

    https://blog.tallent.us/pos...

    I still loves me some regex and use it heavily in my work (processing regulations), but this experience has definitely made me pay more attention to regex performance and to use more custom code for critical paths.

  5. Avatar for Maarten Balliauw
    Maarten Balliauw April 26th, 2017

    Nice one! Yeah me too, being more cautious now. Regexes are great at what they do, but if simple/readable enough custom code may be much faster indeed.

  6. Avatar for Thomas Levesque
    Thomas Levesque April 26th, 2017

    Probably overkill for such a simple case, but you could also use a parser like Sprache: https://github.com/sprache/...

  7. Avatar for Daniel Hegener
    Daniel Hegener April 29th, 2017

    Nice post! Thanks for that. Kept me up two nights in a row... And it made me start blogging. ;)
    http://harponcsharp.blogspo...

  8. Avatar for Maarten Balliauw
    Maarten Balliauw April 29th, 2017

    Amazing cool! Adding it to the blog post.

  9. Avatar for Bevan Arps
    Bevan Arps May 7th, 2017

    Admittedly I'm indulging my inner pedant, but you're not working with ASCII. .NET is Unicode based through and through. Unless you're processing a byte array, you'll never see actual ASCII. Your custom solution is checking code-point indexes that happen to match ASCII byte values, but are not the same thing. They match because the folks behind Unicode decided to do it that way for backward compatibility.

  10. Avatar for Maarten Balliauw
    Maarten Balliauw May 8th, 2017

    Updated the text a bit :-)

  11. Avatar for Christophe De Mey
    Christophe De Mey May 10th, 2017

    Great article, keep them comming ! :)

  12. Avatar for Vince
    Vince May 12th, 2017

    I enjoyed reading this post as you've explained all the different scenarios and reasoning behind the decisions made.

    Some context on the problem would be nice though. Are you working in a high performance environment (such as trading systems) where 1 second matters? I've done quite a bit of performance tuning over the years and I have rarely found the CPU to be a bottle neck in most applications.

    I'm also really interested in knowing about the return, either monetary or other tangible benefit) to your product.

  13. Avatar for Maarten Balliauw
    Maarten Balliauw May 12th, 2017

    The system was one where a catalog of items was fetched from the database and then rendered in either HTML or JSON. It had to be loaded fresh for every request (caching was not an option, still not sure why not but that was the constraint) and request times were usually around 1-2 seconds. We saw CPU spikes when a lot of these requests came in and investigated why, discovered the issue described in this post and started thinking about ways to give the CPU more time to do other stuff. Response times dropped to under 1 second, improving end user experience (as they too had to wait for this data to arrive in their browser).

  14. Avatar for Maarten Balliauw
    Maarten Balliauw May 12th, 2017

    Forgot: the better way to solve this would be to get the validation out of the way completely in that path, but that would have taken too long to refactor the codebase around two models (one with, one withoud) validation or moving the validation to its own class and calling into it where needed.

  15. Avatar for Vince
    Vince May 12th, 2017

    In that context it makes a lot of sense. 1 second of CPU time per request for that one particular piece of code is a problem.

  16. Avatar for Daniel Coryat
    Daniel Coryat October 29th, 2018

    Great article, helped me a lot..thanks (:

  17. Avatar for Maarten Balliauw
    Maarten Balliauw October 30th, 2018

    You’re welcome :-) Happy to hear!

  18. Avatar for Kunphuzd
    Kunphuzd November 12th, 2018

    “Emits” or “omits”?

  19. Avatar for Maarten Balliauw
    Maarten Balliauw November 13th, 2018

    “Emits”, or “writes” IL code.