zivkan.com - Version string parsing performance in .NET

Is regex harmful to runtime performance? I attempted a few different approaches to parsing version strings, and found regex performance awful. Unless someone can help me learn if I'm doing something poorly, my conclusion is that RegEx is for development productivity, but should be avoided on performance hot paths.

All my code and raw results are on GitHub. In case I update the benchmarks in the future, you can check the latest results with this link, but this blog post is based on the results from commit hash 31cd9bc4fa6d6126770c516e67d0700ece119101.

Background

I have two previous blog posts about version sorting/comparison performance in .NET. The first was an investigation into different data structure and how it impacts sort performance, and the second was a deep dive into NuGet's NuGetVersion class and sorting performance, which resulted in me creating a pull request on the NuGet.Client repo improving NuGetVersion sorting by about 60%.

In case you're not familiar with Semantic Versioning 2, a version string can look like the following:

1.2.3-preview.1+metadata.A.B

The first section is 1.2.3 at the beginning. SemVer2 requires these 3 segments, called major, minor and patch. NuGet version strings are a superset which also supports .NET's System.Version, so allows a 4th revision segment, 1.2.3.4. NuGet also only requires the major version, the rest are implied to be zero if omit. Versions cannot be negative and can only contain digits.

The second section -preview.1 is an optional pre-release section. It consists of one or more segments, separated by the period (.), and may contain English letters, or digits, or a dash (-). When comparing segments that are entirely digits, the segment should be compared as a number, but when it contains at least one letter, the segment must be compared lexically. My first blog post in this version performance series showed that pre-parsing the version had significant benefit over NuGet's implementation of parsing only the string segments, and then parsing as an int during comparison.

The final section +metadata.A.B is an option build metadata section. It also consists of one or more segments, separated by the period (.), and also may only contain English letters, or digits, or a dash (-). Build metadata is ignored in comparisons, so being period separated is quite irrelevant, except that SemVer forbids empty segments, so it's important to know to do correct input validation.

Implementations

I tested 3 of my own implementations, and compared to NuGet's own NuGetVersion.Parse method.

NuGetVersion.Parse

NuGetVersion.Parse first separates the input string into a 3 item Tuple<string, string[], string>, representing the version, prerelease, and metadata sections respectively. Each of the strings are obtained using string.SubString, largely because Span<T> didn't exist until several years after NuGet, and this version parser, was first written. Plus since NuGet ships in Visual Studio, there are complications in trying to add new assemblies into the product, and using Span<T> with .NET Framework requires an additional assembly. Fortunately, System.Memory.dll now ships with Visual Studio, so NuGet can start taking advantage of it, but it would have been difficult in the past.

SubstringParser

Similar to how I created a VersionWithStringArray that was a "clean-room" implementation of NuGetVersion for sort performance testing, I created a version parser from scratch using string.SubString to mimic NuGet and discover whether or not there are fundamental inefficiencies with NuGet's implementation.

I wasn't aware that NuGetVersion.Parse used System.Version.TryParse to parse the version components, so that is one potentially significant difference. Particularly on .NET Core 3.1 and .NET 5, which have Span<T> built in, NuGetVersion has the potential to only need one memory allocation when using string.Substring to parse the version section, whereas my implementation requires up to 4 string.Substring calls. On the other hand, on .NET Framework, NuGetVersion.Parse will use string.Substring to get the version section from the original input, and then System.Version.TryParseVersion uses string.Split to generate an array of strings, causing up to 5 string allocations and one array, just for the version section.

I haven't dug deep into NuGetVersion.Parse's implementation, but I imagine that otherwise the prerelease and metadata implementations are fairly similar.

SpanParser

Next, I more-or-less copy-pasted the SubstringParser, replacing as many string.SubString calls with ReadOnlySpan<char> equivalents as I could. .NET Core, which has Span<T> as part of the Base Class Library (BCL), has methods like the static uint.Parse(ReadOnlySpan<char>). But the .NET Framework has neither. The System.Memory package can be used to bring in the Span<T> struct, but the rest of the BCL doesn't have convenience methods. Hence, I needed to use #if NETFRAMEWORK, #else and #endif to compile different code for different target frameworks. Additionally, I had to implement my own uint parser, otherwise the span parser's performance was the same as SubstringParser. The implementation is simple:

private static uint Parse(ReadOnlySpan<char> input)
{
    checked
    {
        uint result = 0;
        foreach(var c in input)
        {
            if (c < '0' || c > '9')
            {
                throw new FormatException();
            }

            result = result * 10 + (uint)(c - '0');
        }

        return result;
    }
}

As you'll see in the results, this allows the .NET Framework SpanParser to have a similar performance benefit to the SpanParser on .NET Core 3.1 and .NET 5, compared to the SubstringParser.

RegexParser

At the end of the SemVer2 spec, there is a suggested regex string. I took this and modified it to work with NuGet's version strings. This is the shortest parser, by lines of code, or bytes. However, while it only uses common/simple regex features, to anyone without regex experience, this is probably completely unreadable. I'm by no means a RegEx expert, but I'm also not a beginner. I feel like reading and understanding a regex is slower and more difficult than reading and understanding procedural code, such as the span or substring parsers above. The regex has fewer characters, but being terse is not the same thing as being understandable. Having said that, once you have adequate regex knowledge, developers certainly can implement some parsing tasks more quickly with regex rather than procedural code, since it's less to type.

I also had high hopes, because blog posts about the .NET Core runtime improvements have mentioned RegEx in the past as something that has improved. Especially in .NET 5. I used RegexOptions.CultureInvariant|RegexOptions.Singleline|RegexOptions.Compiled hoping this gives The RegEx engine the best chance at performance.

StateMachineParser

I had this idea before starting any of the semver parsing work, and this is what motivated me the most to write the benchmarks and this blog post. With C# 7.2's ref struct feature, I thought it should be possible to write a zero allocation parser. However, my design was to have the state machine use a Next(char) method called for each character in the input string. While this worked brilliantly for the numeric version section, I couldn't benefit from pre-allocating the perfect size prerelease labels array, hence I used a List<string> which not only is heap-allocated, it causes more allocations each time its internal buffer capacity is reached, which could theoretically be multiple times in a single version parse. Similarly, string lengths are not known in advance, so a StringBuilder is used, which is heap-allocated and has internal buffers.

One of the other reasons I was excited for this state machine idea is that it makes it very easy to give extremely precise error messages. Imagine for example the input string 1.2.3-pre.+build1, which is invalid because the . in the prerelease section must be followed by letters or numbers, the segment cannot be empty. With the state machine, it's easy to have an error message saying something like "Prerelease segment 2 at position 10 must contain at least one character". The opposite extreme case is the RegEx parser, where all we know is that the input string did not match the RegEx, but we have no idea why.

Results

Finally, the results. I learned that BenchmarkDotNet has a [Orderer(BenchmarkDotNet.Order.SummaryOrderPolicy.FastestToSlowest)] attribute, and by using baseline: true on the .NET 5 job declaration, I've saved a lot of time not needing to re-order manually for this blog post. The trends for all the runtimes are similar, so here's the .NET 5 results, to keep the table shorter:

Method	Ratio	Gen 0	Gen 1	Gen 2	Allocated
StateMachine	0.89	18125.0000	125.0000	125.0000	74.91 MB
Span	1.00	11857.1429	142.8571	142.8571	49.79 MB
SubString	1.44	24000.0000	-	-	99.08 MB
NuGetVersion	1.45	20600.0000	-	-	85.76 MB
Regex	6.08	140000.0000	-	-	563.92 MB

From this, we can really see 3 groups. StateMachine and Span have very similar performance. Next, my SubString implementation has the same performance as NuGetVersion, about 40% slower than span and state machine. Finally, Regex is multiple times slower than any of the oters. In fact, .NET 5's RegEx performance improvements, compared to older runtimes, is quite visible. Comparing the Span and RegEx implementations across all 3 runtimes:

Method	Runtime	Ratio
Span	.NET Core 3.1	0.98
Span	.NET Core 5.0	1.00
Span	.NET 4.7.2	1.07
Regex	.NET Core 5.0	6.08
Regex	.NET Core 3.1	8.04
Regex	.NET 4.7.2	8.66

We can see that the span parser performance is very similar across the 3 runtimes, but the RegEx implementation was 5x slower on .NET 5, and 7x slower on .NET Core 3.1 and .NET Framework.

At the top of this post I have a link to the full results, so feel free to have a look to compare all the implementations across all the runtimes.

Commentary

I had fun investigating this and trying out different implementations. Parsing seems to be an area where there's a lot of room for creativity and discovering what the tradeoffs are for different approaches.

Is performance inversely proportional to the number of garbage collections?

Looking just at the Gen0 Garbage Collections (GC) for the moment, we see that the state machine parser has over 50% more GCs triggered than the span parser, yet the performance is similar. In fact, on .NET 5 the state machine parser was about 10% faster, despite the higher GC count.

At the other extreme, BenchmarkDotNet reported that the Regex parser had the same GC count, 140,000 per 1,000 calls, for .NET Core 3.1 and .NET 5. On .NET Framework, this increased to 786,000 Gen0 GCs! Yet the .NET Framework and .NET Core 3.1 perf was the same, while .NET 5 was about 30% faster.

Also, the span and NuGetVersion parsers have similar GC counts, the span parser has significantly fewer GCs, and my substring parser has somewhat more GCs than the state machine/NuGetVersion parsers. Yet the runtime performance is not at all proportional to the GC count.

So, these benchmarks show that there's no clear relationship between GC count and runtime performance. It's too simplistic to say that reducing memory allocations will always result in faster performance. Yet, we've seen many other examples on how reducing memory allocations, and therefore GCs, do increase performance. Indeed the RegEx parser, which has much more GC than any other parser, is by far the slowest. Personally, I'll still advocate for reducing allocations as a performance benefit, but I now see that when comparing two similar algorithms, the one that has fewer allocations may not be the faster one.

Does GC generation matter?

General advice I've heard is that Gen 1 and Gen 2 GC is slow. But, only the fastest parser implementations had any Gen1 and Gen2 GC. I'm not at all an expert, so I'm guessing. But the Span and StateMachine allocate little memory (I think the span parser only allocates what it returns, nothing temporary). So, when the GC triggers, the Gen0 collection doesn't find much to collect, hence it tries the later generations before requesting more memory from the operating system. On the other hand, SubString, NuGetVersion and RegEx allocate more temporary, short-lived objects, so I guess that the Gen0 collection frees up enough memory so it doesn't want/need to try later generations.

I believe the GC also tries to timebox itself, so perhaps since span and statemachine don't contain many temporary objects, the GC completes so quickly it decides it has enough time to do another generation, to try to free more memory. Whereas the other implementations that have many temporary objects created, the mark and sweep take so long that the GC decides it's best to return control to the app to run business logic, rather than continue to be paused during the GC.

In any case, these benchmark results are evidence that Gen1 and Gen2 GC counts above zero are not necesserily a bad thing.

Trading off code complexity, performance, and error messages

Performance is only one aspect of software engineering. Another very important aspect is maintenance, or technical debt. The RegEx parser was just 59 lines, including whitespace. Although the expression is difficult to understand unless you already have some experience with regex, and it is on a line 376 characters long. By comparison, the substring parser is 152 lines. The span parser could have been about the same, if I didn't want to also test running on the .NET Framework. But, since I did, my span parser is 198 lines. Both the span and substring parsers share one validation method, in a separate 59 line file, so should be included to be a fair comparison other implementations. Finally, my state machine parser is 322 lines, and I didn't finish writing all the friendly messages for validation errors.

For a scenario like a NuGet server, error messages probably aren't important, because you'd expect the program creating the package to have validated the input string. But for interactive development, it could give the best error messages. Imagine this:

c:\src\MyApp> dotnet add package MyLibrary --version 1.2.3-preview..4

Error: Version string 1.2.3-preview..4 is not valid
                                    ^
Prerelease segment 2, at position 14, cannot be empty.

Compilers are increasingly generating better and better error messages when source code has a syntax error, and I think developers would be right to expect a similar improvement in error messages in other development tools. Error messages should be clear precisely why something failed validation, especially when there are multiple validations allowed. A simple "input is not valid" is acceptable for an integer input, but SemVer2 has enough validation rules that I think developers might benefit. Especially when the version string is being used to create a package, as not all developers are aware of SemVer2's limitations, especially the limitations on the prerelease and build metadata characters.

Further investigations

This blog post isn't close to being exhaustive. There's plenty of things that could be investigated to see if it improves performance, if maximum performance is your goal. But, I have many other things I want to spend time on, so I'm not going to investigate myself, at this time. Some ideas:

Spend much more time on different RegEx implementations. For example, remove most of the error checking from the RegEx implementation, and do validation separately, much like the substring and span implementations. Is the RegEx engine fundamentally slow, or is the expression used in this benchmark too complex, or just poorly written?
Try a tokenizer parser, like what a compiler would implement, or a JSON/XML parser. This would be like a hybrid of my state machine and span parsers, where each token could be efficiently found, like the span parser, then numbers parsed and semantics validated by a state machine. I believe this could result in a zero-allocation (excluding return value) parser, while retaining my state machine implementation's improved error messages.
While chatting with my friend, Loïc, about preliminary benchmark results, he had the idea that the parser could return a struct equivalent of a version data structure. This might be interesting for NuGet server implementations, trying to minimise allocations per HTTP request.
- A comparer that compares two of these structs could do a zero allocation compare/sort. Although sorting structs involves copying/swapping the entire struct, whereas sorting reference types is just swapping references, which will fit in a single CPU register. Swapping data types larger than a single CPU register might harm performance more than allocations during parsing.
- One challenge is that the prerelease labels list needs heap allocations to create any kind of collection. If zero allocations is the goal, try validating the string at parse time, but then only store a single span representing the entire prerelease label, not each part. The compare method will need to count the number of . characters, but then can stackalloc Span<T>[count] to represent the prerelease labels. This is CPU bound code each comparison, but does zero allocations make it worthwhile? Probably not, given the first blog post in the series had nearly double the performance by pre-parsing the each prerelease segment as an int, and using more memory to store this. But this also warrants a more realistic benchmark. The sort benchmark sorted the list of all version strings on nuget.org. It's more plausible to sort all the versions of a single package, which is likely to have a much smaller percentage of versions with prerelease labels.
There are other approaches that a state machine could take. The state machine I wrote for this benchmark was just the first idea that popped into my head. I didn't investigate other attempts, such as using a delegate, rather than a big switch statement, in the Next(char) method. I didn't investigate state machine packages. Maybe a Next(char) method isn't the best trigger to switch states. Are there other ways the state machine could avoid the StringBuilder and List<string> allocations?
It might be interesting to investigate whether it's possible to tune the GC to run less frequently in the span and state machine benchmarks, so it has more to collect, and perhaps performs fewer Gen1 and Gen2 collections.