Exploring memory allocation and strings

Edit on GitHub

A while back, I wrote about making code allocate less memory (go read it now if you haven’t). In that post, we saw how the Garbage Collector works and how it decides to keep objects around in memory or reclaim them. There’s one specific type we never touched on in that post: strings. Why would we? They look like value types, so they aren’t subject to Garbage Collection, right? Well… Wrong.

Strings are objects like any other object and follow the same rules. In this post, we will look at how they behave in terms of memory allocation. Let’s see what that means.

In this series:

Strings are objects

When writing code in C#, sometimes it almost looks as if a string is a value type. They look immutable: re-assigning a string just replaces the value we are working with. We write code with string, we can compare strings using == knowing it compares the value of the string and not the reference, … But don’t be fooled! There’s quite some magic happening to make strings easy to work with, but they are in fact objects.

If we look at MSDN, we can read:

A string is an object of type String whose value is text. Internally, the text is stored as a sequential read-only collection of Char objects. (…) The Length property of a string represents the number of Char objects it contains, not the number of Unicode characters.

There we have it: strings are objects. They may hold an immutable array of Char and a length property that is a value type, but the bits of text we are passing around in memory are objects.

Quick note: If you want to learn more about strings in C#, I highly recommend the chapter on strings in "C# in depth" and this tutorial on strings.

When are strings allocated?

Let’s start at the beginning. In any .NET application, a string is allocated whenever we either new a string, which I haven’t seen happen too often, when we create one using quotes, e.g. "this is a new string", or when we load a string from somewhere else, for example a database or a remote HTTP API.

There are some other cases as well, but these are essentially the cases:

var a = new string('-', 25);

var b = "Hello, World!";

var c = httpClient.GetStringAsync("http://blog.maartenballiauw.be");

If we profile this, we can see that strings have indeed been allocated:

System.String allocation in dotMemory

Whoa! That’s a lot of strings! As you can imagine, the .NET runtime also needs a couple of strings to do its thing, and objects such as the HttpClient used in the above code snippet obviously also need to store HTTP requests and responses.

What’s interesting though, is that it seems our application is duplicating string content. Here’s "http://blog.maartenballiauw.be". We can see that string is in memory 6 different times.

String duplicates!

Just for fun: I also attached the profiler to a running devenv.exe (Visual Studio). If you ever wonder why it consumes so much memory, this is a good start of an explanation... Yes, that is the string "http://schemas.microsoft.com/winfx/2006/xaml/" duplicated 673 times.</code>
Visual Studio string duplicates

String duplication isn’t bad though. As we’ve seen previously, the .NET Garbage Collector (GC) is quite fast at cleaning up objects, especially when they are short-lived. But just as with any other object type, it may be bad to have lots of duplicate strings when they move to higher heap generatons like Gen 2 (or the large object heap if you have very large strings). We wouldn’t want our memory swallowed by a huge amount of unwanted string duplicates or (string) objects that aren’t being collected.

String literals

Are all strings allocated on the managed heap? No. The Common Language Runtime (CLR) does some optimization. Consider the following snippet of code:

var a = "Hello, World!";
var b = "Hello, World!"; 

If we run this piece of code, we will not see the string "Hello, World!" appear in the profiler. The reason for that is that the compiler optimizes this code and places the string "Hello, World!" in the assembly (or more correctly, the Portable Excecutable (PE)) #US metadata stream. When we run our application, the CLR reads these metadata values and loads them into a special place, called the intern pool. Every string literal in our code is placed in this pool, and duplicates simply reference the entry in the pool. If we’d run the following snippet, the result would be True, twice, because both the value and object reference of a and b are equal.

Console.WriteLine(a == b);
Console.WriteLine(Object.ReferenceEquals(a, b));

So in essence, "Hello, World!" is in memory only once - very optimized!

Quick note: What would be the better thing to use: "" or string.Empty? From what we learned so far, "" would be interned, which means it will only be stored in memory once, right? Right! There is one downside however: each time we use "", the intern pool is checked which spends some precious CPU time. If we use string.Empty, we're passing around the object reference instead which means no extra memory is allocated and no extra CPU cycles are wasted checking the intern pool.

Let’s geek out a little bit. We can double-check the string literals using dotPeek, exploring the Portable Executable (PE) metadata tree. The full list of unique strings is added in the #Strings and #US (for User Strings) metadata streams.

Interned strings in PE header metadata

If you need some bed literature, the metadata streams are described in the ECMA-335 standard, section II.24.2.4. Under section III.4.16, we can see the Intermediate Language (IL) instruction ldstr loads a string literal from the metadata.

String interning

With string interning, we can store strings in the intern pool, a set of unique strings we can reference at runtime. We saw that the compiler optimizes string usage by storing string literals in the PE metadata and that the CLR adds those into the intern pool, making sure they are not duplicated.

Why aren’t all strings in our application interned then? There are several reasons for that… But before we answer that, let’s see how we can allocate strings on the intern pool ourselves.

We can intern strings manually by using the String.Intern method. We can check whether there is already an interned string with the same value (or the same “character sequence”, to be correct), using the String.IsInterned method.

For example, the following snippet will only keep two strings around:

var url = "http://blog.maartenballiauw.be";

var stringList = new List<string>();
for (int i = 0; i < 100; i++)
{
    stringList.Add(string.Intern(url + "/"));
}

Which strings? "http://blog.maartenballiauw.be" - in the intern pool because it’s a literal - and "http://blog.maartenballiauw.be/" (with the trailing slash) because we’re interning the string. 100 calls? No problem, we’re just adding the reference to that same string into our list. Nice and optimized! Why aren’t all strings interned by default, then?

One reason is the classic “CPU vs. memory” debate. When using the intern pool, we’re increasing CPU usage as we’re checking if the string exists in there or not. When not using the intern pool, we’re just consuming memory.

While this is a valid argument, there is a better reason for not auto-interning all strings. Interned strings no longer appear on the heap and are allocated on the intern pool instead. There is no garbage collection on that pool: this means those strings will stay in memory forever (well, for the lifetime of the AppDomain), even if they are no longer being referenced. So use with caution!

With great power comes great responsibility

So what is it? Is string interning good? Or bad? It is fast and can be very good for optimizing memory.

War story! In the MyGet.org code base, we are using string interning for package id's. We did some profiling on our application and found that there are not that many different package id's around and in fact, since a package id can exist in multiple versions, we were seeing a lot of duplicate package id's in memory. We started interning package id's, and have seen a nice improvement in memory usage with virtually no impact on CPU usage.

As a rule of thumb, keep this in mind:

  • If an application has a lot of long-lived strings, but not a massive amount of unique strings, interning can improve memory efficiency.

  • If an application has a lot of long-lived strings, but these are almost all distinct values, string interning adds no benefit as the strings have to be stored anyway. Plus it may exhaust memory…

  • If an application has a lot of short-lived strings, trust the Garbage Collector to do its thing fast and efficiently.

  • When in doubt, measure. Use a memory profiler to detect string duplicates and analyze where they come from and how they can be optimized. Do watch out: you may see strings as a potential memory issue but they most probably are not.

Enjoy! And remember, don’t optimize what should not be optimized (but do optimize the rest).

P.S.: Thank you Wesley Cabus for reviewing!

Leave a Comment

avatar

13 responses

  1. Avatar for David Pine
    David Pine November 15th, 2016

    Awesome post, loved it! Thanks for sharing. I just wanted to point out a cool trick for linking PDF URLs. You can specify the page you want it to navigate to in the URL itself, so for the ECMA-335 link you could set the href to this http://www.ecma-internation... and it will navigate to section II.24.2.4.

  2. Avatar for Maarten Balliauw
    Maarten Balliauw November 15th, 2016

    Thanks, fixing that :-)

  3. Avatar for Jonathan Snow
    Jonathan Snow November 16th, 2016

    Great post. Thanks for sharing.

  4. Avatar for LetMeCodeThis
    LetMeCodeThis November 20th, 2016

    In relation to: "There is one downside however: each time we use "", the intern pool is checked which spends some precious CPU time.". Some time ago I've benchmarked that and the result was exactly the same so this made me curious. Looked at IL code and noticed you were right that different op code is being used (ldstr for "" and ldsfld for string.Empty) but from .NET 4.5 ldstr "" is replaced with ldsfld string [mscorlib]System.String::Empty by the jitter resulting with exactly the same asm code being generated.

  5. Avatar for Maarten Balliauw
    Maarten Balliauw November 20th, 2016

    Thanks, made an update to that!

  6. Avatar for Aron
    Aron November 24th, 2016

    Excellent posts! How did you get the Metadata node in the assembly explorer in DotPeek though?

  7. Avatar for Maarten Balliauw
    Maarten Balliauw November 24th, 2016

    Thanks! The Metadata node is in the latest EAP of dotPeek - https://confluence.jetbrain... - it will surface in the next full release as well.

  8. Avatar for Ivan Teles
    Ivan Teles March 1st, 2018

    As for the issue of hallucination of memory.
    I created a web app following the Dispose standard after using the features, even so in the task manager on the server I see that w3wp consumes around 1Gb of memory, how would I limit this use? either through windows or IIS, or even through my web app?

    Exemplo: Criação de sites e aplicativos

  9. Avatar for CompitionPoint
    CompitionPoint January 27th, 2019

    Thanks, made an update to that!

  10. Avatar for Shadman Kudchikar
    Shadman Kudchikar October 13th, 2019

    Just a quick note to let you know that I included your post in my article “C# String.Format() and StringBuilder” https://kudchikarsk.com/c-stringformat-and-stringbuilder/

    We linked to https://blog.maartenballiauw.be/post/2016/11/15/exploring-memory-allocation-and-strings.html

  11. Avatar for Rahul Kumar
    Rahul Kumar April 15th, 2020

    I want to know the object size at runtime as how much memory it takes. Can you please suggest the way. I have a class with few properties and at runtime those property vales are set. Now I want to know the size pf the class object. Please help

  12. Avatar for Maarten Balliauw
    Maarten Balliauw April 16th, 2020

    Hi Rahul,

    In dotMemory (www.jetbrains.com/dotmemory), you can check the “Instances” view, which will show the object size of object instances. https://www.jetbrains.com/help/dotmemory/2020.1/Instances.html

  13. Avatar for Anmol
    Anmol February 18th, 2021

    This article provides a good and very detailed explanation. Thanks For Sharing .