Running Large Language Models locally – Your own ChatGPT-like AI in C#

Edit on GitHub

For the past few months, a lot of news in tech as well as mainstream media has been around ChatGPT, an Artificial Intelligence (AI) product by the folks at OpenAI. ChatGPT is a Large Language Model (LLM) that is fine-tuned for conversation. While undervaluing the technology with this statement, it’s a smart-looking chat bot that you can ask questions about a variety of domains.

Until recently, using these LLMs required relying on third-party services and cloud computing platforms. To integrate any LLM into your own application, or simply to use one, you’d have to swipe your credit card with OpenAI, Microsoft Azure, or others.

However, with advancements in hardware and software, it is now possible to run these models locally on your own machine and/or server.

In this post, we’ll see how you can have your very own AI powered by a large language model running directly on your CPU!

Towards open-source models and execution – A little bit of history…

A few months after OpenAI released ChatGPT, Meta released LLaMA. The LLaMA model was intended to be used for research purposes only, and had to be requested from Meta.

However, someone leaked the weights of LLaMA, and this has spurred a lot of activity on the Internet. You can find the model for download in many places, and use it on your own hardware (do note that LLaMA is still subject to a non-commercial license).

In comes Alpaca, a fine-tuned LLaMA model by Standford. And Vicuna, another fine-tuned LLaMA model. And WizardLM, and …

You get the idea: LLaMA spit up (sorry for the pun) a bunch of other models that are readily available to use.

While part of the community was training new models, others were working on making it possible to run these LLMs on consumer hardware. Georgi Gerganov released llama.cpp, a C++ implementation that can run the LLaMA model (and derivatives) on a CPU. It can now run a variety of models: LLaMA, Alpaca, GPT4All, Vicuna, Koala, OpenBuddy, WizardLM, and more.

There are also wrappers for a number of languages:

Let’s put the last one from that list to the test!

Getting started with SciSharp/LLamaSharp

Have you heard about the SciSharp Stack? Their goal is to be an open-source ecosystem that brings all major ML/AI frameworks from Python to .NET – including LLaMA (and friends) through SciSharp/LLamaSharp.

LlamaSharp is a .NET binding of llama.cpp and provides APIs to work with the LLaMA models. It works on Windows and Linux, and does not require you to think about the underlying llama.cpp. It does not support macOS at the time of writing.

Great! Now, what do you need to get started?

Since you’ll need a model to work with, let’s get that sorted first.

1. Download a model

LLamaSharp works with several models, but the support depends on the version of LLamaSharp you use. Supported models are linked in the README, do go explore a bit.

For this blog post, we’ll be using LLamaSharp version 0.3.0 (the latest at the time of writing). We’ll also use the WizardLM model, more specifically the wizardLM-7B.ggmlv3.q4_1.bin model. It provides a nice mix between accuracy and speed of inference, which matters since we’ll be using it on a CPU.

There are a number of more accurate models (or faster, less accurate models), so do experiment a bit with what works best. In any case, make sure you have 2.8 GB to 8 GB of disk space for the variants of this model, and up to 10 GB of memory.

2. Create a console application and install LLamaSharp

Using your favorite IDE, create a new console application and copy in the model you have just downloaded. Next, install the LLamaSharp and LLamaSharp.Backend.Cpu packages. If you have a Cuda GPU, you can also use the Cuda backend packages.

Here’s our project to start with:

LocalLLM project in JetBrains Rider

With that in place, we can start creating our own chat bot that runs locally and does not need OpenAI to run.

3. Initializing the LLaMA model and creating a chat session

In Program.cs, start with the following snippet of code to load the model that we just downloaded:

using LLama;

var model = new LLamaModel(new LLamaParams(
    model: Path.Combine("..", "..", "..", "Models", "wizardLM-7B.ggmlv3.q4_1.bin"),
    n_ctx: 512,
    interactive: true,
    repeat_penalty: 1.0f,
    verbose_prompt: false));

This snippet loads the model from the directory where you stored your downloaded model in the previous step. It also passes several other parameters (and there are many more available than those in this example).

For reference:

  • n_ctx – The maximum number of tokens in an input sequence (in other words, how many tokens can your question/prompt be).
  • interactive – Specifies you want to keep the context in between prompts, so you can build on previous results. This makes the model behave like a chat.
  • repeat_penalty – Determines the penalty for long responses (and helps keep responses more to-the-point).
  • verbose_prompt – Toggles the verbosity.

Again, there are many more parameters available, most of which are explained in the llama.cpp repository.

Next, we can use our model to start a chat session:

var session = new ChatSession<LLamaModel>(model)
    .WithPrompt(...)
    .WithAntiprompt(...);

Of course, these ... don’t compile, but let’s explain first what is needed for a chat session.

The .WithPrompt() (or .WithPromptFile()) method specifies the initial prompt for the model. This can be left empty, but is usually a set of rules for the LLM. Find some example prompts in the llama.cpp repository, or write your own.

The .WithAntiprompt() method specifies the anti-prompt, which is the prompt the LLM will display when input from the user is expected.

Here’s how to set up a chat session with an LLM that is Homer Simpson:

var session = new ChatSession<LLamaModel>(model)
    .WithPrompt("""
        You are Homer Simpson, and respond to User with funny Homer Simpson-like comments.

        User:
        """)
    .WithAntiprompt(new[] { "User: " });

We’ll see in a bit what results this Homer Simpson model gives, but generally you will want to be more detailed in what is expected from the LLM. Here’s an example chat session setup for a model called “LocalLLM” that is helpful as a pair programmer:

var session = new ChatSession<LLamaModel>(model)
    .WithPrompt("""
        You are a polite and helpful pair programming assistant.
        You MUST reply in a polite and helpful manner.
        When asked for your name, you MUST reply that your name is 'LocalLLM'.
        You MUST use Markdown formatting in your replies when the content is a block of code.
        You MUST include the programming language name in any Markdown code blocks.
        Your code responses MUST be using C# language syntax.

        User:
        """)
    .WithAntiprompt(new[] { "User: " });

Now that we have our chat session, we can start interacting with it. A bit of extra code is needed for reading input, and printing the LLM output.

Console.WriteLine();
Console.Write("User: ");
while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    var prompt = Console.ReadLine() + "\n";

    Console.ForegroundColor = ConsoleColor.White;
    foreach (var output in session.Chat(prompt, encoding: "UTF-8"))
    {
        Console.Write(output);
    }
}

That’s pretty much it. The chat session in the session variable is prompted using its .Chat() method, and all outputs are returned token by token, like any generative model.

You want to see this in action, right? Here’s the “Homer Simpson chat” in action:

Homer Simpson local large language model

The more useful “C# pair programmer chat”:

Helpful C# programming bot large language model

Pretty nice, no?

On my Windows laptop (i7-10875H CPU @ 2.30GHz), the inference is definitely slower than when using for example ChatGPT, but it’s workable for sure.

Wrapping up

Because of the hardware needs, using LLMs has always required third-party services and cloud platforms like OpenAI’s ChatGPT.

In this post, we’ve seen some of the history of open-source large language models, and how the models themselves as well as the surrounding community have made it possible to run these models locally.

I’m curious to hear what you will build using this approach!

Leave a Comment

avatar

13 responses

  1. Avatar for Arjun Krishna
    Arjun Krishna June 16th, 2023

    How do we extend the above code to support QnA on our files?

  2. Avatar for Alex Resnik
    Alex Resnik June 18th, 2023

    Hello

    The file wizardLM-7B.ggmlv3.q4_1.bin is too big to download. Please send me an alternate source.

    Thank you

    Alex

  3. Avatar for Long
    Long June 20th, 2023

    Thank you it very helpful! Can LLamaSharp support fine tuning and other language (not English) ?

  4. Avatar for Pedro Hernandez
    Pedro Hernandez June 24th, 2023

    I loved your article, I hope to see more on this topic, especially with C# and models that can be executed locally. Thank you so much.

  5. Avatar for Rolf
    Rolf June 28th, 2023

    Thanks loads! This is still blocked from commercial use though like in a game?

  6. Avatar for Maarten Balliauw
    Maarten Balliauw June 29th, 2023

    The file wizardLM-7B.ggmlv3.q4_1.bin is too big to download. Please send me an alternate source.

    That’s how it is :-) You can find some other models at the link mentioned in the blog post, but they all are several GB in size.

  7. Avatar for Maarten Balliauw
    Maarten Balliauw June 29th, 2023

    @Rolf yes, the models are all still under a non-commercial license so do check the licenses

  8. Avatar for Leon
    Leon June 30th, 2023

    Hello Maarten, thank you for this helpful article! When I run the code (with model wizardLM-7B.ggmlv3.q5_1.bin) the ChatSession responds, but not always. Using the same question/prompt results in an answer half of the times, the other times the result from Chat is empty (and no exception either). Do you know what may be the cause of this?

  9. Avatar for Leon
    Leon June 30th, 2023

    Hi Maarten, sorry bothering you twice :-) but I found the cause of not always getting output. I am using the LLamaSharp.Backend.Cpu package and I was using the Any CPU build setting. But now since I changed it to x64 it all works fine and get responses all the time.

  10. Avatar for PsillyPseudonym
    PsillyPseudonym July 5th, 2023

    Thanks for this tutorial!

    I couldn’t find a clear answer to this question: Can you use LLamaSharp in .NetFramwork applications? The example they have on github works fine in a .NetCore application but when I try it in .Netframework, I get the error: “System.TypeInitializationException: ‘The type initializer for ‘LLama.Native.NativeApi’ threw an exception.’ RuntimeError: The native library cannot be found.”

  11. Avatar for Tom
    Tom July 11th, 2023

    So what if we want to run it using the GPU instead?

  12. Avatar for Mlost
    Mlost July 13th, 2023

    Hello, great tutorial! Everything is working fine on CPU (R2700), but i’m trying to figure out what to do to put my GPU (3080 10gb) to work. I got already installed Backend.Cuda12 package (is it ok or it suppose to be Cuda11?). Should i use some specific LLamaParams just for gpu (i guess so but which to use? The only one that is obvious is n_gpu_layers); are there some variables (just for cpu) from the tutorial code to delete? Should i start fresh new project without installing Backend.Cpu? I was searching for the solution online but there’s not much about LLamaSharp so haven’t found any (or I was looking at one but didn’t understood, sometimes I’m dum, also not a programmer). Thanks in advance.

  13. Avatar for Hakan Nykvist
    Hakan Nykvist August 5th, 2023

    Thansk, it works well! on my I9 laptop

    Can LLamaSharp support fine tuning and other language (not English) ?