Video Guide
IF YOU ARE NEW TO RUNNING OFFLINE AI MODELS FOR THE LOVE OF GOD READ THIS:
KoboldCPP is a program used for running offline LLM's (AI models).
However it does not include any offline LLM's so we will have to download one separately.
Running KoboldCPP and other offline AI services uses up a LOT of computer resources.
We only recommend people to use this feature if they have a powerful GPU or a 2nd computer to offload the resources too.
Ideally you want a NVIDA GPU for the best performance.
You will most likely have to spend some time testing different models and performance settings to get the best result with your machine. This is still "experimental" technology.
Basic Terminology:
LLM: Large Language Model, the backbone tech of AI text generation.
7B, 13B etc: How many billions of parameters an LLM has. More parameters = "smarter" (subjective) but more resource intensive it is.
HuggingFace: Website which hosts a whole heap of free LLM's.
Why run an offline LLM instead of using ChatGPT?
- Its free! Apart from having the expensive hardware to run it in the first place.....
- Less censorship with offline LLM models.
KoboldCPP Download
- Go here: https://github.com/LostRuins/koboldcpp and on the righthand side click the latest release.
- Download the latest koboldcpp.exe file and place it on your desktop.
- Well done you have KoboldCPP installed! Now we need an LLM.
LLM Download
- Currently KoboldCPP support both .ggml (soon to be outdated) and .gguf models. For this tutorial we are going to download an LLM called MythoMax. You can use any other compatible LLM. Check the Discord's #llm channel to find more.
- Go here: https://huggingface.co/TheBloke/MythoMax-L2-13B-GGML and click Files and Versions.
- Any of the .bin files will work, in this case download mythomax-l2-13b.ggmlv3.q5_K_M.bin.
Note on qN levels: Different models have different qN levels. The higher the Q number the more VRAM it will use, meaning better quality text generation. We found that q4_k_m and q5_k_m are a good sweet spot. Do not go lower than q4_k_s.
TL;DR download larger .bin file for big smart AI. - Once you have downloaded the file place it on your desktop, or wherever you want to store these files.
KoboldCPP Setup
- Run koboldcpp.exe as Admin.
- Once the menu appears there are 2 presets we can pick from. Use the one that matches your GPU type.
2. CLBlast = Best performance for AMD GPU's
- For GPU Layers enter "43". This is how many layers of the GPU the LLM will use. Different LLM's have different amount of maximum layers (7B use 35 layers, 13B use 43 layers etc.). If you are finding that your computer is choking when generating AI response you can tone this down.
- Make sure Launch Browser and Streaming Mode are enabled.
- Click Browse and select the LLM file we downloaded earlier.
- Your menu should look like this (I am using a NVIDA GPU):

- If everything looks good hit Launch.
- A web browser interface should pop up. Just leave the command terminal running and we are all set to connect this to the mod.
- In the configuration menu for the mod all we need to do is point to the URL of the KoboldCPP server running on your computer. Depending on what you are hosting the Herika Server on there are 2 ways to point to it.
- UWAMP = Just set the configuration as: $KOBOLDCPP_URL="http://localhost:5001";
- DwemerDistro = Because its hosted on another "Virtual Machine" on your PC you cannot simply point it to localhost. You will need to use your computer's private IP address. This is easy to find.
Open up a Command Prompt as Admin
Run this command: ipconfig
Whatever is your primary WIFI/Ethernet adapter you should see an IPv4 Address, copy that. (Should look something like 192.168.x.x or 172.16.x.x)
Your KoboldCPP_URL configuration should look like this: $KOBOLDCPP_URL="http://192.168.81.32:5001"; (Replace with your own IP address)
- Launch the game, open up the MCM menu and under $SPG set a hotkey for switching between AI models.
- Press the hotkey and you should be in KoboldCPP mode.
- If everything is setup correctly you should get a response from Herika! You can check the KoboldCPP command menu to see more information about the AI generation.
- Make sure to play around with the KoboldCPP settings and other LLM's to find the best performance for your computer!
16 comments
Will the program work well?
I was running into some trouble integrating with the latest KoboldCpp (1.50.1), however. I looked at the terminal for kobold and saw that it was throwing a python error indicating an expected string type was getting set to null, and I also could see the prompt was missing from the payload. I stepped through the koboldcpp.php and realized that there are some additionally required variables for the connector in the server configuration that were not initialized by default (assuming I followed the instructions correctly).
So, if anyone else missed this too, this was my configuration that ultimately worked:
$CONNECTORS=["koboldcpp"];
...
$CONNECTOR["koboldcpp"]["url"]="http://YOUR_IPV4_ADDRESS:5001";
$CONNECTOR["koboldcpp"]["max_tokens"]=100;
$CONNECTOR["koboldcpp"]["temperature"]=0.98;
$CONNECTOR["koboldcpp"]["rep_pen"]=1.04;
$CONNECTOR["koboldcpp"]["top_p"]=0.9;
$CONNECTOR["koboldcpp"]["MAX_TOKENS_MEMORY"]=512;
$CONNECTOR["koboldcpp"]["template"]="alpaca"; // ADDED - A template is required in the php code for the prompt to be provided in the payload to kobold
$CONNECTOR["koboldcpp"]["use_default_badwordsids"]=false; // ADDED - Was defaulting to null if not provided
$CONNECTOR["koboldcpp"]["eos_token"]=""; // ADDED - Defaults to null, unless another flag is used to set it to "\n", which then throws the NoneType issue trying to invoke .encode()
I'm using the 13b Q8_0 MythoMax GGUF as my LLM (uses >13gb VRAM).. a really great model, but then I don't have enough VRAM left for SkyrimVR when running at 43 gpu layers, on my 4090(!!), I cut it back to 28. Unfortunately for whatever reason that is the difference of 3-5s responses to 30s+ responses. There's also 13b Q5_K_M version of the model that uses *only* >9gb VRAM, which is a bit faster on 28 layers... I felt like I got less random responses on the Q8.
Basically every information about the addresses of the WEBUIs, the right IP-Address to fill into the SimpleGateWayer.ini or the path to the conf.php file can be found in the terminal window, that opens, when you run Dwemer Distro!
One hint: Very often using "localhost" instead of the real IP-adress of you local machine causes trouble with the firewall since the server application is running on a virtual machine. So better use the real IP-address.
Hope that helps.
Just use it.
You may want to try the ROCm port of KoboldCPP if you have AMD card, rather than CLBlas, it's much faster. On windows you shouldn't need to install the ROCm SDK, as the the relevant dll's should be be included with the Adrenalin Drivers or baked into the .exe. However only 7000 and high end 6000 cards have full ROCm support (mid range or lower end 6000 *might* work, but might not).
When the option for streaming mode in the launcher was removed, apparently it was actually made the default, so you I don't think you need to use 1.42.1 to get it.