A Guide to Self-Hosted LLM Coding Assistants

9 min readSep 3, 2024

Assistive coding is one of the most powerful ways to apply large language models (LLMs). A well-trained model integrated into your development environment can supercharge your productivity.

Hosted models as-a-service have become cheaper and more effective as their adoption accelerates. However, there are cases in which you may want to run your own:

Privacy: Relying on hosted APIs requires sending your data to a third party to populate context.
Cost: Iterating often with a large codebase can incur charges quickly and discourage experimenting with different prompts.
New developments: Models with significant new capabilities appear on an almost weekly (or daily!) basis. Running your own open models can keep you ahead of the curve.

This article will walk you through how to leverage the right self-hosted LLM and integrate it into your environment.

Doing it Yourself: An Overview

The most popular and effective commercial LLM coding assistant services pair highly-trained models with powerful hardware to deliver accurate suggestions quickly. To build your own, you’ll need the same: a refined model well-suited for coding tasks, compute hardware with sufficient power to generate tokens (or suggestion output) quickly, and an editor or IDE integration to bring it all together.

In this guide we’ll use Ollama to easily run our desired models and assume that you have a functional ollama service API available for use. If not, many resources are available to help you get started with Ollama. Consider using the official quick start documentation as a starting point.

The primary setup considerations to bear in mind in the context of this article are:

More capable models need significant resources. If you intend to run models beyond 40 billion parameters, you likely need powerful local hardware (or optimized cloud resources).
Choosing the right model is as much an art as it is a science. This tutorial will help explore models and their capabilities, but experimenting with different models against your use case will help you select the most effective strategy.

We’ll look at editor integrations for VSCode, Vim, and Emacs — though most editor ecosystems provide some method of integration with LLM APIs. When evaluating an LLM integration, API compatibility against Ollama is the primary requirement. The models themselves can be swapped out from underneath the Ollama service.

To ensure a fair comparison across both models and editor integrations, we’ll compare models with similarly-sized parameters, then integrations, and finally demonstrate the two paired together.

A Brief Tour of Coding Models

Before diving into code-optimized LLMs, we should establish a baseline prompt to gauge their effectiveness against a uniform set of requirements.

Consider the following example:

Write a Python script that uses the click library to accept a URL argument that the script will POST input to with the request body populated from stdin and then pretty print the resulting response body. Print a friendly message if the response returns an HTTP error status code. Do not provide commentary or any other explanatory text, return code only.

This prompt judges a model’s ability to:

Use standard libraries (to read input from standard input)
Use third-party modules (with the click library)
Write functions that accept and use arguments
Perform network HTTP operations
Present user-readable output
Implement error handling

To follow along, the only prerequisite is a functional Ollama server accessible from your own machine with an ollama run command. If you’re running Ollama on your local machine, then ollama commands should default to localhost as the API endpoint. If you’re running Ollama on a cloud resource or other remote machine, then invoke your commands in the form of:

OLLAMA_HOST=ollama-host:11434 ollama run

Where ollama-host is the destination hostname or IP address of the remote server. The examples that follow are results derived from entering our test instructions at the > prompt from the ollama run command.

With this criteria in hand, let’s explore!

codeqwen

codeqwen is a coding-optimized model with a parameter size small enough at 7B parameters that it can fit into the available memory space of consumer-grade GPUs. Given our baseline Python program prompt, here’s an example response from codeqwen:

import click
import requests
import json

@click.command()
@click.argument('url')
def post_request(url):
    try:
        response = requests.post(url, data=input())
        response.raise_for_status()  # Raise an HTTPError for 4xx/5xx status codes        print(json.dumps(json.loads(response.text), indent=2))  # Pretty print the response body
    except requests.HTTPError as http_err:
        print('HTTP error occurred:', http_err)  # Print HTTP error message
    except Exception as err:
        print('An error occurred:', err)  # Print other errorsif __name__ == '__main__':
    post_request()

Despite its relatively lightweight footprint, codeqwen does an excellent job at fulfilling our prompt, honoring our requirements with good patterns:

It leverages the requests library without any hints to do so.
Error handling catches both HTTP errors and other exceptions.
The standard json library pretty-prints responses (with indent=2).

The source format is suitable for use as a script and meets the requirements defined in our baseline prompt.

deepseek-coder

The deepseek-coder model has sizes similar to those available to codeqwen, ranging from 1B to 33B. Although a v2 deepseek-code model exists, we’ll contrast the 6.7B deepseek-coder model against codeqwen:7B to compare across similar parameter sizes.

Given the same initial prompt, deepseek-coder:6.7B returns the following Python:

import click
import requests
from pprint import pformat

@click.command()
@click.argument('url')
def post_request(url):
    try:
        resp = requests.post(url, data=click.getchar())
        if 400 <= resp.status_code < 600:
            click.echo("An HTTP error {} occurred.".format(resp.status_code), err=True)
        else:
            click.echo(pformat(resp.json()))
    except requests.exceptions.RequestException as e:
        click.echo('Error: {}'.format(e), err=True)if __name__ == '__main__':
    post_request()

Like codeqwen:7B, deepseek-coder:6.7B correctly uses the chosen libraries and infers the use of requests as an HTTP client. Some noteworthy differences include:

Using the native pprint library to pretty-print JSON parsed from the .json() requests method.
Manually checking that the response code falls within a successful range. This is slightly less ergonomic than codeqwen’s .raise_for_status() approach.
The click.getchar() function only accepts one character, and does not read from standard input.

This solution is mostly functional but misses subtle success criteria from our original requirements. For example, it cannot accept an entire payload from standard input rather than one character from an interactive prompt.

codellama

codellama is a Llama 2-based model from Meta tuned for coding and available in many different parameter sizes including 7B, 13B, 34B, and 70B. We’ll use the 7B model to compare model capabilities as with codeqwen and deepseek-coder.

import click
import requests
from rich.console import Console

@click.command()
@click.argument("url")
def main(url: str):
    console = Console()
    try:
        response = requests.post(url, data=console.input())
        if response.status_code == 200:
            console.print(response.json())
        else:
            console.print("Error: {}".format(response.status_code))
    except Exception as e:
        console.print("An error occurred: {}".format(e))

Once again, the differences between this solution and the previous attempts are interesting to explore:

The use of type annotations in main() is an optional but generally good practice to improve the readability of the code.
The rich library accepts input and pretty-prints the response which is an effective way to repurpose the library to serve two purposes. The console.input() method also properly accepts input via either stdin or command line input.
Without a call to main() we’ll need to either invoke it manually or wrap it with a standard conditional check against __name__ == "main".

While missing the trailing __name__ check at the end of the generated code, the use of features like type annotations and additional libraries suggests that codellama may be well-suited for code that follows best practices. Larger parameter sizes also offer the potential for more comprehensive code completion.

llama3.1

Although not specifically tuned for coding, the llama3.1 model is a recent addition to the family of open-source models. With parameter weights ranging from 8B to 405B, it can compete with even the most capable commercial LLM offerings.

Queried against our baseline prompt, the 8B model returns the following:

import click
import json
from requests import post
from sys import stdin

@click.command()
@click.argument('url')
def main(url):
    """POST input to URL and print response"""
    req = post(url)
    print(json.dumps(req.json(), indent=4))
    if req.status_code < 400:
        return
    click.echo(f"Error {req.status_code}: {req.reason}", err=True)if __name__ == '__main__':
    main()

Despite being a generalized model, llama3.1 performs relatively well against models trained specifically for coding tasks:

The requests library is the only other additional dependency besides click.
Unlike every other generated response, this function includes a document string.
The code snippet is complete with a trailing conditional to run as a script when invoked from the command line.

However, the generated code has a glaring bug: it never accepts any input! Despite importing stdin, the call to requests.post never includes any request body. While the solution is well-formatted and runs, it misses our requirement for payload input entirely.

In an environment with sufficient computing power, llama3.1 may still be worth exploring. Models with more parameters are more capable than most of the other models we’ve explored so far.

Editor Integration

Integrating your chosen model into an editor is a key step to making full use of an LLM. Both chat-based assistive coding and autocomplete-style generation can significantly improve their utility.

The Ollama API abstraction layer provides a uniform access method to different coding models, which makes editor integration simpler. Even across different coding assistant extensions or packages, as long as each is compatible with the Ollama API, the backend can remain the same without any significant friction.

Let’s look at integrations that can present a convenient interface in a variety of editors.

VSCode

Among the Ollama-compatible extensions available on the VSCode Extension Marketplace, we’ll use Ollama Autocoder as an illustrative case for integrating an Ollama model into VSCode.

After installing the extension, open the settings for the extension:

Scroll to the Endpoint setting and change it to the Ollama API endpoint for your running service. For example, in this screenshot, replace ollama-host with either your Ollama endpoint IP address or hostname:

After making these changes, you’re ready to try Ollama-powered autocompletion!

Open a Python source code file and begin with an appropriate script shebang like #!/usr/bin/env python. Follow it with a comment including our test prompt, and position the cursor at the end of the file. The extension will pick up at the cursor location with text preceding as context to the model. Use the command palette (Ctrl-Shift-P) to execute the command Autocomplete with Ollama. The extension should stream generated code:

The preceding recording is generating code from the codeqwen model running on a legacy NVidia 1070 GTX GPU.

Emacs

Packages like gptel and org-ai offer integration with LLM-based APIs, but this example will use ellama as a simple example of code completion with Ollama.

The following elisp snippet uses the use-package macro to install and configure the ellama package and set its provider backend to an ollama API at http://ollama-host:14434:

(use-package ellama
  :config
  (require 'llm-ollama)
  (setq ellama-provider
		'(make-llm-ollama
		  :host "ollama-host"
		  :port 11434
		  :chat-model "codeqwen"
		  :embedding-model "codeqwen")))

Reload Emacs after placing the code snippet into ~/.emacs.d/init.el or evaluate the use-package form. Position the point in a source code file where you’d like to generate code and invoke M-x ellama-code-complete. The following recording demonstrates this in a Python source file preceded by our baseline prompt:

As with the VSCode extension, ellama will rely on the Ollama API endpoint to generate code based on the editor context. Completion input streams into the open buffer until it finishes.

Neovim

The Neovim plugin model.nvim offers the ability to integrate with APIs like Ollama and provide streaming completion as with the integrations with VSCode and Emacs.

To install and configure model.nvim with the lazy.nvim plugin manager, use the following configuration under the section for require("lazy"). This will also prepare the plugin to use the hosted Ollama API endpoint:

require("lazy").setup({
  spec = {
     "gsuuon/model.nvim",
     config = function()
        local ollama = require('model.providers.ollama')
        require('model').setup({
              prompts = {
                 ['ollama:codeqwen'] = {
                    provider = ollama,
                    options = {
                       url = "http://ollama-host:11434",
                    },
                    params = {
                       model = "codeqwen",
                    },
                    builder = function(input)
                       return {
                          prompt = '<|im_start|>' .. input .. '<|im_stop|><|im_start|>assistant',
                          stops = { '<|im_stop|>' },
                       }
                    end,
                 }
              },
         })
     end,
  },
})

The ['ollama:codeqwen'] prompt stores the configuration for our ollama prompt. Change the url option to point at your ollama API endpoint, listed in this example as ollama-host.

Start Neovim with nvim and open a new Python source buffer with :e main.py and populate the buffer with the baseline prompt. Move the cursor to a newline and invoke the command :Model ollama:codeqwen. The plugin will stream the response into the buffer to complete the code:

Edit the plugin configuration as necessary to define additional models to use with the :Model command to use other models.

Summary

In this guide, we have:

Evaluated a variety of different large language models and their effectiveness at fulfilling coding instructions.
Showcased editor extensions for VSCode, Emacs, and Neovim that integrate with ollama APIs.
Combined large language models that fit on consumer-grade hardware with those extensions to stream generated code completion.

To continue exploring self-hosted modules:

Browse the library of Ollama models.
Install and use Ollama integrations.
Explore tools optimized for local hosting like llama.cpp and exo.

Originally published at https://semaphoreci.com on September 3, 2024.