Cipher – A Vision for the Future of Tab Completion

Tab-completion LLMs (large language models) are models that provide real-time suggestions while a developer types code, often integrated into editors ranging from Vim to IntelliJ as plugins, but sometimes as an integral part of the editor. In this post, I will lay out some thoughts for what I think the future of tab completion systems will look like.

Current State of Tab Completion LLMs #

Right now, LLM-based tab completion systems are limited by several things:

Limited context awareness, since even knowing what the user wants and having a large enough compute budget isn’t always enough to provide accurate completions.
Limited compute budget for inference, since latency needs to be kept low for tab completions to be useful.
Limited training data availability, since training a model to provide good completions isn’t just a matter of seeing good code, but also having high quality examples of how to be helpful.

Let’s briefly examine the recent history of tab-completion LLMs.

The Simplest Tab-Completion LLMs #

In the early days, LLM tab-completion was performed by LLM base models that had been trained on large amounts of code. These LLMs would be provided with up to N tokens of context from the left side of the user’s cursor, and try to predict to the right.

There was no defined prompt format here, no structure to the result. This is a very messy approach, requiring the editor to detect a number of end-conditions to keep the model from rambling, such as detecting when the prediction uses less indentation, or when the prediction runs into suggestions that match existing code. Sometimes, these models would be limited to only generate until end-of-line. It is very hard to get good results out of such a limited system, because the model has no clue what code the user has already written below the cursor. It could spew a large completion that is largely duplicative of the code that comes after, but it is just different enough to not run into any of the end-conditions, wasting everyone’s time.

This brings us to FIM:

FIM LLMs #

After this, tab-completion LLMs were trained on the FIM (fill-in-the-middle) task. These LLMs would receive both the left and right context from the user’s cursor, and try to predict the middle. This helps the LLM to focus, since it knows the scope of the current completion.

To provide an example of this, I will reference one from the codegemma-2b description.

<|fim_prefix|>import datetime
def calculate_age(birth_year):
    """Calculates a person's age based on their birth year."""
    current_year = datetime.date.today().year
    <|fim_suffix|>
    return age<|fim_middle|>

Here, the LLM will know that the cursor is currently on the line before return age, and its goal is to provide code that will fit between the current_year and return age lines. It will return this code completion after the <|fim_middle|> token.

This runs into a few limitations.

If the LLM sees a mistake in any of the existing code, it is unable to suggest a fix, and any code it provides will be wrong by association – does it write code that assumes the other code gets fixed, or does it write code that tries to muddle through the broken design?
If user has just edited a similar calculate_X function and has come here to apply a similar edit, the LLM won’t know that, and won’t be able to use that information to inform its prediction.
This format does not specifically provide a place to put useful context. The editor might like to include a few useful function or type signatures that are related to the current edit block. The editor could insert that information into the prefix or suffix, but it would have to try to ensure it is syntactically legal for that information to exist where it is put to avoid confusing the LLM. This is a cumbersome task when the programming language may not be known to the editor.

To address some of these problems, we now have edit-prediction LLMs:

Edit-Prediction LLMs #

Zed wrote a good blog post that introduced their new Zeta model. Based on my observations of Cursor’s tab-completion system just as a user, I believe that it is operating in a similar way to the Zeta model, although Zeta may take the concepts farther. Zeta adopts a new prediction format that gives the LLM more flexibility in its response, as well as more context.

I have been unable to find the exact Zeta prompt format, but based on the blog post and the training data, the format appears to be functionally similar to this:

<|edit_events|>
User edited file: \"internal/api/middleware.go\":

diff
@@ -3,6 +3,7 @@
 func RequestIDMiddleware(next http.Handler) http.Handler {
     return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
         ctx := r.Context()
+        ctx = context.WithValue(ctx, \"requestID\", reqID)
         next.ServeHTTP(w, r.WithContext(ctx))
     })
 }

User edited file: \"internal/api/middleware.go\":

diff
@@ -3,7 +3,7 @@
 func RequestIDMiddleware(next http.Handler) http.Handler {
     return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
         ctx := r.Context()
-        ctx = context.WithValue(ctx, \"requestID\", reqID)
+        ctx = context.WithValue(ctx, \"requestID\", )
         next.ServeHTTP(w, r.WithContext(ctx))
     })
 }
<|edit_events_end|>
internal/api/middleware.go
  }
<|editable_region_start|>

func RequestIDMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx := r.Context()
        ctx = context.WithValue(ctx, \"requestID\", <|user_cursor_is_here|>)
        next.ServeHTTP(w, r.WithContext(ctx))
<|editable_region_end|>
    })
}

The model is able to see where the user’s cursor is, and it is able to see recent changes that the user has made while editing. The model is also expected to respond with the entire block of code, allowing it to not only address the missing code where the user’s cursor is, but also to propose fixes to nearby issues, and to do so contextually based on the work the user has recently been doing.

The Zed editor’s code which is calling this LLM appears to provide an outline of the code, diagnostics from the language server, and a speculated output block… presumably these latter two are intended to allow the LLM a second chance to fix its prediction. However, the published training data for Zeta does not make use of these fields, so it’s unclear how they fit into the prompt format.

The Future of Tab Completion #

The current tab-completion LLMs are a good start, but there is much more that can be done. Let’s imagine a system that isn’t constrained by compute budget, training data availability, or context awareness.

Let’s imagine a reasoning, tool-calling tab-completion LLM called Cipher that runs on an infinitely fast server. Cipher will receive a prompt in the following format:

<|dependency_list|>

<|context|>

<|edit_events|>

<|diagnostics|>

<|previous_completion|>

<|current_file|>

<|editable_region_start|>

<|editable_region_end|>

<|current_file_end|>

The <|dependency_list|> will provide the list of third party packages (and their versions) that the current project depends on, so that the LLM will have a general sense of what is available. This list will also include information about the programming language version and runtime version (if applicable). For any packages that have been added since the last commit, it could be helpful to make a note of that for the LLM.

The <|context|> block will be populated by the IDE with the contents of recently accessed and/or relevant files except for the current file. This provides a solid foundation for the LLM to understand the structure of the code.

The <|diagnostics|> block will contain any diagnostics from the language server that are applicable to the current file.

The <|previous_completion|> block will either be empty or it will contain the last <|editable_region|> that the LLM generated, which will happen if the last <|editable_region|> resulted in new diagnostics from the language server. This way, the LLM can see both the original state of the <|editable_region|> and its previous attempt, and it can stick to its previous design while fixing issues.

In training, it will learn to use several simple tools to inform its completion. Here are two of the tools:

describeSymbols("symbol_name", ...), which will return the name, definition, and any documentation for the type that describes these symbols, preferring to resolve ambiguity by first trying to find a symbol of that name if it exists in the <editable_region>, and then looking at a larger scope, if needed. This will ideally use the language server, but it can also fall back on a simple code search, if needed.
usesOfSymbols("symbol_name", ...), which will find other places where these symbols are used. It is often helpful to understand how a function or type is currently being used, if we’re about to suggest a modification of that function or type.

Through training, it will respond first with an block. In this block, it will reason through the problem to develop a high-level understanding of what the code being edited currently does, figure out what the developer has been working on recently (within the edit events), determine what the user is most likely trying to do at the cursor location, and then decide what it will need to do in the editable region to further the user’s goals. Then, it will optionally respond with a block if it neeeds to understand any symbols that are poorly described by the current code. Once it has this information, it will generate <|editable_region_start|> and fill in all of the code until <|editable_region_end|>, making as few changes as it can to accomplish what it sees as the next step (since nobody enjoys spurious changes).

Since even an infinitely fast inference server may not be zero latency due to network round trip calls, it would be prudent for the editor to remember the tool calls and continue to provide the (updated) information resulting from those tool calls the next time the user is editing in that part of the code, at least for a period of time, in order to reduce the number of tool calls that the LLM needs to wait on.

In a zero-latency scenario where the model is running locally, you could imagine an interactive inference process that adds more information dynamically. A model could be trained to take in one more block of context:

<|dependency_list|>

<|context|>

<|edit_events|>

<|diagnostics|>

<|previous_completion|>

<|current_file|>

<|editable_region_start|>

<|editable_region_end|>

<|current_file_end|>

<|lsp_suggestions|> # new!

Here, <|lsp_suggestions|> is at the very end of the prompt. This is important. As the LLM types each character of the <|editable_region|>, the LSP would be invoked to get a list of symbols that the user might want tab complete, and the LSP output (including the available symbols and their definitions) would be inserted into the <|lsp_suggestions|> block. This would keep the LLM grounded on what is really available at every step of the process. Since this block is at the end of the prompt, the inference server would be able to cache the rest of the prompt prefix, and it would only need to reprocess the tokens from the <|lsp_suggestions|> block onwards (including the portion of the code completion that the LLM had provided up to this point). For slow LSPs, one could imagine only filling in the <|lsp_suggestions|> block after the LLM suggests certain characters, such as a period.

This form of dynamic reprompting seems like it would work great – the LLM would make steady progress with its answer, and it wouldn’t think too hard about the fact that it just happens to intuitively have access to the right information each time it is processing a new output token. I don’t believe I’ve ever heard of anyone doing this up to this point.

Example: Cipher in Action #

Imagine you’re editing a user.go file. You just updated a helper function in auth.go and installed a new authentication library. Cipher’s prompt would include:

Recent diffs in <|edit_events|> to show that you introduced a new function signature in auth.go.
Dependency changes in <|dependency_list|> stating that you’ve added github.com/example/awesome-auth v1.2.0.
Language server diagnostics in <|diagnostics|> that flag a type mismatch from your new function usage.

By combining this knowledge, Cipher’s <architect> block might reason: “The user likely needs a parameter for the new auth function. Let’s fix the type mismatch and pass the correct argument.” It could then generate the updated snippet in <|editable_region|>, ensuring the code is consistent with both the new library and your recent changes.

Conclusion #

Tab-completion LLMs have progressed from simple next-token generation to more advanced methods like fill-in-the-middle and edit-prediction. I am impressed at how this technology has improved, and I find it interesting to imagine how it might advance in the future.

Let’s revisit the initial limitations:

Limited context awareness #

The concept of Cipher is laser-focused on giving the LLM enough context to be able to accurately predict a helpful edit.

Limited compute budget for inference. #

A very high speed inference server such as Cerebras Inference could provide the necessary speed for Cipher to work. Cerebras Inference is able to process prompts at 20k tokens/second and provide output tokens at nearly 2300 tokens per second when using Llama-3.1-8B. If Cerebras were running a fine-tuned 7B/8B Cipher model, and if we set a latency target of 500 milliseconds, then we have to figure out to how make this feasible. If we set aside half of that time for prompt processing and the other half for output processing, this gives us room to hold nearly 4500 tokens of context and output about 575 tokens to cover both the reasoning and the editable_region. This still isn’t quite enough on its own, but with prompt caching, we might be able to meet the latency target for subsequent completions. The majority of the prompt should not change while a user is typing in one part of the file, so we might hypothetically need to process only 1000 tokens of prompt each time after it is cached. This takes 50 milliseconds. The remaining 450 milliseconds would be enough to process over 1,000 output tokens.

Limited training data availability #

Collecting the necessary training data for a project of this scope is not impossible, but it would be challenging. When creating the dataset, one would need to use a few projects in various programming languages that have LSP support. In generating the training data, one would need to programmatically use the LSP to generate the <|diagnostics|> and <lsp_suggestions> for various scenarios that match up with the real data in the rest of the blocks. The existing Zeta training set would be a great starting point, but mainly as a diverse training set for the model to learn how to behave when the LSP is not configured to provide diagnostics or suggestions. Entirely new training data would need to be collected with just LSP diagnostics, and new data collected with both diagnostics and suggestions. I could imagine just having one dataset with diagnostics and suggestions, and then augmenting the data by filtering the existing examples to create ones where the suggestions or the diagnostics and suggestions have been removed, but I would be worried that showing the model the same examples repeatedly within each epoch might lead to overfitting, so this is something that would need to be tested as an ablation study – it could help, or it could hurt.

A concept like Cipher is likely just possible even today, but it would certainly be pushing the boundaries of the technology. I expect advancements in LLM inference efficiency and inference hardware over the next several years to make ideas like Cipher not just possible, but inevitable.