Ollama KV-cache reuse for faster multi-turn conversations
Context caching significantly improves performance for multi-turn conversations by reusing Ollama’s KV-cache. Instead of re-processing the entire conversation history on each message, the system caches context tokens and only processes new content.
| Metric | Without Caching | With Caching | Improvement |
|---|---|---|---|
| 1st message | 100% tokens | 100% tokens | Baseline |
| 2nd message | 100% tokens | ~15% tokens | 85% reduction |
| 5th message | 100% tokens | ~10% tokens | 90% reduction |
| 10th message | 100% tokens | ~7% tokens | 93% reduction |
┌─────────────────────────────────────────────────────────────┐
│ Context Caching Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ Message 1: "Hello!" │
│ ↓ │
│ Ollama processes full prompt │
│ ↓ │
│ Response + context: [1, 45, 789, ...] saved to cache │
│ │
│ ───────────────────────────────────────────── │
│ │
│ Message 2: "How are you?" │
│ ↓ │
│ System sends context: [1, 45, 789, ...] + "How are you?" │
│ ↓ │
│ Ollama ONLY processes new tokens (uses KV-cache) │
│ ↓ │
│ ~10-20% of original processing time │
│ │
└─────────────────────────────────────────────────────────────┘
import { Config, ContentGeneratorConfig } from 'ollama-code';
const config: ContentGeneratorConfig = {
model: 'llama3.2',
enableContextCaching: true, // Enable KV-cache reuse
sessionId: 'my-session', // Optional: custom session ID
};
The HybridContentGenerator automatically selects the optimal endpoint:
import { HybridContentGenerator } from 'ollama-code';
const generator = new HybridContentGenerator({
model: 'llama3.2',
baseUrl: 'http://localhost:11434',
preferGenerateEndpoint: true, // Prefer /api/generate for caching
sessionId: 'my-session',
});
// First message - uses /api/generate, caches context
const response1 = await generator.generateContent({
contents: [{ role: 'user', parts: [{ text: 'Hello!' }] }],
});
// Second message - reuses cached context
const response2 = await generator.generateContent({
contents: [
{ role: 'user', parts: [{ text: 'Hello!' }] },
{ role: 'model', parts: [{ text: 'Hi there!' }] },
{ role: 'user', parts: [{ text: 'How are you?' }] },
],
});
interface ContentGeneratorConfig {
/** Model to use */
model: string;
/** Enable context caching for KV-cache reuse */
enableContextCaching?: boolean;
/** Session ID for context tracking */
sessionId?: string;
/** Base URL for Ollama API */
baseUrl?: string;
/** Request timeout in milliseconds */
timeout?: number;
/** Context window size override */
contextWindowSize?: number;
}
Manages context tokens per session:
import { ContextCacheManager } from 'ollama-code';
const cache = new ContextCacheManager({
maxSessions: 10, // Maximum cached sessions
ttl: 300000, // Time-to-live (5 minutes)
});
// Set context for a session
cache.setContext('session-1', [1, 2, 3, 4, 5], 'llama3.2', 1);
// Get cached context
const context = cache.getContext('session-1');
// Returns: [1, 2, 3, 4, 5] or null if not found
// Check if context exists
cache.hasContext('session-1'); // true
// Invalidate context
cache.invalidate('session-1');
// Get statistics
const stats = cache.getStats();
// { cachedSessions: 1, hits: 5, misses: 1, averageContextSize: 1000 }
Specialized client for /api/generate with context caching:
import { OllamaContextClient } from 'ollama-code';
const client = new OllamaContextClient('http://localhost:11434', 30000);
// Generate with automatic context caching
const response = await client.generate({
model: 'llama3.2',
sessionId: 'chat-1',
prompt: 'Hello!',
system: 'You are a helpful assistant.',
});
console.log(response.response); // Generated text
console.log(response.context); // Context tokens [1, 45, ...]
console.log(response.contextReused); // false (first message)
// Second message reuses context
const response2 = await client.generate({
model: 'llama3.2',
sessionId: 'chat-1',
prompt: 'What is 2+2?',
});
console.log(response2.contextReused); // true
// Stream with context caching
await client.generateStream(
{
model: 'llama3.2',
sessionId: 'chat-1',
prompt: 'Tell me a story',
},
(chunk) => {
process.stdout.write(chunk.response);
}
);
The HybridContentGenerator intelligently selects the optimal endpoint:
| Request Type | Endpoint | Reason |
|---|---|---|
| Simple conversation | /api/generate |
Context caching available |
| With tools | /api/chat |
Tool calling requires chat endpoint |
| Function calls in history | /api/chat |
Function response handling |
// HybridContentGenerator internal logic
private selectEndpoint(request: GenerateContentParameters): EndpointSelection {
// Must use chat if tools are present
if (request.config?.tools?.length > 0) {
return { endpoint: 'chat', reason: 'Tools require /api/chat' };
}
// Must use chat if function calls in history
if (hasFunctionCallsInHistory(request.contents)) {
return { endpoint: 'chat', reason: 'Function calls in history' };
}
// Use generate for context caching
return { endpoint: 'generate', reason: 'Context caching available' };
}
import { contextCacheManager } from 'ollama-code';
// Initialize session
contextCacheManager.initSession('session-1', 'llama3.2', 'You are helpful.');
// After receiving response with context
contextCacheManager.updateSessionState('session-1', contextTokens, messageCount);
// Clear specific session
contextCacheManager.remove('session-1');
// Clear all sessions
contextCacheManager.clear();
Context is automatically invalidated when:
import { OllamaChat } from 'ollama-code';
const chat = new OllamaChat(config, {
enableContextCaching: true,
sessionId: 'my-chat',
systemInstruction: 'You are a coding assistant.',
}, history);
// Context caching is handled automatically
for await (const event of chat.sendMessageStream(model, params, promptId)) {
if (event.type === 'chunk') {
console.log(event.value.text());
}
}
// Access cached context
const cachedContext = chat.getCachedContext();
console.log('Context length:', cachedContext?.length);
// Good: Consistent session ID
const sessionId = `project-${projectId}-user-${userId}`;
// Bad: Random session ID each time
const sessionId = Math.random().toString();
// When switching models
chat.clearCachedContext();
chat.setSessionId('new-session');
const stats = contextCacheManager.getStats();
console.log(`Hit rate: ${stats.hits / (stats.hits + stats.misses) * 100}%`);
// Long conversations: longer TTL
const cache = new ContextCacheManager({ ttl: 600000 }); // 10 minutes
// Quick interactions: shorter TTL
const cache = new ContextCacheManager({ ttl: 60000 }); // 1 minute
// Debug context state
console.log('Has context:', cache.hasContext(sessionId));
console.log('Context:', cache.getContext(sessionId));
console.log('Session state:', cache.getSessionState(sessionId));
Reduce maxSessions or lower TTL:
const cache = new ContextCacheManager({
maxSessions: 5, // Fewer cached sessions
ttl: 120000, // 2 minute TTL
});
The context caching system has comprehensive test coverage with 118 tests:
| Suite | Tests | Description |
|---|---|---|
| Constructor | 2 | Default values, custom configuration |
| Singleton | 1 | Export verification |
| setContext | 3 | Basic operations, eviction |
| getContext | 4 | Retrieval, TTL, hits/misses |
| hasContext | 3 | Valid, invalid, non-existent |
| invalidate | 2 | Invalidation, non-existent |
| remove | 1 | Removal verification |
| clear | 1 | Clear all |
| isContextCompatible | 3 | Model compatibility |
| getStats | 1 | Statistics tracking |
| estimateMemoryUsage | 1 | Memory estimation |
| createGenerateRequest | 3 | Request building |
| handleGenerateResponse | 2 | Response handling |
| Session state management | 4 | Init, update, system prompt |
| Edge cases | 10 | Empty arrays, unicode, large contexts |
| Eviction policy | 3 | LRU eviction behavior |
| Concurrent access | 2 | Read/write concurrency |
| TTL behavior | 2 | Custom TTL, expiry stats |
| Suite | Tests | Description |
|---|---|---|
| Constructor | 2 | Default and custom settings |
| Generate | 4 | Basic, context reuse, streaming, errors |
| Session management | 3 | Clear session, clear all, separate contexts |
| Error handling | 7 | Network, timeout, JSON, HTTP errors |
| Request options | 5 | Model options, images, format, raw mode |
| Edge cases | 4 | Special chars, unicode, large contexts |
| Suite | Tests | Description |
|---|---|---|
| Endpoint selection | 3 | Generate, chat, function calls |
| Context caching | 1 | Context reuse |
| Streaming | 1 | Stream responses |
| countTokens | 4 | Token counting scenarios |
| embedContent | 4 | Embedding generation |
| Generation config | 6 | temperature, topP, topK, etc. |
| System instruction | 3 | Config, parts, text property |
| Image handling | 1 | Inline images |
| Response handling | 2 | Usage metadata, finish reason |
# Run all context caching tests
cd packages/core
bun run test --run src/cache/contextCacheManager.test.ts
bun run test --run src/core/ollamaContextClient.test.ts
bun run test --run src/core/hybridContentGenerator.test.ts
# Run all tests together
bun run test --run src/cache/contextCacheManager.test.ts src/core/ollamaContextClient.test.ts src/core/hybridContentGenerator.test.ts