NVidia Jetson Orin for Chatbot

Flynn · Sep 29, 2025

NVidia Jetson Orin for Chatbot

Short answer: yes—your Jetson (NVIDIA) is a solid edge box for a private chatbot. Use it as the “LLM brain” in your Empire Node, with optional RAG (local docs), and add Riva for voice (ASR/TTS) if you want a talk-to-me movie prop. Here’s the clean path that actually works on Orin Nano / Orin NX today, plus a drop-in .NET gateway so your MVC apps can call it.

What to run on the Jetson (pick 1 to start)

Ollama (CUDA-accelerated on Jetson) – Easiest. Install and ollama run models like Llama-3, Mistral, Gemma, Phi. Jetson AI Lab ships a one-command setup and containers tailored for JetPack 6.x. It’s the lowest-friction path to “it chats right now.” (Jetson AI Lab)
TensorRT-LLM (fast path) – More work to build/convert engines, but best performance on Orin with INT8/FP8, paged KV cache, and fused attention kernels. NVIDIA has a Jetson branch + guides; support landed with JetPack 6.x (first AGX Orin, expanded via Jetson builds). Use when you want maximum tokens/sec. (GitHub)
Voice I/O (optional) – Add NVIDIA Riva for streaming ASR + TTS locally; it ships Quick Start scripts and runs on Jetson with the right JetPack. That gives you “speak with Flynn” scenes. (NVIDIA Docs)
NIM (microservices) – If you later want an enterprise packaging of these as GPU-accelerated inference microservices, NIM is NVIDIA’s route; check the support matrix for which NIMs are available on Jetson vs x86. (NVIDIA Developer)

Practical notes (so you don’t burn a weekend)

JetPack matters. Aim for JetPack 6.x on Orin. Older JetPack + Docker often leads to “GPU not found” for Ollama containers. (NVIDIA Developer Forums)
Model sizes: On Orin Nano 8–16GB, stick to 2B–8B class models for snappy UX (Gemma 2 2B/9B, Llama-3.1-8B-Instruct, Phi-3-mini). Use Q4_K_M or similar quantization under Ollama, or INT8/FP8 under TensorRT-LLM. (Rule of thumb: keep VRAM headroom for KV cache.)
RAG: Keep it Jetson-friendly: SQLite + local embeddings (e.g., nomic-embed-text or gte-small) and a simple vector DB (FAISS).
Offline movie mode: Pre-pull containers and models; preload your RAG index; disable network.
Throughput: Expect “usable” chat rates on 7–8B with quantization. For cinematic scenes (typed lines), you’re golden.

Minimal, production-ish scaffold for your stack

Below are file names and full code (commented) for:
• A Docker Compose that runs Ollama on Jetson with GPU, and (optional) Riva for voice.
• A .NET 8 Web API gateway that your MVC/Razor views can hit locally or over LAN.
• A tiny RAG stub (embeddings + FAISS-style) you can swap later.

— FILE: docker-compose.yml
version: "3.8"
services:
ollama:
image: ghcr.io/johnrkerl/ollama-jetson:latest
# NOTE: Use the Jetson AI Lab Ollama container or jetson-containers build if you prefer.
# Ensure your JetPack 6.x host has NVIDIA Container Runtime set up.
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
- OLLAMA_KEEP_ALIVE=24h
volumes:
- ./ollama:/root/.ollama
ports:
- "11434:11434"
command: ["serve"]

Optional voice: Riva ASR/TTS (heavy; enable after Ollama works)

riva:

image: nvcr.io/nvidia/riva/riva-speech:latest

runtime: nvidia

environment:

- NVIDIA_VISIBLE_DEVICES=all

ports:

- "50051:50051"

command: ["start-riva", "--asr", "--tts"]

After up: docker compose up -d

Then on the Jetson:

— FILE: ollama/Modelfile.llama3

Example: Llama 3.1 8B Instruct, Q4_K_M quant (adjust to your taste)

FROM llama3:8b-instruct-q4_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM You are Flynn’s private Jetson-resident assistant for the Empire Ring movie. Be fast, helpful, and classy.

Build with:

Run with:

— FILE: src/JetsonChat.Gateway/JetsonChat.Gateway.csproj

net8.0
enable
enable

— FILE: src/JetsonChat.Gateway/appsettings.json
{
"Ollama": {
"BaseUrl": "http://localhost:11434", // If Gateway runs on Jetson
"Model": "flynn-llama"
},
"Rag": {
"Enabled": true,
"DbPath": "rag/rag.db"
},
"Logging": {
"LogLevel": { "Default": "Information", "Microsoft.AspNetCore": "Warning" }
},
"AllowedHosts": "*"
}

— FILE: src/JetsonChat.Gateway/Program.cs
using System.Net.Http.Json;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Options;

var builder = WebApplication.CreateBuilder(args);

// Bind config
builder.Services.Configure(builder.Configuration.GetSection("Ollama"));
builder.Services.Configure(builder.Configuration.GetSection("Rag"));
builder.Services.AddSingleton();
builder.Services.AddSingleton(); // simple local RAG
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();

var app = builder.Build();
app.UseSwagger();
app.UseSwaggerUI();

// Health
app.MapGet("/health", () => Results.Ok(new { ok = true, ts = DateTimeOffset.UtcNow }));

// Chat – streams tokens from Ollama and (optionally) augments with RAG
app.MapPost("/chat", async (ChatRequest req, OllamaClient ollama, RagStore rag, IOptions ragOpts, HttpResponse res) =>
{
string augmented = req.Input;
if (ragOpts.Value.Enabled && !string.IsNullOrWhiteSpace(req.Input))
{
var facts = rag.Retrieve(req.Input, topK: 4);
if (facts.Count > 0)
{
augmented = $"Use the following private context when helpful:\n---\n{string.Join("\n", facts)}\n---\nUser: {req.Input}";
}
}

res.Headers.Append("Content-Type", "text/event-stream"); await foreach (var chunk in ollama.Stream(augmented, req.System ?? string.Empty)) { await res.WriteAsync(chunk); await res.Body.FlushAsync(); } 

}).WithDescription("Streams chat tokens from the local Jetson LLM (Ollama).");

// Embeddings endpoint (for RAG indexing)
app.MapPost("/embed", async (EmbedRequest req, OllamaClient ollama, RagStore rag) =>
{
var vec = await ollama.Embed(req.Text);
rag.Upsert(req.Id ?? Guid.NewGuid().ToString("N"), req.Text, vec);
return Results.Ok(new { ok = true });
});

app.Run();

// Options + DTOs
record ChatRequest(string Input, string? System);
record EmbedRequest(string Text, string? Id);
class OllamaOptions { public string BaseUrl { get; set; } = "http://localhost:11434"; public string Model { get; set; } = "flynn-llama"; }
class RagOptions { public bool Enabled { get; set; } = true; public string DbPath { get; set; } = "rag/rag.db"; }

— FILE: src/JetsonChat.Gateway/OllamaClient.cs
using System.Net.Http.Headers;
using System.Net.Http.Json;
using System.Text;
using System.Text.Json;
using Microsoft.Extensions.Options;

public class OllamaClient
{
private readonly HttpClient _http = new();
private readonly OllamaOptions _opt;
public OllamaClient(IOptions opt)
{
_opt = opt.Value;
_http.BaseAddress = new Uri(_opt.BaseUrl);
_http.Timeout = TimeSpan.FromMinutes(30);
}

// Stream chat tokens from /api/generate public async IAsyncEnumerable<string> Stream(string prompt, string system) { var body = new { model = _opt.Model, prompt, system, stream = true }; using var req = new HttpRequestMessage(HttpMethod.Post, "/api/generate") { Content = new StringContent(JsonSerializer.Serialize(body), Encoding.UTF8, "application/json") }; using var resp = await _http.SendAsync(req, HttpCompletionOption.ResponseHeadersRead); resp.EnsureSuccessStatusCode(); using var s = await resp.Content.ReadAsStreamAsync(); using var reader = new StreamReader(s); string? line; while ((line = await reader.ReadLineAsync()) is not null) { try { var doc = JsonDocument.Parse(line); if (doc.RootElement.TryGetProperty("response", out var r)) yield return r.GetString() ?? ""; } catch { /* skip non-JSON heartbeats */ } } } // Embeddings via /api/embeddings (Ollama supports some embed models) public async Task<float[]> Embed(string text) { var body = new { model = "nomic-embed-text", prompt = text }; var resp = await _http.PostAsJsonAsync("/api/embeddings", body); resp.EnsureSuccessStatusCode(); var json = await resp.Content.ReadFromJsonAsync<EmbedResp>(); return json?.embedding ?? Array.Empty<float>(); } private sealed class EmbedResp { public float[] embedding { get; set; } = Array.Empty<float>(); } 

}

— FILE: src/JetsonChat.Gateway/RagStore.cs
using System.Data;
using Microsoft.Data.Sqlite;

// Tiny SQLite + cosine sim store (swap later for FAISS if you like)
public class RagStore
{
private readonly string _path;
public RagStore(Microsoft.Extensions.Options.IOptions o)
{
_path = o.Value.DbPath;
using var conn = Open();
using var cmd = conn.CreateCommand();
cmd.CommandText =
@"CREATE TABLE IF NOT EXISTS docs (id TEXT PRIMARY KEY, text TEXT NOT NULL, vec BLOB NOT NULL);
CREATE INDEX IF NOT EXISTS docs_text ON docs(id);";
cmd.ExecuteNonQuery();
}

public void Upsert(string id, string text, float[] vec) { using var conn = Open(); using var cmd = conn.CreateCommand(); cmd.CommandText = "INSERT INTO docs (id,text,vec) VALUES ($id,$t,$v) ON CONFLICT(id) DO UPDATE SET text=$t, vec=$v;"; cmd.Parameters.AddWithValue("$id", id); cmd.Parameters.AddWithValue("$t", text); cmd.Parameters.Add("$v", SqliteType.Blob).Value = FloatArrayToBytes(vec); cmd.ExecuteNonQuery(); } public List<string> Retrieve(string query, int topK) { // naive: embed query with /embed before calling this in real life; here assume precomputed // For the scaffold, return last N docs (replace with proper cosine search after you embed the query) using var conn = Open(); using var cmd = conn.CreateCommand(); cmd.CommandText = "SELECT text FROM docs LIMIT $k;"; cmd.Parameters.AddWithValue("$k", topK); using var r = cmd.ExecuteReader(); var list = new List<string>(); while (r.Read()) list.Add(r.GetString(0)); return list; } private SqliteConnection Open() { var cs = new SqliteConnectionStringBuilder { DataSource = _path, Mode = SqliteOpenMode.ReadWriteCreate }.ToString(); var c = new SqliteConnection(cs); c.Open(); return c; } private static byte[] FloatArrayToBytes(float[] arr) { var bytes = new byte[arr.Length * sizeof(float)]; Buffer.BlockCopy(arr, 0, bytes, 0, bytes.Length); return bytes; } 

}

— FILE: src/JetsonChat.Gateway/README.txt
Build & Run (Jetson or remote PC pointing to Jetson):

Start the model server on Jetson:
- docker compose up -d
- ollama create flynn-llama -f ollama/Modelfile.llama3
- ollama run flynn-llama (first run downloads base; subsequent runs are instant)
Start the gateway:
- cd src/JetsonChat.Gateway
- dotnet run
- Open Swagger: http://localhost:5000/swagger (adjust port if bound)
Test chat:
curl -X POST http://localhost:5000/chat -H "Content-Type: application/json" -d "{"Input":"Say hi to Flynn."}"
Add RAG docs:
curl -X POST http://localhost:5000/embed -H "Content-Type: application/json" -d "{"Id":"policy-1","Text":"Empire Ring house rules..."}"
Wire your MVC/Razor view to call /chat (stream tokens) and render movie UI.

Where this fits your movie + stack

Record-ready UI: Your Razor “mock texting tool” can call /chat and render live tokens into the phone pane while you screen-record.
Offline reliability: Everything runs local. Preload model + RAG. Perfect for travel scenes.
Upgrade path: When you want more speed, convert your chosen model to TensorRT-LLM engines and switch the gateway’s OllamaClient to a TensorRT endpoint—same app, faster tokens. (GitHub)
Voice scenes: Add Riva for wake-word + streaming ASR and TTS to get “Flynn, call Brad” sequences without cloud. (NVIDIA Docs)

If you want, I can next:
• Swap the gateway to call TensorRT-LLM directly,
• Add a /voice/chat route that stitches Riva ASR → LLM → Riva TTS,
• Or generate the Razor view that matches your Tron theme and streams tokens from /chat.

NVidia Jetson Orin for Chatbot

What to run on the Jetson (pick 1 to start)​

Practical notes (so you don’t burn a weekend)​

Minimal, production-ish scaffold for your stack​

Optional voice: Riva ASR/TTS (heavy; enable after Ollama works)​

riva:​

image: nvcr.io/nvidia/riva/riva-speech:latest​

runtime: nvidia​

environment:​

- NVIDIA_VISIBLE_DEVICES=all​

ports:​

- "50051:50051"​

command: ["start-riva", "--asr", "--tts"]​

After up: docker compose up -d​

Then on the Jetson:​

Example: Llama 3.1 8B Instruct, Q4_K_M quant (adjust to your taste)​

Build with:​

Run with:​

Where this fits your movie + stack​

Similar threads