NVidia Jetson Orin for Chatbot

NVidia Jetson Orin for Chatbot

Short answer: yes—your Jetson (NVIDIA) is a solid edge box for a private chatbot. Use it as the “LLM brain” in your Empire Node, with optional RAG (local docs), and add Riva for voice (ASR/TTS) if you want a talk-to-me movie prop. Here’s the clean path that actually works on Orin Nano / Orin NX today, plus a drop-in .NET gateway so your MVC apps can call it.


What to run on the Jetson (pick 1 to start)​


  1. Ollama (CUDA-accelerated on Jetson) – Easiest. Install and ollama run models like Llama-3, Mistral, Gemma, Phi. Jetson AI Lab ships a one-command setup and containers tailored for JetPack 6.x. It’s the lowest-friction path to “it chats right now.” (Jetson AI Lab)
  2. TensorRT-LLM (fast path) – More work to build/convert engines, but best performance on Orin with INT8/FP8, paged KV cache, and fused attention kernels. NVIDIA has a Jetson branch + guides; support landed with JetPack 6.x (first AGX Orin, expanded via Jetson builds). Use when you want maximum tokens/sec. (GitHub)
  3. Voice I/O (optional) – Add NVIDIA Riva for streaming ASR + TTS locally; it ships Quick Start scripts and runs on Jetson with the right JetPack. That gives you “speak with Flynn” scenes. (NVIDIA Docs)
  4. NIM (microservices) – If you later want an enterprise packaging of these as GPU-accelerated inference microservices, NIM is NVIDIA’s route; check the support matrix for which NIMs are available on Jetson vs x86. (NVIDIA Developer)

Practical notes (so you don’t burn a weekend)​


  • JetPack matters. Aim for JetPack 6.x on Orin. Older JetPack + Docker often leads to “GPU not found” for Ollama containers. (NVIDIA Developer Forums)
  • Model sizes: On Orin Nano 8–16GB, stick to 2B–8B class models for snappy UX (Gemma 2 2B/9B, Llama-3.1-8B-Instruct, Phi-3-mini). Use Q4_K_M or similar quantization under Ollama, or INT8/FP8 under TensorRT-LLM. (Rule of thumb: keep VRAM headroom for KV cache.)
  • RAG: Keep it Jetson-friendly: SQLite + local embeddings (e.g., nomic-embed-text or gte-small) and a simple vector DB (FAISS).
  • Offline movie mode: Pre-pull containers and models; preload your RAG index; disable network.
  • Throughput: Expect “usable” chat rates on 7–8B with quantization. For cinematic scenes (typed lines), you’re golden.



Minimal, production-ish scaffold for your stack​


Below are file names and full code (commented) for:
• A Docker Compose that runs Ollama on Jetson with GPU, and (optional) Riva for voice.
• A .NET 8 Web API gateway that your MVC/Razor views can hit locally or over LAN.
• A tiny RAG stub (embeddings + FAISS-style) you can swap later.


— FILE: docker-compose.yml
version: "3.8"
services:
ollama:
image: ghcr.io/johnrkerl/ollama-jetson:latest
# NOTE: Use the Jetson AI Lab Ollama container or jetson-containers build if you prefer.
# Ensure your JetPack 6.x host has NVIDIA Container Runtime set up.
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
- OLLAMA_KEEP_ALIVE=24h
volumes:
- ./ollama:/root/.ollama
ports:
- "11434:11434"
command: ["serve"]


Optional voice: Riva ASR/TTS (heavy; enable after Ollama works)​


riva:​


image: nvcr.io/nvidia/riva/riva-speech:latest​


runtime: nvidia​


environment:​


- NVIDIA_VISIBLE_DEVICES=all​


ports:​


- "50051:50051"​


command: ["start-riva", "--asr", "--tts"]​


After up: docker compose up -d​


Then on the Jetson:​


— FILE: ollama/Modelfile.llama3


Example: Llama 3.1 8B Instruct, Q4_K_M quant (adjust to your taste)​


FROM llama3:8b-instruct-q4_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM You are Flynn’s private Jetson-resident assistant for the Empire Ring movie. Be fast, helpful, and classy.


Build with:​


Run with:​


— FILE: src/JetsonChat.Gateway/JetsonChat.Gateway.csproj


net8.0
enable
enable


— FILE: src/JetsonChat.Gateway/appsettings.json
{
"Ollama": {
"BaseUrl": "http://localhost:11434", // If Gateway runs on Jetson
"Model": "flynn-llama"
},
"Rag": {
"Enabled": true,
"DbPath": "rag/rag.db"
},
"Logging": {
"LogLevel": { "Default": "Information", "Microsoft.AspNetCore": "Warning" }
},
"AllowedHosts": "*"
}


— FILE: src/JetsonChat.Gateway/Program.cs
using System.Net.Http.Json;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Options;


var builder = WebApplication.CreateBuilder(args);


// Bind config
builder.Services.Configure(builder.Configuration.GetSection("Ollama"));
builder.Services.Configure(builder.Configuration.GetSection("Rag"));
builder.Services.AddSingleton();
builder.Services.AddSingleton(); // simple local RAG
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();


var app = builder.Build();
app.UseSwagger();
app.UseSwaggerUI();


// Health
app.MapGet("/health", () => Results.Ok(new { ok = true, ts = DateTimeOffset.UtcNow }));


// Chat – streams tokens from Ollama and (optionally) augments with RAG
app.MapPost("/chat", async (ChatRequest req, OllamaClient ollama, RagStore rag, IOptions ragOpts, HttpResponse res) =>
{
string augmented = req.Input;
if (ragOpts.Value.Enabled && !string.IsNullOrWhiteSpace(req.Input))
{
var facts = rag.Retrieve(req.Input, topK: 4);
if (facts.Count > 0)
{
augmented = $"Use the following private context when helpful:\n---\n{string.Join("\n", facts)}\n---\nUser: {req.Input}";
}
}


res.Headers.Append("Content-Type", "text/event-stream");<br>await foreach (var chunk in ollama.Stream(augmented, req.System ?? string.Empty))<br>{<br> await res.WriteAsync(chunk);<br> await res.Body.FlushAsync();<br>}<br>

}).WithDescription("Streams chat tokens from the local Jetson LLM (Ollama).");


// Embeddings endpoint (for RAG indexing)
app.MapPost("/embed", async (EmbedRequest req, OllamaClient ollama, RagStore rag) =>
{
var vec = await ollama.Embed(req.Text);
rag.Upsert(req.Id ?? Guid.NewGuid().ToString("N"), req.Text, vec);
return Results.Ok(new { ok = true });
});


app.Run();


// Options + DTOs
record ChatRequest(string Input, string? System);
record EmbedRequest(string Text, string? Id);
class OllamaOptions { public string BaseUrl { get; set; } = "http://localhost:11434"; public string Model { get; set; } = "flynn-llama"; }
class RagOptions { public bool Enabled { get; set; } = true; public string DbPath { get; set; } = "rag/rag.db"; }


— FILE: src/JetsonChat.Gateway/OllamaClient.cs
using System.Net.Http.Headers;
using System.Net.Http.Json;
using System.Text;
using System.Text.Json;
using Microsoft.Extensions.Options;


public class OllamaClient
{
private readonly HttpClient _http = new();
private readonly OllamaOptions _opt;
public OllamaClient(IOptions opt)
{
_opt = opt.Value;
_http.BaseAddress = new Uri(_opt.BaseUrl);
_http.Timeout = TimeSpan.FromMinutes(30);
}


// Stream chat tokens from /api/generate<br>public async IAsyncEnumerable&lt;string&gt; Stream(string prompt, string system)<br>{<br> var body = new<br> {<br> model = _opt.Model,<br> prompt,<br> system,<br> stream = true<br> };<br> using var req = new HttpRequestMessage(HttpMethod.Post, "/api/generate")<br> { Content = new StringContent(JsonSerializer.Serialize(body), Encoding.UTF8, "application/json") };<br><br> using var resp = await _http.SendAsync(req, HttpCompletionOption.ResponseHeadersRead);<br> resp.EnsureSuccessStatusCode();<br><br> using var s = await resp.Content.ReadAsStreamAsync();<br> using var reader = new StreamReader(s);<br> string? line;<br> while ((line = await reader.ReadLineAsync()) is not null)<br> {<br> try<br> {<br> var doc = JsonDocument.Parse(line);<br> if (doc.RootElement.TryGetProperty("response", out var r))<br> yield return r.GetString() ?? "";<br> }<br> catch { /* skip non-JSON heartbeats */ }<br> }<br>}<br><br>// Embeddings via /api/embeddings (Ollama supports some embed models)<br>public async Task&lt;float[]&gt; Embed(string text)<br>{<br> var body = new { model = "nomic-embed-text", prompt = text };<br> var resp = await _http.PostAsJsonAsync("/api/embeddings", body);<br> resp.EnsureSuccessStatusCode();<br> var json = await resp.Content.ReadFromJsonAsync&lt;EmbedResp&gt;();<br> return json?.embedding ?? Array.Empty&lt;float&gt;();<br>}<br><br>private sealed class EmbedResp { public float[] embedding { get; set; } = Array.Empty&lt;float&gt;(); }<br>

}


— FILE: src/JetsonChat.Gateway/RagStore.cs
using System.Data;
using Microsoft.Data.Sqlite;


// Tiny SQLite + cosine sim store (swap later for FAISS if you like)
public class RagStore
{
private readonly string _path;
public RagStore(Microsoft.Extensions.Options.IOptions o)
{
_path = o.Value.DbPath;
using var conn = Open();
using var cmd = conn.CreateCommand();
cmd.CommandText =
@"CREATE TABLE IF NOT EXISTS docs (id TEXT PRIMARY KEY, text TEXT NOT NULL, vec BLOB NOT NULL);
CREATE INDEX IF NOT EXISTS docs_text ON docs(id);";
cmd.ExecuteNonQuery();
}


public void Upsert(string id, string text, float[] vec)<br>{<br> using var conn = Open();<br> using var cmd = conn.CreateCommand();<br> cmd.CommandText = "INSERT INTO docs (id,text,vec) VALUES ($id,$t,$v) ON CONFLICT(id) DO UPDATE SET text=$t, vec=$v;";<br> cmd.Parameters.AddWithValue("$id", id);<br> cmd.Parameters.AddWithValue("$t", text);<br> cmd.Parameters.Add("$v", SqliteType.Blob).Value = FloatArrayToBytes(vec);<br> cmd.ExecuteNonQuery();<br>}<br><br>public List&lt;string&gt; Retrieve(string query, int topK)<br>{<br> // naive: embed query with /embed before calling this in real life; here assume precomputed<br> // For the scaffold, return last N docs (replace with proper cosine search after you embed the query)<br> using var conn = Open();<br> using var cmd = conn.CreateCommand();<br> cmd.CommandText = "SELECT text FROM docs LIMIT $k;";<br> cmd.Parameters.AddWithValue("$k", topK);<br> using var r = cmd.ExecuteReader();<br> var list = new List&lt;string&gt;();<br> while (r.Read()) list.Add(r.GetString(0));<br> return list;<br>}<br><br>private SqliteConnection Open()<br>{<br> var cs = new SqliteConnectionStringBuilder { DataSource = _path, Mode = SqliteOpenMode.ReadWriteCreate }.ToString();<br> var c = new SqliteConnection(cs); c.Open(); return c;<br>}<br><br>private static byte[] FloatArrayToBytes(float[] arr)<br>{<br> var bytes = new byte[arr.Length * sizeof(float)];<br> Buffer.BlockCopy(arr, 0, bytes, 0, bytes.Length); return bytes;<br>}<br>

}


— FILE: src/JetsonChat.Gateway/README.txt
Build & Run (Jetson or remote PC pointing to Jetson):


  1. Start the model server on Jetson:
    • docker compose up -d
    • ollama create flynn-llama -f ollama/Modelfile.llama3
    • ollama run flynn-llama (first run downloads base; subsequent runs are instant)
  2. Start the gateway:
  3. Test chat:
    curl -X POST http://localhost:5000/chat -H "Content-Type: application/json" -d "{"Input":"Say hi to Flynn."}"
  4. Add RAG docs:
    curl -X POST http://localhost:5000/embed -H "Content-Type: application/json" -d "{"Id":"policy-1","Text":"Empire Ring house rules..."}"
  5. Wire your MVC/Razor view to call /chat (stream tokens) and render movie UI.



Where this fits your movie + stack​


  • Record-ready UI: Your Razor “mock texting tool” can call /chat and render live tokens into the phone pane while you screen-record.
  • Offline reliability: Everything runs local. Preload model + RAG. Perfect for travel scenes.
  • Upgrade path: When you want more speed, convert your chosen model to TensorRT-LLM engines and switch the gateway’s OllamaClient to a TensorRT endpoint—same app, faster tokens. (GitHub)
  • Voice scenes: Add Riva for wake-word + streaming ASR and TTS to get “Flynn, call Brad” sequences without cloud. (NVIDIA Docs)

If you want, I can next:
• Swap the gateway to call TensorRT-LLM directly,
• Add a /voice/chat route that stitches Riva ASR → LLM → Riva TTS,
• Or generate the Razor view that matches your Tron theme and streams tokens from /chat.
 
Back
Top