RAG Assistant
Enterprise Document Intelligence Platform
A production-ready Retrieval Augmented Generation system featuring semantic search, conversational AI, and document processing built for regulated environments. The platform combines Weaviate, LangChain, and multi-provider LLM support to deliver fast, accurate answers over private knowledge bases.
Challenge
Build a comprehensive document intelligence platform that enables natural language querying over private document collections while maintaining conversation context, supporting multiple LLM providers, and delivering enterprise-grade security and performance.
Solution
Developed a full-stack RAG application with LangChain integration, Weaviate vector storage, and intelligent conversation management, supporting both CLI and web interfaces with streaming responses and session isolation.
Technology Stack
Modern tooling across the stack ensures performance, resilience, and maintainability.
Frontend Experience
- Next.js 14 with server components and streaming UI
- TypeScript for end-to-end type safety
- Server-Sent Events for real-time chat updates
- Local storage for client-side session persistence
Backend & AI Services
- FastAPI powering conversational and ingestion APIs
- LangChain 0.2 RAG orchestration with custom nodes
- Weaviate v4 vector database with hybrid search
- OpenAI embeddings (text-embedding-3-small)
Document Processing
- Recursive chunking with overlap for contextual recall
- Extensible loaders supporting .txt and .md inputs
- Workflow automation for automatic indexing on deploy
- Metadata preservation for source attribution
LLM Integration
- OpenAI GPT-4 and GPT-3.5 via API key authentication
- Ollama local models with streaming completions
- Jan AI self-hosted providers with unified interface
- Custom providers through OpenAI-compatible endpoints
Core Features
Production capabilities that make the assistant dependable for daily enterprise workflows.
Intelligent Document Processing
Sophisticated ingestion pipeline that keeps context-rich chunks synchronized across deployments.
- Multi-format support with extensible LangChain loaders
- 1000-character chunks with 200-character overlap
- Automatic Railway deployment seeding for clean boots
- Dynamic schema recreation to ensure fresh indexes
- OpenAI embeddings optimized for semantic recall
Pipeline Implementation
TextLoader -> RecursiveCharacterTextSplitter -> OpenAIEmbeddings -> WeaviateVectorStore powers consistent retrieval quality.
Advanced Semantic Search
Hybrid retrieval with proactive fallbacks keeps responses fast and reliable.
- Weaviate near_text semantic search as the primary strategy
- Automatic BM25 keyword fallback for resiliency
- Configurable top-k retrieval (default: 5 results)
- Context trimming to maintain concise prompt payloads
- Full metadata attribution for every source document
Performance Envelope
Delivers sub-500ms retrieval for corpora of 10K+ documents with automated failover to sustain 99.9% success rates.
Conversational Memory Management
Session-aware conversation storage maintains context while respecting token budgets.
- SQLite session storage with automatic isolation per user
- Configurable history retention with summarization backstop
- Token counting safeguards to avoid model limits
- Automatic pruning of stale conversations after 30 days
- Session IDs support multi-user concurrency safely
Architecture Notes
ConversationManager coordinates truncation and summarization to stay within 32K token windows without losing fidelity.
Streaming Response System
SSE-powered experience streams assistant replies as they are generated.
- FastAPI StreamingResponse emitting SSE-compliant chunks
- React client renders incremental updates in real time
- Graceful error handling with retry strategies
- Token-by-token progress indicator keeps users informed
- Automatic persistence of streamed exchanges per session
Under the Hood
Async generators yield `data: {chunk}` payloads that map one-to-one with UI updates for a seamless chat experience.
Data Architecture & Performance
Retrieval, storage, and optimization decisions engineered for reliability at scale.
Vector Storage Strategy
- Primary semantic retrieval backed by near_text queries
- BM25 keyword fallback for resiliency
- Embedding model: text-embedding-3-small
- Context window alignment with 32K token limits
- Observed retrieval latency under 500ms
Intelligent Token Management
- 1Token counting with tiktoken estimations
- 2Progressive context truncation before overflow
- 3LLM-powered summarization when limits approach
- 4Final validation ensuring safe prompt size
Conversation Storage Schema
- SQLite tables keyed by session_id and timestamp
- Automatic migrations maintain schema consistency
- Indexed lookups keep retrieval performant
Advanced Features
Operational depth that unlocks enterprise readiness, governance, and visibility.
Multi-LLM Provider Support
Unified configuration enables rapid switching between providers without code changes.
- Support for OpenAI, Ollama, Jan AI, and custom endpoints
- Provider-specific authentication handled transparently
- Automatic model discovery via /api/models
- Fallback hierarchies to guarantee completions
Enterprise Security Framework
Production-ready guardrails protect sensitive data and ensure safe operations.
- API key authentication enforced via X-API-Key headers
- Session isolation boundaries per conversation
- Explicit CORS policies for trusted origins
- Input sanitization to prevent injection or XSS vectors
- Rate limiting throttles abusive usage patterns
Web Interface Dashboard
Rich frontend experience that mirrors CLI functionality with modern UX.
- Real-time streaming chat with edit history
- Document browser exposing metadata-rich records
- Model selection UI for rapid provider switching
- Session management with clear and persist actions
- Admin tooling for API key configuration and validation
Technical Challenges & Solutions
Each obstacle unlocked stronger engineering patterns that now ship with the platform.
Context Window Management
Challenge
Maintaining answer quality during lengthy interactions while staying under a 32K token ceiling.
Solution
Implemented tiered truncation with automated summarization to preserve salient details without overruns.
Multi-Provider LLM Integration
Challenge
Supporting OpenAI, Ollama, and Jan AI despite differing authentication flows and payload formats.
Solution
Created a unified JanAIPromptNode abstraction that normalizes requests and handles provider-specific nuances.
Production Deployment Complexity
Challenge
Coordinating Weaviate, FastAPI, and Next.js services with synchronized environment configuration and indexing.
Solution
Railway infrastructure scripts automatically seed documents, configure networking, and perform health checks.
Semantic Search Reliability
Challenge
Ensuring consistent retrieval even when upstream providers degrade or encounter errors.
Solution
Dual-strategy retrieval with extensive error handling and logging keeps search dependable under load.
Key Achievements
The RAG Assistant delivers tangible results across engineering and product dimensions.
Technical Implementation
- Complete LangChain-powered RAG pipeline across ingestion and retrieval
- Production FastAPI + Next.js architecture with streaming and auth
- CLI and web interfaces deliver identical capabilities
- Enterprise safeguards including API keys and isolation
- Performance tuning keeps retrieval under half a second
Platform Features
- Semantic search across private document collections
- Context-aware conversational AI with durable memory
- Multi-LLM support with seamless provider switching
- Railway deployment with automated document indexing
- Responsive SSE streaming for premium UX
Future Enhancements
A forward roadmap keeps the platform evolving with the AI landscape.
Planned Features
- LLM-powered query rewriting and contextual refinement
- Agentic multi-step retrieval for complex research tasks
- Expanded document support including PDF, DOCX, and HTML
- Hybrid semantic and keyword search with reranking
- Collaborative annotation and document sharing workflows
Technical Improvements
- LangGraph integration for orchestrating advanced flows
- Distributed vector storage for enterprise-scale datasets
- Monitoring for retrieval performance and user behavior
- Role-based access control with document-level permissions
- Progressive web app features for mobile and offline access
Deployment & Infrastructure
Robust delivery pipelines and environments make the system straightforward to operate.
Production Hosting
- Railway handles FastAPI, Next.js, and Weaviate services
- Environment variables injected automatically per deploy
- Persistent volumes maintain vector indexes reliably
Container Orchestration
- Docker Compose coordinates API, web, and vector services
- Isolated networks secure inter-service communication
- Mounted volumes store documents and embeddings
- Health checks ensure services remain responsive
Development Workflow
- Git-based branching strategy for collaborative work
- .env management with sharable templates
- Docker Compose mirrors production locally
- Multi-layer testing across unit and integration levels
- Automated Railway deployments on main branch merges