Prompt-Driven Schema Generation for RAG Workflows

Generate database schemas for RAG applications using natural language prompts with Pydantic and SQLAlchemy models

Generate database schemas for RAG applications using natural language prompts with Pydantic and SQLAlchemy models. Create optimized schemas for vector storage and retrieval workflows with AI-powered schema generation tailored for AI applications.

How it works

Describe your RAG use case in natural language, and GibsonAI generates an optimized MySQL database schema with vector storage capabilities and retrieval optimization. The system automatically creates Pydantic schemas and SQLAlchemy models for seamless Python integration.

Key Features

RAG-Optimized Schema Generation

Vector Storage: Optimized tables for storing embeddings and vectors
Document Management: Efficient document storage and indexing
Metadata Handling: Structured metadata storage for enhanced retrieval
Relationship Modeling: Proper relationships between documents, chunks, and vectors

Python AI Integration

Pydantic Schemas: Type-safe models for data validation
SQLAlchemy Models: ORM models for database operations
API Generation: RESTful APIs for vector operations
Embedding Support: Optimized storage for AI embeddings

Text-to-SQL Analysis

Natural Language Queries: Ask questions about your RAG data
Gibson Studio: Run generated SQL queries in the intuitive data management UI
Vector Analysis: Analyze vector similarity and document relationships
Performance Insights: Optimize RAG query performance

Step-by-step guide

1. Generate RAG schema with natural language

# Create a schema for document-based RAG
gibson modify documents "Create a RAG system with documents table containing title, content, and metadata, a chunks table for text segments with embeddings, and a queries table for tracking search queries"

2. Generate Python models

# Generate Pydantic schemas for validation
gibson code schemas

# Generate SQLAlchemy models for database operations
gibson code models

# Generate all Python code
gibson code base

3. Explore RAG data with text-to-SQL

Use Gibson Studio to analyze your RAG system:

"Show me the most frequently queried documents"
"Find documents with similar embeddings to a specific vector"
"What's the average chunk size across all documents?"
"Find documents that haven't been queried in the last month"

4. Access your RAG data

Integration options:

RESTful APIs: Base URL https://api.gibsonai.com
- SQL queries: /v1/-/query
- Table operations: /v1/-/[table-name-in-kebab-case]
OpenAPI Spec: Available in your project settings
Direct Connection: Connection string available in the UI
API Documentation: Available in the data API section

Example RAG schema

Generated database schema

-- Documents table for storing source documents
CREATE TABLE documents (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    content TEXT NOT NULL,
    metadata JSON,
    source_url VARCHAR(500),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

-- Chunks table for storing text segments
CREATE TABLE chunks (
    id INT AUTO_INCREMENT PRIMARY KEY,
    document_id INT NOT NULL,
    content TEXT NOT NULL,
    embedding JSON,
    chunk_index INT NOT NULL,
    token_count INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);

-- Queries table for tracking search queries
CREATE TABLE queries (
    id INT AUTO_INCREMENT PRIMARY KEY,
    query_text TEXT NOT NULL,
    query_embedding JSON,
    results_found INT DEFAULT 0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Generated Pydantic schemas

from pydantic import BaseModel
from datetime import datetime
from typing import Optional, Dict, Any, List

class Document(BaseModel):
    id: Optional[int] = None
    title: str
    content: str
    metadata: Optional[Dict[str, Any]] = None
    source_url: Optional[str] = None
    created_at: Optional[datetime] = None
    updated_at: Optional[datetime] = None

class Chunk(BaseModel):
    id: Optional[int] = None
    document_id: int
    content: str
    embedding: Optional[List[float]] = None
    chunk_index: int
    token_count: Optional[int] = None
    created_at: Optional[datetime] = None

class Query(BaseModel):
    id: Optional[int] = None
    query_text: str
    query_embedding: Optional[List[float]] = None
    results_found: int = 0
    created_at: Optional[datetime] = None

Generated SQLAlchemy models

from sqlalchemy import Column, Integer, String, Text, JSON, DateTime, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from datetime import datetime

Base = declarative_base()

class Document(Base):
    __tablename__ = 'documents'

    id = Column(Integer, primary_key=True, autoincrement=True)
    title = Column(String(255), nullable=False)
    content = Column(Text, nullable=False)
    metadata = Column(JSON)
    source_url = Column(String(500))
    created_at = Column(DateTime, default=datetime.utcnow)
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)

    chunks = relationship("Chunk", back_populates="document", cascade="all, delete-orphan")

class Chunk(Base):
    __tablename__ = 'chunks'

    id = Column(Integer, primary_key=True, autoincrement=True)
    document_id = Column(Integer, ForeignKey('documents.id'), nullable=False)
    content = Column(Text, nullable=False)
    embedding = Column(JSON)
    chunk_index = Column(Integer, nullable=False)
    token_count = Column(Integer)
    created_at = Column(DateTime, default=datetime.utcnow)

    document = relationship("Document", back_populates="chunks")

class Query(Base):
    __tablename__ = 'queries'

    id = Column(Integer, primary_key=True, autoincrement=True)
    query_text = Column(Text, nullable=False)
    query_embedding = Column(JSON)
    results_found = Column(Integer, default=0)
    created_at = Column(DateTime, default=datetime.utcnow)

RAG integration examples

Storing documents and embeddings

import requests
from typing import List

# Store a document
document_data = {
    "title": "Machine Learning Basics",
    "content": "Machine learning is a subset of artificial intelligence...",
    "metadata": {"category": "education", "tags": ["AI", "ML"]},
    "source_url": "https://example.com/ml-basics"
}

response = requests.post("https://api.gibsonai.com/v1/-/documents", json=document_data)
document = response.json()

# Store chunks with embeddings
chunk_data = {
    "document_id": document["id"],
    "content": "Machine learning is a subset of artificial intelligence",
    "embedding": [0.1, 0.2, 0.3, ...],  # Your embedding vector
    "chunk_index": 0,
    "token_count": 150
}

response = requests.post("https://api.gibsonai.com/v1/-/chunks", json=chunk_data)

Querying with text-to-SQL

# Use text-to-SQL to find similar documents
query = """
SELECT d.title, d.content, c.content as chunk_content
FROM documents d
JOIN chunks c ON d.id = c.document_id
WHERE JSON_EXTRACT(d.metadata, '$.category') = 'education'
ORDER BY d.created_at DESC
LIMIT 10
"""

response = requests.post("https://api.gibsonai.com/v1/-/query", json={"query": query})
results = response.json()

Using SQLAlchemy for vector operations

from sqlalchemy import create_engine, func
from sqlalchemy.orm import sessionmaker

# Use connection string from GibsonAI UI
engine = create_engine("your-connection-string-from-ui")
Session = sessionmaker(bind=engine)
session = Session()

# Find documents by metadata
documents = session.query(Document).filter(
    func.json_extract(Document.metadata, '$.category') == 'education'
).all()

# Get chunks for a document
document_id = 1
chunks = session.query(Chunk).filter(
    Chunk.document_id == document_id
).order_by(Chunk.chunk_index).all()

Use cases

What's next?

Try it on GibsonAI!

GibsonAI is an AI-powered developer platform that lets you design, build, and deploy production-grade serverless databases in minutes — using just natural language prompts.