Monthly Save up to 17% Annual

Intelligence Hub Standard

$127

$1,270

/ user / month

/ user / year

AI tokens 40M per month

Storage 10 GB per month

AI CAPABILITIES

Audio & video transcription

Visual scene summarization & description

On-demand multilingual translation

Natural language search across evidence

Audio & video content analysis

Workflow automation with visual graphs

Get started

* Billed monthly (auto-renewal)

* Billed as $1,270 yearly (auto-renewal)

Intelligence Hub Pro

$229

$2,290

/ user / month

/ user / year

AI tokens 90M per month

Storage 20 GB per month

EVERYTHING IN STANDARD, PLUS

2x more AI capacity for complex, multi-hour cases

Cross-media analysis linking audio, video & documents

Concurrent processing of multiple evidence files

Deeper Q&A over interview & interrogation recordings

Automated case chronology from media content

Get started

* Billed monthly (auto-renewal)

* Billed as $2,290 yearly (auto-renewal)

Intelligence Hub Max

$415

$4,150

/ user / year

/Year

AI tokens 140M per month

Storage 50 GB per month

EVERYTHING IN PRO, PLUS

High-volume evidence processing across full case archives

AI-driven pattern detection across large media libraries

Bulk ingestion & auto-indexing of surveillance footage

Complex multi-step investigation workflows

Priority processing for time-critical case requirements

Get started

* Billed monthly (auto-renewal)

* Billed as $4,150 yearly (auto-renewal)

Frequently Asked Questions

LLM → VLM Pipeline

Every user interaction in Intelligence Hub starts at the LLM layer, the orchestration model that interprets the prompt, decides what context it needs, and plans a response. If the prompt requires content inside a video, audio file, or document, the LLM calls VLM (vision-language model inference) to extract that content: frame-level visual analysis, audio transcription, document OCR, or any multimodal extraction task.

A single user prompt can generate two token events: one at the LLM for reasoning and synthesis, and one at VLM for content extraction. Both legs consume tokens. VLM calls on long video or dense documents are typically the larger cost driver, not the conversational LLM turn.

VIDIZMO runs both layers on your own infrastructure (on-prem or your cloud tenant), so you are consuming your own compute, not paying a per-call SaaS fee. The cost structure is infrastructure utilization, not API metering.

The LLM decides whether VLM is needed based on what the prompt requires. VLM is invoked when the answer cannot be formed from already-indexed text metadata alone.

VLM TRIGGERED: "What was said in this interview between 4:20 and 6:00?" Requires audio extraction from a specific segment
VLM TRIGGERED: "Describe what the suspect was wearing." Requires frame-level visual analysis
VLM TRIGGERED: "Summarize this PDF evidence report." Requires document content extraction
LLM ONLY: "List all cases involving Officer Martinez." Answered from structured metadata index
LLM ONLY: "When was this file uploaded?" Answered from database records

A well-indexed library reduces unnecessary VLM calls because pre-processed transcripts and visual summaries are already in the index.

Yes. Token cost varies significantly by modality because the input representation size differs:

AUDIO / SPEECH: Transcription produces a compact text token stream. A 1-hour interview generates roughly 8,000–15,000 tokens of transcript text. Relatively low token density per minute.
VIDEO FRAMES: Visual analysis encodes each sampled frame as a high-dimensional input. Even at sparse sampling (1 frame per 2 seconds), an hour of video produces hundreds of frame-level inference calls, the highest token cost modality per hour.
DOCUMENTS / PDFs: Processed page by page. Cost scales with page count and OCR complexity. Typically lower per-page than video frames but higher than audio per equivalent information density.

For most law enforcement deployments, audio transcription is the dominant VLM workload by volume. Visual frame analysis is used selectively on flagged evidence. Configuring this policy correctly is the primary way agencies manage VLM compute consumption.

Token Consumption

Tokens are consumed any time the AI model processes text or multimodal content. Primary triggers: running a generative AI action (summarization, Q&A, report generation), automated classification or tagging pipelines, and any VLM extraction call on video, audio, or document content.

Tokens are not consumed by storage, playback, or user access to already-processed content. Once a video is transcribed and indexed, browsing or replaying it incurs no additional token cost.

Because VIDIZMO routes inference through your chosen LLM (whether Azure OpenAI, on-prem Ollama, or another provider), the per-token rate reflects your negotiated infrastructure cost, not a VIDIZMO markup.

Standard keyword and metadata searches against the VIDIZMO index are token-free. They operate on pre-built search indexes, not live model calls.

Token consumption begins when a query triggers a generative response: semantic search with natural-language answers, AI-generated summaries of results, or follow-up Q&A on specific content. Customers can configure which search modes are enabled per user role, allowing cost control for large deployments.

Token usage scales with what the prompt asks for, what context the LLM decides to retrieve, and the size of media passed to VLM. A short, narrow prompt costs far less than a broad analytical query, not because of word count, but because of how much content each layer must process.

LOW COST: "What time did the officer arrive on scene?" LLM resolves from metadata or a short transcript segment
MEDIUM COST: "Summarize the key statements from all three witness interviews." LLM orchestrates three VLM extractions then synthesizes
HIGH COST: "Across all 47 BWC files in Case #2024-1188, identify every instance of a weapon visible on screen." LLM fans out to VLM for visual frame analysis across 47 files

The largest token events are batch analytical prompts that range over many files simultaneously. Per-query cost for routine investigator lookups is low; wide-scope analysis prompts are where consumption spikes.

By default, Intelligence Hub's retrieval layer performs semantic chunking before calling VLM. It identifies the most relevant time segments or page ranges first, then passes only those chunks for deep extraction. VLM processes a 45-second clip, not a 4-hour surveillance recording, when the query is narrow enough to allow it.

When a prompt genuinely requires full-file processing, VLM processes the full content. Administrators can set file-length caps or require explicit confirmation for queries that will invoke full-file VLM passes.

Pre-indexing content at ingest is the single biggest lever for reducing per-query VLM usage at runtime. Files analyzed at ingest cost far less per query than cold files processed on-demand.

No. VLM extraction results are cached in the Intelligence Hub index. Once a video segment, audio passage, or document page has been processed and its output stored as embeddings and structured metadata, subsequent queries resolve from the cache. VLM is not re-invoked for the same content unless the source file changes or a re-index is explicitly triggered.

The most expensive token event per file is the first query (or scheduled ingest-time processing if pre-indexing is enabled). Every query thereafter is incrementally cheaper because the LLM works from pre-extracted, pre-embedded content rather than raw media.

VIDIZMO's pre-sales team produces a usage model based on: average monthly hours of video ingested, percentage of content flagged for AI processing, and expected number of AI-assisted queries per user per day. For agentic workflow deployments, workflow type mix and average file scope per run are added inputs.

A 30–60 day metered pilot is strongly recommended for agentic deployments. Single-turn query usage is predictable from estimates alone, but agentic costs are harder to model without observing how investigators actually phrase goals and scope requests in practice.

Bring your average monthly upload volume in hours and your active investigator/analyst headcount to the scoping call. Those two numbers drive 80% of the estimate.

Agentic Workflows

In a standard single-turn query, one user prompt produces one LLM call and at most one VLM call. In an agentic workflow, the user submits a goal. The LLM acts as an autonomous agent: it plans a sequence of steps, invokes tools (including VLM) repeatedly, evaluates intermediate results, decides whether to continue or revise, and iterates until it satisfies the goal. Each step is an independent token event.

A prompt that looks simple to the user, like "build me a timeline of this incident across all evidence files," may trigger 10–30 discrete LLM and VLM steps internally before a response is returned. Every step consumes tokens on both input (context carried forward) and output (the model's reasoning and tool call instructions).

You are not paying per question in agentic mode. You are paying per agent reasoning step. A complex goal with many files can trigger 10–30 discrete LLM and VLM steps before a response is returned.

INCIDENT RECONSTRUCTION: Agent retrieves BWC, in-car, and interview files, cross-references timestamps, extracts segments via VLM, and synthesizes a chronological narrative
MULTI-FILE Q&A: Agent answers a question requiring reading across many documents or video files, iteratively deciding which sources to consult next based on prior findings
AUTOMATED CASE SUMMARIZATION: Agent traverses an entire case folder, extracts content from each file via VLM, and produces a structured summary
COMPLIANCE AUDITING: Agent scans for policy violations across a batch of files, flagging and documenting findings autonomously
SINGLE-FILE Q&A: Not agentic. One LLM call, at most one VLM call. Predictable and low cost.

Five variables compound together:

GOAL BREADTH: A narrow goal triggers far fewer steps than a broad one (the highest-leverage variable)
FILE COUNT IN SCOPE: Each file the agent retrieves adds at least one VLM call. Ten files can mean 10–30 VLM invocations depending on extraction depth
CONTEXT WINDOW CARRY-FORWARD: Agents carry prior results as context into each subsequent LLM step. A 20-step workflow's final synthesis step may have a context window 5–10x larger than its first
ITERATION AND SELF-CORRECTION: When the agent evaluates an intermediate result as insufficient, it loops, re-querying VLM or retrieving additional files
OUTPUT FORMAT COMPLEXITY: Structured report generation (JSON, timestamped narrative, court-exhibit format) requires more LLM output tokens than plain prose

Pre-indexed content dramatically reduces per-step VLM cost. The agent still performs the same number of reasoning steps, but retrieval steps resolve from cache rather than invoking live VLM extraction.

Prompt specificity is a direct cost lever. A vague prompt forces the agent to retrieve broadly before narrowing down; a specific prompt lets it go directly to the relevant content.

HIGH COST: "Tell me everything important about this incident." The agent cannot bound scope, retrieves all available files, iterates extensively
MEDIUM COST: "Summarize the use-of-force events in Case #2024-1188." The agent filters by case, still processes multiple files with a clear extraction target
LOWER COST: "What did Officer Torres say in his BWC footage on 03/14/2024, between 14:20 and 14:45?" Routes directly to a single file and specific time window Training investigators on goal-scoped prompting (being specific about case number, file type, time range, and the exact question) is one of the most effective cost-management practices for agentic deployments.

Yes, significantly. In agentic chains the LLM is called at every reasoning and planning step, not just once. A larger model costs more per token but may complete the task in fewer steps. A smaller model costs less per token but may require more iterations.

LIGHT MODEL: Routine lookups, single-file Q&A, metadata queries. Fast and sufficient for the task
FULL MODEL: Multi-file analysis, incident reconstruction, court-ready summarization. Higher per-token cost, fewer iterations needed
ON-PREM VIA OLLAMA: Zero marginal token cost after infrastructure. Preferred for high-volume or CJIS-sensitive workflows

The optimal configuration for most law enforcement deployments: a light model for daily investigator queries, a full model reserved for supervisor-level analytical workflows.

Agentic workflows are priced on consumption: the aggregate token volume across all LLM and VLM steps within the workflow run. No flat per-workflow charge, no per-step fee. The cost is the sum of all LLM input and output tokens across every reasoning and synthesis step, plus all VLM tokens consumed by each multimodal extraction call.

Because token rates reflect your underlying infrastructure, VIDIZMO does not add a per-token markup. For on-premises deployments, cost is GPU/CPU compute time. There is no runaway spend scenario the way there is with cloud API billing.

Content Pre-processing

HLS adaptive-bitrate conversion is a pre-processing step that runs before any AI pipeline. It is a pure media encoding operation (FFmpeg-based, CPU/GPU bound) and does not consume LLM or VLM tokens. It is included in the platform license as part of standard ingestion and is not separately metered.

Pipeline sequence at ingest: Ingest → HLS Encode → VLM (transcribe + analyze) → LLM (index + embed) → Searchable

Only the VLM and LLM stages consume AI tokens. Encoding, storage write, and metadata extraction (file hash, duration, codec info) are all token-free.

For DEMS customers with continuous ingest from BWC, in-car, or interview room systems, encoding capacity is provisioned as part of the deployment tier rather than metered per-minute. The license covers a defined ingest throughput level aligned to agency size. There is no surprise per-minute overage charge within the provisioned tier.

For on-premises deployments, encoding runs on your hardware. VIDIZMO's software license is the cost, with no cloud transcoding fees.

This is a configurable policy. VIDIZMO supports two modes:

INGEST-TIME PROCESSING: VIDIZMO’s proprietary transcription models process and transcribe content immediately on upload. Higher upfront compute cost, but all subsequent queries are served from cache, dramatically lower per-query cost for active evidence libraries.
ON-DEMAND PROCESSING: Transcription is triggered only when a query first touches a file. Zero compute cost for content never queried. The first query on any file absorbs the full extraction cost.

A hybrid policy is also available: auto-process content tagged as active case evidence, defer processing on background/archive footage.

Transcription is priced per minute of audio processed. VIDIZMO supports 82 languages and uses a speech-to-text engine that can be customer-selected (Azure Cognitive Services, AWS Transcribe, on-premises VIDIZMO's proprietary transcription, etc.). The per-minute rate reflects your chosen engine's cost.

For customers deploying on-premises with VIDIZMO’s proprietary transcription models, transcription cost is effectively zero beyond the compute infrastructure, a significant differentiator for high-volume law enforcement customers managing CJIS data residency requirements.

Transcripts are stored and re-used. Replaying a video or re-running a search does not re-transcribe the file. The per-minute charge is a one-time processing cost.

Visual AI analysis is priced per hour of video processed through the analysis pipeline. The pipeline can include any combination of object and scene detection, face detection, license plate detection, activity classification, and custom model inference. Customers choose which analysis modules run on which content categories. You do not pay for modules you don't enable.

PER-HOUR: Full visual AI pipeline run
PER-HOUR: Custom model inference (bring-your-own-model)
INCLUDED: Accessing stored analysis results and metadata search

Redactor uses VIDIZMO's own AI detection models, separate from the VLM pipeline. Pricing is based on hours of video processed, not token consumption.

ON-DEMAND PROCESSING: Redaction is triggered manually per file and charged based on the duration of video processed.
AUTOMATIC PROCESSING: Redaction runs automatically on ingest or via policy rules, also charged per hour of video processed.
FREE: Storing, viewing, and sharing the completed redacted file incurs no additional charge.

Learn

Proof

Support

Intelligence Hub Pricing

AI CAPABILITIES

EVERYTHING IN STANDARD, PLUS

EVERYTHING IN PRO, PLUS

Frequently Asked Questions

LLM → VLM Pipeline

Token Consumption

Agentic Workflows

Content Pre-processing

Learn

Proof

Support

Intelligence Hub Pricing

AI CAPABILITIES

EVERYTHING IN STANDARD, PLUS

EVERYTHING IN PRO, PLUS

Frequently Asked Questions

LLM → VLM Pipeline

What is the LLM → VLM architecture and why does it matter for pricing?

What triggers a VLM call vs. staying within the LLM layer?

Do different VLM modalities consume different token volumes?

Token Consumption

How are AI tokens consumed in Intelligence Hub, and what triggers a charge?

Does every search query consume tokens, or only AI-powered searches?

How does prompt complexity affect token consumption across both layers?

Does VLM process entire video files, or only relevant segments?

Does the same content get re-processed by VLM every time it is queried?

How are token costs estimated for a deployment before go-live?

Agentic Workflows

Why does pricing work differently in agentic workflows vs. a single query?

What kinds of Intelligence Hub tasks run as agentic workflows?

What are the main variables that determine how many tokens an agentic workflow consumes?

How does prompt wording affect agentic token consumption for the same underlying task?

Does the choice of LLM model affect agentic workflow cost?

How is agentic workflow usage priced: per workflow, per step, or by consumption?

Content Pre-processing

Is HLS encoding part of the content pipeline, and does it involve VLM?

We ingest body camera and in-car video continuously. Does high-volume encoding affect cost?

Is transcription always run at ingest, or only on-demand when a query requires it?

How is transcription priced: per minute, per file, or flat?

What does visual analysis (object detection, scene classification) cost?

Does Redactor processing (blurring faces, plates, audio) use the VLM pipeline?