Skip to content

Multimodal

The DVAI-Bridge Capacitor plugins accept OpenAI-shaped content parts. What actually runs depends on the backend you started and whether your loaded model has the matching modality. This page documents the shapes, the per-backend support matrix, and the exact error wording you get when a request doesn't fit.

OpenAI content parts

Each messages[i].content is either a plain string or an array of content parts.

ts
type ContentPart =
  | { type: "text"; text: string }
  | { type: "image_url"; image_url: { url: string; detail?: "low" | "high" | "auto" } }
  | { type: "input_audio"; input_audio: { data: string; format: AudioFormat } };

type AudioFormat = "pcm16" | "wav" | "mp3" | "m4a" | "aac" | "flac" | "ogg";

A multimodal request looks like this.

json
{
  "model": "<modelId>",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "What is in this picture?" },
        { "type": "image_url", "image_url": { "url": "data:image/png;base64,iVBOR..." } }
      ]
    }
  ]
}

Per-backend modality matrix

Modalitycapacitor-llamacapacitor-foundationcapacitor-mediapipe
Text
Image✅ if mmprojPath loaded✅ if vision-capable model
Audio✅ if model has native audio encoder
Streaming SSE
Embeddings✅ if embeddingMode: true

Phase 1 best-tested paths:

  • Text on all three backends.
  • Image on capacitor-mediapipe against vision-capable Gemma .task artifacts. Image on capacitor-llama is wired — the mmproj path is gated until Phase 2 verification.
  • Audio on capacitor-llama needs a model whose GGUF has a native audio encoder — Gemma 4 multimodal, Phi-4 Multimodal. The pass-through is implemented, verified only on Phase 2 hardware.

Treat the matrix as the contract. Treat Phase 1 verification status as a caveat layered on top.

Image content parts

Three URL forms are accepted.

1. Data URLs (base64 inline)

ts
{
  type: "image_url",
  image_url: { url: "data:image/png;base64,iVBORw0KGgoAAAANS..." }
}

Best for images already in memory — camera capture, generated previews. The plugin base64-decodes inline.

2. https:// URLs

ts
{
  type: "image_url",
  image_url: { url: "https://example.com/cat.jpg" }
}

The plugin fetches the URL on the native side. No CORS concerns — that's a browser-only constraint. Treat any external fetch as network-dependent and error-prone.

3. file:// URLs

ts
{
  type: "image_url",
  image_url: { url: "file:///data/.../cache/photo.jpg" }
}

Reads directly from app-private storage. Pair with Capacitor's Camera and Filesystem plugins for capture flows.

Decoded image bytes go to the backend.

  • capacitor-llamamtmd_helper_eval with the loaded mmproj. Decodes PNG and JPEG internally.
  • capacitor-mediapipeLlmInferenceSession.addImage(MPImage) on vision-enabled .task models.
  • capacitor-foundation — returns 400. Not in the current API.

Audio content parts

ts
{
  type: "input_audio",
  input_audio: { data: "<base64>", format: "wav" }
}

data is base64-encoded bytes of the encoded format — or raw PCM samples for format: "pcm16". The plugin decodes via platform-native APIs into 16-bit PCM and hands the samples to the backend's audio API.

Format availability per platform

FormatiOSAndroid
pcm16✅ direct✅ direct
wav
mp3
m4a / aac
flac❌ → 400
ogg❌ → 400

Decoding paths:

  • iOSAVAudioFile + AVAudioConverter (built-in).
  • AndroidMediaExtractor + MediaCodec (built-in).

If you target both platforms, wav, mp3, and m4a are the safe-by-default formats. flac works only on iOS. ogg only on Android.

Backend routing for audio:

  • capacitor-llamamtmd_helper_eval_audio (or current upstream equivalent) for models with a native audio encoder.
  • capacitor-foundation — 400. Not in the current API.
  • capacitor-mediapipe — 400. No audio-capable tasks in Phase 1.

Error semantics

When a content part can't be served, the plugin returns one of these exact-wording responses. Match on these strings if you build user-facing remediation UI.

SituationStatusBody
Image content part on llama, no mmproj loaded400{ "error": "Request includes an image but no mmproj was loaded. Set nativeMmprojPath when starting." }
Image content part on foundation400{ "error": "Image input not supported by Apple Foundation Models in this version." }
Audio content part, model without audio encoder400{ "error": "Loaded model has no native audio encoder. Use a multimodal model like Gemma 4 or Phi-4 Multimodal." }
Image fetch from https:// URL fails502{ "error": "Failed to fetch image: <reason>" }
Audio decode fails400{ "error": "Audio decode failed: <reason>" }
Unsupported audio format400{ "error": "Unsupported audio format: <fmt>. Supported on this platform: <list>." }

These wordings are spec-pinned and asserted by the cross-language handler parity tests. They will not change without a CHANGELOG entry.

Streaming SSE notes

When stream: true, all three backends emit OpenAI-shaped chunks. There is one documented asymmetry across plugins — see Handler parity for the detail. For application code using an OpenAI SDK or the Vercel AI SDK, the asymmetry is invisible — those clients tolerate both shapes.

Phase 1 limitations

  • Image and audio pass-through are implemented behind the HTTP boundary. Per-modality verification on capacitor-llama lands on Phase 2's hardware budget. Expect "wired but lightly tested."
  • Vision on capacitor-mediapipe is the most-tested image path in Phase 1.
  • capacitor-foundation stays text-only until Apple ships a multimodal LanguageModelSession API.

See also