← Back to Documentation Index

🤖 MediaPipe Tracking

Real-time hand, face, and body pose tracking via webcam — powered by Google MediaPipe Vision

📋 Overview

The MediaPipe module brings computer-vision tracking to Nova64 carts. Track hands, faces, and body poses using your webcam, then use the landmark data to drive 3D scenes — perfect for AR filters, gesture controls, motion capture, and interactive installations.

💡 Key Features
  • Zero bundle impact — Models (5–20 MB each) are lazy-loaded from CDN only when you call init*Tracking(). Carts that don't use tracking pay nothing.
  • GPU-accelerated — Uses WebGL delegate by default for fast inference on any device.
  • Poll-based API — Detection runs in a background loop; call getHandLandmarks() etc. in your update() to read the latest results.
  • AR camera background — One call to showCameraBackground() sets the webcam feed as your scene background for an AR passthrough effect.
  • Combine freely — You can run hand + face + pose tracking simultaneously.
⚠️ Requirements
  • Camera permission — The browser will prompt the user for webcam access on first call.
  • HTTPS requiredgetUserMedia only works over HTTPS or localhost.
  • Internet connection — Models are downloaded from Google's CDN on first use. Subsequent loads use browser cache.

🤚 Hand Tracking

await initHandTracking(options?)

Starts the webcam and initializes the MediaPipe Hand Landmarker. Begins detecting hands immediately. This function is async — the model download may take a few seconds on first call.

options.numHands number - Maximum hands to detect (default: 2)
options.facing string - Camera facing mode: 'user' (default) or 'environment'
options.delegate string - Inference backend: 'GPU' (default) or 'CPU'
Example:
export async function init() {
  await initHandTracking({ numHands: 2 });
  showCameraBackground(); // AR passthrough
}
getHandLandmarks()

Returns the latest detected hand landmarks. Each hand has 21 landmark points with normalized coordinates (0–1 range, relative to the video frame).

Returns: Array<Array<{x, y, z}>> - One array of 21 landmarks per detected hand

Hand Landmark Indices

8 12 16 20 | | | | 7 11 15 19 | | | | 6 10 14 18 | | | | 4 5 9 13 17 | \ | / / 3 \ | / / | \|// 2 +-------- PALM | / 1 / | / 0 WRIST
IndexLandmarkIndexLandmark
0Wrist11Middle finger PIP
1Thumb CMC12Middle finger tip
2Thumb MCP13Ring finger MCP
3Thumb IP14Ring finger PIP
4Thumb tip15Ring finger DIP
5Index finger MCP16Ring finger tip
6Index finger PIP17Pinky MCP
7Index finger DIP18Pinky PIP
8Index finger tip19Pinky DIP
9Middle finger MCP20Pinky tip
10Middle finger DIP
Example — pinch detection:
export function update(dt) {
  const hands = getHandLandmarks();
  if (hands.length === 0) return;

  const thumb = hands[0][4];  // thumb tip
  const index = hands[0][8];  // index tip
  const dx = thumb.x - index.x;
  const dy = thumb.y - index.y;
  const dist = Math.sqrt(dx * dx + dy * dy);

  if (dist < 0.04) {
    // Pinch detected!
    sfx('coin');
  }
}
getHandGesture()

Returns the recognized gesture for each detected hand. Uses MediaPipe's built-in gesture classifier.

Returns: Array<{name, score} | null> - One result per hand, or null if no gesture recognized

Built-in Gesture Names

GestureDescription
NoneNo recognized gesture
Closed_FistAll fingers curled into fist
Open_PalmAll fingers extended, palm facing camera
Pointing_UpIndex finger pointing upward
Thumb_DownThumbs down gesture
Thumb_UpThumbs up gesture
VictoryPeace / victory sign (index + middle fingers)
ILoveYouASL "I love you" (thumb + index + pinky extended)
Example:
const gestures = getHandGesture();
if (gestures[0]?.name === 'Closed_Fist') {
  // Fist = attack!
  triggerAttack();
} else if (gestures[0]?.name === 'Open_Palm') {
  // Open palm = shield
  activateShield();
}

😀 Face Tracking

await initFaceTracking(options?)

Starts the webcam and initializes the MediaPipe Face Landmarker with blend shape output. Detects face mesh (478 landmarks) and expression blend shapes.

options.numFaces number - Maximum faces to detect (default: 1)
options.facing string - 'user' (default) or 'environment'
options.delegate string - 'GPU' (default) or 'CPU'
Example:
export async function init() {
  await initFaceTracking({ numFaces: 1 });
  showCameraBackground(); // webcam behind the 3D scene
}
getFaceLandmarks()

Returns the 478-point face mesh for each detected face. Landmarks are normalized coordinates (0–1) relative to the video frame.

Returns: Array<Array<{x, y, z}>> - One array of 478 landmarks per face
🗺️ Key Landmark Indices
  • 1 — Nose tip
  • 33, 263 — Left / right eye inner corner
  • 133, 362 — Left / right eye outer corner
  • 159, 386 — Left / right eye top (pupil area)
  • 13, 14 — Upper / lower lip center
  • 61, 291 — Left / right mouth corner
  • 10 — Forehead center
  • 152 — Chin bottom
Example — 3D face filter:
let hat;

export async function init() {
  await initFaceTracking();
  showCameraBackground();
  hat = createCone(0.3, 0.8, 0xff0000, [0, 0, 0]);
}

export function update(dt) {
  const faces = getFaceLandmarks();
  if (faces.length === 0) return;

  // Place hat above forehead (landmark 10)
  const forehead = faces[0][10];
  const x = (forehead.x - 0.5) * 8;
  const y = -(forehead.y - 0.5) * 6 + 0.6;
  setPosition(hat, x, y, -3);
}
getFaceBlendShapes()

Returns blend shape coefficients for each detected face — 52 expression values (0–1) for driving avatar animation or expression detection.

Returns: Array<BlendShapeData> - One set of blend shapes per face

Key Blend Shape Names

CategoryBlend Shapes
EyeseyeBlinkLeft, eyeBlinkRight, eyeWideLeft, eyeWideRight, eyeSquintLeft, eyeSquintRight
BrowsbrowDownLeft, browDownRight, browInnerUp, browOuterUpLeft, browOuterUpRight
MouthmouthSmileLeft, mouthSmileRight, mouthFrownLeft, mouthFrownRight, mouthOpen, jawOpen
Nose / CheeknoseSneerLeft, noseSneerRight, cheekPuff, cheekSquintLeft
Example — smile detection:
const shapes = getFaceBlendShapes();
if (shapes.length > 0) {
  const bs = shapes[0].categories;
  const smile = bs.find(b => b.categoryName === 'mouthSmileLeft');
  if (smile?.score > 0.5) {
    print('😊 Smiling!', 10, 10, 0x00ff00);
  }
}

🏃 Pose Tracking

await initPoseTracking(options?)

Starts the webcam and initializes the MediaPipe Pose Landmarker. Detects 33 body keypoints for full-body pose estimation.

options.numPoses number - Maximum poses to detect (default: 1)
options.facing string - 'user' (default) or 'environment'
options.delegate string - 'GPU' (default) or 'CPU'
getPoseLandmarks()

Returns 33 body keypoints per detected person. Coordinates are normalized (0–1).

Returns: Array<Array<{x, y, z, visibility}>> - One array of 33 landmarks per person

Pose Landmark Indices

IndexLandmarkIndexLandmark
0Nose17Left pinky
1–3Left eye (inner, center, outer)18Right pinky
4–6Right eye (inner, center, outer)19–20Left / right index
7–8Left / right ear21–22Left / right thumb
9–10Left / right mouth23–24Left / right hip
11–12Left / right shoulder25–26Left / right knee
13–14Left / right elbow27–28Left / right ankle
15–16Left / right wrist29–32Left / right heel + foot index
Example — body puppet:
let head, bodyMesh, leftHand, rightHand;

export async function init() {
  await initPoseTracking();
  showCameraBackground();

  head = createSphere(0.3, 0xff8800, [0, 0, -3]);
  leftHand = createSphere(0.15, 0x00ff88, [0, 0, -3]);
  rightHand = createSphere(0.15, 0x00ff88, [0, 0, -3]);
}

export function update(dt) {
  const poses = getPoseLandmarks();
  if (poses.length === 0) return;

  const p = poses[0];
  // Map nose to head position
  setPosition(head,
    (p[0].x - 0.5) * 8,
    -(p[0].y - 0.5) * 6,
    -3
  );
  // Map wrists to hand spheres
  setPosition(leftHand,  (p[15].x - 0.5) * 8, -(p[15].y - 0.5) * 6, -3);
  setPosition(rightHand, (p[16].x - 0.5) * 8, -(p[16].y - 0.5) * 6, -3);
}

📷 Camera Control

startCamera(facing?)

Starts the webcam. Called automatically by init*Tracking(), but can be called independently to preview the camera feed without tracking.

facing string - 'user' (front camera, default) or 'environment' (rear camera)
stopCamera()

Stops the webcam, releases the media stream, and disposes the video texture. Also calls hideCameraBackground().

getCameraTexture()

Returns a THREE.VideoTexture of the webcam feed. Useful for mapping the camera onto 3D objects (e.g., a TV screen, a mirror, a portal).

Returns: THREE.VideoTexture | null
Example — webcam on a TV screen:
await startCamera();
const tex = getCameraTexture();
// Apply to a plane as a "TV screen"
tv.material.map = tex;
showCameraBackground()

Sets the webcam feed as the scene's background, creating an AR passthrough effect. 3D objects appear overlaid on the real world.

hideCameraBackground()

Removes the webcam background and restores the previous scene background (skybox, color, etc.).

stopTracking()

Complete cleanup — closes all landmark detectors, stops the detection loop, stops the camera, and clears all cached results. Call this on game exit or cart reload.

🎮 Complete Hand-Tracking AR Cart

A full example combining hand tracking with AR camera background, pinch gestures, and 3D particle effects.

examples/my-ar-game/code.js
let cursor, particles = [];

export async function init() {
  await initHandTracking({ numHands: 2 });
  showCameraBackground();

  cursor = createSphere(0.15, 0xff00ff, [0, 0, -3],
    { material: 'holographic' });
  setAmbientLight(0xffffff, 0.8);
}

export function update(dt) {
  const hands = getHandLandmarks();
  if (hands.length === 0) return;

  // Move cursor to index finger tip
  const tip = hands[0][8];
  const x = (tip.x - 0.5) * 8;
  const y = -(tip.y - 0.5) * 6;
  setPosition(cursor, x, y, -3);

  // Pinch = spawn particle
  const thumb = hands[0][4];
  const dist = Math.hypot(thumb.x - tip.x, thumb.y - tip.y);
  if (dist < 0.04) {
    particles.push({
      mesh: createSphere(0.08, Math.random() * 0xffffff, [x, y, -3]),
      life: 2,
    });
  }

  // Fade and clean up particles
  for (let i = particles.length - 1; i >= 0; i--) {
    particles[i].life -= dt;
    if (particles[i].life <= 0) {
      removeMesh(particles[i].mesh);
      particles.splice(i, 1);
    }
  }
}

export function draw() {
  print('AR Hand Tracking', 10, 10, 0x00ffcc);
  const g = getHandGesture();
  if (g[0]) print(`Gesture: ${g[0].name}`, 10, 30, 0xffffff);
}