🤖 MediaPipe Tracking

Real-time hand, face, and body pose tracking via webcam — powered by Google MediaPipe Vision

📋 Overview

The MediaPipe module brings computer-vision tracking to Nova64 carts. Track hands, faces, and body poses using your webcam, then use the landmark data to drive 3D scenes — perfect for AR filters, gesture controls, motion capture, and interactive installations.

💡 Key Features

Zero bundle impact — Models (5–20 MB each) are lazy-loaded from CDN only when you call init*Tracking(). Carts that don't use tracking pay nothing.
GPU-accelerated — Uses WebGL delegate by default for fast inference on any device.
Poll-based API — Detection runs in a background loop; call getHandLandmarks() etc. in your update() to read the latest results.
AR camera background — One call to showCameraBackground() sets the webcam feed as your scene background for an AR passthrough effect.
Combine freely — You can run hand + face + pose tracking simultaneously.

⚠️ Requirements

Camera permission — The browser will prompt the user for webcam access on first call.
HTTPS required — getUserMedia only works over HTTPS or localhost.
Internet connection — Models are downloaded from Google's CDN on first use. Subsequent loads use browser cache.

🤚 Hand Tracking

await initHandTracking(options?)

Starts the webcam and initializes the MediaPipe Hand Landmarker. Begins detecting hands immediately. This function is async — the model download may take a few seconds on first call.

options.numHands number - Maximum hands to detect (default: 2)

options.facing string - Camera facing mode: 'user' (default) or 'environment'

options.delegate string - Inference backend: 'GPU' (default) or 'CPU'

Example:

export async function init() {
  await initHandTracking({ numHands: 2 });
  showCameraBackground(); // AR passthrough
}

getHandLandmarks()

Returns the latest detected hand landmarks. Each hand has 21 landmark points with normalized coordinates (0–1 range, relative to the video frame).

Returns: Array<Array<{x, y, z}>> - One array of 21 landmarks per detected hand

Hand Landmark Indices

8 12 16 20 | | | | 7 11 15 19 | | | | 6 10 14 18 | | | | 4 5 9 13 17 | \ | / / 3 \ | / / | \|// 2 +-------- PALM | / 1 / | / 0 WRIST

Index	Landmark	Index	Landmark
0	Wrist	11	Middle finger PIP
1	Thumb CMC	12	Middle finger tip
2	Thumb MCP	13	Ring finger MCP
3	Thumb IP	14	Ring finger PIP
4	Thumb tip	15	Ring finger DIP
5	Index finger MCP	16	Ring finger tip
6	Index finger PIP	17	Pinky MCP
7	Index finger DIP	18	Pinky PIP
8	Index finger tip	19	Pinky DIP
9	Middle finger MCP	20	Pinky tip
10	Middle finger DIP

Example — pinch detection:

export function update(dt) {
  const hands = getHandLandmarks();
  if (hands.length === 0) return;

  const thumb = hands[0][4];  // thumb tip
  const index = hands[0][8];  // index tip
  const dx = thumb.x - index.x;
  const dy = thumb.y - index.y;
  const dist = Math.sqrt(dx * dx + dy * dy);

  if (dist < 0.04) {
    // Pinch detected!
    sfx('coin');
  }
}

getHandGesture()

Returns the recognized gesture for each detected hand. Uses MediaPipe's built-in gesture classifier.

Returns: Array<{name, score} | null> - One result per hand, or null if no gesture recognized

Built-in Gesture Names

Gesture	Description
`None`	No recognized gesture
`Closed_Fist`	All fingers curled into fist
`Open_Palm`	All fingers extended, palm facing camera
`Pointing_Up`	Index finger pointing upward
`Thumb_Down`	Thumbs down gesture
`Thumb_Up`	Thumbs up gesture
`Victory`	Peace / victory sign (index + middle fingers)
`ILoveYou`	ASL "I love you" (thumb + index + pinky extended)

Example:

const gestures = getHandGesture();
if (gestures[0]?.name === 'Closed_Fist') {
  // Fist = attack!
  triggerAttack();
} else if (gestures[0]?.name === 'Open_Palm') {
  // Open palm = shield
  activateShield();
}

😀 Face Tracking

await initFaceTracking(options?)

Starts the webcam and initializes the MediaPipe Face Landmarker with blend shape output. Detects face mesh (478 landmarks) and expression blend shapes.

options.numFaces number - Maximum faces to detect (default: 1)

options.facing string - 'user' (default) or 'environment'

options.delegate string - 'GPU' (default) or 'CPU'

Example:

export async function init() {
  await initFaceTracking({ numFaces: 1 });
  showCameraBackground(); // webcam behind the 3D scene
}

getFaceLandmarks()

Returns the 478-point face mesh for each detected face. Landmarks are normalized coordinates (0–1) relative to the video frame.

Returns: Array<Array<{x, y, z}>> - One array of 478 landmarks per face

🗺️ Key Landmark Indices

1 — Nose tip
33, 263 — Left / right eye inner corner
133, 362 — Left / right eye outer corner
159, 386 — Left / right eye top (pupil area)
13, 14 — Upper / lower lip center
61, 291 — Left / right mouth corner
10 — Forehead center
152 — Chin bottom

Example — 3D face filter:

let hat;

export async function init() {
  await initFaceTracking();
  showCameraBackground();
  hat = createCone(0.3, 0.8, 0xff0000, [0, 0, 0]);
}

export function update(dt) {
  const faces = getFaceLandmarks();
  if (faces.length === 0) return;

  // Place hat above forehead (landmark 10)
  const forehead = faces[0][10];
  const x = (forehead.x - 0.5) * 8;
  const y = -(forehead.y - 0.5) * 6 + 0.6;
  setPosition(hat, x, y, -3);
}

getFaceBlendShapes()

Returns blend shape coefficients for each detected face — 52 expression values (0–1) for driving avatar animation or expression detection.

Returns: Array<BlendShapeData> - One set of blend shapes per face

Key Blend Shape Names

Category	Blend Shapes
Eyes	`eyeBlinkLeft`, `eyeBlinkRight`, `eyeWideLeft`, `eyeWideRight`, `eyeSquintLeft`, `eyeSquintRight`
Brows	`browDownLeft`, `browDownRight`, `browInnerUp`, `browOuterUpLeft`, `browOuterUpRight`
Mouth	`mouthSmileLeft`, `mouthSmileRight`, `mouthFrownLeft`, `mouthFrownRight`, `mouthOpen`, `jawOpen`
Nose / Cheek	`noseSneerLeft`, `noseSneerRight`, `cheekPuff`, `cheekSquintLeft`

Example — smile detection:

const shapes = getFaceBlendShapes();
if (shapes.length > 0) {
  const bs = shapes[0].categories;
  const smile = bs.find(b => b.categoryName === 'mouthSmileLeft');
  if (smile?.score > 0.5) {
    print('😊 Smiling!', 10, 10, 0x00ff00);
  }
}

🏃 Pose Tracking

await initPoseTracking(options?)

Starts the webcam and initializes the MediaPipe Pose Landmarker. Detects 33 body keypoints for full-body pose estimation.

options.numPoses number - Maximum poses to detect (default: 1)

options.facing string - 'user' (default) or 'environment'

options.delegate string - 'GPU' (default) or 'CPU'

getPoseLandmarks()

Returns 33 body keypoints per detected person. Coordinates are normalized (0–1).

Returns: Array<Array<{x, y, z, visibility}>> - One array of 33 landmarks per person

Pose Landmark Indices

Index	Landmark	Index	Landmark
0	Nose	17	Left pinky
1–3	Left eye (inner, center, outer)	18	Right pinky
4–6	Right eye (inner, center, outer)	19–20	Left / right index
7–8	Left / right ear	21–22	Left / right thumb
9–10	Left / right mouth	23–24	Left / right hip
11–12	Left / right shoulder	25–26	Left / right knee
13–14	Left / right elbow	27–28	Left / right ankle
15–16	Left / right wrist	29–32	Left / right heel + foot index

Example — body puppet:

let head, bodyMesh, leftHand, rightHand;

export async function init() {
  await initPoseTracking();
  showCameraBackground();

  head = createSphere(0.3, 0xff8800, [0, 0, -3]);
  leftHand = createSphere(0.15, 0x00ff88, [0, 0, -3]);
  rightHand = createSphere(0.15, 0x00ff88, [0, 0, -3]);
}

export function update(dt) {
  const poses = getPoseLandmarks();
  if (poses.length === 0) return;

  const p = poses[0];
  // Map nose to head position
  setPosition(head,
    (p[0].x - 0.5) * 8,
    -(p[0].y - 0.5) * 6,
    -3
  );
  // Map wrists to hand spheres
  setPosition(leftHand,  (p[15].x - 0.5) * 8, -(p[15].y - 0.5) * 6, -3);
  setPosition(rightHand, (p[16].x - 0.5) * 8, -(p[16].y - 0.5) * 6, -3);
}

📷 Camera Control

startCamera(facing?)

Starts the webcam. Called automatically by init*Tracking(), but can be called independently to preview the camera feed without tracking.

facing string - 'user' (front camera, default) or 'environment' (rear camera)

stopCamera()

Stops the webcam, releases the media stream, and disposes the video texture. Also calls hideCameraBackground().

getCameraTexture()

Returns a THREE.VideoTexture of the webcam feed. Useful for mapping the camera onto 3D objects (e.g., a TV screen, a mirror, a portal).

Returns: THREE.VideoTexture | null

Example — webcam on a TV screen:

await startCamera();
const tex = getCameraTexture();
// Apply to a plane as a "TV screen"
tv.material.map = tex;

showCameraBackground()

Sets the webcam feed as the scene's background, creating an AR passthrough effect. 3D objects appear overlaid on the real world.

hideCameraBackground()

Removes the webcam background and restores the previous scene background (skybox, color, etc.).

stopTracking()

Complete cleanup — closes all landmark detectors, stops the detection loop, stops the camera, and clears all cached results. Call this on game exit or cart reload.

🎮 Complete Hand-Tracking AR Cart

A full example combining hand tracking with AR camera background, pinch gestures, and 3D particle effects.

examples/my-ar-game/code.js

let cursor, particles = [];

export async function init() {
  await initHandTracking({ numHands: 2 });
  showCameraBackground();

  cursor = createSphere(0.15, 0xff00ff, [0, 0, -3],
    { material: 'holographic' });
  setAmbientLight(0xffffff, 0.8);
}

export function update(dt) {
  const hands = getHandLandmarks();
  if (hands.length === 0) return;

  // Move cursor to index finger tip
  const tip = hands[0][8];
  const x = (tip.x - 0.5) * 8;
  const y = -(tip.y - 0.5) * 6;
  setPosition(cursor, x, y, -3);

  // Pinch = spawn particle
  const thumb = hands[0][4];
  const dist = Math.hypot(thumb.x - tip.x, thumb.y - tip.y);
  if (dist < 0.04) {
    particles.push({
      mesh: createSphere(0.08, Math.random() * 0xffffff, [x, y, -3]),
      life: 2,
    });
  }

  // Fade and clean up particles
  for (let i = particles.length - 1; i >= 0; i--) {
    particles[i].life -= dt;
    if (particles[i].life <= 0) {
      removeMesh(particles[i].mesh);
      particles.splice(i, 1);
    }
  }
}

export function draw() {
  print('AR Hand Tracking', 10, 10, 0x00ffcc);
  const g = getHandGesture();
  if (g[0]) print(`Gesture: ${g[0].name}`, 10, 30, 0xffffff);
}