Real-time 3D Motion Capture with 2D Webcam Input

Controlling a rigged 3D model with a laptop webcam.

Exploring the potential

Creating 3D depth effects through head-tracking

Using Blazepose eye tracking to estimate head position and projection matrix.

For a few years now I’ve been interested in the potential of combining real-time head tracking with a camera’s projection matrix to create a parallax illusion. By manipulating the projection matrix of a 3D scene as a user moves their face around a screen, it’s possible to create an illusion of depth that makes a screen look like window. In the past, this could only be achieved through the use of a depth-enabled sensor like Microsoft Kinect or another RGB-D enabled camera. However, since the release of lightweight pose estimation models, such as PoseNet or Mediapipe, it’s quite trivial to track a user’s position in real-time on low power hardware. This could even be done on a browser these days with passable frame rates.

Tracking a user’s eyes was not too difficult. A naive implementation is to take the two eyes in screen space and measure their distance. Also measure the real-world distance of a user’s eyes. You are essentially computing similar triangles. The first is between the real-world eyes and a point on the real-world screen and the other is between the screen space eyes and the object in the scene. After some trial and error and heavy use of low pass filtering, I was able to get something that didn’t feel too unrealistic.

The parallax effect in unity with real-time head tracking.

Real-time motion capture from 2D images

Real-time inverse kinematics from 2D video

The next stage in my experiment was to create an inverse kinematics solver that computes joint rotation with 2D joint positions. Again, I used Mediapipe’s lightweight joint tracking model to get the position of a user’s joints in 2D screen space. To solve for the inverse kinematics, I mapped the joint positions to the most significant bones in a rigged 3D model. Any 2 joint points could be interpreted as a vector. Therefore, each bone’s start and end points correspond to these vectors. From there, for each frame I compare each bone’s vector to the previous frame’s vector and compute the change in angle and apply those to the rigged model.

After a bit of tedious work (particularly solving for pretty unnatural rotations) I got to a point where things started to look pretty good.