Simple lip-sync animations in Linux

Commercial Windows programs like CrazyTalk let you turn any image into an animation that lip-syncs to speech audio, so you can create talking characters. In this article, I will outline how to do this using nothing but free Linux tools. The result is more basic but it should be adequate in many cases.

Step 1: Create about 3 frames in The GIMP

Start with the image you want to animate in PNG format, preferably at a fairly low resolution so that the face fits in 100x100 pixels or so (which saves you from making too many frames). The face should have a fully-closed mouth initially, so let's call the image mouth-closed.png. Load it into The GIMP (gimp mouth-closed.png) and use the scale drop-down box (on the status bar) to get it up to 400% or 800% zoom so you can work with individual pixels. Scroll the image to the mouth area.

Enable GIMP's Free Select tool, either by finding it in the toolbox window or by pressing F. This tool lets you draw freehand areas of the image you want to manipulate. For example, you can erase an unwanted background to white by drawing around areas of the background and pressing Control-X to cut them out. However, in this case we want to drag the bottom half of the mouth down, opening it by one pixel, and we'll probably want the inside of the mouth to be black rather than white. Therefore, it is important to set the background colour to black. This can be done, for example, by using the GIMP toolbox window to swap the foreground and background colours.

Carefully draw a line that horizontally traces out where the lips join. Without releasing the mouse, drag downward a little and continue to draw around the entire lower half of the mouth. You don't need to worry about ending on the exact point where you started, as The GIMP will complete your path with a straight line if necessary. If you make a mistake, click outside the selected area to cancel it and try again.

When you have the lower half of the mouth selected, press Control-X to cut it out, and then press Control-V to paste. Then drag the pasted copy so that it is about one pixel below its original position. You should now have about one pixel of black in the mouth, showing it is partially open. (I say "about" one pixel of black, because it won't be a clear-cut black line; The GIMP will be anti-aliasing it for you.) Click outside the selected area to cancel the selection, and go back to 100% zoom to check how it looks. Then save the image as mouthopen-1.png.

Now repeat the process to get the mouth opened by another pixel. It's better if this time you don't select quite as far as the extreme corners of the partially-opened mouth, because the middle of a mouth moves more than its corners. Save the result as mouthopen-2.png.

If you're working in a low enough resolution, then you should find that those two are enough. But you can try making mouthopen-3.png as well if you like, in which case make sure it is listed in the script below.

Step 2: Convert the sound's amplitude to an image sequence

This is not very professional because the true shape of a mouth will depend on the vowel that is being spoken and not just the volume of the speech, but for light use you might be surprised how far you can get by simply using the amplitude.

Because we'll be using a simple Python script to convert the amplitude to a lip position, it is very important that the audio file we start with has absolutely no background noise. (If you want background noise in the final result then you'll have to mix it in after running the script below.) If the audio file has been generated by a speech synthesizer (espeak or whatever) then that should be perfect, but if you are going to record it then you'd better make sure to record in a very quiet environment.

We need to make sure that our speech file (let's call it speech.wav) is padded with at least 3 seconds of silence at the end. This is because we'll be using MEncoder later, and a bug in some versions of MEncoder can cause the last 3 seconds of audio to be lost. (You can skip this step if you don't have a buggy MEncoder, in which case just call the file padded.wav.)

sox speech.wav padded.wav pad 0.1 3

You should now have a file padded.wav with the extra silence in it. Next, for our "analytical" purposes, we convert this to unsigned 8bit 4kHz mono (but don't throw away the original!) so that we can read out the amplitudes more easily with a script.

sox padded.wav -1 -u -c 1 -r 4000 -t raw rawfile

This should make a file rawfile which the following Python script can use to convert into an image sequence (actually a sequence of symbolic links to your frames). The Python script will then run mencoder to make the actual animation.

framerate = 10 ; slice=4000/framerate
dat = open("rawfile").read()
frames = []
import os
for i in range(0,len(dat),slice):
    samples = map(lambda x:ord(x)-128,
                  dat[i:i+slice])
    frames.append(max(samples))

pics = ["mouth-closed.png",
        "mouthopen-1.png",
        "mouthopen-2.png"]
max_mouthOpen = len(pics)-1

step = int(max(frames)/(max_mouthOpen*2))
for i in range(len(frames)):
    mouth=min(int(frames[i]/step),max_mouthOpen)
    if i:
        if mouth>frames[i-1]+1:
            mouth=frames[i-1]+1
        elif mouth < frames[i-1]-1:
            mouth=frames[i-1]-1
    else: mouth=0
    frames[i] = mouth
    os.system("ln -s %s frame%09d.png" %
              (pics[mouth],i))

os.system(("mencoder 'mf://frame0*.png' " +
          "-audiofile padded.wav -mf type=png " +
          "-mf fps=%d -oac mp3lame -ovc lavc " +
          "-o animation.avi && rm frame0*.png")
          % framerate)

Make sure there are no files that match the pattern frame0*.png in the current directory when you run this. The output is saved to animation.avi which you can then view in mplayer.

Limitations

Because this approach opens the mouth by only a few pixels, the resulting video is unlikely to scale well. Rather than try to scale the video after it has been produced, try to make sure the original image is of the right dimensions to start with.

Some versions of MEncoder/MPlayer might not manage to keep the audio in sync with the video for long sequences (more than a few seconds). A player with a setting like "override AVI frame rate based on audio" will not have this problem, and neither does YouTube's uploads converter.

Talkback: Discuss this article with The Answer Gang

[BIO] Silas Brown is a legally blind computer scientist based in Cambridge UK. He has been using heavily-customised versions of Debian Linux since 1999.

Copyright © 2010, Silas Brown. Released under the Open Publication License unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 181 of Linux Gazette, December 2010

<-- prev | next -->

Home Main Site FAQ Site Map Mirrors Translations Search Archives Authors Mailing Lists Join Us! Contact Us
The Free International Online Linux Monthly	ISSN: 1934-371X	Main site: http://linuxgazette.net