SYSTEM TO DEFEND AGAINST PUPPETEERING ATTACKS IN AI-BASED LOW BANDWIDTH VIDEO

Title:

SYSTEM TO DEFEND AGAINST PUPPETEERING ATTACKS IN AI-BASED LOW BANDWIDTH VIDEO

Document Type and Number:

WIPO Patent Application WO/2024/097701

Kind Code:

Abstract:

A system defends against puppeteering attacks in low bandwidth talking head videoconferencing systems. The defensive system is a software that can be embedded within existing platforms or an operating system and it exploits the fact that the facial expression and pose information sent to the receiver inherently contain biometric information about the driving speaker. The system leverages this information to obtain measurements of the biometric distance between the driving and reconstructed speaker. If the biometric distance becomes large, this indicates that the driving speaker is a different person than the reconstructed speaker. The system then flags the video transmission as a puppeteering attack.

Inventors:

STAMM MATTHEW (US)
VAHDATI DANIAL (US)
NGUYEN TAI (US)

Application Number:

PCT/US2023/078277

Publication Date:

May 10, 2024

Filing Date:

October 31, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV DREXEL (US)

International Classes:

G06V40/40; G06F18/20; G06V10/764; G06V20/40; G06V40/16; G06V10/00

Attorney, Agent or Firm:

SCHOTT, Stephen, B. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1. A system that protects against puppeteering attacks in AI-Based low bandwidth video system that extracts pose information from the synthesized video at a video receiver side, compares the extracted pose information to pose information transmitted by a video sender, wherein a puppeteering attack is detected by identifying significant biometric differences between the pose information transmitted by the sender and pose information extracted from the video reconstructed by the receiver.

2. The system of claim 1, wherein the system measures depth variation in video frames and accounts for this as part of the system determination of whether there is a puppeteering attack.

3. The system of claim 1, wherein the system measures reconstruction errors in the reconstructed speaker video frames and accounts for this as part of the system determination of whether there is a puppeteering attack.

4. The system of claim 1, wherein the difference is measured over a predetermined window size.

5. The system of claim 1, wherein the video system if a low bandwidth talking head videoconferencing system.

6. A system that identifies puppeteering attacks in video systems between people comprising: obtaining a baseline measurement of the biometric distance between the reconstructed speaker and the driving speaker by estimating facial expression and pose features f_t' from a reconstructed speaker video frame I^˄ such that where h( ) is facial representation; capturing a difference between the driving and reconstructed speaker's biometric information as where m( ) is biometric difference measurement; wherein if the difference is above a pretermined threshold, the system identifies a communication as puppeteered, and if the difference is below a predetermined threshold, the communication is authentic.

7. The system of claim 6, wherein the system measures depth variation in video frames and accounts for this as part of the system determination of authentic vs puppeteered.

8. The system of claim 6, wherein the system measures reconstruction errors in the reconstructed speaker video frames and accounts for this as part of the system determination of authentic vs puppeteered.

9. The system of claim 6, wherein the difference is measured over a predetermined window size.

10. The system of claim 6, wherein the video system if a low bandwidth talking head videoconferencing system.

Description:

System to Defend Against Puppeteering Attacks in AI-Based Low Bandwidth Video

BACKGROUND

[0001] Talking head videos are a type of video where the main focus is on a speaker being filmed from the shoulders up and directly addressing the camera. Advances in Al have allowed for the development of systems that can synthesize realistic talking head videos. In recent years, researchers have made significant progress in creating "one shot" talking head synthesis networks. This new technology enables the creation of realistic talking head videos of a speaker using a single image of that speaker. As a result, synthetic talking head videos can now be easily created for positive uses ranging from virtual assistants and movie production, to potentially malicious uses such as deepfakes.

[0002] Recently, researchers have proposed low bandwidth talking head video systems for use in applications such as videoconferencing and video calls. These systems are based on one-shot talking head synthesis networks. In these systems, a single frame or latent space representation of a speaker is sent from the sender to the receiver. After this initial transmission, the sender transmits information related to facial expression and pose features to the receiver. Using this information and the initial face representation, the receiver synthesizes each frame of the video in real-time. As a result, the speaker on the sender side can "drive" the actions of the synthetic version of themselves on the receiver side in real time. This method significantly reduces the bandwidth needed for videoconferencing, as the vectors of facial pose and expression information are transmitted rather than the entire frame itself. This advancement has the potential to greatly improve the quality and accessibility of videoconferencing, especially in low bandwidth or remote areas.

[0003] Talking head videoconferencing systems are unfortunately susceptible to realtime puppeteering attacks, in which the synthetic video generated at the receiver side does not match the person driving the video. To carry out such attacks, the attacker first sends an image of the target speaker to the receiver during the initialization phase of the video. The system then receives the attacker's facial expression and pose information, which is used to create a synthetic video of the target speaker. This allows the attacker to control a realistic version of the target speaker in real-time, potentially deceiving the viewer on the receiver side. [0004] The ability to puppeteer a target speaker in real-time using talking head videoconferencing systems poses significant risks. This capability can be used to spread misinformation and disinformation, but it can also enable other criminal activities such as fraud and defamation. Already, real-time audio deepfakes have been reportedly used to commit financial crimes. This trend is expected to become more common if videoconferencing systems are unable to protect against puppeteered videos. These videos are likely to be even more convincing than audio-only deepfakes, making them a potent tool for malicious actors. It is crucial to develop effective security measures to prevent puppeteering attacks and ensure the authenticity of input signals in talking head videoconferencing systems.

[0005] Currently, there are no defenses against against puppeteering attacks in low- bandwidth talking head videoconferencing systems. Initially, this problem may seem identical to detecting deepfake videos. However, even when these videoconferencing systems are operating as intended, they create a synthetic version of the speaker at the receiver, i.e. the system deepfakes an authentic speaker in order to save bandwidth. As a result, deepfake detectors are ill-suited to protecting against puppeteering attacks.

[0006] Talking Head Video Systems. Talking head video systems are a type of artificial intelligence-driven technology used to generate highly realistic and dynamic facial animations or video sequences. These systems create virtual characters or "talking heads" that can mimic human-like speech, facial expressions, and emotions. The primary goal of talking head video systems is to provide more engaging and interactive experiences in various applications such as virtual assistants, video games, film, and telecommunication. Non-AI-based talking head video systems typically rely on traditional computer graphics and animation techniques, such as: keyframe animation blend-shapes, morph target, facial motion capture, to generate facial animations and movements. These approaches often require more manual intervention, time, expensive equipment and large amount of expertise compared to Al-based systems. Recent developments in Al-based "talking head" systems involve extracting facial features combined with facial expression or emotion features from both source and target videos. These systems then learn a transfer function that adapts the source's features to fit the target's features, resulting in a more natural and accurate representation. Notable work using this paradigm includes Face2Face, DaGAN, ReenactGAN, SAFA, and X2Face. [0007] A possible application for talking head video systems lies in enhancing low- bandwidth video transmission systems. In these systems, low utilization of bandwidth can be achieved in two ways: 1) By transmitting highly compressed facial embeddings from the sender, which the receiver then uses to reconstruct the face in the video stream; or 2) The sender initially sends low-level representations of the face and background, followed by facial landmarks for subsequent frames, allowing the receiver to reconstruct the corresponding face and background using this information.

[0008] Deepfakes And Synthetic Image Detectors. In order to combat these concerns, researchers have developed many techniques to detect synthetic media. Some of this work has focused on detecting deepfakes. Deepfake detectors work by leveraging priors about the human face's anatomy structure to identify subtle inconsistencies or artifacts in the generated video. State-of-the-art approaches use deep learning to do this and they have achieved very strong results in multiple public datasets. Other research has been done to detect synthetic images, as well as identify video editing and origin. These systems work by looking for either specific forensic traces left by the image generation process, or anomalies in the locally edited media.

[0009] However, since these approaches will likely false alarm authentic, self-reenacted videos as being deepfaked, or synthesized, they are not effective in identifying misuse of this technology.

SUMMARY OF THE EMBODIMENTS

[0010] The inventors propose a system to defend against puppeteering attacks in low bandwidth talking head videoconferencing systems. The defensive system is a software that can be embedded within existing platforms or an operating system and it exploits the fact that the facial expression and pose information sent to the receiver inherently contain biometric information about the driving speaker. The system leverages this information to obtain measurements of the biometric distance between the driving and reconstructed speaker. If the biometric distance becomes large, this indicates that the driving speaker is a different person than the reconstructed speaker. The system then flags the video transmission as a puppeteering attack.

[0011] The system has several advantages over the previous systems: It requires no modifications to the video encoding and transmission system, nor does it require the additional transmission of side information to detect puppeteering. Instead, it uses information already available at the receiver. Biometric features describing the driving and reconstructed speaker are obtained using components already present in an existing system. Furthermore, the system operates with a low computational cost, making it well suited to real-time puppeteering detection.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] FIG. 1 is a high level overview of the proposed system.

[0013] FIG. 2 is an overview of a low bandwidth talking head videoconference system.

[0014] FIG. 3 is an overview of a puppeteering attack in a low bandwidth talking head videoconferencing system.

[0015] FIG. 4 is an example showing the effect of puppeteering on facial landmark positions. [0016] FIG. 5 is an overview of the proposed defensive system. [0017] FIG. 6 is an example of authentic self-reenacted videos as well as puppeteered videos.

[0018] FIG. 7 shows ROC curves showing the performance of the defensive system.

[0019] FIG. 8 shows a plot of puppeteering detecting accuracy vs temporal averaging window size W.

[0020] FIG. 9 shows Table 1, showing puppeteering detection accuracies in the system compared to other deepfake defensive systems.

[0021] FIG. 10 shows Table 2, showing the defensive system's effectiveness across multiple sex and ethnicities.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[0022] 1. Problem Formulation

[0023] This section describes the problem of puppeteering in low bandwidth talking head video systems in more detail. The section begins by describing how these systems operate, including system components. Next, it provides details of how puppeteering attacks are launched.

[0024] 1.1 Low-Bandwidth Talking Head Videoconferencing Systems

[0025] Low-bandwidth talking head videoconferencing systems are designed reduce the amount of information that must be transmitted to a receiver in video conferencing and similar applications. They do this by encoding a talking head video of a speaker at the sender side, then transmitting the encoded information to a receiver. The receiver decodes the video by using the transmitted information as input to a generator, which creates a synthetic video of the speaker.

[0026] These systems operate by first, sending a representation x of the speaker's face to the receiver. This representation is learned from the initial portion of the video, often the first video frame. Typically, x is a single video frame of the speaker containing the speaker's neutral face or a representation of the speaker's face in a latent space learned by a generative adversarial network (GAN).

[0027] At the sender's side, facial expression and pose information f _t at time t is extracted from the current frame I _t using a system h(·) such that EQ. (1)

[0028] The resulting expression and pose feature vector f _t is then transmitted to the receiver. Additional information, such as features that capture any motion present in the background may also be captured and transmitted. These additional features are not relevant for this, and for simplicity we will omit them further discussion without loss of generality.

[0029] The receiver decodes each video frame by using a system g(·), which takes as input the current expression and pose information from the sender, along with the representation of the speakers face sent at the beginning of the transmission. This produces a reconstructed frame I' _t containing a synthesized version of the speaker's face with the desired pose and expression such that EQ. (2)

[0030] In many systems, g corresponds to a generator pre-trained as part of a GAN to synthesize a realistic human face. An overview of the complete encoding, transmission, and decoding process at time t can be seen in FIG. 2.

[0031] 1.2 Puppeteering Attacks

[0032] The low-bandwidth talking head videoconferencing systems described above are vulnerable to puppeteering attacks, in which the reeconstructed speaker at the receiver side is actually controlled in real-time by a different person at the sender side. An overview of a puppeteering attack can be seen in FIG. 3.

[0033] In a puppeteering attack, an attacker (Speaker A) on the sender side first obtains a representation x ^(B) of a target speaker's face (Speaker B). When the video transmission is initiated, the attacker sends x ^(B) to the receiver instead of a representation of their own face. After this, they allow the video system at the sender side to observe their face, and produce a facial expression and pose vector f ^(A) _t which they send to the receiver. The receiver uses along with x ^(B) to construct a video frame with the face of Speaker B, but with the facial expression and pose of Speaker A. As a result, the viewer at the receiver side sees a video of Speaker B that is actually controlled by the actions of the attacker.

[0034] 2. Proposed Approach

[0035] 2.1 Exploiting Biometric Side-Information

[0036] In a puppeteered video, the biometric identity of the driving speaker is different from that of the reconstructed speaker. The system leverages this fact to detect puppeteered videos. While the identity of the driving speaker is not directly observable to the receiver, the receiver does have access to the series of facial expressions and pose vectors f _t sent by the driving speaker. These vectors inherently capture biometric information about the driving speaker. By analyzing the reconstructed video and comparing it to the corresponding f _t's, the system is able to identify biometric differences between the driving and reconstructed speaker present in puppeteering attacks.

[0037] To gain further intuition how this is possible, let us first examine talking head video systems such as X2Face, in which f _t directly corresponds to facial landmark positions of the driving speaker as part of the driving features. In systems such as this, if the driving speaker is the same as the reconstructed speaker, then facial landmark positions extracted from the reconstructed speaker should closely match the facial landmark positions sent by the driving speaker. This can be seen in the top row of FIG. 4, which shows the difference between the landmark positions extracted from a video frame synthesized by X2Face in red and from the driving speaker in blue. Here, the facial landmarks from the driving and reconstructed speaker closely align. In general, there may be small differences between the landmark positions from the driving and reconstructed speaker due to reconstruction error. [0038] If the driving speaker is different than the reconstructed speaker, then they will not share the same facial geometry. This will cause facial landmark positions extracted from the reconstructed video to differ significantly from those sent in f _t. This can be seen in the bottom row of FIG. 4, which shows the difference between the landmark positions extracted from a puppeteered video frame synthesized by X2Face and those from the driving speaker. [0039] Some systems, such as SAFA, do not transmit explicit facial landmark locations as f _t. Instead, these systems encode facial expressions and pose information through other means, such as a learned embedding. These embeddings, however, implicitly capture facial landmark positions and other biometric information about the speaker. As a result, the system is able to use these features to expose biometric differences between the driving and reconstructed speaker.

[0040] 2.2 Detecting Puppeteering

[0041] Our proposed system detects puppeteering attacks by exploiting the biometric information that f _t captures about the driving speaker as described above. A diagram providing an overview of the system can be seen in FIG. 5.

[0042] Baseline Biometric Distance Measurement: First, the system obtains a baseline measurement of the biometric distance between the reconstructed speaker and the driving speaker. To do this, we estimate the facial expression and pose features f _t' from the reconstructed video frame I ^˄ such that:

EQ. (3)

[0043] Note that h is already available to the receiver because it is required to encode and transmit their face back to the sender. Next, the system captures the difference between the driving and reconstructed speaker's biometric information as EQ. (4)

[0044] where m(•, •) is an appropriate metric that measures the difference between f _t and f _t'. In practice, the inventors have found that using is sufficient to achieve strong system performance.

[0045] Controlling For Depth Variation: When a speaker moves farther from the camera, their face becomes smaller. As a result, the differences between ft and f caused by puppeteering also become relatively smaller. The opposite of this is true when the speaker moves closer to the camera. The system accounts for this when differentiating between values of d _t cased by puppeteering and those that naturally occur due to imperfect reconstruction of an authentic speaker.

[0046] To do this, the system makes an initial reference estimate roof the speaker's distance from the camera in the first video frame. In subsequent frames, the system estimates the speaker's depth r _t and calculates a depth-calibrated biometric distance c _t between the driving and reconstructed speaker according to EQ. 5

[0047] Controlling For Natural Reconstruction Errors: As previously noted, a low- bandwidth talking head video system will not perfectly reconstruct an authentic driving speaker at the receiver. As a result, there will be natural variation between f _t and f' . This variation will be larger at some times due to temporally isolated conditions that make it difficult for the video system to accurately synthesize the driving speaker. This could be due to sudden motion, irregular facial expressions or poses, or a number of other factors. If the instantaneous biometric difference ct is used to detect puppeteering, then the system will false alarm when this occurs.

[0048] To control for these effects, the system calculates a time-averaged value of the biometric distance At between the driving and the reconstructed speaker as

[0049] where W is the width of a sliding window over which c _t values are averaged.

[0050] Puppeteering Detection: Finally, the system uses the time-averaged biometric distance At to detect puppeteering by comparing it to a detection threshold T. Because puppeteering induces large biometric distances, values of At greater than T indicate that the video is puppeteered.

[0051] 3. Experiments

[0052] This section presents the details and results of a series of experiments conducted to evaluate the performance of the proposed defensive system. [0053] 3.1 Dataset

[0054] To conduct the experiments, the inventors created a dataset of talking head videos reconstructed by the receiver, along with the facial expression and pose vectors used to reconstruct them. To do this, the inventors first collected a set of pristine videos of multiple speakers, which were used to drive a talking head video system. The inventors gathered these pristine videos by excerpting segments from celebrity interviews publicly distributed on Youtube. Each pristine video corresponds to a 20 to 30 second clip of a single, front-facing speaker. Three pristine videos with different backgrounds and settings were collected from each of 24 different celebrities, resulting in a total of 72 pristine videos. To ensure diversity in the dataset and to help identify any biases that may be inherent in the system, celebrity speakers were chosen to be equally split across sex (i.e. 12 male and 12 female speakers) as well as across four racial/ethnic groups: Black, White, Hispanic, and Asian (i.e. 6 speakers from each group). [0055] The set of pristine videos was then used to create both authentic and puppeteered talking head videos, as would be reconstructed by the receiver in a low- bandwidth talking head video system. Reconstructed talking head videos were created using four different networks: DA-GAN, SAFA, X2Face, and ReenactGAN. The set of facial expression and pose features used to create each video was also retained.

[0056] Using each of the four networks, the inventors created a set of both authentic self-driven videos as well as a set of puppeteered videos. Authentic videos were created by using each of the 72 pristine videos to drive a self-driven reconstruction. Puppeteered videos of each speaker were created by using a pristine video from a different speaker to drive the system. For each speaker, a set of 18 puppeteered videos were made using two different driving speakers. To produced higher-quality reconstructions, the driving speakers for each puppeteered video were selected to match the race/ethnicity and sex of the target speaker. This process was repeated for each of the 24 speakers, resulting in a set of 432 puppeteered videos per network.

[0057] In total, the dataset included 2016 talking head videos corresponding to approximately 14 hours of total video footage. Examples of this dataset can be seen in FIG. 6.

[0058] 3.2 System Performance

[0059] To assess the system's overall accuracy, the inventors used it to identify puppeteering in each of the videos in the dataset. When conducting these experiments, the system used a window size of W = 30 frames, corresponding to 1 second intervals of each video. Puppeteering detection decisions were assessed at a window level. To compare the system's performance to existing approaches, the inventors also analyzed each video using three leading deepfake detectors: Efficient ViT, Cross-Efficient ViT, and CNN Ensemble. Puppeteering detection accuracies obtained by the system are shown in FIG. 9, Table 1, as well as accuracies obtained by the three deepfake detection networks used for comparison. The system acheived strong puppeteering detection performance across all four talking head video systems, with an average detection accuracy of 98.03%. Additionally, FIG. 7 shows ROC curves capturing the performance of the defensive system on all four talking head systems. These ROC curves demonstrate strong puppeteering detection performance at low false alarm rates. This strong performance for SAFA is possible even though facial expression and pose vector f _t used by SAFA do not correspond to explicit facial landmark positions. Instead, these correspond to learned abstract landmark representations. Despite this, the system is still able to use SAFA's f _t's to measure the biometric distance between the driving and reconstructed speaker.

[0060] Comparison With Deepfake Detectors: The results in FIG. 9, Table 1 clearly show that the proposed system significantly outperforms deepfake detectors. The system achieves approximately a 20 percentage point increase in accuracy over the highest performing deepfake detector (Efficient ViT). This is not a surprising result, as deepfake detectors are intentionally built to for a different application.

[0061] 3.3 Effect of Window Size

[0062] The inventors conducted additional experiments to examine the effect of the window size W in (6) on our system's overall accuracy. To do this, they repeated the experiments described in Section 3.2, and let W vary from 1 to 40, i.e. a single frame to ~ 1.33 seconds. The results of this experiment were used to create the plots in FIG. 8, which show the system's puppeteering detection accuracy vs window size.

[0063] From this plot, we can see that our system's accuracy increases with W for all talking head video systems until W lies between 25 and 30 frames. After this point, the accuracy holds roughly constant as W is further increased.

[0064] 4. Discussion

[0065] 4.1 Why Deepfake Detectors Perform Poorly

[0066] Deepfake detectors are intentionally built to detect deepfake videos where a speaker's face has been generated to match a target speaker. This is a similar, yet distinct problem from detecting puppeteering in low-bandwidth talking head systems. While it is clear that a deepfake detector should produce a detection when analyzing a puppeteered video, it is not as clear what these detectors should output when presented with an authentic talking head video.

[0067] In the above experiments, deepfake detectors' most frequent source of errors corresponded to them flagging authentic self-driven videos as 'fake.' For example, this accounts for the vast majority of puppeteering detection errors produced CNN Ensemble. This is reasonable, since the face of the speaker in an authentic talking head video has still been synthesized using essentially the same means used to produce a deepfake. The inventors note, however, that Efficient ViT is able to achieve puppeteering detection perfomances as high as 79.00%. This is possible because Efficient ViT identifies a large portion of self-driven videos as 'real.'

[0068] 4.2 Influence of Race/Ethnicity and Sex [0069] To examine the system for implicit biases, the inventors investigated the influence of race/ethnicity and sex on our system's performance. FIG. 10, Table 2 shows the system's accuracy conditioned on the reconstructed speaker's race/ethnicity and sex. From this table, we can see that the average accuracies hold fairly consistent across all groups. The standard deviation of the system's average accuracy for each group was 0.53 percentage points, with all group's average accuracies lying within two standard deviations from the mean. This indicates that the system is unlikely to produce incorrect decisions more frequently for a speakers of a particular race/ethnicity or sex.

[0070] Note that biases inherent in a low-bandwidth talking head videconferencing system may propogate to the defensive system. This is because the system uses the f _t's produced by the video system to measure the biometric difference between the driving and reconstructed speaker. If the video system produces worse facial expression and pose representations for one sex or racial /ethnic group, then the system will likely perform worse for the same group.

[0071] 5 Conclusion

[0072] The system defends against puppeteering attacks in low bandwidth talking head video systems. The defensive system exploits the fact that the facial expression and pose information sent to the receiver inherently contains biometric information about the driving speaker, which can be used to identify discrepancies between the driving and reconstructed speaker. The system may require no modifications to the video encoding and transmission system and can operate in real time with low computational cost. [0073] EMBODIMENTS [0074] 1. A system that protects against puppeteering attacks in AI-Based low bandwidth video system that extracts pose information from the synthesized video at a video receiver side, compares the extracted pose information to pose information transmitted by a video sender, wherein a puppeteering attack is detected by identifying significant biometric differences between the pose information transmitted by the sender and pose information extracted from the video reconstructed by the receiver.

[0075] 2. A system that identifies puppeteering attacks in video systems between people comprising: obtaining a baseline measurement of the biometric distance between the reconstructed speaker and the driving speaker by estimating facial expression and pose features f _t' from a reconstructed speaker video frame F such that where h( ) is facial representation; capturing a difference between the driving and reconstructed speaker's biometric information as where m( ) is biometric difference measurement; wherein if the difference is above a pretermined threshold, the system identifies a communication as puppeteered, and if the difference is below a predetermined threshold, the communication is authentic.

[0076] The system of embodiment 1 or 2, wherein the system measures depth variation in video frames and accounts for this as part of the system determination of authentic vs puppeteered.

[0077] The system of embodiment 1 or 2 wherein the system measures reconstruction errors in the reconstructed speaker video frames and accounts for this as part of the system determination of authentic vs puppeteered.

[0078] The system of embodiment 1 or 2, wherein the difference is measured over a predetermined window size.

[0079] The system of embodiment 1 or 2, wherein the video system if a low bandwidth talking head videoconferencing system.

[0080] While the invention has been described with reference to the embodiments above, a person of ordinary skill in the art would understand that various changes or modifications may be made thereto without departing from the scope of the claims.

Previous Patent: SYSTEMS AND METHODS FOR ADAPTIVE DEEP LEARNING MR IMAGE ENHANCEMENT WITH PROPERTY CONSTRAINED UNROLL...

Next Patent: TEST KITS, DEVICES AND METHODS FOR DETECTING HIV INFECTION