Recordings of public violence have never been as readily available as today. Livefeeds of shootings and attacks have become an ever increasing problem with gruesome images of violence being a click away from viewers of all ages. AI has begun to be employed to monitor video surveillance in prisons or psychiatric centres to detect “suspicious behaviour” but this technique has yet to be exploited for more general monitoring live broadcast and media sharing sites such as Twitch.tv and Facebook live. This proposed model could be useful as a “missing piece” in the field of censorship AI and used as the basis of a start-up company, as a web browser ad-on or sold directly to streaming services to be incorporated into their website.
A significant amount of research has be done on the various methods to detect violence in videos, focusing on visual content, audio content or a combination of the two. There has been major success with real-time monitoring of audio profanities and nudity but on most online platforms to date manual human monitoring and reporting is still relied upon to detect violent content. This report will focus on violent human behaviour such as fighting as opposed to videos involving weapons, blood or fire which have been previously classified using simple image classification algorithms.
Violent human behaviour can be classified in real-time using pose estimation, an emerging area of research where 3D stick-man poses of individuals can be extracted from 2D pictures and videos. Some of the numerous current applications include automatic creation of assets for digital media such as video games, analysing and coaching the techniques of athletes and with specific interest to this report, machine learning using image classification techniques. Difficulties in the process include accounting for lighting, occlusion and variety of clothing. An advantage of using deep learning over “hand-crafted” techniques is the lack of the need for generalisation of frames and prior information means there is no need to heavy pre-processing of the data.
The objective of this report was to develop a model for an end to end trainable, deep neural network to classify live videos to detect violence. The specific aims can be broken down as follows:
● Assess the performance of employing a convolutional neural network to train on frame differences and a ConvLSTM to classify the frames where the output gives an overall probability violence score ranging between zero for extremely unlikely to be violent and 1 for certain violence.
● Theorise the pros and cons of the system as well as performing an evaluation the algorithm against known benchmarks.
● Detail how this model can be incorporated into a live-video moderation application for streaming sites and browsers.
A convolutional neural network is used to extract frame level features using frame difference from a video in real-time. The output of the trained convolutional network, which will be the desired pose information will be subsequently fed into a seriesfeed-forward layer to output the probability and thus level of violence in the frames thus far. The model can be considered a classical blackbox of which a block-diagram can be viewed in Figure 1.
Although an advantage of convolutional neural networks is an absence of extensive pre-processing of the training data, the method in which the video frames are fed to the model can improve the accuracy of the algorithm. Classification accuracy was investigated by Krizhevsky et al. for the ImageNet dataset using each video frame separately and the difference between each frame. The classification accuracy rose from 96.0