-
Abstract
Recordings of public violence have never been as readily available as today. Livefeeds of shootings and attacks have become an ever increasing problem with gruesome images of violence being a click away from viewers of all ages. AI has begun to be employed to monitor video surveillance in prisons or psychiatric centres to detect “suspicious behaviour” but this technique has yet to be exploited for more general monitoring live broadcast and media sharing sites such as Twitch.tv and Facebook live. This proposed model could be useful as a “missing piece” in the field of censorship AI and used as the basis of a start-up company, as a web browser ad-on or sold directly to streaming services to be incorporated into their website.
-
Overview
A significant amount of research has be done on the various methods to detect violence in videos, focusing on visual content,[1] audio content[2] or a combination of the two.[3] There has been major success with real-time monitoring of audio profanities and nudity but on most online platforms to date manual human monitoring and reporting is still relied upon to detect violent content.[4] This report will focus on violent human behaviour such as fighting as opposed to videos involving weapons, blood or fire which have been previously classified using simple image classification algorithms.[5]
Violent human behaviour can be classified in real-time using pose estimation, an emerging area of research where 3D stick-man poses of individuals can be extracted from 2D pictures and videos. Some of the numerous current applications include automatic creation of assets for digital media such as video games, analysing and coaching the techniques of athletes and with specific interest to this report, machine learning using image classification techniques.[6] Difficulties in the process include accounting for lighting, occlusion and variety of clothing. An advantage of using deep learning over “hand-crafted” techniques is the lack of the need for generalisation of frames and prior information means there is no need to heavy pre-processing of the data.
2.1. Previous developments on pose classification
Bogo and Kanazawa et al. previously reported a convolutional neural network, used to predict the position of individual joints which can be used as a basis to use a Skinned Multi-Person Linear Model optimisation in a classic bottom up – top down approach where the full 3D-geometry and body type is also conferred.[7] The computational demand of this approach is optimised by using constraints such as avoiding impossible join bends, thus minimising the number of possible solutions.
Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Find out more about our Essay Writing Service
Real-time pose estimation has been achieve with the work of Güler et al.[8] The technique of dense human pose estimation maps human pixels in a frame of a video to the 3D surface of the human body with positive results, improved by training an additional “inpainting” network that filled in missing values based on the surrounding data. Opposed to previous research of Bogo and Kanazawa et al., the output poses are dense correspondents between the 2D images and the 3D models, named DensePose.
-
System description
The objective of this report was to develop a model for an end to end trainable, deep neural network to classify live videos to detect violence. The specific aims can be broken down as follows:
● Assess the performance of employing a convolutional neural network to train on frame differences and a ConvLSTM to classify the frames where the output gives an overall probability violence score ranging between zero for extremely unlikely to be violent and 1 for certain violence.
● Theorise the pros and cons of the system as well as performing an evaluation the algorithm against known benchmarks.
● Detail how this model can be incorporated into a live-video moderation application for streaming sites and browsers.
3.1. Pose classification
A convolutional neural network is used to extract frame level features using frame difference from a video in real-time. The output of the trained convolutional network, which will be the desired pose information will be subsequently fed into a seriesfeed-forward layer to output the probability and thus level of violence in the frames thus far. The model can be considered a classical blackbox of which a block-diagram can be viewed in Figure 1.
Although an advantage of convolutional neural networks is an absence of extensive pre-processing of the training data, the method in which the video frames are fed to the model can improve the accuracy of the algorithm. Classification accuracy was investigated by Krizhevsky et al. for the ImageNet dataset using each video frame separately and the difference between each frame.[9] The classification accuracy rose from 96.0