Machine Learning Application using Pose Estimation to Detect and Moderate Violence in Live Videos

2536 words (10 pages) Essay

8th Feb 2020 Computer Science Reference this


Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of

  1. Abstract

Recordings of public violence have never been as readily available as today. Livefeeds of shootings and attacks have become an ever increasing problem with gruesome images of violence being a click away from viewers of all ages. AI has begun to be employed to monitor video surveillance in prisons or psychiatric centres to detect “suspicious behaviour” but this technique has yet to be exploited for more general monitoring live broadcast and media sharing sites such as and Facebook live. This proposed model could be useful as a “missing piece” in the field of censorship AI and used as the basis of a start-up company, as a web browser ad-on or sold directly to streaming services to be incorporated into their website. 

  1. Overview

A significant amount of research has be done on the various methods to detect violence in videos, focusing on visual content,[1] audio content[2] or a combination of the two.[3] There has been major success with real-time monitoring of audio profanities and nudity but on most online platforms to date manual human monitoring and reporting is still relied upon to detect violent content.[4] This report will focus on violent human behaviour such as fighting as opposed to videos involving weapons, blood or fire which have been previously classified using simple image classification algorithms.[5]

Violent human behaviour can be classified in real-time using pose estimation, an emerging area of research where 3D stick-man poses of individuals can be extracted from 2D pictures and videos. Some of the numerous current applications include automatic creation of assets for digital media such as video games, analysing and coaching the techniques of athletes and with specific interest to this report, machine learning using image classification techniques.[6] Difficulties in the process include accounting for lighting, occlusion and variety of clothing. An advantage of using deep learning over “hand-crafted” techniques is the lack of the need for generalisation of frames and prior information means there is no need to heavy pre-processing of the data.

2.1.  Previous developments on pose classification 

Bogo and Kanazawa et al. previously reported a convolutional neural network, used to predict the position of individual joints which can be used as a basis to use a Skinned Multi-Person Linear Model optimisation in a classic bottom up – top down approach where the full 3D-geometry and body type is also conferred.[7] The computational demand of this approach is optimised by using constraints such as avoiding impossible join bends, thus minimising the number of possible solutions.

Real-time pose estimation has been achieve with the work of Güler et al.[8] The technique of dense human pose estimation maps human pixels in a frame of a video to the 3D surface of the human body with positive results, improved by training an additional “inpainting” network that filled in missing values based on the surrounding data. Opposed to previous research of Bogo and Kanazawa et al., the output poses are dense correspondents between the 2D images and the 3D models, named DensePose.

  1.   System description

The objective of this report was to develop a model for an end to end trainable, deep neural network to classify live videos to detect violence. The specific aims can be broken down as follows:

●       Assess the performance of employing a convolutional neural network to train on frame differences and a ConvLSTM to classify the frames where the output gives an overall probability violence score ranging between zero for extremely unlikely to be violent and 1 for certain violence.

●       Theorise the pros and cons of the system as well as performing an evaluation the algorithm against known benchmarks.

●       Detail how this model can be incorporated into a live-video moderation application for streaming sites and browsers.

3.1.  Pose classification

A convolutional neural network is used to extract frame level features using frame difference from a video in real-time. The output of the trained convolutional network, which will be the desired pose information will be subsequently fed into a seriesfeed-forward layer to output the probability and thus level of violence in the frames thus far. The model can be considered a classical blackbox of which a block-diagram can be viewed in Figure 1.

Although an advantage of convolutional neural networks is an absence of extensive pre-processing of the training data, the method in which the video frames are fed to the model can improve the accuracy of the algorithm. Classification accuracy was investigated by Krizhevsky et al. for the ImageNet dataset using each video frame separately and the difference between each frame.[9] The classification accuracy rose from 96.0

The recognition accuracy and error rate of the algorithm can be evaluated on a number of standard benchmark datasets, for example the Hollywood, YouTube-Actions and Violent-Flows dataset. This can be performed using cross validation against other well-studied classification techniques such as ViF/SVM,[12] OVif[13] and MoSIFT[14] that have been evaluated on the Violent-Flows dataset. An extremely robust dataset of pose-action classification known as Action Similarity Labelling (ASLAN) was presented by Kliper-Gross et al. in 2012 and has become a standard benchmark dataset for pose estimation in the years following.[15]  The dataset includes thousands of videos, collected incorporating over 400 complex pose-action with violent and non-violent classes. Models incorporating both convolution and ConvLSTM layers for pose classification exhibit accuracies upwards of approx. 94 % when evaluated against Violent-Flows and ASLAN.[4] From this similar results should be expected from the proposed model present in this report. As mentioned in Section 3, problems have arisen in previous work when including videos of sporting events into the training dataset. When evaluating this model, accuracy values should be taken when both including and excluding sports footage.

  1.   Discussion: Application of the algorithm

The final output of the model described in section 3 gives a user a continuous probability score for the presence of violence in a livestream in real time. The model has been developed to incorporate modern methods such as ConvLSTM to improve classification accuracy and blackbox convolutional neural networks to allow for real-time detection. Assuming that the model performs well against established benchmark datasets, the next step is to research the algorithm’s viability as a commercial product and specify the niche market, functionality and obstacles that a start-up company using this technology may face.

5.1.  Market analysis

The use of deep neural networks for video violence detection applications is currently in its infancy. The most prominent use of the technology to date is seen in the “AI Guardman” developed by the company Earth Eyes released in late 2018. The software boasts the ability to target shoplifting using CCTV using a post estimation model based on OpenPose, a predecessor to DensePose discussed in Section 2. Although the source code is not available, knowing that the product is largely based on the OpenPose algorithm infers that the algorithm cannot compute in real-time. To combat this only a selected number of poses are defined to reflect “suspicious behaviour”, leading to more inaccurate results, and increasing the number of false positives. The software occupies a niche as it does not require sound, something that standard CCTV cameras do not process. The software can be installed on the CCTV directly and alerts are sent to a shop-workers phone, who can then handle the rest of the matter. The software is simple but currently is plagued with false alarms. As all commercial uses of violence detection are geared towards surveillance, a niche for violence detection for streaming services is identified, optimised for online applications.

5.2.  Implementation

After identifying this algorithm as a unique product, it is important to understand how to implement the model in a valuable product. At this point it is critical to note that violence detection software to date has occupied surveillance monitoring as crude analysis can be tolerated as it works as a warning system which then can be followed up by human investigation. Due to this it has not had the need to incorporate other elements such as audio classification.

On its own this algorithm will be able to moderate livestreams based on action recognition, but when paired with already well-researched audio profanity and nudity detection, provides the missing piece to a robust streaming video moderator. Much work has been previously been done on the detection of audio profanities in videos, most notably Bleep developed for iOS released in 2015, which has the ability to censor swear words from voice calls and videos. The same can be said for nudity detection with NudeNet, released for video censoring in March 2019. Amalgamation of these three technologies results in a much requested feature for steaming services such as which censor and report streams instantaneously.

As mentioned above websites such as could benefit from this technology and a first application of a start-up company would be to pitch the idea to streaming services for incorporation into their website and/or app. The feature could be toggled for adult users and made mandatory for kid-accounts. The output of the model gives a probability score for the violence, so thresholds could be put into place. In addition to this a browser add-on could be developed for unsupported streaming websites.

The initial tasks of a start-up company would then be to incorporate a model which could have audio profanity and nudity detection algorithms running in parallel while still having the ability to detect frames in real-time. The idea could then be pitched to established streaming websites, who could host the algorithm server-side. In addition to this if there was demand for the product, a browser ad-on could be developed, incorporating a user-friendly interface with customisable censoring options such as only nudity censoring or censoring of violence above a certain probability threshold.

5.3.  Conclusion

Overall a literature review of pose-estimation and violence detection was conducted to present the notable research in the fields but the lack of a commercial application aside from surveillance. A model was proposed to use post estimation to detect violence in real-time, comprised of a convolutional neural network and LSTM-divided layers based on current research. The system architecture was discussed including a complete block-diagram for the system. Pros and cons for the algorithm were theorised along with a proposed system evaluation. Finally discussion was made on the capability of the algorithm to act as a steaming moderator and act as the product of a start-up company.

  1.  References

[1] P. Bilinski, F. Bremond, I. S. Antipolis, R. Lucioles, and S. Antipolis, “Human Violence Recognition and Detection in Surveillance Videos,” AVSS, August, 2016.

[2] T. Giannakopoulos, A. Pikrakis, and S. Theodoridis, “A multi-class audio classification method with respect to violent content in movies using Bayesian Networks,” in 2007 IEEE 9Th International Workshop on Multimedia Signal Processing, MMSP 2007 – Proceedings, 2007, pp. 90–93.

[3] E. Acar, F. Hopfgartner, and S. Albayrak, “Breaking down violence detection: Combining divide-et-impera and coarse-to-fine strategies,” Neurocomputing, vol. 208, pp. 225–237, Oct. 2016.

[4] S. Sudhakaran and O. Lanz, “Learning to detect violent videos using convolutional long short-term memory,” in 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2017, 2017.

[5] O. Arriaga, P. Plöger, and M. Valdenegro-Toro, “Image Captioning and Classification of Dangerous Situations,” 2017.

[6] M. Ariz, A. Villanueva, and R. Cabeza, “Robust and accurate 2D-tracking-based 3D positioning method: Application to head pose estimation,” Computer Vision and Image Understanding, vol. 180, Academic Press, 2016, pp. 13–22.

[7] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “arXiv : 1607 . 08128v1 [ cs . CV ] 27 Jul 2016 Keep it SMPL : Automatic Estimation of 3D Human Pose and Shape from a Single Image,” eccv2016, 2016,  pp. 1–18.

[8] R. A. Güler, N. Neverova, and I. Kokkinos, “DensePose: Dense Human Pose Estimation in the Wild,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 7297–7306.

[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “2012 AlexNet,” Adv. Neural Inf. Process. Syst., pp. 1–9, 2012.

[10] Z. Dong, J. Qin, and Y. Wang, “Multi-stream deep networks for person to person violence detection in videos,” in Communications in Computer and Information Science, 2016, vol. 662, pp. 517–531.

[11] C. Olah, A. Mordvintsev, and L. Schubert, “Feature Visualization,” Distill, vol. 2, no. 11, p. e7, Nov. 2017.

[12] T. Hassner, “Violent-Flows – Crowd Violence Non-violence Database and benchmark,” 2014.

[13] Y. Gao, H. Liu, X. Sun, C. Wang, and Y. Liu, “Violence detection using Oriented VIolent Flows,” Image and Vision Computing, vol. 48–49. pp. 37–41, 2016.

[14] E. Bermejo Nievas, O. Deniz Suarez, G. Bueno García, and R. Sukthankar, “Violence Detection in Video Using Computer Vision Techniques,” Springer, Berlin, Heidelberg, 2011, pp. 332–339.

[15] O. Kliper-Gross, T. Hassner, and L. Wolf, “The action similarity labeling challenge,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 3, pp. 615–621, 2012.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the website then please: