Open Access
Open access
Mathematics, volume 12, issue 6, pages 899

Boundary-Match U-Shaped Temporal Convolutional Network for Vulgar Action Segmentation

Publication typeJournal Article
Publication date2024-03-18
Journal: Mathematics
scimago Q2
SJR0.475
CiteScore4.0
Impact factor2.3
ISSN22277390
General Mathematics
Computer Science (miscellaneous)
Engineering (miscellaneous)
Abstract

The advent of deep learning has provided solutions to many challenges posed by the Internet. However, efficient localization and recognition of vulgar segments within videos remain formidable tasks. This difficulty arises from the blurring of spatial features in vulgar actions, which can render them indistinguishable from general actions. Furthermore, issues of boundary ambiguity and over-segmentation complicate the segmentation of vulgar actions. To address these issues, we present the Boundary-Match U-shaped Temporal Convolutional Network (BMUTCN), a novel approach for the segmentation of vulgar actions. The BMUTCN employs a U-shaped architecture within an encoder–decoder temporal convolutional network to bolster feature recognition by leveraging the context of the video. Additionally, we introduce a boundary-match map that fuses action boundary inform ation with greater precision for frames that exhibit ambiguous boundaries. Moreover, we propose an adaptive internal block suppression technique, which substantially mitigates over-segmentation errors while preserving accuracy. Our methodology, tested across several public datasets as well as a bespoke vulgar dataset, has demonstrated state-of-the-art performance on the latter.

Cao J., Xu R., Lin X., Qin F., Peng Y., Shao Y.
2023-01-12 citations by CoLab: 12 Abstract  
Temporally detecting and classifying action segments in untrimmed videos are significant for many applications, especially for the detection of vulgar actions such as sucking and caressing in video platform supervision and surveillance applications. At present, vulgar action segmentation has problems such as fuzzy spatial features and complex temporal features of the video, which affect the detection accuracy. Therefore, this paper proposed an effective Adaptive receptive field U-shaped Temporal Convolutional Network (AU-TCN) for the automatic and accurate detection of vulgar actions in the video. Firstly, considering that the current temporal convolutional network has a significant effect on temporal feature extraction, AU-TCN uses the adaptive receptive field convolution kernel to solve the problem of large differences in the average duration between different types of actions in the Internet videos and then realize the temporal attention mechanism. Secondly, the U-shaped structure based on the temporal convolutional network is introduced to effectively analyze both high-level and low-level temporal features of the model, to solve the problem that the spatial features of vulgar actions are not obvious. Finally, extensive experiments on multiple data sets, including public datasets and a self-built vulgar dataset, verify the effectiveness of the proposed model. Our method achieves state-of-the-art results on the vulgar action dataset. This is of great significance to the purification of the Internet environment.
Park J., Kim D., Huh S., Jo S.
Pattern Recognition scimago Q1 wos Q1
2022-09-01 citations by CoLab: 24 Abstract  
• We propose a divide-and-conquer method that first maximizes frame accuracy and then reconstructs the features to reduce over-segmentation. • Dilation passing network propagates long- and short-range features enabling better understanding of the relation between frames. • Temporal reconstruction network uses a convolutional encoder-decoder to capture local context for temporal consistency among frames. • Our model achieves meaningful results over the state-of-the-art models on three challenging datasets. Action segmentation aims to split videos into segments of different actions. Recent work focuses on dealing with long-range dependencies of long, untrimmed videos, but still suffers from over-segmentation and performance saturation due to increased model complexity. This paper addresses the aforementioned issues through a divide-and-conquer strategy that first maximizes the frame-wise classification accuracy of the model and then reduces the over-segmentation errors. This strategy is implemented with the Dilation Passing and Reconstruction Network, composed of the Dilation Passing Network, which primarily aims to increase accuracy by propagating information of different dilations, and the Temporal Reconstruction Network, which reduces over-segmentation errors by temporally encoding and decoding the output features from the Dilation Passing Network. We also propose a weighted temporal mean squared error loss that further reduces over-segmentation. Through evaluations on the 50Salads, GTEA, and Breakfast datasets, we show that our model achieves significant results compared to existing state-of-the-art models.
Du H., Shi H., Zeng D., Zhang X., Mei T.
ACM Computing Surveys scimago Q1 wos Q1
2022-01-05 citations by CoLab: 79 Abstract  
Face recognition (FR) is one of the most popular and long-standing topics in computer vision. With the recent development of deep learning techniques and large-scale datasets, deep face recognition has made remarkable progress and has been widely used in many real-world applications. Given a natural image or video frame as input, an end-to-end deep face recognition system outputs the face feature for recognition. To achieve this, a typical end-to-end system is built with three key elements: face detection, face alignment, and face representation. Face detection locates faces in the image or frame. Then, the face alignment is proceeded to calibrate the faces to the canonical view and crop them with a normalized pixel size. Finally, in the stage of face representation, the discriminative features are extracted from the aligned face for recognition. Nowadays, all of the three elements are fulfilled by the technique of deep convolutional neural network. In this survey article, we present a comprehensive review about the recent advance of each element of the end-to-end deep face recognition, since the thriving deep learning techniques have greatly improved their capability of them. To start with, we present an overview of the end-to-end deep face recognition. Then, we review the advance of each element, respectively, covering many aspects such as the to-date algorithm designs, evaluation metrics, datasets, performance comparison, existing challenges, and promising directions for future research. Also, we provide a detailed discussion about the effect of each element on its subsequent elements and the holistic system. Through this survey, we wish to bring contributions in two aspects: first, readers can conveniently identify the methods which are quite strong-baseline style in the subcategory for further exploration; second, one can also employ suitable methods for establishing a state-of-the-art end-to-end face recognition system from scratch.
Alwassel H., Giancola S., Ghanem B.
2021-10-01 citations by CoLab: 78 Abstract  
Due to the large memory footprint of untrimmed videos, current state-of-the-art video localization methods operate atop precomputed video clip features. These features are extracted from video encoders typically trained for trimmed action classification tasks, making such features not necessarily suitable for temporal localization. In this work, we propose a novel supervised pretraining paradigm for clip features that not only trains to classify activities but also considers background clips and global video information to improve temporal sensitivity. Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning. We also show that our pretraining approach is effective across three encoder architectures and two pretraining datasets. We believe video feature encoding is an important building block for localization algorithms, and extracting temporally-sensitive features should be of paramount importance in building more accurate models. The code and pretrained models are available on our project website.
Ahn H., Lee D.
2021-10-01 citations by CoLab: 43 Abstract  
In this paper, we propose Hierarchical Action Segmentation Refiner (HASR), which can refine temporal action segmentation results from various models by understanding the overall context of a given video in a hierarchical way. When a backbone model for action segmentation estimates how the given video can be segmented, our model extracts segment-level representations based on frame-level features, and extracts a video-level representation based on the segment-level representations. Based on these hierarchical representations, our model can refer to the overall context of the entire video, and predict how the segment labels that are out of context should be corrected. Our HASR can be plugged into various action segmentation models (MS-TCN, SSTDA, ASRF), and improve the performance of state-of-the-art models based on three challenging datasets (GTEA, 50Salads, and Breakfast). For example, in 50Salads dataset, the segmental edit score improves from 67.9% to 77.4% (MS-TCN), from 75.8% to 77.3% (SSTDA), from 79.3% to 81.0% (ASRF). In addition, our model can refine the segmentation result from the unseen backbone model, which was not referred to when training HASR. This generalization performance would make HASR be an effective tool for boosting up the existing approaches for temporal action segmentation. Our code is available at https://github.com/cotton-ahn/HASR_iccv2021.
Arif M.
2021-06-01 citations by CoLab: 29 Abstract  
Social media networks are becoming an essential part of life for most of the world’s population. Detecting cyberbullying using machine learning and natural language processing algorithms is getting the attention of researchers. There is a growing need for automatic detection and mitigation of cyberbullying events on social media. In this study, research directions and the theoretical foundation in this area are investigated. A systematic review of the current state-of-the-art research in this area is conducted. A framework considering all possible actors in the cyberbullying event must be designed, including various aspects of cyberbullying and its effect on the participating actors. Furthermore, future directions and challenges are also discussed.
Li Y., Dong Z., Liu K., Feng L., Hu L., Zhu J., Xu L., wang Y., Liu S.
Neurocomputing scimago Q1 wos Q1
2021-05-04 citations by CoLab: 28 Abstract  
Due to boundary ambiguity and over-segmentation issues, identifying all the frames in long untrimmed videos is still challenging. To address these problems, we present the Efficient Two-Step Network (ETSN) with two components. The first step of ETSN is Efficient Temporal Series Pyramid Networks (ETSPNet) that capture both local and global frame-level features and provide accurate predictions of segmentation boundaries. The second step is a novel unsupervised approach called Local Burr Suppression (LBS), which significantly reduces the over-segmentation errors. Our empirical evaluations on the benchmarks including 50Salads, GTEA and Breakfast dataset demonstrate that ETSN outperforms the current state-of-the-art methods by a large margin.
Ishikawa Y., Kasai S., Aoki Y., Kataoka H.
2021-01-01 citations by CoLab: 107 Abstract  
We propose an effective framework for the temporal action segmentation task, namely an Action Segment Refinement Framework (ASRF). Our model architecture consists of a long-term feature extractor and two branches: the Action Segmentation Branch (ASB) and the Boundary Regression Branch (BRB). The long-term feature extractor provides shared features for the two branches with a wide temporal receptive field. The ASB classifies video frames with action classes, while the BRB regresses the action boundary probabilities. The action boundaries predicted by the BRB refine the output from the ASB, which results in a significant performance improvement. Our contributions are three-fold: (i) We propose a framework for temporal action segmentation, the ASRF, which divides temporal action segmentation into frame-wise action classification and action boundary regression. Our framework refines frame-level hypotheses of action classes using predicted action boundaries. (ii) We propose a loss function for smoothing the transition of action probabilities, and analyze combinations of various loss functions for temporal action segmentation. (iii) Our framework outperforms state-of-the-art methods on three challenging datasets, offering an improvement of up to 13.7% in terms of segmental edit distance and up to 16.1% in terms of segmental F1 score. Our code is publicly available 1 .
Wang Z., Gao Z., Wang L., Li Z., Wu G.
2020-11-20 citations by CoLab: 90 Abstract  
Identifying human action segments in an untrimmed video is still challenging due to boundary ambiguity and over-segmentation issues. To address these problems, we present a new boundary-aware cascade network by introducing two novel components. First, we devise a new cascading paradigm, called Stage Cascade, to enable our model to have adaptive receptive fields and more confident predictions for ambiguous frames. Second, we design a general and principled smoothing operation, termed as local barrier pooling, to aggregate local predictions by leveraging semantic boundary information. Moreover, these two components can be jointly fine-tuned in an end-to-end manner. We perform experiments on three challenging datasets: 50Salads, GTEA and Breakfast dataset, demonstrating that our framework significantly outperforms the current state-of-the-art methods. The code is available at https://github.com/MCG-NJU/BCN .
Mallmann J., Santin A.O., Viegas E.K., dos Santos R.R., Geremias J.
2020-11-01 citations by CoLab: 20 Abstract  
Convolutional neural network (CNN) models are typically composed of several gigabytes of data, requiring dedicated hardware and significant processing capabilities for proper handling. In addition, video-detection tasks are typically performed offline, and each video frame is analyzed individually, meaning that the video’s categorization (class assignment) as normal or pornographic is only complete after all the video frames have been evaluated. This paper proposes the Private Parts Censor (PPCensor), a CNN-based architecture for transparent and near real-time detection and obfuscation of pornographic video frame regions. Our contribution is two-fold. First, the proposed architecture is the first that addresses the detection of pornographic content as an object detection problem. The objective is to apply user-friendly content filtering such that an inevitable false positive will obfuscate only regions (objects) within the video frames instead of blocking the entire video. Second, the PPCensor architecture is deployed on dedicated hardware, and real-time detection is deployed using a video-oriented streaming proxy. If a pornographic video frame is identified in the video, the system can hide pornographic content (private parts) in real time without user interaction or additional processing on the user’s device. Based on more than 50,000 objects labeled manually, the evaluation results show that the PPCensor is capable of detecting private parts in near real time for video streaming. Compared to cutting-edge CNN architectures for image classification, PPCensor achieved similar results, but operated in real time. In addition, when deployed on a desktop computer, PPCensor handled up to 35 simultaneous connections without the need for additional processing on the end-user device.
Ge S., Li C., Zhao S., Zeng D.
2020-10-01 citations by CoLab: 96 Abstract  
Face recognition has achieved advanced development by using convolutional neural network (CNN) based recognizers. Existing recognizers typically demonstrate powerful capacity in recognizing un-occluded faces, but often suffer from accuracy degradation when directly identifying occluded faces. This is mainly due to insufficient visual and identity cues caused by occlusions. On the other hand, generative adversarial network (GAN) is particularly suitable when it needs to reconstruct visually plausible occlusions by face inpainting. Motivated by these observations, this paper proposes identity-diversity inpainting to facilitate occluded face recognition. The core idea is integrating GAN with an optimized pre-trained CNN recognizer which serves as the third player to compete with the generator by distinguishing diversity within the same identity class. To this end, a collect of identity-centered features is applied in the recognizer as supervision to enable the inpainted faces clustering towards their identity centers. In this way, our approach can benefit from GAN for reconstruction and CNN for representation, and simultaneously addresses two challenging tasks, face inpainting and face recognition. Experimental results compared with 4 state-of-the-arts prove the efficacy of the proposed approach.
Papadamou K., Papasavva A., Zannettou S., Blackburn J., Kourtellis N., Leontiadis I., Stringhini G., Sirivianos M.
A large number of the most-subscribed YouTube channels target children of very young age. Hundreds of toddler-oriented channels on YouTube feature inoffensive, well produced, and educational videos. Unfortunately, inappropriate content that targets this demographic is also common. YouTube's algorithmic recommendation system regrettably suggests inappropriate content because some of it mimics or is derived from otherwise appropriate content. Considering the risk for early childhood development, and an increasing trend in toddler's consumption of YouTube media, this is a worrisome problem. In this work, we build a classifier able to discern inappropriate content that targets toddlers on YouTube with 84.3% accuracy, and leverage it to perform a large-scale, quantitative characterization that reveals some of the risks of YouTube media consumption by young children. Our analysis reveals that YouTube is still plagued by such disturbing videos and its currently deployed counter-measures are ineffective in terms of detecting them in a timely manner. Alarmingly, using our classifier we show that young children are not only able, but likely to encounter disturbing videos when they randomly browse the platform starting from benign videos.
Yu H., He F., Pan Y.
2019-12-09 citations by CoLab: 101 Abstract  
Image segmentation plays an important role in the computer vision . However, it is extremely challenging due to low resolution, high noise and blurry boundaries. Recently, region-based models have been widely used to segment such images. The existing models often utilized Gaussian filtering to filter images, which caused the loss of edge gradient information. Accordingly, in this paper, a novel local region model based on adaptive bilateral filter is presented for segmenting noisy images. Specifically, we firstly construct a range-based adaptive bilateral filter, in which an image can well be preserved edge structures as well as resisted noise. Secondly, we present a data-driven energy model, which utilizes local information of regions centered at each pixel of image to approximate intensities inside and outside of the circular contour. The estimation approach has improved the accuracy of noisy image segmentation. Thirdly, under the premise of keeping the image original shape, a regularization function is used to accelerate the convergence speed and smoothen the segmentation contour. Experimental results of both synthetic and real images demonstrate that the proposed model is more efficient and robust to noise than the state-of-art region-based models.
Liu Y., Ma L., Zhang Y., Liu W., Chang S.
2019-06-01 citations by CoLab: 159 Abstract  
Temporal action proposal generation is an important task, aiming to localize the video segments containing human actions in an untrimmed video. In this paper, we propose a multi-granularity generator (MGG) to perform the temporal action proposal from different granularity perspectives, relying on the video visual features equipped with the position embedding information. First, we propose to use a bilinear matching model to exploit the rich local information within the video sequence. Afterwards, two components, namely segment proposal producer (SPP) and frame actionness producer (FAP), are combined to perform the task of temporal action proposal at two distinct granularities. SPP considers the whole video in the form of feature pyramid and generates segment proposals from one coarse perspective, while FAP carries out a finer actionness evaluation for each video frame. Our proposed MGG can be trained in an end-to-end fashion. Through temporally adjusting the segment proposals with fine-grained information based on frame actionness, MGG achieves the superior performance over state-of-the-art methods on the public THUMOS-14 and ActivityNet-1.3 datasets. Moreover, we employ existing action classifiers to perform the classification of the proposals generated by MGG, leading to significant improvements compared against the competing methods for the video detection task.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?