CNN–LSTM con mecanismo de atención suave para el reconocimiento de acciones humanas en videos

Carlos Ismael Orozco, María Elena Buemi, Julio Jacobo Berlles

Resumen


El reconocimiento de acciones en videos es actualmente un tema de interés en el área de la visión por computador, debido a potenciales aplicaciones como: indexación multimedia, vigilancia en espacios públicos, entre otras. Los mecanismos de atención se han convertido en un concepto muy importante dentro del enfoque de aprendizaje profundo, su operación intenta imitar la capacidad visual de las personas que les permite enfocar su atención en partes relevantes de una escena para extraer información importante. En este artículo proponemos un mecanismo de atención suave adaptado para degradar la arquitectura CNN–LSTM. Primero, una red neuronal convolucional VGG16 extrae las características del video de entrada. Para llevar a cabo las fases de entrenamiento y prueba, usamos los conjuntos de datos HMDB-51 y UCF-101. Evaluamos el desempeño de nuestro sistema usando la precisión como métrica de evaluación, obteniendo 40,7 % (enfoque base), 51,2 % (con atención) para HMDB-51 y 75,8 % (enfoque base), 87,2 % (con atención) para UCF-101.

Palabras clave


reconocimiento de acciones; redes neuronales convolucionales; redes neuronales lstm; mecanismo de atención

Texto completo:

PDF HTML

Referencias


I. Jegham, A. B. Khalifa, I. Alouani, and M. A. Mahjoub, “Vision-based human action recognition: An overview and real world challenges,” Forensic Science International: Digital Investigation, vol. 32, p. 200901, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S174228761930283X

M. A. Khan, K. Javed, S. A. Khan, T. Saba, U. Habib, J. A. Khan, and A. A. Abbasi, “Human action recognition using fusion of multiview and deep features: an application to video surveillance,” Multimedia tools and applications, pp. 1-27, 2020.

J. Bao, M. Ye, and Y. Dou, “Mobile phone-based internet of things human action recognition for e-health,” in 2016 IEEE 13th International Conference on Signal Processing (ICSP). IEEE, 2016, pp. 957–962.

N. Jaouedi, N. Boujnah, O. Htiwich, and M. S. Bouhlel, “Human action recognition to human behavior analysis,” in 2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT). IEEE, 2016, pp. 263–266.

V. Bloom, D. Makris, and V. Argyriou, “G3d: A gaming action dataset and real time action recognition evaluation framework,” in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2012, pp. 7–12.

I. Laptev, “On space-time interest points,” International journal of computer vision, vol. 64, no. 2-3, pp. 107–123, 2005.

C. G. Harris, M. Stephens et al., “A combined corner and edge detector.” in Alvey vision conference, vol. 15, no. 50. Citeseer, 1988, pp. 10–5244.

H. Wang, A. Kläser, C. Schmid, and C. Liu, “Action recognition by dense trajectories,” in CVPR 2011, June 2011, pp. 3169–3176.

H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.

N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, 2005, pp. 886–893 vol. 1.

J. Perš, V. Sulić, M. Kristan, M. Perše, K. Polanec, and S. Kovacic, “Histograms of optical flow for efficient representation of body motion,” Pattern Recognition Letters, vol. 31, no. 11, pp. 1369–1376, 2010.

H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” in BMVC 2009 - British Machine Vision Conference, A. Cavallaro, S. Prince, and D. Alexander, Eds. London, United Kingdom: BMVA Press, Sep. 2009, pp. 124.1–124.11. [Online]. Available: https://hal.inria.fr/inria-00439769.

H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Dense trajectories and motion boundary descriptors for action recognition,” International journal of computer vision, vol. 103, no. 1, pp. 60–79, 2013.

K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural networks, vol. 4, no. 2, pp. 251 -257, 1991.

S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735.

Y. Ye and Y. Tian, “Embedding sequential information into spatiotemporal features for action recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2016, pp. 1110–1118.

B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, “Real-time action recognition with deeply transferred motion vector cnns,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2326–2339, 2018.

S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” CoRR, vol. abs/1511.04119, 2015. [Online]. Available: http://arxiv.org/abs/1511.04119.

M. A. Goodale and A. D. Milner, “Separate visual pathways for perception and action,” Trends in neurosciences, vol. 15, no. 1, pp. 20–25, 1992.

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv 1409.1556, 09 2014.

C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1933–1941.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.

C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal residual networks for video action recognition,” CoRR, vol. abs/1611.02155, 2016. [Online]. Available: http://arxiv.org/abs/1611.02155.

Y. Wang, S. Wang, J. Tang, N. O’Hare, Y. Chang, and B. Li, “Hierarchical attention network for action recognition in videos,” arXiv preprint arXiv:1607.06416, 2016.

X. Wang, A. Farhadi, and A. Gupta, “Actions˜ transformations,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 2658–2667.

C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1933–1941.

C. I. Orozco, M. E. Buemi, and J. J. Berlles, “Cnn-lstm architecture for action recognition in videos,” in I Simposio Argentino de Imágenes y Visión (SAIV 2019)-JAIIO 48 (Salta), 2019.

F. Chollet et al. (2015) Keras. [Online]. Available: https: //github.com/fchollet/keras

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database for human motion recognition,” in Proceedings of the International Conference on Computer Vision (ICCV), 2011.

K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CoRR, vol. abs/1212.0402, 2012. [Online]. Available: http://arxiv.org/abs/1212.0402

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265–283.

Y. Dauphin, H. de Vries, and Y. Bengio, “Rmsprop and equilibrated adaptive learning rates for non-convex optimization,” in NIPS, 2015.

Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo, “Trajectory-based modeling of human actions with motion reference points,” in European Conference on Computer Vision. Springer, 2012, pp. 425–438.

A. Gaidon, Z. Harchaoui, and C. Schmid, “Activity representation with motion hierarchies,” International journal of computer vision, vol. 107, no. 3, pp. 219–238, 2014.

L. Meng, B. Zhao, B. Chang, G. Huang, F. Tung, and L. Sigal, “Where and when to look? spatio-temporal attention for action recognition in videos,” CoRR, vol. abs/1810.04511, 2018. [Online]. Available: http://arxiv.org/abs/1810.04511

X. Li, M. Xie, Y. Zhang, G. Ding, and W. Tong, “Dual attention convolutional network for action recognition,” IET Image Processing, vol. 14, no. 6, pp. 1059–1065, 2020.

M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, June 2009, pp. 2929–2936.

K. K. Reddy and M. Shah, “Recognizing 50 human action categories of web videos,” Mach. Vision Appl., vol. 24, no. 5, pp. 971–981, Jul. 2013. [Online]. Available: http://dx.doi.org/10.1007/s00138-012-0450-4

H. Yang, C. Yuan, L. Zhang, Y. Sun, W. Hu, and S. J. Maybank, “Sta-cnn: convolutional spatial-temporal attention learning for action recognition,” IEEE Transactions on Image Processing, vol. 29, pp. 5783–5793, 2020.

R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 244–253.




DOI: https://doi.org/10.37537/rev.elektron.5.1.130.2021

Enlaces de Referencia

  • Por el momento, no existen enlaces de referencia


Copyright (c) 2021 Carlos Ismael Orozco, María Elena Buemi, Julio Jacobo Berlles

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.


Revista elektron,  ISSN-L 2525-0159
Facultad de Ingeniería. Universidad de Buenos Aires 
Paseo Colón 850, 3er piso
C1063ACV - Buenos Aires - Argentina
revista.elektron@fi.uba.ar
+54 (11) 528-50889