Bot crawler to retrieve data from Facebook based on the selection of posts and the extraction of user profiles
DOI:
https://doi.org/10.17981/ingecuc.18.2.2022.08Keywords:
Web scraping, web crawling, HTML, Social Networking, dataAbstract
Introduction— Data can currently be found within organizations and outside of them, they are growing exponentially. Today, the information available on the Internet and social networks has become a generator of value, through the effective analysis of a specific situation, using techniques and methodologies with which content-based solutions can be proposed, and thus achieve, execute timely, intelligent and assertive decision-making processes.
Objective— The main objective of this work is to development of a Bot Crawler, which allows extracting information from Facebook without access restrictions, or request for credentials, based on web crawling and scraping techniques, through the selection of HTML tags, to track and be able to define patterns.
Methodology— The development of this project consisted of four main stages: A) Teamwork with SCRUM, B) Comparison of web data extraction techniques, C) Extraction and validation of permissions to access the data in Facebook, D) Development of the bor crawler.
Results— As a result of this process, a graphical interface was created to review the process of obtaining data derived from user profiles in this social network.
Conclusions— As a result of this process, a graphical interface is created that allows checking the process of obtaining data derived from user profiles of this social network.
Downloads
References
N. Bolbol & T. Barhoom, “Mitigating Web Scrapers using Markup Randomization,” presented at 2021 Palestinian International Conference on Information and Communication Technology, PICICT, GZA, PS, 28-29 Sept. 2021. https://doi.org/10.1109/PICICT53635.2021.00038
L. Wang & H. Wang, “Design and Research ofWeb Crawler Based on Distributed Architecture,” presented at 3rd International Conference on Artificial and Advance Manufacture, AIAM, MAN, UK, 23-25 Oct. 2021. https://doi.org/10.1145/3495018.3495061
P. Thota & E. Ramez, “Web Scraping of COVID-19 News Stories to Create Datasets for Sentiment and Emotion Analysis,” presented at 14th Pervasive Technologies Related to Assistive Environments Conference, PETRA, CFU, GR, 29 Jun. 2 Jul. 2021. https://doi.org/10.1145/3453892.3461333
H. Habib, S. Pearman, E. Young, I. Saxena, R. Zhang & L. Cranor, “Identifying User Needs for Advertising Controls on Facebook,” presented at Human-Computer Interaction, ACM, NYC, NY, USA, 2022. https://doi.org/10.1145/3512906
M. Klymash, I. Demydov, L. Uryvskyi & Y. Pyrih, “A Brief Survey on Architecture of Feedback Systems for Interactive E-Government ICT Platforms,” presented at 15th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering, TCSET, LV-SLA, UA, 25-29 Feb. 020. https://doi.org/10.1109/TCSET49122.2020.235475
A. Lagopoulos, G. Tsoumakas & G. Papadopoulos, “Web robot detection: A semantic approach,” presented at 30th International Conference on Tools with Artificial Intelligence, ICTAI, VLS, GR, 5-7 Nov. 2018. https://doi.org/10.1109/ICTAI.2018.00150
M. Hossen, Y. Wang, H. Tariq, G. Nyame & R. Nuhoho, “Statistical analysis of extracted data from video site by using web crawler,” presented at 2018 International Conference on Computing and Artificial Intelligence, ICCAI, CHD, CN, 12-14 Mar. 2018. https://doi.org/10.1145/3194452.3194466
Y. Feng, J. Li, L. Jiao & X. Wu, “BotFlowMon: Learning-based, Content-Agnostic Identification of Social Bot Traffic Flows,” presented at 2019 IEEE Conference on Communications and Network Security, CNS, WA D.C., WA, USA, 10-12 Jun. 2019. https://doi.org/10.1109/CNS.2019.8802706
P. Lewandowski, M. Janiszewski & A. Felkner, “SpiderTrap - An Innovative Approach to Analyze Activity of Internet Bots on a Website,” IEEE Access, vol. 8, pp. 141292–141309, Jul. 2020. https://doi.org/10.1109/ACCESS.2020.3012969
J. Ho, “Assessing the bias of Facebook’s graph API,” presented at 30th ACM Conference on Hypertext and Social Media, ACM, HOF, DE, 17-20 Sept. 2019. https://doi.org/10.1145/3342220.3344923
Y. Huang, “Privacy Security Status and Countermeasures in the Era of Big Data,” presented at 3rd International Conference on Big Data Engineering and Technology, BDET, SGP, SGP, 16-18 Jan. 2021. https://doi.org/10.1145/3474944.3474952
G. Gao, Y. Liu & G. Bai, “Crawling and Analysis of Data Based on Social Networking on Stock Comments,” presented at IOP Conference Series: Earth and Environmental Science, IOP, HB, CN, 14-16 Dec. 2019. https://doi.org/10.1088/1755-1315/234/1/012093
Ö. Çoban, A. Inan & S. Özel, “Facebook Tells Me Your Gender: An Exploratory Study of Gender Prediction for Turkish Facebook Users,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 20, no. 4, pp. 1–38, Jul. 2021. https://doi.org/10.1145/3448253
S. Pais, J. Cordeiro, R. Martins & M. Albardeiro, “Socialnetcrawler - Online social network crawler,” presented at 11th International Conference on Management of Digital EcoSystems, MEDES, LMS, CYP, 12-14 Nov. 019. https://doi.org/10.1145/3297662.3365805
F. Erlandsson, R. Nia, M. Boldt, H. Johnson & S. Wu, “Crawling Online Social Networks,” presented at 2015 Second European Network Intelligence Conference, ENIC, KKN, SE, 21-22 Sept. 2015. https://doi.org/10.1109/ENIC.2015.10
M. Yadav, G. Tanwar & A. Wadhwa, “Social Network with Web Crawler & Cluster,” Int J Comput Sci Commun, vol. 10, no. 2, pp. 171–179, Mar. 2019. Available from http://csjournals.com/IJCSC/PDF10-2/4.%20Meenu.pdf
G. Colmenares, N. Méndez y O. Virgüez, “Deep Dark Web & Social Crawler (DDW&SC): Aplicativo para apoyar la gestión de Ciberinteligencia”, Trabajo de grado, Fac Ing, Prog Ing Sist, PUJ, BOG D.C., CO, 2019. Available: http://hdl.handle.net/10554/47278
M. Ramírez, M. Salgado, H. Ramírez, E. Manrique, N. Osuna y R. Rosales, “Metodología SCRUM y desarrollo de Repositorio Digital,” RISTI, no. E17, pp. 1062–1072, Ene. 2019. Disponible en http://www.risti.xyz/issues/ristie17.pdf
B. Grebić & A. Stojanović, “Application of the Scrum Framework on Projects in IT Sector,” Eur Proj Manag J, vol. 11, no. 2, pp. 37–46, Dec. 2021. https://doi.org/10.18485/epmj.2021.11.2.4
R. Martínez, R. Rodríguez, P. Vera y C. Parkinson, “Análisis de técnicas de raspado de datos en la web aplicado al Portal del Estado Nacional Argentino”, presentado al XXV Congreso Argentino de Ciencias de la Computación, RedUNCI, Rio CTO, AR, 14-18 Oct. 2019. Disponible en http://sedici.unlp.edu.ar/handle/10915/91026
I. Galdino, E. Gallindo & M. Moreira, “Utilização de Bots para Obtenção Automática de Dados Públicos usando as Técnicas de Web Crawling e Web Scraping,” presentado a VIII Workshop de Computação Aplicada em Governo Eletrônico, WCGE, POA, BR, 16-20 Nov. 2020. https://doi.org/10.5753/wcge.2020.11269
S. Kaur, A. Singh, G. Geetha & X. Cheng, “IHWC: intelligent hidden web crawler for harvesting data in urban domains,” Complex Intell Syst, pp. 1–19, Jul. 2021. https://doi.org/10.1007/s40747-021-00471-1
G. Meiser, P. Laperdrix & B. Stock, “Careful Who You Trust: Studying the Pitfalls of Cross-Origin Communication,” presented at ACM Asia Conference on Computer and Communications Security, ASIA CCS, HK, CN, 7-11 Jun. 2021. https://doi.org/10.1145/3433210.3437510

Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 INGE CUC

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Published papers are the exclusive responsibility of their authors and do not necessary reflect the opinions of the editorial committee.
INGE CUC Journal respects the moral rights of its authors, whom must cede the editorial committee the patrimonial rights of the published material. In turn, the authors inform that the current work is unpublished and has not been previously published.
All articles are licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.