Development of a Model for Detecting Prompt Injection in Large Language Models Using the BERT Architecture
Milica M. Živanović2
and Marko M. Živanović1 
1 Faculty of Organizational Sciences, University of Belgrade, Jove Ilića 154, Belgrade, 11000, Serbia
2 Faculty of Information Technology, Belgrade Metropolitan University, Tadeuša Košćuška 63, Belgrade, 11000, Serbia
milicazivanovic2411@gmail.com
marko.zivanovic@metropolitan.ac.rs
DOI: 10.46793/BISEC25.186Z
ABSTRACT: Large Language Models (LLMs) have demonstrated remarkable proficiency in both understanding and generating natural language, which has contributed to their rapid adoption across various domains. Yet, their widespread use has also exposed them to emerging security threats, most notably prompt injection attacks. Such attacks can compromise model behavior and potentially reveal sensitive in-formation. This research explores the phenomenon of prompt injection and sur-veys existing defense mechanisms, with a particular focus on developing and evaluating detection approaches. The study formulates the detection task as a bi-nary text classification problem, distinguishing between malicious and benign prompts. Central to the analysis is the application of the BERT architecture and its lightweight variants. The main objective is to compare the performance of small-er, fine-tuned BERT-based models in identifying malicious inputs. The underly-ing hypothesis suggests that these compact models, when properly adapted, can outperform larger counterparts in detecting and mitigating injection-based attacks due to their efficiency and adaptability
KEYWORDS: Large language models, prompt injection, malicious prompt detection, BERT ar-chitecture, cybersecurity, classification problem.
ACKNOWLEDGMENT: The authors express their gratitude to Metropolitan University for the stimulating environment for scientific research and for the financial support provided. Particular gratitude is owed to the measure of exempting the authors from the registration fee, which directly enabled the publication and presentation of the results of this research.
REFERENCES:
- OWASP. (2023). OWASP Cannon 10 for LLM Applications. https://llmtop10.com/
- Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R., and McHardy, R. (2023). Challenges and Applications of Large Language Models, 2307.10169.
- Rahman, MA, Shahriar, H., Wu, F. Cuzzocrea, A. (2024). Applying Pre-trained Multilin-gual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection. University of West Florida, Tuskegee University, University of Calabria.
- Yu, J., Lynn, X., Yu, Z., & Xing, X. (2024). GPTFUZZER: Order Teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253. https://arxiv.org/pdf/2309.10253
- Devlin, J., Chang, M. W., Lee, K., Toutanova, K. (2018). Bert: Pre-training of deep bidirec-tional transformers for language understanding.
- arXiv preprint arXiv:1810.04805
- Deepset. (2024). deepset/prompt-injections [Dataset]. Hugging Face. https://huggingface.co/datasets/deepset/prompt-injections
- Hugging Face. (2025). Sequence classification. https://huggingface.co/docs/transformers/tasks/sequence_classification
- Gupta, M., (2025). What with Prompt Injection? AI has Gothic and new poison
- A bigger problem then LLM Hallucinations. Medium. [https://medium.com/data-science-in-your-pocket/what-is-prompt-injection-ai-has-got-a-new-poison-3b6455b57b4d ]
- Vaswani, A., Shazeer, N., Parmar, N., Faster, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Po-losukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762
- GeeksforGeeks. (2025, July 17). Explanation of BERT Model – NLP. GeeksforGeeks. https://www.geeksforgeeks.org/nlp/explanation-of-bert-model-nlp/
- Perez, F., Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models
- OWASP Gen AI Security Project. (2025). LLM01: Prompt Injection. OWASP Gen AI Se-curity Project. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
- Jain, N., Schwarzschild, A., Wen, Y., Somepalli, G., Kirchenbauer, J., Chiang, P., Gold-blum, M., Saha, A., Geiping, J., Goldstein, A. (2023). Baseline defense for adversarial at-tacks against aligned language models.
- arXiv preprint arXiv:2309.00614.
- Hung, K.-H., Ko, C.-Y., Rawat, A., Chung, I.-H., Hsu, WH, & Chen, P.-Y. (2024). Atten-tion Tracker: Detecting prompt injection attacks in LLMs. North American Chapter of the Association for Computational Linguistics
- Kaggle. (n.d.). Kaggle: Your machine learning and data science community . Kaggle. https://www.kaggle.com
- Hugging Face. (n.d.). Hugging Face – The AI community building the future. Hugging Face. https://huggingface.co
- Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). {LEGAL}-{BERT}: The Muppets straight out of Law School. In *Findings of the Associa-tion for Computational Linguistics: EMNLP 2020* (Short Papers) (pp. 2898–2904). Asso-ciation for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.261
- Sentence Transformers. (n.d.). all-MiniLM-L6-v2 [Model]. Hugging Face. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- Solatorio, AV (2024). GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning [Preprint]. arXiv. https://arxiv.org/abs/2402.16829
IZVOR: Proceedings of the 16th International Conference on Business Information Security BISEC’2025