Development of a Model for Detecting Prompt Injection in Large Language Models Using the BERT Architecture

Milica M. Živanović2 and Marko M. Živanović1
1 Faculty of Organizational Sciences, University of Belgrade, Jove Ilića 154, Belgrade, 11000, Serbia
2 Faculty of Information Technology, Belgrade Metropolitan University, Tadeuša Košćuška 63, Belgrade, 11000, Serbia
milicazivanovic2411@gmail.com
marko.zivanovic@metropolitan.ac.rs
DOI: 10.46793/BISEC25.186Z

 

ABSTRACT: Large Language Models (LLMs) have demonstrated remarkable proficiency in both understanding and generating natural language, which has contributed to their rapid adoption across various domains. Yet, their widespread use has also exposed them to emerging security threats, most notably prompt injection attacks. Such attacks can compromise model behavior and potentially reveal sensitive in-formation. This research explores the phenomenon of prompt injection and sur-veys existing defense mechanisms, with a particular focus on developing and evaluating detection approaches. The study formulates the detection task as a bi-nary text classification problem, distinguishing between malicious and benign prompts. Central to the analysis is the application of the BERT architecture and its lightweight variants. The main objective is to compare the performance of small-er, fine-tuned BERT-based models in identifying malicious inputs. The underly-ing hypothesis suggests that these compact models, when properly adapted, can outperform larger counterparts in detecting and mitigating injection-based attacks due to their efficiency and adaptability

KEYWORDS: Large language models, prompt injection, malicious prompt detection, BERT ar-chitecture, cybersecurity, classification problem.

 

ACKNOWLEDGMENT: The authors express their gratitude to Metropolitan University for the stimulating environment for scientific research and for the financial support provided. Particular gratitude is owed to the measure of exempting the authors from the registration fee, which directly enabled the publication and presentation of the results of this research.

REFERENCES:

  1. OWASP. (2023). OWASP Cannon 10 for LLM Applications. https://llmtop10.com/
  2. Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R., and McHardy, R. (2023). Challenges and Applications of Large Language Models, 2307.10169.
  3. Rahman, MA, Shahriar, H., Wu, F. Cuzzocrea, A. (2024). Applying Pre-trained Multilin-gual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection. University of West Florida, Tuskegee University, University of Calabria.
  4. Yu, J., Lynn, X., Yu, Z., & Xing, X. (2024). GPTFUZZER: Order Teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253. https://arxiv.org/pdf/2309.10253
  5. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. (2018). Bert: Pre-training of deep bidirec-tional transformers for language understanding.
  6. arXiv preprint arXiv:1810.04805
  7. Deepset. (2024). deepset/prompt-injections [Dataset]. Hugging Face. https://huggingface.co/datasets/deepset/prompt-injections
  8. Hugging           Face.    (2025).             Sequence classification. https://huggingface.co/docs/transformers/tasks/sequence_classification
  9. Gupta, M., (2025). What with Prompt Injection? AI has Gothic and new poison
  10. A bigger problem then LLM Hallucinations. Medium. [https://medium.com/data-science-in-your-pocket/what-is-prompt-injection-ai-has-got-a-new-poison-3b6455b57b4d ]
  11. Vaswani, A., Shazeer, N., Parmar, N., Faster, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Po-losukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762
  12. GeeksforGeeks. (2025, July 17). Explanation of BERT Model – NLP. GeeksforGeeks. https://www.geeksforgeeks.org/nlp/explanation-of-bert-model-nlp/
  13. Perez, F., Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models
  14. OWASP Gen AI Security Project. (2025). LLM01: Prompt Injection. OWASP Gen AI Se-curity Project. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
  15. Jain, N., Schwarzschild, A., Wen, Y., Somepalli, G., Kirchenbauer, J., Chiang, P., Gold-blum, M., Saha, A., Geiping, J., Goldstein, A. (2023). Baseline defense for adversarial at-tacks against aligned language models.
  16. arXiv preprint arXiv:2309.00614.
  17. Hung, K.-H., Ko, C.-Y., Rawat, A., Chung, I.-H., Hsu, WH, & Chen, P.-Y. (2024). Atten-tion Tracker: Detecting prompt injection attacks in LLMs. North American Chapter of the Association for Computational Linguistics
  18. Kaggle. (n.d.). Kaggle: Your machine learning and data science community . Kaggle. https://www.kaggle.com
  19. Hugging Face. (n.d.). Hugging Face – The AI community building the future. Hugging Face. https://huggingface.co
  20. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). {LEGAL}-{BERT}: The Muppets straight out of Law School. In *Findings of the Associa-tion for Computational Linguistics: EMNLP 2020* (Short Papers) (pp. 2898–2904). Asso-ciation for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.261
  21. Sentence Transformers. (n.d.). all-MiniLM-L6-v2 [Model]. Hugging Face. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
  22. Solatorio, AV (2024). GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning [Preprint]. arXiv. https://arxiv.org/abs/2402.16829

 

IZVOR: Proceedings of the 16th International Conference on Business Information Security BISEC’2025