https://arxiv.org/abs/2502.13347v1 Craw4LLM: Efficient Web Crawling for LLM PretrainingWeb crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the webarxiv.org 1. AbstractCraw4LLM은 대형 언어 모델(LLM)의 사전 훈련을 위한 ..