The initial stage of LLM development is pre-training. This step uses self-supervised learning to train the model on massive text collections, such as web pages, books, articles, and source code. During training, the text is first divided into tokens through tokenization. Tokens are the fundamental textual units often smaller than complete words.

After tokenization, each token is encoded into numerical representations that the model can process. As a result, the model can handle multiple languages, writing styles, and formats. In this step, given a sequence of tokens, the model must predict the next token.

Through many training iterations, the model develops a broad understanding of language without relying on manually labeled data. It learns syntax, semantics, stylistic patterns, and implicit world knowledge encoded in the training data. This learning happens only during training as the model does not continue learning during inference time.

The primary challenge in pre-training is scalability. Modern training runs process trillions of tokens and rely on large, distributed GPU clusters. As a result, concerns such as memory efficiency, data throughput, multiprocessing workflows and distributed strategies become central challenges. However, pre-training is not simply about scale. Data quality, diversity, and filtering matter as much as raw volume. Deduplication, removal of low-quality content, and mitigation of bias are critical steps during dataset construction.

Pre-training is also resource intensive as it requires extensive computational infrastructure, long training times, and careful optimization. Hence, instead of creating their own models, most teams depend on pretrained models or open source models such as IBM Granite® and LLaMA. The output of this phase is a general-purpose language model but not yet specialized for particular use cases.