Predictive LLMs: Scaling, Reproducibility & DeepSeek
In our blog series on Large Language Models (LLMs), we've focused on how well different LLMs can predict vehicle prices – a metric target variable. We provide the models with not only classic tabular data but also texts describing the condition and sales conditions of the cars. In the last post, we showed that open-source models can now compete well with OpenAI's models. This time, we look at further details: We wanted to know if a larger open-source model – measured by the number of its parameters – also makes better predictions than a smaller one. We also report on our experiences with fine-tuning using multiple GPUs on an AWS machine, discuss challenges in the reproducibility of LLM fine-tuning, and take a look at CPU-based inference. Another experiment: We examined a model from the Chinese company DeepSeek, which received a lot of attention at the beginning of the year.
Vehicle Price Prediction and Model Size
Results
The available open-source LLMs differ, among other things, in their number of parameters. For us, the question arose whether larger models also make better predictions. To test this, we generated predictions for our benchmark dataset with the Llama-3.1-70B-Instruct and the Llama-3.2-1B-Instruct. We also compare the results to the Llama-3.1-8B-Instruct, which we have used before. Thus, we compare three models from the same model class with different numbers of parameters. 8B here means 8 billion parameters. Table 1 shows the results. The benchmark dataset includes vehicle prices as the target variable, which are predicted based on various vehicle characteristics. Further details about the dataset are described here.
Model | MAPE (%) | Median APE (%) |
---|---|---|
Llama-3.2 1B | 12.5 | 6.3 |
Llama-3.1 8B | 10.3 | 5.5 |
Llama-3.1 70B | 11.0 | 6.7 |
Table 1: Results with 6000 training observations
In our experiment, it turned out that a larger model does not necessarily lead to better predictions. The 8B model performs best. Various aspects can be discussed regarding these results. Firstly, there is the question of reproducibility. If we fine-tune again with the same training data, do we get a model that makes the same predictions? We discuss this topic further below. Furthermore, new technical questions arose during fine-tuning with the 70B model, as the model is too large to fine-tune on our in-house Nvidia L40S GPU.
Multi-GPU Finetuning
Instead, we switched to an AWS machine with 8 Nvidia A100 GPUs and used the pytorch-native torchtune framework to fine-tune the 70B model in a multi-GPU setup. For this, we used a Docker container setup and needed software such as the Cuda Toolkit, Cuda drivers, Nvidia Container Toolkit, and the Nvidia Fabric Manager, which controls the communication between the GPUs. It turned out that we needed one of the latest Nvidia models – the A100 GPU – to carry out the fine-tuning. Each A100 has 80 GB of memory. Our attempt to fine-tune the model on 8 smaller L40S GPUs with 48 GB of memory each could not be implemented. In the future, we could also experiment with quantizing models to run larger models on smaller hardware. A multi-GPU setup is also required for using the adapted (fine-tuned) model for inference. The Python package vLLM offers a comfortable and fast solution for this.
Reproducibility of LLM Fine-tuning
Why Reproducibility Matters
Reproducibility is primarily known as an important criterion in scientific contexts. It often means that the code and, ideally, the data used are provided so that other researchers can independently verify how the results were generated. This allows errors to be discovered more easily. Reproducibility is also relevant from a compliance perspective with regard to the EU AI Act. In the context of data science projects, complex algorithms and models that work with random elements are used. Often, it is enough to set a seed for the random number generator to make the results of model training reproducible. Reproducibility is useful because then different hyperparameter combinations can be better tested against each other. Without reproducibility, it is less clear whether a change in the predictive quality is due to the change in hyperparameters or the randomization in training.
Our Observations
To reproduce an LLM fine-tuning on a GPU in such a way that the same results can be generated, it was not enough in our experiments to set a simple seed in the Python script. Our research revealed that in addition to the training algorithms, there are also further random elements at the hardware level that cannot yet be fully controlled. For PyTorch, some of these aspects are described in more detail here , for example. As a result, we get different results for the prediction quality with repeated fine-tuning with the same data and hyperparameters. This observation is also made by other researchers. In this paper, the authors summarize in their abstract: "Our study shows that fine-tuning of LLMs with the QLoRA method is not repeatable (not stable), such that different fine-tuned runs result in different performance on the holdout test set." A repetition of the fine-tuning under the same conditions leads to different results.
Interpretation and Recommendation for LLM Applications
If even small differences in prediction quality have a high leverage effect in the application, we recommend a targeted model selection:
- Multiple fine-tuning and evaluation of the resulting models on a validation dataset
- Selection of the model with the best prediction quality
- Final evaluation of the selected model on an independent test dataset
- Use of this model for creating the predictions
This approach can be seen as analogous to conventional model selection, for example, when deciding in which functional form a feature is to be included. In this case, different parameter values of the LLM are tested to see if they lead to better predictions. The parameter values are not selected manually but by the (randomized) training algorithm. For our results, we did not perform such model selection, and therefore we cannot say for sure how strong the fluctuations in predictive quality would be if fine-tuning had been performed multiple times. However, our results give a good indication of what the individual models can achieve. To ensure reproducibility of the predictions in the sense of the EU AI Act, it is important that the best fine-tuned model is saved after use and that a seed is used when generating the predictions. This is the only way to ensure that the predictions are exactly reproducible.
Prediction Speed: CPU vs. GPU
In ongoing operation, it can happen that computing locations do not have a powerful GPU. It can also be cost-saving if it does not have to be purchased additionally. Ideally, fine-tuning can be carried out in a central data center that has GPUs. After fine-tuning on the GPU, the model can possibly be used for inference and other applications with pure CPU computing power. Thus, the question arises above all whether the use of an LLM on a CPU is possible and how much this extends the runtime for individual predictions. We tested this using the Llama-3.2-1B. On the GPU, creating a prediction for the car price took an average of about 0.13 seconds, regardless of whether the transformer or the vLLM Python library was used. On the CPU, creating a prediction took an average of about 4.1 seconds. Thus, this increased the runtime in our application by about a factor of 30. We also saw that quantization of the LLM can provide further speed advantages. Here, the parameter values of the model are converted into a less high-resolution data format. In the 16-bit data format, our predictions were 4 times faster compared to the 32-bit format. Stronger quantization to 8-bit or 4-bit is already implemented in various software packages, and we plan to test this possibility, too.
Car Price Prediction with DeepSeek
Hype Around DeepSeek
In January of this year, numerous major news portals reported on the new language model from the Chinese company DeepSeek. Their LLMs are said to be more cost-effective in development and operation while offering comparable performance to OpenAI's models. This announcement attracted a lot of attention in the developer scene – reason enough for us to take a closer look at a model from DeepSeek and test it practically in the context of our application.
Results
Our benchmark dataset gives us a good opportunity to compare a DeepSeek LLM with OpenAI and other models. Table 2 shows the prediction quality for the car price data. We are using a small version of the R1 model, called DeepSeek-R1-Distill-Llama-8B.
Model | N | MAPE (%) | Median APE (%) |
---|---|---|---|
gpt-4 | 600 | 11.3 | 6.6 |
gpt-3.5-turbo | 600 | 11.8 | 6.6 |
Llama-3.1 8B | 600 | 14.8 | 9.7 |
Llama-3.1 70B | 600 | 14.8 | 9.6 |
Llama-3.2 1B | 600 | 22.8 | 12.3 |
DeepSeek | 600 | 24.1 | 12.8 |
Llama-3.1 8B | 6000 | 10.3 | 5.5 |
DeepSeek | 6000 | 10.8 | 5.5 |
Llama-3.1 70B | 6000 | 11.0 | 6.7 |
Llama-3.2 1B | 6000 | 12.5 | 6.3 |
Table 2: Results with the DeepSeek Model R1-Distill-Llama-8B and other models in comparison. The result of the Llama-3.1-8B were improved by differently chosen hyperparameters compared to the last blog article.
With 6000 training observations, the DeepSeek model is close to the Llama-3.1 model (10.8% vs 10.3%). With 600 training observations, the prediction quality is in a different order of magnitude compared to the Llama-3.1 (24.1% vs. 14.8%). With few training observations, the OpenAI models are clearly ahead. With larger training data sets (here 6000 observations), however, open-source models can also achieve similar or better performance.
Summary
Our results show that a larger number of parameters of the LLM does not necessarily lead to better predictions. In our case, the 8B Llama model even achieved slightly better results than the 70B model. In addition, we have discussed that for optimal prediction quality, it can be helpful to select the best model from multiple fine-tuning runs. This approach makes sense because individual fine-tunings can lead to different results due to the high complexity of the LLM architectures as well as the hard-to-control random elements in training algorithms and the hardware used. Creating LLM predictions with a CPU instead of a GPU is fundamentally possible, but in our experiments, it led to an approximately 30-fold speed loss. Furthermore, we tested an 8B variant of the R1 model from DeepSeek, which, especially with a larger number of training observations, achieves a comparable prediction quality to the Llama 3.1 8B.
Outlook
In further blog articles, we would like to share our experiences with multi-GPU fine-tuning in more detail. A relevant topic for us is also the quantization of LLMs, as it could enable their use on more cost-effective hardware. We have also become aware of TabPFN – a promising foundation model that was specifically developed for predictions on tabular data. We will also report on our experiences with TabPFN.