Monitoring a Fine-Tuning Job
Monitoring the fine-tuning process is essential to ensure that the model is learning effectively to make adjustments as needed. With Tromero, users can easily monitor the progress of their fine-tuning job and gain insights into the training process.
Monitor Through Tromero's Python and TypeScript libraries
Users can retrieve metrics for a fine-tuning job using the following code snippets:
from tromero import Tromero
client = Tromero(tromero_key=os.getenv("TROMERO_KEY"))
response = client.fine_tuning_jobs.get_metrics("{model_name}")
Monitor Through The UI
1. Access the Training Metrics Page
Navigate to this page to view the status of training jobs.
2. Check Job Status
- Training Jobs: View the list of jobs that are currently training.
- Failed Jobs: Identify and review jobs where the training has failed.
3. Review Metrics
The primary metric collected during training is the loss data, which is crucial for evaluating the performance of a model. The following details are provided:
- Training Loss: Indicates how well the model is learning from the training data.
- Validation Loss: Measures the model's performance on a separate validation dataset.
4. Analyse Visualizations
The Training Metrics page offers two main types of visualizations to help users understand their model's performance:
- Performance Comparison Chart: This bar chart compares the training and validation loss before and after fine-tuning, showing the reduction in loss as a result of the training.
- Model Performance Over Steps Chart: This line chart displays the loss over the steps of the training process, allowing users to see how the loss decreases as the training progresses.
Example Metrics Display
-
Training Details:
- Model Name:
DoctorAI-1.0
- Tags:
doctor
,medical
- Base Model:
Mistral-7B
- Model Name:
-
Training Status:
Running
-
Training Duration:
1 hour 30 minutes
-
Current Metrics:
- Training Loss:
5.6282
- Validation Loss:
10.8603
- Training Loss:
-
Hyperparameters Used:
- Learning Rate:
auto
- Batch Size:
auto
- Epochs:
auto
- Percentage of Dataset for Eval:
5%
(auto)
- Learning Rate:
-
Performance Comparison:
- Training Loss Reduction:
5.2321
- Validation Loss Reduction:
0.0000
- Training Loss Reduction:
Example Charts
Performance Comparison (Start/End of Fine-Tuning)
Metric | Before Fine-Tuning | After Fine-Tuning |
---|---|---|
Training Loss | 10.860 | 5.628 |
Validation Loss | 10.861 | 5.860 |
Monitoring the training and validation loss is essential for understanding how well a user's model is learning and generalizing to new data. Consistently decreasing loss values indicate effective training progress.
Key Benefits
- Monitor regularly: Regularly check the fine-tuning job metrics to ensure the model is learning effectively.
- Compare metrics: Compare the metrics across different fine-tuning jobs to identify the most effective hyperparameters.
- Adjust as needed: Adjust the hyperparameters, model architecture, or training data as needed based on the metrics.
For further assistance, please contact support@tromero.ai and we would be happy to help!