^{Last Update: July 18, 2024}

Monitoring a Fine-Tuning Job

Monitoring the fine-tuning process is essential to ensure that the model is learning effectively to make adjustments as needed. With Tromero, users can easily monitor the progress of their fine-tuning job and gain insights into the training process.

Monitor Through Tromero's Python and TypeScript libraries

Users can retrieve metrics for a fine-tuning job using the following code snippets:

from tromero import Tromero

client = Tromero(tromero_key=os.getenv("TROMERO_KEY"))

response = client.fine_tuning_jobs.get_metrics("{model_name}")

Monitor Through The UI

1. Access the Training Metrics Page

Navigate to this page to view the status of training jobs.

2. Check Job Status

Training Jobs: View the list of jobs that are currently training.
Failed Jobs: Identify and review jobs where the training has failed.

3. Review Metrics

The primary metric collected during training is the loss data, which is crucial for evaluating the performance of a model. The following details are provided:

Training Loss: Indicates how well the model is learning from the training data.
Validation Loss: Measures the model's performance on a separate validation dataset.

4. Analyse Visualizations

The Training Metrics page offers two main types of visualizations to help users understand their model's performance:

Performance Comparison Chart: This bar chart compares the training and validation loss before and after fine-tuning, showing the reduction in loss as a result of the training.
Model Performance Over Steps Chart: This line chart displays the loss over the steps of the training process, allowing users to see how the loss decreases as the training progresses.

Example Metrics Display

Training Details:
- Model Name: DoctorAI-1.0
- Tags: doctor, medical
- Base Model: Mistral-7B
Training Status: Running
Training Duration: 1 hour 30 minutes
Current Metrics:
- Training Loss: 5.6282
- Validation Loss: 10.8603
Hyperparameters Used:
- Learning Rate: auto
- Batch Size: auto
- Epochs: auto
- Percentage of Dataset for Eval: 5% (auto)
Performance Comparison:
- Training Loss Reduction: 5.2321
- Validation Loss Reduction: 0.0000

Example Charts

Performance Comparison (Start/End of Fine-Tuning)

Metric	Before Fine-Tuning	After Fine-Tuning
Training Loss	10.860	5.628
Validation Loss	10.861	5.860

Monitoring the training and validation loss is essential for understanding how well a user's model is learning and generalizing to new data. Consistently decreasing loss values indicate effective training progress.

Key Benefits

Monitor regularly: Regularly check the fine-tuning job metrics to ensure the model is learning effectively.
Compare metrics: Compare the metrics across different fine-tuning jobs to identify the most effective hyperparameters.
Adjust as needed: Adjust the hyperparameters, model architecture, or training data as needed based on the metrics.

For further assistance, please contact support@tromero.ai and we would be happy to help!