Last Update: July 18, 2024

Monitoring a Fine-Tuning Job

Monitoring the fine-tuning process is essential to ensure that the model is learning effectively to make adjustments as needed. With Tromero, users can easily monitor the progress of their fine-tuning job and gain insights into the training process.

Monitor Through Tromero's Python and TypeScript libraries

Users can retrieve metrics for a fine-tuning job using the following code snippets:

from tromero import Tromero

client = Tromero(tromero_key=os.getenv("TROMERO_KEY"))

response = client.fine_tuning_jobs.get_metrics("{model_name}")

Monitor Through The UI

1. Access the Training Metrics Page

Navigate to this page to view the status of training jobs.

2. Check Job Status

  • Training Jobs: View the list of jobs that are currently training.
  • Failed Jobs: Identify and review jobs where the training has failed.

3. Review Metrics

The primary metric collected during training is the loss data, which is crucial for evaluating the performance of a model. The following details are provided:

  • Training Loss: Indicates how well the model is learning from the training data.
  • Validation Loss: Measures the model's performance on a separate validation dataset.

4. Analyse Visualizations

The Training Metrics page offers two main types of visualizations to help users understand their model's performance:

  1. Performance Comparison Chart: This bar chart compares the training and validation loss before and after fine-tuning, showing the reduction in loss as a result of the training.
  2. Model Performance Over Steps Chart: This line chart displays the loss over the steps of the training process, allowing users to see how the loss decreases as the training progresses.

Example Metrics Display

  • Training Details:

    • Model Name: DoctorAI-1.0
    • Tags: doctor, medical
    • Base Model: Mistral-7B
  • Training Status: Running

  • Training Duration: 1 hour 30 minutes

  • Current Metrics:

    • Training Loss: 5.6282
    • Validation Loss: 10.8603
  • Hyperparameters Used:

    • Learning Rate: auto
    • Batch Size: auto
    • Epochs: auto
    • Percentage of Dataset for Eval: 5% (auto)
  • Performance Comparison:

    • Training Loss Reduction: 5.2321
    • Validation Loss Reduction: 0.0000

Example Charts

Performance Comparison (Start/End of Fine-Tuning)

MetricBefore Fine-TuningAfter Fine-Tuning
Training Loss10.8605.628
Validation Loss10.8615.860

Key Benefits

  • Monitor regularly: Regularly check the fine-tuning job metrics to ensure the model is learning effectively.
  • Compare metrics: Compare the metrics across different fine-tuning jobs to identify the most effective hyperparameters.
  • Adjust as needed: Adjust the hyperparameters, model architecture, or training data as needed based on the metrics.

For further assistance, please contact support@tromero.ai and we would be happy to help!

Was this page helpful?