Data Driven Modeling

In TwinFabrica, data-driven modeling improves traditional physics-based models by learning from real-world data. These machine learning models are physically constrained, meaning they respect the laws of physics while trying to "fill in the gaps" left by missing information, such as unknown parameters, unmodeled physical effects, or uncertain boundary conditions.

Training Data Preparation

To train a model, we use experimental data organized into a set of experiments. Each experiment includes time series of input and output measurements.

The model we train is called a discrepancy model. Its goal is to learn the difference between what happens in reality (measured data) and what the physics-based model (ROM) predicts.

Example

Suppose we want to model rotor temperatures in an electric motor. During testing, we collect the following signals for 300 seconds:

Torque (Q)
Speed (S)
Current (I)

If we sample every second, we end up with 300 data points for each signal.

Before training, TwinFabrica splits these long time series into shorter fixed-length chunks, called trajectories, which are used as training samples (see Figure 1).

Figure 1: Sequence cuts for training purpose.

Why does this matter?

If the sampling time is too short (for example, one point every millisecond), the sequences become very long. This makes training slower and harder because the model has too much information per sample.

If the sampling time is too large (for example, one point every 10 seconds), important details might be lost. So, the sampling time needs to match the real system’s dynamics.

Once the data is prepared, the discrepancy model learns how to map the selected inputs to the difference between real-world measurements and ROM predictions (see Figure 2).

Figure 2: The discrepancy model.

Choosing the `Input Data`

One of the most important phases when designing the discrepancy model with TwinFabrica is to select the input data. Here are a set of rules to keep in mind when selecting the input data:

Use your domain knowledge. Pick inputs that clearly affect the output. For example, speed and torque strongly influence rotor temperature.
Keep inputs minimal. Only include what adds value. If two inputs contain the same information (such as one being a scaled version of another), keep just one.
Prefer real sensor data. Sensor inputs are meaningful, especially if you want to use the model for real-time adjustments.
Include reference signals. ROM predictions of the same variable the model is trying to correct are helpful, since they already contain physically meaningful behavior. If all ROM estimates are similar, including just one is enough.

Choosing the `Hidden Size`

The hidden size refers to how many parameters the model uses in its internal layers. This controls how complex the model can be.

Use this as a starting point:

If the number of inputs (N) is greater than the outputs (O), choose a hidden size (H) where N is greater than or equal to H, and H is greater than or equal to O.
If outputs are more than inputs, choose a hidden size close to the number of inputs and outputs.

Helpful hints:

If validation loss is much higher than training loss, the model might be overfitting. Try reducing the hidden size or removing some inputs.
If the model struggles to learn, especially with a wide variety of operating conditions, try increasing the hidden size.

Choosing the `Batch Size`

The batch size sets how many training samples the model uses in each training step. It depends on your dataset size and how much memory your machine has.

A small batch size uses less memory but needs more steps to complete training.
A large batch size learns more quickly per step but can use too much memory.

Pick a batch size that fits your computer’s capacity while keeping training efficient.

Modeling Prediction

On this page

Data Driven Modeling
Training Data Preparation
Example
Choosing the Input Data
Choosing the Hidden Size
Choosing the Batch Size

Data Driven Modeling