What actually are hyperparameters in machine learning and deep learning?

If you’ve ever taken your car to a mechanic, you might have noticed various settings and adjustments that are need needed to ensure the car runs smoothly. For instance, a mechanic might tune the air-fuel ratio, ignition timing, and tire pressure to optimize the car’s performance. Each of these adjustments impacts how efficiently the car operates and how well it performs on the road. Similarly, in machine learning, tunable settings or parameters are analogous to hyperparameters, which are set beforehand to control how the learning process works.
So basically hyperparameters are the tunable parameters that determine the learning process and how well a model functions. The more optimal the hyperparameters, the better the model performs.
Finding the optimal hyperparameters typically involves an iterative process of experimentation. Approaches such as grid search, cross-validation, Bayesian optimization etc. are commonly used, but that will be a topic for another article.
Any values or configuration that is set beforehand in a training process can be considered as hyperparameters, the “hyper” in parameters means “top-level” .
Here are some general hyperparameters used in learning algorithms:
1. Learning Rate
Controls the step size at each iteration while moving toward the minimum of the loss function. It determines how quickly old information is overridden with new information and influences how well a model learns.
High Learning Rate: Can lead to faster training but may cause instability and overfitting.
Low Learning Rate: Can result in slower training but may potentially lead to better convergence and a more accurate final model.
2. Epochs
The number of times the model completes a pass through the entire dataset.
High Epoch Value: Can lead to overfitting and increased computational resource usage.
Low Epoch Value: Might result in underfitting, where the model hasn’t learned enough from the data.
3. Batch Size
The number of training examples used to calculate each update to the model’s weights. It basically splits the data to process in one epoch.
Imagine there are 1,000 poems. If you set a batch size of 100 poems, each update to the model’s weights will be based on 100 poems. For one epoch (a complete pass through all 1,000 poems), it will take 10 steps to update the model’s weights.
Smaller batch size values: Can provide better accuracy, but can be computationally expensive and time-consuming.
Higher batch size values: Can lead to faster training times but may result in lower accuracy and overfitting.
4. Optimizer
Optimizers are the algorithms that are used to update the model’s weights based on the calculated loss (error) during training. Common optimizers include SGD (Stochastic Gradient Descent) and Adam etc.
Different optimizers have different convergence properties and can be better suited for specific tasks or datasets.
5. Activation functions.
The activation functions are special functions that are applied at the output of a neuron to introduce non-linearity to the model. It helps the model learn intricate patterns and non-linear relationships, without activation function the models will be limited to linearly separable problems.
The choice of activation function can depend on the type of problems, dataset and model architecture and different activation function serve different purpose.
For example if the model is used for binary classification then use sigmoid activation function, if it is used for multi-class classification use SoftMax activation functions etc.
Understanding and optimizing hyperparameters is crucial as it determines overall model performance and robustness and can make the difference between a good model and a great model.