đEnsembling & Correlation
Ensembling with Correlation Guidance â ECG
Usually, at the very beginning of the calculations, a part of the training data is separated for the validation data. Validation data has important applications in calculations. Tuning of hyperparameters is done with the help of validation data and it can even be used to determine Ensembling coefficients. But in many cases (for example, Kaggle challenges, etc.) when we want to Ensembling the results of two or more independent calculations, only the results themselves and the score (public score) of each of them are available. So, determining Ensembling coefficients in this case should be based on previous experiences and trial and error method. Even when we want to Ensembling them based on the score (public score), i.e. give more weight to the better scores, we fail very quickly and then in the dark we have to try different coefficients so that we might be lucky and the score Be better than before.
So just as finding the right coefficient for Ensembling is not methodical, even finding two results whose Ensembling can improve the score is also not methodical. That is, for example, Ensembling the best scores is not always successful, and in most cases only by Ensembling some results, the score improves. When the number of answer columns is more than one, the darkness increases. Because it is not known that after finding the right coefficient for Ensembling the first columns, the same coefficient is optimal for Ensembling the next columns.
In other words, in many cases, we are looking for an optimal coefficient for the linear combination of two lists (a pair of lists). But to combine two pairs of lists, we should logically look for two coefficients. That is, for thousands of pairs of lists, we should no longer expect a coefficient to be the best coefficient and the optimal coefficient. Although it is possible to use only two or three coefficients to combine thousands of pairs of lists. Recently, the âOpen Problems â Single-Cell Perturbations challengeâ was held in Kaggle, which includes the prediction of more than eighteen thousand columns. That is, the issue of Ensembling in these cases becomes very complicated, because if a coefficient is chosen for Ensembling the first column, it cannot be sure that the same coefficient is optimal for more than eighteen thousand other Ensembling. For example, a notebook may have predicted ten thousand columns very well, but not predicted other columns (for any reason) well. When we want to use the results of this notebook for Ensembling, it might be better to use at least two different coefficients.
In this article, we want to share with you the experience of using correlation for Ensembling. Knowing the correlation value of the columns, when you want to Ensembling them, is like the light of a candle in the dark, which can help you to some extent to reach your destination. This article was written in January 2024 by Somayyeh Gholami and Mehran Kazeminia.
How does knowing the âCorrelation Valueâ help Ensembling?
Knowing the correlation value is very important. When the correlation between column A and column B is negative, Ensembling should not be performed on these two columns. That is, if the public score for notebook_column_A is better than the public score for notebook_column_B, the first choice is column A alone and the second choice is column B alone, and no linear combination between these two columns can be good. Maybe only philosophers disagree with this issue and say that although the correlation value of two columns is negative, but abstractlyâşď¸it is possible that column A and column B are far from the target and a linear combination of them is close to the target.
When the correlation of column A and column B is a positive number, the issue becomes more difficult. In this case, first you should check how far the correlation value is from zero and how close it is to one. Secondly, it is necessary to check how much better (or how much worse) the public score of notebook_column_A is than the public score of notebook_column_B . Knowing these two things, you can guess and then check these guesses by trial and error. For example, if the correlation value is very close to one and public score of notebook_column_A is much better, column A alone is probably a good choice and Ensembling will not help improve the score. Please note that the public score that we have is the score of all columns. But in a unique notebook, some columns can be good or bad for Ensembling. Knowing the correlation value of similar columns in two separate notebooks helps to find good and bad columns (for Ensembling).
You must know that if after Ensembling the public score improves, but the private score worsens, so-called overfitting has occurred. That is, the whole model is unstable due to bad Ensembling. To avoid this risk, pay attention to three important points; First, donât use the results of notebooks that are themselves the product of multiple Ensemblings, because they increase this risk. Secondly, if the correlation value of all columns is very close to one, the effect of Ensembling in improving the public score is very low, but randomly, the private score may worsen. Obviously, you wonât notice a drop in private score, so itâs best to avoid too much and fruitless Ensembling. Thirdly, due to the separation of test samples in Kaggle (i.e. public and private score) and the lack of knowledge of how to sort test samples, choosing different coefficients for different rows can be very dangerous and you should expect overfitting. (unless in very special cases where the sorting and categorization of test samples is completely clear from the beginning) Obviously, specifying different coefficients for different columns does not have the risk of overfitting. In this case, Ensembling is done for all public and private test samples without any discrimination. That is, if the public score improves, the private score will also improve at the same time and vice versa.
Functions: Tuning & Ensembling
We compiled all the code and data related to this article into a Kaggle notebook, which is linked below. The dear reader can try all the examples and topics raised in this article in the Kaggle environment and do more research. In addition, the codes are also available in our GitHub and you can access them at the following address.
At the beginning of the above notebook, we imported some libraries and assigned the next cells to âAuxiliary Functionsâ, which are seven simple functions in total. In the following, you can see the three âMain Functionsâ. Here we briefly explain each of these three functions.
Function â 1 : generate_corr_coeff (dfx, dfy, corr_limit, coeff)
- This function takes the regression results and prediction of two notebooks, each of these results can have thousands of columns. The first result (dfx) is called âMainâ and the second result (dfy) is called âSupportâ.
- This function ignores the column corresponding to âMainâ when the correlation of two similar columns in âMainâ and âSupportâ becomes negative. We can also set âcorr_limitâ to be less than one (and greater than zero). When the correlation of two similar columns in âMainâ and âSupportâ is greater than âcorr_limitâ, this function ignores the column corresponding to âSupportâ.
- This function performs Ensembling with a coefficient that we define as âcoeffâ, only for columns whose correlation is between zero and âcorr_limitâ.
- It is obvious that if we change the places of âMainâ and âSupportâ, the result of Ensembling will change and the score may be better with the new setting of âcorr_limitâ and âcoeffâ.
Function â 2 : generate_corr (dfx, dfy, corr_limit)
- This function is very similar to the previous function, except that you do not need to specify the âcoeffâ. This function calculates a coefficient for Ensembling for both columns whose correlation is between zero and âcorr_limitâ. That is, it considers âcoeffâ equal to the correlation value of two columns.
- Certainly, the results of this function are not as good as the first function, but trying this function can be a good image to determine âcoeffâ and âcorr_limitâ.
Function â 3 : tuning_corr_limit (dfx, dfy)
- Before setting the parameters of the above functions, you can use the following function to see the correlation status of âMainâ and âSupportâ in the table and graph.
- If the number of prediction columns is thousands of numbers, this function must perform a lot of calculations and the execution of this function takes time.
Examples for Ensembling with Correlation Guidance
In the rest of this article, we do three examples of Ensembling for the results of the âOpen Problems â Single-Cell Perturbations challengeâ notebooks. Recently, the deadline for this challenge in Kegel has expired, and as mentioned before, each prediction contains more than eighteen thousand columns. A large number of columns will clarify the topics better. We hope that by reading these examples, you will find answers to your doubts and questions.
Example â 1 : Chain Ensembling
For the first example, we Ensembling the results of three notebooks in a row. We create the so-called âEnsembling Chainâ.
- The first one is our own notebook where we have done âFeature Augmentationâ. The following results are before any Ensembling.
- The next notebook has effectively used âNeural Networkâ.
- The next notebook has effectively used âNLPâ (SMILES Embedding).
Before starting any Ensembling for parameter tuning, we can compare the correlation values of all the columns of the two results and see the details in the table and graph. (This function takes a long time)
# Part: a
tuning_corr_limit(sub_604, sub_607)
gen_1a = generate_corr_coeff(sub_604, sub_607, 0.95, 0.50)
# Public Score: 0.595
# Private Score: 0.808
# Part: b
tuning_corr_limit(gen_1a, sub_606)
gen_1b = generate_corr_coeff(gen_1a, sub_606, 0.85, 0.60)
# Public Score: 0.589
# Private Score: 0.796
Explanations for Example â 1
- Only the correlation of 3.7% of sub_604 and sub_607 columns is negative, maybe because their scores are close to each other. However, if these columns are ignored during Ensembling, the Ensembling score will be better. Note that the values of 3.7% of the columns must be extracted from only one of the results. In the function we wrote, these values are extracted from the second result in the function, the sub_607 result. In cases where you see a negative correlation, you should definitely try which result should be written first and which result should be written second. If the number of columns with negative correlation is high, removing them from Ensembling is very important and will have a great impact on the final score. As mentioned earlier, two columns with negative correlation have no chance for Ensembling.
- To calculate gen_1a, the value of 5% of the columns with a correlation close to one was ignored, and for this reason, the value of corr_limit is considered equal to 0.95. This 5% is extracted from the sub_604 result. But for the calculation of gen_1b, 15% of the columns with a correlation close to one were ignored, and for this reason, the value of corr_limit is considered equal to 0.85. This 15% is extracted from the gen_1a result.
- The coeff value for calculating gen_1a is equal to 0.50 and the coeff value for calculating gen_1b is equal to 0.60. Meanwhile, in the notebook, Ensembling has been done in a classical way separately and with the same coefficients with the names ens_1a and ens_1b, so that there is a possibility of comparison.
Example â 2 : Ensembling for challenge winners
The âOpen Problems â Single-Cell Perturbations challengeâ challenge has ended and the winners of this challenge were determined based on the best private score. For this reason, for the second example, we use the results of the competition winners and try to optimize the private score by Ensembling.
- The results of the first and second teams have not been published or we could not find them. So we will check the results of the third and fourth teams.
- We noticed that after the end of the â4th Placeâ contest, he published a better version and a better private score than his Leaderboard score. So we also use their final version.
Before doing anything on the above two results, we can compare the correlation value of all their columns. (This function takes a long time)
tuning_corr_limit(sub_4th, sub_3rd)
gen_2 = generate_corr_coeff(sub_4th, sub_3rd, 0.90, 0.50)
# Private Score: 0.707
Explanations for Example â 2
- In this example, the correlation of two similar columns is never negative. In addition, if we consider âcorr_limitâ equal to 0.90, almost Ensembling is done for forty percent of the columns. Because the scores of the two results are close to each other, we consider the âcoeffâ value to be 0.50 so that they can correct each other if they can.
- The private score for this calculation is 0.707, which is better than the score of sub_3rd and sub_4th results. In addition, it is much better than the challenge championâs private score, which is 0.729.
- If the ensembling is done with the classical method and with a coefficient of 0.50, the private score becomes 0.713, which is even worse than the sub_4th result. That is, at first glance and with simple Ensembling, we think that these two results cannot help each other. But by using correlation guidance, the score will improve.
Example â 3 : âResults of impure goldenâ
For the third example, letâs go to a specific notebook, to make some things clearer. The public score of the following notebook is 0.720 and its private score is 0.960. These scores are not good at all, because they are even worse than âsample_submissionâ scores. If you try âsample_submissionâ (ie when all answers are zero), the public score is 0.666 and the private score is 0.902.
- We realized at the beginning of the challenge that the solution of this notebook is good and regardless of the bad score of this notebook, we Ensembled it with our own notebook and the score improved. We named this type of results âResults of impure goldenâ. Of course, in another challenge, we have checked this type of results.
- We will try Ensembling our own notebook, i.e. sub_606 result, with this particular notebook, so that you can see the improvement in the score.
Before doing anything , we can compare the correlation value of all their columns. (This function takes a long time)
tuning_corr_limit(sub_720, sub_606)
gen_3 = generate_corr_coeff(sub_720, sub_606, 1.00, 0.25)
# Public Score: 0.601
# Private Score: 0.792
Explanations for Example â 3
- So the correlation of 8.4 percent of the columns is negative. These columns cannot be useful for Ensembling. We ignored them and Ensembled the rest of the sub_720 columns by a factor of 0.25 with our notebook columns. The general score of our notebook changed from 0.606 to 0.601 and the private score changed from 0.809 to 0.792. We see a significant improvement in the score, while the sub_720 scores were not good at all.
- Please note that in all the above examples, you still need to use trial and error to get the optimal value for âcoeffâ and âcorr_limitâ, but knowing the correlation of the columns can somewhat clarify the decision scene. It means that we no longer need to work in absolute darkness.