Use Correlation Algorithms to Avoid AutoML Picking Inappropriate Features
Please run a correlation algorithm (or improve the existing), so that features that are correlated with the label are not suggested. Otherwise, "business analysts" will create models that are not useful and this would degrade the value of the excellent job done on Automated ML in PBI.
The general advertisement is that the Auto ML is designed for "business analysts to build machine learning models". However, in some situations there could be realistic problems, unless someone is having a data science background.
Here is the case:
Use the Power BI sample projects (Supplier Quality Analysis Sample)
Extend the metrics query by creating a HasDefect column, which we would like to have as a label for future prediction. In my case: = Table.AddColumn(#"Added Custom", "HasDefect", each if [Defect Type ID]<=1 then 0 else 1)
Create a data flow and entity based on the Metrics query
Proceed to creating an ML model and follow the wizard
For historical outcome field select "HasDefect"
When you go to "Customize inputs" step you will get a proposal for fields, some of which are highly correlated with the label.
Now the problem is that HasDefect is very closely correlated to the column DefectType and a user without a data science background will "successfully" train a model with 100% accuracy.
A simple Python visual with Pearson correlation shows: there is an evident correlation between the HasDefect and DefectType.
Below is the code to generate this. However, as you may see, I tried to address the "Defect" string column by encoding it to integer, so that it could take part in the correlation, as the defect text is 1-to-1 match with the defect ID, which determines the defect type, which is related with the label "Has defect". However, I did not manage to make it identify this correlation, due to the shuffled order.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
dataset['Defect'] = le.transform(dataset['Defect'])
Paste or type your script code here:
import matplotlib.pyplot as pyplot
corr = dataset.corr('pearson')
More details about this idea (which could also be considered an issue) could be found here: