In this blogpost, we will review a CTR (Clickthrough rate) use case by exploring a publicly available dataset from Facebook Ads (more details below). The purpose of the use case is to illustrate why explainable AI methods are not just "nice to have" charts in some reports.
What is Explainable AI?
Think of explainable AI (XAI) as AI with a "show your work" feature. Imagine when you ask a friend why they made a certain decision, and they give you a clear explanation that you can understand.
With XAI, AI systems are designed to explain their decisions in a way that makes sense to humans. Instead of just getting a prediction or result, we also get insights into why the AI made that choice. It's a conversation with the AI to understand its reasoning, which is pretty cool and helpful, especially in important areas such as healthcare or credit risk, or many other domains of high risk. XAI is employed and demanded in many cases for a variety of reasons.
Over the recent years, there has been a plethora of proposed methods in the scientific literature that tries to delve into the explanations of models to make the decision-making process transparent. Transparency in such processes is mandatory and since 2023 it is required by law:
Article 13 of the EU AI Act provides the requirement of transparency and provision of information for high-risk AI systems, according to which “high-risk AI systems shall be designed and developed in such a way to ensure that their operation is sufficiently transparent to enable providers and users to reasonably understand the system’s functioning.
More often than not, the constant demand for better results can lead to very complex AI models. From the one hand, having "better" results (by better, I imply better based on a specific KPI) may lead to more sales or better customer satisfaction and so on. On the other hand, this increasing complexity creates the so-called "black-box" models which are so complex that not even humans can really understand how these models come up with their predictions. For example, deep neural networks that contain million to billions of parameters can be considered as black box models.
We can separate the XAI methods as follows:
Local XAI methods are like zooming in on a specific decision or prediction made by an AI system. Imagine putting a magnifying glass on that one particular instance and asking the AI, "Hey, why did you do this?" These methods focus on explaining individual outputs or decisions, giving you insights into why the AI made that specific choice. It's helpful when you want to understand the reasoning behind a particular prediction or decision without diving into the entire AI model's complexity. Think of it as getting a detailed explanation for a single action rather than the whole process.
On the other hand, global XAI methods are the complete opposite. These methods are stepping back and look at the big picture of how the AI system works overall. Instead of focusing on just one decision, global methods aim to explain the entire AI model's behavior or patterns. These methods are useful for understanding the overall logic and trends in the AI's decision-making process, helping you grasp its general behavior and performance.
We can separate XAI methods by model-specific and model agnostic. Model-specific methods are designed to work with a particular type of AI model, such as neural networks or decision trees, providing detailed insights into its unique structure and parameters. In contrast, model-agnostic XAI methods offer more universal explanations that can be applied to any AI model, focusing on general patterns and insights across different architectures, making them versatile tools for understanding the decision-making process of various types of AI models.
1. SHAP (SHapley Additive exPlanations) falls under the category of model-agnostic XAI methods. It is designed to provide explanations for the output of any machine learning model by computing Shapley values, which are a concept from cooperative game theory. SHAP can be applied to a wide range of models, including but not limited to linear models, tree-based models (such as decision trees and random forests), support vector machines (SVMs), and deep learning models (like neural networks).
2. LIME (Local Interpretable Model-agnostic Explanations) is also categorized as a model-agnostic XAI method. LIME works by approximating the behavior of a black-box model locally around a specific instance of interest by training an interpretable model, such as a linear regression or decision tree, on perturbed samples generated around that instance. This approach allows LIME to provide explanations for individual predictions of any machine learning model, regardless of its underlying architecture or complexity, making it a versatile tool for local interpretability.
Usecase: Increase CTR using actionable insights from XAI methods
In this use case, we don't use XAI for just understanding the model's behavior. We aim to use the explanations to make the advertisments better and increase their CTR. Lets introduce first the dataset. The dataset is a collection of 3.5K social media ads (Facebook) released by the US House of Representatives in May 2018. Table 1 describes the features and Table 2 descibes the meta-features.
Table 1
Name | Type | Example |
AdText | Unstructured Text | It is an American history. African-American citizens sat behind signs like these on city buses. |
Clicks | Number | 32 |
Impressions | Number | 321 |
Age | Text | 18 - 65+, 20 - 45 |
CreationDate | Date | 06:16:15 08:20:31 AM |
EndDate | Date | 06:17:15 08:20:30 AM |
Behaviors | Text | New smartphone and tablet users, Multicultural Affinity |
AdSpend | Number | 599 |
ExcludedConnections | Text | Exclude people who like Memopolis |
The dataset contains other features as well but they were not used in this analysis. Meta data were generated based on the Table 1 and are described in Table 2.
Table 2
Name | Based On | Type | Example | Description |
Days | CreationDate & EndDate | Number | 4 | Number of days the ad was visible |
total_word_count | AdText | Number | 12 | Number of total words |
capital_word_count | AdText | Number | 2 | Number of capitalized words |
noun_count | AdText | Number | 2 | Using POS Tagger, get #NOUNS |
verbs_count | AdText | Number | 1 | Using POS Tagger, get #VERBS |
sent_class: pos/neg/neu | AdText | Hot encoding | [1 0 0] | Using transformer, get sentiment |
question_count | AdText | Number | 1 | Number of "?" |
exclamation_count | AdText | Number | 1 | Number of "!" |
behaviours_cnt | Behaviors | Number | 3 | Number of different behaviors |
exclude_cnt | ExcludedConnections | Number | 1 | Number of different excluded groups |
min_age/max_age | Age | Number | 15, 55 | Number of min/max age |
For POS Tagger, spacy model was employed, while for sentiment annotation hugging-face transformer was employed. Data also contained variations of the same ad using sligthly different text. These ads were merged together after been detected by preprocessing the AdText field, which resulted in a dataset of 2.3K samples.
The AdSpend feature was removed after the initial analysis for two reasons: 1) money spent on the ad was dominating the global as well as the local impact in the decision making. This makes sense of course, since more money means more visibility, more exposure and so on. 2) The actionable suggestions should not affect the budget of the customer rather improving the ad itself.
Objective
Employ a classifier to assess the effectiveness of advertisements by distinguishing between high-quality and low-quality ones. When an advertisement is identified as low-quality, utilize explainable AI techniques to propose actionable recommendations for improving it into a high-quality ad. Quality is defined based on a threshold of 50 clicks, where ads receiving fewer than 50 clicks are categorized as low-quality (class label 1), and those exceeding 50 clicks are considered high-quality (class label 0). This categorization has resulted in a dataset comprising 711 low-quality ads and 1622 high-quality ads. The concept involves employing the classifier as a referee to assess the qualitative aspects of an advertisement and determine the degree of quality, utilizing the predicted probabilities as a measure.
For base model, CatBoostClassifier was employed and fitted on the whole dataset. After training the model had good performance overall, scoring around 90% in ROC AUC.
CatBoostClassifier is an ensemble method that uses trees as base learners. Therefore, it is easy to extract what are the most influential features that the "model itself" assumes as most important. Let's see which are the top 5 below.
So, total word count, days, capital word count, noun count and verb count seem to be the top 5 most important features based on the classifier iteself. Now lets see what SHAP figures out by testing the model.
As we can see, 3 out of 5 features are intersecting. SHAP excluded noun and verb count and used min and max age instead.
Now lets take a single sample which belongs to the poor quality class and is predicted as poor quality. The example is drawn based on its predictive probability e.g., high predicted probability means that the model is very confident of the sample's poor quality. For single predictions and explantions we use LIME. The instance which is selected was predicted with 85% probability being (correctly) classified as of poor quality. LIME scores are shown below.
What this figure says is practically that the total amount of words, min age and the non count are playing a positive role for the instance to be labeled as poor while days and end age push the instance to the good quality class. By taking a closer look, we can already see come actionable suggestions.
The sample is labeled as poor (label 1) and we can see that the total word is already too high. Lets start reducing this feature and observe how the model will respond w.r.t the poor class.
As expected, by reducing only the number of total words, the model's confidence is reduced up to the point (60 words) that the class label changes (43% probability to be labeled as poor). So, recommending the reduction of words to the seller is an easy and inexpensive solution to turn the ad from poor to good quality. Although this process is simple, it is also quite naive. Now, lets consider instead of having 1 tunable feature that we set the tunable features from the XAI as inputs to an AutoML optimization solver (bayesian optimization) and let it decide which is the optimum combination that minimizes the confidence to the poor class. By doing this, we also control which feature is most appropriate for the user to change e.g., the seller does not want to change the geographical region but they can change the age group. The results are shown below.
Now the results look much better! From 85% the probability was reduced to 16% for the poor quality (label 1) class. The seach optimization was able to find the combination of perturbations that provides the best suggestion.
LLMs to the rescue!
Now that we know how many words are needed for the model to change its decision, the next question is: can we provide automatically suggestions of alternative text that fits our criteria?
Well the answer is yes obviously! LLMs shuch as ChatGPT, LLama and others can be employed to serve our need.
There is a plethora of open source models that can be used to assist us. We just need to provide the right template based on the type of suggestions we get from the XAI e.g., if the most influential features are number of total words and number of verbs then the tamplate would be something like:
Act as a advertisement expert. Use 20 words where 7 are verbs to reconstruct the following advertisment to an improved version: "African-American soldiers played a decisive role in the US Army on the western frontier during the Plains Wars, but it's not mentioned in our history books."
The employed open source LLM responds back the following:
"African-American soldiers were pivotal in shaping US Army history on the western frontier, yet textbooks overlook their significant contributions."
And this example illustrates a very simple case. Now by scaling the combination of XAI methods and LLMs, we can provide better suggestions to poor advertisments to our customers automatically.
In case we want to take it one step further, we can let the customer/user choose from a list of suggestions which improved advertisment serves their needs. The selection (or customer preference) can be utilized for A/B testing in order to improve the recommendations in the future.
Conclusion
In conclusion, using XAI for increasing CTR is quite easy and straight forward approach. By providing actionable suggestions to sellers, the quality of the ads can drastically improve. The process not only enhances the recommendation process but also provides valuable insights into which specific features should be modified. The results demonstrate a substantial improvement, with the probability of an ad being classified as poor quality reduced from 85% to 16%, underscoring the effectiveness of this refined approach in enhancing ad quality. Furthermore, LLMs present an exciting opportunity to automate the process, seamlessly identifying and enhancing advertisements.
** If you are interested in other ML use-cases, please contact me using the form (and also include a publicly available dataset for this case, I'm always curious to explore new problems).
Comments