Predicting Customer Subscription in Bank Marketing Campaigns



A Simple Data Science Project Using R and Logistic Regression

Link to project files: https://drive.google.com/drive/folders/1gLOxi34ijmdN5ZVE-I6WXX-LI_ABFp2R?usp=sharing


In one of my Computational Mathematics and Statistics class projects, I worked on a dataset about bank marketing campaigns.

The main question I wanted to answer was simple:

Can we use customer and campaign data to understand whether a person is likely to subscribe to a bank term deposit?

A term deposit is a financial product where a customer deposits money for a fixed period of time and earns interest. Banks often promote these products through marketing campaigns, including phone calls.

For this project, I used data from a Portuguese banking institution. The data came from real phone-based marketing campaigns where customers were contacted and asked whether they wanted to subscribe to a term deposit. The final answer was recorded as either yes or no.

The goal of the project was not just to build a model. It was also to understand which factors seemed to influence customer decisions.

Tools Used

For this project, I used the following tools:

R
I used R as the main programming language for the analysis.

RStudio / R Markdown
I used R Markdown to write the project, run the code, show the results, and generate the final report.

tidyverse
I used the tidyverse package for data cleaning, data selection, and visualization.

ggplot2
This was used to create charts such as boxplots and histograms.

Logistic Regression with glm()
I used the glm() function in R to build a logistic regression model.

PRROC
I used the PRROC package to create a precision-recall curve and evaluate the model on an imbalanced dataset.

Step 1: Understanding the Dataset

The dataset had information about customers and the bank campaign.

Some of the variables included:

  • Customer age

  • Customer account balance

  • Whether the customer had loans

  • Call duration

  • Number of times the customer was contacted

  • Whether the customer subscribed to the term deposit

For this project, I focused on four main independent variables:

  • age

  • balance

  • duration

  • campaign

The target variable was:

  • y, which shows whether the customer subscribed or not

So the main research question became:

What factors influence whether a customer subscribes to a term deposit?

A second important question was:

Does call duration increase the likelihood that a customer subscribes?

Step 2: Cleaning the Data

Before building any model, I first had to clean and prepare the data.

This step is important because a model is only as good as the data we give it.

Tools used in this step

  • R

  • R Markdown

  • tidyverse

The cleaning process included:

  1. Loading the dataset into R

  2. Converting the subscription outcome into a categorical variable

  3. Checking for missing values

  4. Removing unrealistic extreme values

  5. Selecting only the variables needed for the analysis

The dataset did not have missing values in the selected fields, which made the cleaning process easier. After cleaning, the dataset had 45,211 observations and the selected variables were age, balance, duration, campaign, and y.

This helped keep the project focused and easier to explain.

Step 3: Exploring the Data

After cleaning the data, I explored it using summary statistics and visualizations.

Tools used in this step

  • R

  • tidyverse

  • ggplot2

The first thing I noticed was that the dataset was imbalanced.

Most customers did not subscribe.

The subscription results were:

  • 39,922 customers said no

  • 5,289 customers said yes

That means only about 11.7% of the customers subscribed, while about 88.3% did not.

This matters because a model can look accurate simply by predicting “no” most of the time.

For example, if almost everyone says no, a lazy model could predict “no” for everyone and still look accurate. That is why accuracy alone is not always enough.

Step 4: Visualizing the Variables

The visualizations helped me understand the patterns before building the model.

Tool used in this step

  • ggplot2

Call Duration

One of the most important charts was the boxplot comparing call duration with subscription outcome.

The chart showed that customers who subscribed usually had longer phone calls than those who did not subscribe.

This makes sense. If a customer stays longer on the phone, they may be more interested, more engaged, or asking more questions.

So call duration looked like an important signal.

Account Balance

I also looked at account balance.

Customers who subscribed seemed to have slightly higher balances, but the difference was not as strong as call duration. The account balance variable also had many extreme values, meaning some customers had very high balances compared to most others.

Campaign Contacts

The campaign variable showed how many times a customer was contacted during the campaign.

This was interesting because more contact did not necessarily mean better results. In fact, the model later showed that too many contacts were linked with a lower chance of subscription.

That may suggest customer fatigue. In simple terms, if you call people too many times, they may become annoyed and less likely to say yes.

Step 5: Choosing the Statistical Method

For the model, I used logistic regression.

Tools used in this step

  • R

  • glm() function

  • Logistic regression

I used logistic regression because the outcome variable had only two possible answers:

  • yes

  • no

Logistic regression is useful when we want to estimate the probability of something happening.

In this case, I wanted to estimate the probability that a customer would subscribe to a term deposit.

The model used this formula:

y ~ age + balance + duration + campaign

This means the model tried to predict subscription outcome using age, balance, call duration, and campaign contacts.

Step 6: Model Results

The logistic regression results showed that all four variables were statistically significant.

That means each variable had a meaningful relationship with the subscription outcome.

Here is the simple interpretation.

Age

Age had a positive effect.

This means older customers were slightly more likely to subscribe.

The effect was small, but it was still statistically significant.

Balance

Balance also had a positive effect.

Customers with higher account balances were slightly more likely to subscribe.

Again, the effect was small.

Duration

Call duration had the strongest positive effect.

Longer calls were strongly linked with a higher chance of subscription.

This was the clearest result in the project.

Campaign

Campaign had a negative effect.

This means that as the number of contacts increased, the chance of subscription decreased.

This may suggest that repeatedly contacting customers does not always help. Sometimes, it may actually reduce interest.

Step 7: Evaluating the Model

After building the model, I tested how well it performed.

Tools used in this step

  • R

  • Confusion matrix

  • PRROC package

  • Precision-recall curve

The model had an accuracy of about 88.9%.

At first, this sounds very good.

But when I looked deeper, the story changed.

The confusion matrix showed:

  • The model was very good at predicting customers who would say no

  • The model struggled to correctly identify customers who would say yes

This happened because the dataset was imbalanced. Since most customers did not subscribe, the model became better at predicting the majority class.

This is an important lesson in data science:

High accuracy does not always mean the model is useful.

For a marketing campaign, the bank is usually more interested in finding the people who might say yes. So even though the model had high accuracy, it was not perfect for the business goal.

Step 8: The Biggest Limitation

The biggest limitation in this project was the use of call duration.

Call duration was the strongest predictor, but there is a problem:

You only know the call duration after the call is finished.

That means it is useful for understanding what happened, but it is not very useful for deciding who to call before the campaign starts.

This is an example of what data scientists call data leakage.

Data leakage happens when a model uses information that would not actually be available at the time of prediction.

So while duration helped the model perform better, it also made the model less realistic for real-world use.

What I Learned From This Project

This project taught me several important lessons.

1. Data cleaning matters

Before modeling, it is important to understand the data, check for missing values, remove unrealistic values, and select useful variables.

2. Visualization helps before modeling

The charts helped me see patterns before running the regression.

For example, the boxplot clearly showed that subscribers had longer call durations.

3. Accuracy can be misleading

Because the dataset was imbalanced, accuracy alone did not tell the full story.

The model looked strong overall, but it struggled to identify actual subscribers.

4. Not every strong predictor is practical

Call duration was powerful, but it is only known after the call.

This taught me that a model can be statistically strong but still have practical limitations.

5. More marketing is not always better

The campaign variable showed a negative relationship with subscription.

This suggests that calling customers too many times may reduce effectiveness.

Practical Meaning

From a business point of view, this project suggests that banks should focus more on the quality of customer conversations rather than simply increasing the number of calls.

Longer calls may show higher engagement, but banks should not rely only on call duration because it is known after the interaction.

A better future version of this project could remove call duration and focus only on information available before the call, such as:

  • age

  • balance

  • job

  • education

  • loan status

  • previous campaign history

That would make the model more useful for real marketing decisions.

Conclusion

This project used R, data visualization, and logistic regression to study customer subscription behavior in bank marketing campaigns.

The main finding was that call duration had the strongest relationship with subscription. Balance and age also had positive effects, while repeated campaign contacts had a negative effect.

However, the project also showed an important limitation: some variables may help explain past outcomes but may not be useful for future prediction.

For me, this project was a good example of how statistics and data science can help us understand real-world behavior. It also showed that building a model is not just about getting a high accuracy score. It is about asking whether the model makes sense in the real world.

Data science is not only about prediction.

It is also about interpretation, decision-making, and understanding the story behind the numbers.

Comments