Financial markets in today’s era of an undulating economic trend and ever-evolving technology have made it challenging for an average consumer to make an educated decision even with a pool of resources available today. Although the idea of any innovation is making the life of a consumer easy, it doesn’t necessarily turn out to be the case when it comes to access even some basic navigation tools with respect to a financial market.
Since 2011, roughly more than 700,000 consumer complaints have been registered and made public by Consumer Financial Protection Bureau (CFPB). This provides a scope to improve and find the hidden pattern among these complaints. To address these issues, this project focuses on developing a more decision-based and visually compatible system that acts as a tool for both the consumer, and the firm involved, such that:
- The consumer using the service can make an informed decision before involving with that firm and even after the complaint arises. Also, the user will have the historical information on the company’s past behavior and responses on the specific issue.
- From a Company standpoint, the analysis can be helpful in providing information about the areas where they are doing good and those which have the scope of improvement.
To achieve these goals the project proposes the use of a Classification based Predictive model using machine learning algorithms in R and a visualization tool like Power BI. The patterns obtained through the predictive model along with some visual analysis will assist in resolving these complaints and in a timely fashion. Thus, these predictions and visualizations will not only assist different companies and consumers for future planning and financial decisions but will turn out to be a good source of information for the US financial market.
Keywords: Financial market, CFPB, predictive modeling, visualization, consumers, complaints, machine-learning algorithms, R, Power BI.
Conventional economic radicals believe that stability in financial market data is impractical to predict accuracy that is more than 50 percent in a perfect market world.
However, studies show that with an increase in consumer awareness and assistance from some government and private organizations, these financial traits can be predicted as well as improved. In this capstone project, the data source is obtained from the Consumer Financial Protection Bureau (CFPB), an agency of the United States government who is liable for consumer protection in the financial sector . CFPB has authority over many financial practices whether it be private or public such as banks, security firms, mortgage services, credit firms and so on.
This organization aims at providing the means, through which consumers can take appropriate actions against the unfair and abusive financial practices. In addition to this vision, CFPB focuses on providing a vigilant environment for customers both before and after their involvement in a financial activity; thus, making them aware of different practices associated with several companies, products, and the services involved.
The Role of CFPB dataset: The CFPB agency provides a pool of rich data that covers all the significant sectors in the financial market. It is important to both the consumers and the companies involved as it enables a varied scope of analysis on some really diversified fronts:
- The vast scope of Data Analytics provides some in-depth insights that can be used as a learning model to mitigate future user complaints and help different financial firms to identify the key affected areas.
- The second aspect of using this dataset is enabling an effective visualization tool for an average user who can easily draw some quick conclusions based on the visual patterns obtained.
In this proposal, a comprehensive overview of adopting a data-driven analysis model and a visualization tool is being proposed. Thus, providing customizable information to the user in any chosen financial market area. It further includes a comparison of selected algorithms for the training model and an interactive and intuitive visual analysis. Also, a brief account of some earlier work is mentioned related to the CFPB dataset. Since it follows a systematic approach for handling a consumer complaint, it provides a good scope of leveraging this information and develop a more educated consumer as well as an informed public and private firm policymaker.
The use of CFPB’s historical data can act as a catalyst to bridge the gap between a user’s awareness both before and after involving with a certain financial service . Although CFPB provides a huge amount of diversified information on various topics, the task of navigating to a specific issue of interest is still very confusing for an average user.
Different researches involving a specific financial service is available . However, analysis linking to the use of these services and educating consumers has a lot of scopes. The proposed predictive modeling analysis will help in creating a well-informed decision-making model that hasn’t been explored so far to its potential. Moreover, it will enable the user to understand the product/firm viability with respect to certain services, which will, in turn, provide them the understanding on the course of actions needed to be taken.
Also, the inadequate means of visual analysis (from both a Consumer and a Company user standpoint) doesn’t let a user explore different aspects of financial services and its impact on the global market.
“To prepare a decisive predictive analysis model alongside an interactive visual analysis tool that focuses on consumer complaints and the company’s effort to resolve them successfully and on time”
The primary focus of the proposal is to leverage both structured and unstructured data analysis approach within the scope of this project for studying the financial market more efficiently. In addition, enhancing the capabilities of analysts to better explore the varying nature of the financial market by providing the visual analysis and statistical information of different firms and consumer complaints which will help in revealing new insights. Thereby, aiding CFPB’s effort for a more feasible environment for both the consumers and the companies involved.
1. Building a data-driven predictive model using popular R programming and visualization libraries that assist in identifying the data and statistical indicators. This will, in turn, help in revealing insights that best help companies and consumers. Thus, we further drill down the objectives as:
- Using a supervised classification-based approach to develop a model, focusing on resolving consumer complaints in a timely manner.
- Aiming, to educate a consumer on several financial services.
- Applying supportive data analytics algorithms and performance tuning practices. The concluding results will assist a consumer to make informed decisions, whether it be choosing a product or involving with a firm’s financial service.
2. Developing Visualization dashboards using Power BI that provides insights based on the financial services and consumer complaint database. To evaluate and compare the parameters which affect a company’s behavior towards a consumer complaint, the project focuses on making use of any direct or indirect indicators such as complaint count, the timely response by companies, consumer disputes on the responses and so on. The idea behind this objective is leveraging the rich historical data that CFPB agency provides and help different financial firms as well as consumers to customize information based on a product, issue, firm etc. Fulfilling this objective will in turn help in answering a lot of questions based on the data such as:
- The trend of Complaint rises or fall over the years
- Types of the product responsible for the rise in these complaints
- Companies complained most by the consumers
- Geographical Statistics based on the complaints
- Interrelations or dependencies between the complaint types and global financial market trend .
A significant number of researches have already been conducted on the data provided by CFPB since 2011 i.e., the year when the agency was first formed. The CFPB website has a dedicated section that publishes all the previous as well as ongoing work and studies referencing to the open source data provided by the agency . Moreover, the researchers are varied into a lot of specific and combined streams like mortgages, credit card frauds, and scams, student loans and so on, thus, the approach for this section is basically including some of these relevant work and researches related to the proposed implementation of the project.
As mentioned above, CFPB itself performs extensive researches to support a well-educated consumer behavior when it comes to financial planning. A financial report of the year 2016 by the agency broadly emphasize how to use the information that CFPB provides on a public podium as a medium to spread financial literacy among the consumers . The research primarily focuses on educating a financially capable consumer.
Following it, the focus shifts on the importance of the financial knowledge, skills, and habits a user must develop to effectively recognize the methods and incorporate it to achieve financial protection. The research results are also imperative to financial institutions, it provides them the course to further explore and educate the user for a more sustainable market growth .
In a somewhat similar fashion, Deloitte performed an analysis in 2012 on the Consumer Financial Protection Bureau data and produced several valuable insights into the nature and sources of recent complaints . The outcomes were really intriguing for the scope of this project:
- Majority of the complaints stemmed from troubled mortgages, which is a growing trend.
- Client misconception may make a bigger number of complaints than financial institution error.
- Established neighborhoods were more probable sources of complaints. Complaint resolution times have improvised
This research provides some interesting patterns and insights into complaints and issues related to financial practices and the factors contributing to it. Thus, in confines of the proposed proposal, this study helps in identifying some important factors that should be investigated while analyzing the patterns and building the predictive model. Further, it also gives us an idea about the areas that are consistently affected by complaints and need attention from both consumers, and the firms involved.
Using Visual Analytics in this context
Although the CFPB agency themselves provides some elementary visualizations, data analysts always take a step forward to enhance a consumer’s analytics proficiency. A lot of researches and commercial entities have proactively played an important role over the years in providing visual statistics and information on some diversified topics and issues related to the financial practices.
Mark Schott presented a lightweight web-based tool for CFPB database using Shiny in R that allows you to build an interactive web-based application directly from R  . The application is built with the idea of providing an exploratory analysis channel for users to delve into the data provided by the CFPB agency. The tool provides some interesting functionalities that effectively uncovers some evident findings shown in this article.
Component usability and the timeline customization used in this application has inspired the cross-product comparison analysis within the breadth of this project. Although limited but effective visual analysis like the time series count, bar charts and mosaic plots proved to be an effective tool for some elementary analysis. However, the tool’s performance is restrained in providing some essential and extensive correlations and it also lacks some additional filters to carter wide information needs.
While a web-based tool referred to in the previous section is acutely useful, its usage can be further explored by integrating data from different sources. For instance, a report by Stephen Redmond mentions the nature of developing a system based on integrating multiple sourced data and performing some ETL transformations, visualization and advanced analytics concepts . The practices mentioned in this paper is a good example, to how we can leverage the concepts of analytics and visualization simultaneously. Alongside the breakneck advancements of using the latest visualization and analysis tools and circumventing some misinformation presented in the report, it still provides some useful guidelines.
There is a clear need today to assess either the complexity involved in a financial market or drive the correlations responsible for these complexities. Although the project does not cover the grounds of integrating multiple data sources , the effect of consumer awareness with respect to one financial service applies to other as well, is a fact.
The project focuses on leveraging the diversified financial information based on a predictive model along with the visual analysis. This provides a premise to a firm for improvising the affected areas whether it be a consumer response, dispute of a settlement or the response time. Also, it enables a consumer to make an informed decision both before and after involving with any financial firm, product, or a service.
This section focuses on the technical implementation. To begin with, the data is gathered, cleaned and some initial data exploration is performed. Then we discuss the Classification based predictive modeling approach. Following that there is a brief mention of the visual analysis involved and how it is being applied. Finally, there is some discussion on the advanced analytics and design methodology.
To achieve the proposed objectives and plans, it is necessary to focus on those technologies and tools that support flexible modes of implementation. Keeping this in mind R programming is the most obvious choice as it is well suited for performing a lot of statistical operations and algorithm implementation such as Logistic Regression, Random Forest, Naïve Bayes, and KNN.
The proposed proposal also needs an interactive data visualization tool that supports data from different sources and can help to perform some basic functions like aggregation that might be suitable on a monthly and product level. For this purpose, Power BI is preferred as the most suitable option.
The dataset acquired here is available as an open source on Consumer Financial Protection Bureau (CFPB) website from the year 2011 to present and has about 700,000 complaints registered since that time. It contains information about different financial practices, companies, products, and issues related to them.
These consumers’ complaints are responded by different organizations and are distributed like clockwork. I believe this data is multifaceted for Data Analytics and Visualization as it presents a scope for both Structured and Unstructured Data Analysis which can be supported with some interesting Visual analysis. Moreover, for the scope of this project, we will focus only on recent years. A detailed information can be found at:
Initial Data Exploration
With some initial exploration of the data, it provides some useful insights that helped in deciding the next course of actions for the dataset. The table below describes some important fields and some basic information related to it in the data that we can later use as a piece of baseline information to perform some advanced analytics using R.
Unstructured Data Handling Measures
The CFPB dataset provides certain sections which is text rich in terms of descriptive data. Issue, Sub-issue and Consumer complaint narrative are the fields that provides the scope of extracting more information on the types of problems the consumer is facing. Later, we can also draw correlations between these problems and company’s response towards them. Following this process will in turn support the idea of targeting the gray areas where companies should improve, and consumers should be aware.
For the scope of this section, the sentiment analysis process will be explored using the tidy data principles in R. The text data from Consumer complaint narrative section is organized in a tidy data structure and sentiment analysis is implemented as an inner join. This analysis can be used to understand the words with emotional and opinion content, potentially proving out to be supportive towards the overall goals and objectives of the project.
One other Sentiment Analysis approach that can be explored here is the ‘Score Sentiment Approach’. This approach calculates the score of each complaint and classify it as positive, neutral or negative.
Score = Number of positive words – Number of negative words
The sentiment lexicon is used to calculate the number of positive or negative words. A score is provided to each word and an overall count is obtained and allocated to each complaint narrative. The primary focus of this phase is to identify the customer feedback that can be positive, neutral or negative and thus help the companies to infer the status of their product or services. These inferences will also help to improve the response time of the consumer complains.
Note: Here, the Text Analytics is not the primary focus of the proposed predictive model. The idea here is to just leverage the textual data present and use it as an extra piece of information to draw correlation (if any) based on the findings from the expected model.
Feature Analysis and Selection Criteria
The main trade-off that lies in using many variables is that often, the prediction accuracy tends to increase but the interpretability of the model decreases and it is more likely to overfit. So, the main task lies in selecting the best number of features to have a good prediction accuracy and interpretability of the model.
For this reason and the scope of the dataset, selecting Lasso Regression will assist to implement the best variable selection method to build the model. Here, the not so important parameters are given low weight (close to zero), making it easy to eliminate them. Adopting lasso regression methods also provides a scope of further tuning using cross-validation among other methods and achieving the ultimate goal to have an optimal number of features with a good prediction accuracy yet not overly complex models to interpret or use.
The Classification Model
Classification can be performed on this dataset based on three class labels company_response_to_consumer, timely_response and consumer_disputed and for this, a new column has been added named as ‘Efficiency’. The class value here informs us about the company’s performance for a particular product. The following section mentions a simple measure to calculate the efficiency factor based on the three class labels mentioned above.
The Efficiency indicator classified as ‘Excellent, Good, Bad or Worse’ are based on the overall score which is obtained by considering the following questions as a performance indicator:
- How did the Company acknowledge a consumer complaint?
- Whether the response is on time (within 15 days or fewer)? And,
- Is the consumer convinced with that response?
The Efficiency values obtained here will be then used as a parameter to predict a product and thereby the involved firm’s capabilities.
Algorithm Selection and Performance Tuning
In order to achieve a parsimonious model and leveraging the feature of prediction probability score of an event, Logistic Regression is the most suitable choice when it comes to a classic predictive modeling technique for binary categorical variables. Thus, some preliminary analysis of this dataset suggested building a classification model that weighs the benefits of using the associated properties of this algorithm and further enhance the proposed predictive model. These properties include variable selection/elimination based on Null Deviance, Residual Deviance and AIC values. The performance tuning aspect uses the features of ROC curve to further adjust the threshold to compute the accuracy of the model.
Beside logistic regression, along with few basic assumptions Random Forest also proved out to be efficient in terms of handling large databases; also providing and estimating the crucial variables associated with the characteristics of this data. One such basic assumption included the idea of choosing a set of possible variables at each node and selecting it as a training set for growing the tree. Also, the values related to mtry and ntree proves out to be a significant factor while tuning this algorithm. R provides a variety of interesting packages like CARET, which when linked with such algorithms provides some essential aspects like feature selection, feature importance, model tuning, and visualization. Further tuning associated with Random Forest like Random search, Grid Search and Manual tuning will play a vital role while choosing a core algorithm for the proposed predictive model.
Alongside these algorithms, the model will be assisted with a few other comparing algorithms like Naïve Bayes, KNN or SVM which are best suited under these circumstances to support the analysis. A core algorithm among them will be chosen and further tuned to build a more robust model. This approach of selecting and tuning the algorithms while building the classification model will turn out to be very useful when we drill down to the frequency correlation of each company’s product, sub-product, issues, and sub-issues. The proposed classification-based approach will, in turn, assist the company to work and improve on that specific issue highlighted by the Efficiency factor. Also, it will serve its purpose of consumer awareness. The following figure depicts the general process followed while building the classification model and carrying out evaluations.
In the context of a high-level view, the dashboard functionality enables a customizable view for users. This enables the focus on the implementation of providing different statistics based on user needs. Thus, the design focuses on providing graphical visualization based on the following component division:
- Component 1: Timeframe Selector- For the scope of this project, the user can select a time (daily, monthly, quarterly, or yearly) for which they want to analyze the statistics.
- Component 2: Multiple Variable selectors- For the selected time users can then select different types of parameters to analyze such as a product, sub-product, issue, sub-issue, company, state, and city and so on.
- Component 3: Multiple Field of interest Selector- Once the time and variables are decided, it can be drilled down to a specific or multiple field of interest within that selected variable such as Credit card, Mortgage, Auto, Domestic money transfer and other financial services.
The combination of selection from the above components will enable the user to draw a more lucid picture about the financial service or product they want to involve or are already using. This visualization approach also leverages the previously proposed Classification model where users draw conclusions based on the Efficiency factor. The dashboard designed here can be used to draw correlations (if any) between the suggested classification model, and the visual analysis generated. The following figure (2) represents the overview of how these components will be laid out on the dashboard and the graphical representations that they will provide.
Geo-Indications: The state and zip code information present will be used to identify the states and cities involved with different types of consumer complaint indices.
Microsoft Excel: Exploratory analysis, data preparation, and cleaning.
R Studio: Building a classification model, extracting statistical information, implementing the algorithm and using different libraries to analyze data.
Power BI: Developing visuals and insights for the top companies receiving the most amount of complaints.
The dataset obtained is sizeable, more than 200,000 entries for past three years, which demands a more sophisticated data cleaning and processing techniques. Moreover, the data preparation needs to be done keeping in mind both classification model and visualization implementation. Also, there are more than 3500 companies against whom complaints have been launched by consumers. This diverts our focus on top companies with a maximum number of complaints, and the issues related to those firms.
Financial market predictions are always subjected to a lot of uncertainties; thus, a well-chosen subset is needed to drive the business goals, and the visual stories related to it. Moreover, as mentioned in the literature review, the project revolves around emphasizing the efficiency and indicators affecting the company’s ability to resolve an issue, so a lot of deciding factors are associated to some prior work done in this field using CFPB database.
One significant limitation to the CFPB data in recent time is its legality of actions. The methods and approaches that agency follows on making the complaints public and other similar actions are constantly challenged and have several undergoing legal cases. Even though the idea of the organization is making a more feasible environment for consumers, its policies can tarnish and affect a company’s image in the longer run. Also, many firms handle their complaints and disputes internally, this might affect several ongoing as well as previous studies done in this area.
Moreover, the financial market and practices are always growing and evolving. No one model or practice can best provide a global solution for all companies. The complaints submitted to CFPB cannot be treated as the only and complete source, it is far-fetched that the database completely reflects a financial institution’s complaint records. A more convenient way for a firm to best analyze its consumer behavior is by exploring its own database. But again, not all the companies have the resources, or a need to do it.
Significance and Implications
- Addressing consumer complaint to their satisfaction correlates to customer loyalty.
- Anticipating some emerging trend in consumer complaints.
- Companies can avoid or respond to the complaints by using the historic trends of consumers behavior towards a financial product or service more efficiently.
- Improving the time duration for complaint-resolution will play a deciding role for products or service and thereby the company’s future.
- Consumers can use these results and visual aids to make thoughtful decisions about choosing products from different companies.
- Implementation of a Classification based Predictive model using machine learning algorithms and techniques along with the code & documentation involved.
- A customizable PowerBI dashboard published through powerbi.com with a detailed narrative.
- Complete implementation and documentation of the PowerBI workbook.
In this proposal, I have included the problems associated with a financial market and user complaints. Also, stated the importance of the strong association between Predictive modeling and Visual analysis. Moreover, an interpretation and evaluation on building a classification model, validated by some advanced analysis algorithms like Random Forest and SVN is proposed.
Following that, details of leveraging the capabilities of a visual analysis tool like Power BI is examined. The proposal also presents some of the challenging problems in the financial market today and practical solutions to address them. Further, it includes a comprehensive overview of adopting predictive modeling and visual analysis techniques that can help in detecting different types of products/services affecting a consumer’s financial goal and in many folds assist with different ideas for company’s policymaker.
- The CFPB strategic plan, budget, and performance plan and report. (n.d.). Retrieved from https://files.consumerfinance.gov/f/201602_cfpb_report_strategic-plan-budget-and-performance-plan_FY2016.pdf
- Slack, M. (2012, January 4). Consumer Financial Protection Bureau 101: Why We Need a Consumer Watchdog. Retrieved December 2, 2018, from https://obamawhitehouse.archives.gov/blog/2012/01/04/consumer-financial-protection-bureau-101-why-we-need-consumer-watchdog
- Sabo, T. (n.d.). Applying Text Analytics and Machine Learning to Assess Consumer Financial Complaints, 15.
- Redmond, S. (n.d.). Project Report – Consumer Financial Protection Bureau. Retrieved from https://www.academia.edu/33575702/Project_Report_-_Consumer_Financial_Protection_Bureau
- Mark, S. (n.d.). EDA of Consumer Complaint data from the CFPB. Retrieved December 2, 2018, from https://nycdatascience.com/blog/student-works/exploration-consumer-complaint-data-provided-cfpb/
- Asanka, P., Fonseka, W. R. A., Nadeesha, D. G. M., & Thakshila, P. M. C. (2016). Use of Data Warehousing to Analyze Customer Complaint Data of CFPB of USA.
- Babcock University, F.Y, O., J.E.T, A., O, A., J. O, H., O, O., & J, A. (2017). Supervised Machine Learning Algorithms: Classification and Comparison. International Journal of Computer Trends and Technology, 48(3), 128–138. https://doi.org/10.14445/22312803/IJCTT-V48P126
- Olivera, A. R., Roesler, V., Iochpe, C., Schmidt, M. I., Vigo, Á., Barreto, S. M., & Duncan, B. B. (2017). Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes – ELSA-Brasil: accuracy study. Sao Paulo Medical Journal, 135(3), 234–246. https://doi.org/10.1590/1516-3180.2016.0309010217
- Cowley, S. (2018, October 3). Consumer Bureau Looks to End Public View of Complaints Database. The New York Times. Retrieved from https://www.nytimes.com/2018/04/25/business/cfpb-complaints-database-mulvaney.html
- Yilmaz, C., Varnali, K., & Kasnakoglu, B. T. (n.d.). How do firms benefit from customer complaints? Retrieved December 2, 2018, from https://www.researchgate.net/publication/281781314_How_do_firms_benefit_from_customer_complaints
- Faed, A., Hussain, O. K., & Chang, E. (2014). A methodology to map customer complaints and measure customer satisfaction and loyalty. Service Oriented Computing and Applications, 8(1), 33–53. https://doi.org/10.1007/s11761-013-0142-6  Kiefer, Dennis, et al. “Consumer Financial Protection Bureau’s (CFPB) Consumer Complaint Database | Deloitte US | Banking An.” Deloitte United States, 5 July 2017, www2.deloitte.com/us/en/pages/financial-services/articles/consumer-financial-protection-bureau-cfpb-consumer-complaint-database.html.
- Financial literacy annual report. (n.d.). Retrieved from http://www.thecb.state.tx.us/reports/PDF/9047.PDF?CFID=53901583&CFTOKEN=10522576.
- Research & Reports. (n.d.). Retrieved December 2, 2018, from https://www.consumerfinance.gov/data-research/research-reports/?page=4.
- Spiralyze, K. (2016, April 8). Shiny. Retrieved December 2, 2018, from https://www.rstudio.com/products/shiny-2/
- Peng, Joanne, Kuk Lee Lida, and Gary M. Ingersoll. “An Introduction to Logistic Regression Analysis and Reporting.” ResearchGate. Accessed January 5, 2019. https://www.researchgate.net/publication/242579096_An_Introduction_to_Logistic_Regression_Analysis_and_Reporting.