Phase-2 Solution Architecture
College Name: KLS Vishwanathrao Deshpande Institute of Technology, Haliyal.
Group Members:
Name: Mushtaq Ahmed N Jamali
CAN ID Number: CAN_33717654
Name: Kalpesh P. Pavaskar
CAN ID Number: CAN_33710599
Name: Nagendra M. Borekar
CAN ID Number: CAN_33724095
Name: Santosh M. Turamari
CAN ID Number: CAN_33692571
Project Title: Improving Data Accuracy in CRM using AI
1. Solution Architecture Overview
The solution architecture is designed to integrate advanced Artificial Intelligence (AI) and
Machine Learning (ML) techniques within a Customer Relationship Management (CRM)
system to enhance data analysis capabilities. The focus is on developing visualizations to
identify data patterns, highlight anomalies, and assess the feasibility of AI model
implementation. The architecture supports robust data preparation and selection of
appropriate models to achieve the project objectives.
2. Data Visualization
Objectives:
Analyze customer data patterns to uncover insights.
Identify anomalies in customer behavior and transactions.
Assess the feasibility and effectiveness of AI models.
Tools and Techniques:
1. Visualization Libraries:
o Matplotlib and Seaborn for exploratory data analysis.
o Power BI for dynamic, interactive dashboards.
2. Key Visualizations:
Customer Segmentation: Cluster visualizations to analyze customer groups and their
characteristics.
Anomaly Detection: Highlight outliers in customer transaction data using boxplots
and scatterplots.
Churn Prediction Trends: Display churn probability distributions across different
customer groups.
Sentiment Analysis Trends: Word clouds and sentiment polarity distribution from
social media feedback.
3. Data Preparation Techniques
1. Data Cleaning
Handling Missing Values: In the "Customer Segmentation and Churn
Analysis" notebook, missing values are addressed by imputing them with
appropriate statistics (e.g., mean, median) or by removing records with
significant missing information to maintain data integrity.
Removing Duplicates: The dataset is examined for duplicate records, which
are then removed to prevent redundancy and ensure the accuracy of analyses,
such as clustering and predictive modelling .
Correcting Data Types: Data types of each column are verified and corrected
as necessary to ensure compatibility with machine learning algorithms. For
instance, categorical variables are encoded properly for model training.
Outlier Detection and Treatment: Statistical methods are employed to
identify outliers that may skew the analysis. Depending on the context,
outliers are either transformed, capped, or removed to enhance model
performance.
2. Normalization and Scaling
Feature Scaling: Continuous variables are scaled using techniques like Min-
Max Scaling to bring all features into a similar range, which is crucial for
algorithms sensitive to feature magnitude, such as K-means clustering.
Standardization: Some models benefit from standardization, where features
are rescaled to have a mean of 0 and a standard deviation of 1, ensuring that
each feature contributes equally to the analysis.
Log Transformation: For features with skewed distributions, log
transformation is applied to stabilize variance and make the data more
normally distributed, aiding in meeting the assumptions of various statistical
models.
Normalization of Text Data: In the "Sentiment Analysis" notebook, text data
is normalized by converting to lowercase, removing punctuation, and
eliminating extra whitespace to ensure uniformity before further processing.
3. Text Data Processing
Tokenization:Text data from tweets is tokenized into individual words or
tokens using Natural Language Toolkit (NLTK) to facilitate analysis.
Stopword Removal: Commonly used words that do not contribute significant
meaning (e.g., 'the', 'is') are removed from the text data to focus on the more
informative words.
Stemming and Lemmatization: Words are reduced to their root forms using
stemming or lemmatization techniques to treat different forms of a word as a
single entity, enhancing the consistency of the data.
Vectorization: Processed text data is converted into numerical representations
using methods like Term Frequency-Inverse Document Frequency (TF-IDF)
to enable the application of machine learning algorithms.
4. Anomaly Detection
Isolation Forest: An Isolation Forest algorithm is implemented to detect
anomalies in customer behavior data, identifying customers whose purchasing
patterns significantly deviate from the norm.
Statistical Methods: Techniques such as Z-score analysis are used to detect
outliers in numerical features, flagging data points that fall beyond a certain
number of standard deviations from the mean.
Domain-Specific Rules: Business logic is applied to define thresholds for
what constitutes anomalous behavior based on industry standards and
company policies, allowing for the identification of irregular activities.
Visualization of Anomalies: Tools like Matplotlib and Seaborn are used to
create visualizations (e.g., box plots, scatter plots) that highlight anomalies,
making it easier to interpret and communicate findings to stakeholders.
These processes are integral to the project's goal of enhancing CRM systems through AI and
ML, ensuring that the data used is clean, well-prepared, and suitable for modeling and
analysis.
4. AI Model Selection and Justification
Objectives:
Select models that align with project goals for segmentation, churn prediction, and
sentiment analysis.
Selected Models:
1. Customer Segmentation:
o K-Means Clustering: Effective for identifying customer groups based on
spending behavior and engagement.
o DBSCAN (Density-Based Spatial Clustering): Alternative for handling
noise and non-linear clusters.
2. Churn Prediction:
o Logistic Regression: Simple yet effective for binary classification.
o XGBoost: Robust and efficient for handling large datasets with high accuracy.
3. Sentiment Analysis:
o NLP Techniques:
TF-IDF (Term Frequency-Inverse Document Frequency): For
feature extraction from text.
Support Vector Machines (SVM): For sentiment classification.
o Deep Learning Models:
Pre-trained BERT (Bidirectional Encoder Representations from
Transformers) for context-aware sentiment analysis.
Justification:
Scalability: All models are computationally efficient and scalable for large CRM
datasets.
Accuracy: High performance in predictive tasks (churn, sentiment analysis).
Flexibility: Models like XGBoost and BERT can be fine-tuned to meet specific
business needs.
5. Feasibility Assessment
Challenges:
Handling large volumes of unstructured social media data.
Ensuring real-time processing for dynamic visualizations and insights.
Mitigation Strategies:
Leverage distributed storage and processing using IBM Cloud Object Storage and
Watson Studio.
Use optimized libraries like TensorFlow and PyTorch for efficient model training.