Internship Report Final
Internship Report Final
CHAPTER 1
INTRODUCTION
1.1 Project Description
The second major cause of women's death is breast cancer (after lung cancer). 246,660 of
women's new cases of invasive breast cancer are expected to be diagnosed in the US during
2016 and 40,450 of women’s death is estimated. Breast cancer is a type of cancer that starts
in the breast. Cancer starts when cells begin to grow out of control. Breast cancer cells
usually form a tumour that can often be seen on an x-ray or felt as a lump. Breast cancer can
spread when the cancer cells get into the blood or lymph system and are carried to other parts
of the body. The cause of Breast Cancer includes changes and mutations in DNA. There are
many different types of breast cancer and common ones include ductal carcinoma in situ
(DCIS) and invasive carcinoma. Others, like phyllodes tumours and angiosarcoma are less
common. There are many algorithms for classification of breast cancer outcomes.
The side effects of Breast Cancer are – Fatigue, Headaches, Pain and numbness
(peripheral neuropathy), Bone loss and osteoporosis. There are many algorithms for
classification and prediction of breast cancer outcomes.
There are some of the algorithm model gives a comparison between the performance of
four classifiers: SVM, Logistic Regression, Random Forest and KNN which are among the
most influential data mining algorithms. It can be medically detected early during a screening
examination through mammography or by portable cancer diagnostic tool. Cancerous breast
tissues change with the progression of the disease, which can be directly linked to cancer
staging. The stage of breast cancer (I–IV) describes how far a patient’s cancer has
proliferated.
Statistical indicators such as tumour size, lymph node metastasis, and distant metastasis and
so on are used to determine stages. To prevent cancer from spreading, patients have to
undergo breast cancer surgery, chemotherapy, radiotherapy and endocrine. The goal of the
project is to identify and classify Malignant and Benign patients and intending how to
parametrize our classification techniques hence to achieve high accuracy.
We are looking into many datasets and how further Machine Learning algorithms can
be used to characterize Breast Cancer. We want to reduce the error rates with maximum
accuracy. 10-fold cross validation test which is a Machine Learning Technique is used in
JUPYTER to evaluate the data and analyse data in terms of effectiveness and efficiency.
1.1.3 Limitation
This project is restricted to the study of logistic regression for the classification of breast
cancer using Wisconsin Breast Cancer Dataset (WBCD) from UCI machine learning
online repository. Performance of this model is measured using precision score, recall
score and f1-score only.
Mammography: The most important screening test for breast cancer is the mammogram. A
mammogram is an X-ray of the breast. It can detect breast cancer up to two years before the
tumor can be felt by you or your doctor.
Women age 40–45 or older who are at average risk of breast cancer should have a
mammogram once a year.
Women at high risk should have yearly mammograms along with an MRI starting at age 30.
Some Risk Factors for Breast Cancer
The following are some of the known risk factors for breast cancer. However, most cases of
breast cancer cannot be linked to a specific cause. Talk to your doctor about your specific
risk.
Age: The chance of getting breast cancer increases as women age. Nearly 80 percent of breast
cancers are found in women over the age of 50.
Personal history of breast cancer: A woman who has had breast cancer in one breast is at an
increased risk of developing cancer in her other breast.
Family history of breast cancer: A woman has a higher risk of breast cancer if her mother,
sister or daughter had breast cancer, especially at a young age (before 40). Having other
relatives with breast cancer may also raise the risk.
Genetic factors: Women with certain genetic mutations, including changes to the BRCA1 and
BRCA2 genes, are at higher risk of developing breast cancer during their lifetime. Other gene
changes may raise breast cancer risk as well.
Childbearing and menstrual history: The older a woman is when she has her first child, the
greater her risk of breast cancer. Also at higher risk are:
● Women who menstruate for the first time at an early age (before 12)
● Women who go through menopause late (after age 55)
● Women who’ve never had children
To provide appropriate treatment to the patients, symptoms must be studied properly
and an automatic prediction system is required which will classify the tumour into benign or
malignant. There are near about 100 types of cancers affecting human body. Among all the
types of cancer, the very popular and influenced is Breast cancer. The main risk factors of
breast cancer include sex, obesity, less physical exercise, intakes of alcohol, hormonal
misbalance during menopause, ionizing radiation, pre- menstruation, children at later age or
not at all, and older age.
This chapter includes the profile of the organization in India and globally, the project under
which internship was done and also the centre that the project is incubating
Our story begins in the year 2011, back then we used to make students develop their projects
and make sure that they gain success and learn while they develop. We are proud to make
sure the clients are given and treated with at most care and are given with timely service and
make sure the WEB SITES are up running 24X7, we are helping our clients to make sure
they are able to reach and help their clients and made sure these are as per the guidelines and
are at par with the government structure.
When we take on your project, we take the stewardship of the project with you in the
director’s seat. As stewards of your project, we consider ourselves successful not when we
deliver your final product but when the product meets your business objectives and that is our
vision. We believe that this vision can be reached through the following principles: Integrity
– Honesty in how we deal with our clients, each other and with the world Candor – Be open
and upfront in all our conversations. Keep clients updated on the real situation. Deal with
situations early; avoid last minute surprises Service – Seek to empower and enable our
clients.
Consider ourselves successful not when we deliver our client’s final product but when the
product is launched and meets success Kindness – Go the extra mile. Speak the truth with
grace. Deliver more than is expected or promised Competence – Benchmark with the best in
the business. Try new and better things. Never rest on laurels. Move out of comfort zones.
Keep suggesting new things. Seek to know more. Growth – Success is a journey, not a
destination. Seek to multiply/increase what we have – wealth, skills, influence, and our
client’s business
Mission:
Vtech Coders Dharwad mission is to alter the dynamics of the software industry by providing
trusted, supportive and quality software development services to clients that view our
partnership as a strategic driver for their success.
WHAT WE OFFER
1. Business Application Development:
2. Website Re-Design
3. PHP Development
4. Web Hosting
5. SMS Advertising
6. IOS Mobile Application
7. React Native
8. Internship Programs
9. Business Application Development
1. Billing Software – Flexible Billing Software’s for your day-to-day business transaction.
In business IT, billing software refers to programs that handle the tracking of billable
products and services delivered to a customer or set of customers. Some billing software also
tracks work hours for billing purposes. These types of programs automate much of what used
to be a time-consuming process of preparing invoices or other documentation.
2. Vshala (School management system) – V Shala cares for the wellbeing of your school
along with history and heritage of the School, The details and the Heritage of the school
would be depicted here, to make sure you are not behind the rest. In this students and parents
are connected to the school.
VTECH CODERS
Address: Behind Reliance Petrol Bunk, Vivekanand Nagar, 4th Main Road Dharwad,
State: Karnataka
PIN Code: 580004
Office: 0836-3550629
Mobile No: 9886631818, 9986352227, 8618218519
Email Id: vtechcoders@gmail.com
CHAPTER 2
LITERATURE SURVEY
This chapter presents some basic concepts and terminologies such as: Data mining,
Classification techniques. Furthermore, a review of previous related work done in this
research topic is presented. This review is done to know the techniques, other authors
employed for the classification of breast cancer. This review is cut through other machine
learning algorithms that have been used for the classification of breast cancer and not only
logistic regression. In the review, prediction accuracy is discussed as well as the techniques
used in improving them.
Ahmad et al. [4], compared the performance of decision tree (C4.5), SVM, and ANN. The
dataset used was obtained from the Iranian centre for breast cancer. Simulation results
showed that SVM was the best classifier followed by ANN and decision tree.
Nematzadeh et al. [12], conducted a comparative study on decision tree, NB, NN and SVM
with three different kernel functions as classifiers to classify WPBC and Wisconsin Breast
Cancer (WBC). The experimental result showed that NN (10-fold) had the highest accuracy
of 98.09% in WBC dataset, while SVM-RBF (10-fold) had the highest accuracy of 98.32% in
WPBC dataset.
Hasan and Tahir [16], proposed an ANN classifier using PCA pre-processed data as optimal
tool to improve differentiating between benign and malignant tumors on WBC dataset. They
employed the three rules of thumb of PCA namely scree test, cumulative variance and Kaiser
Guttman rule as feature selection. The result obtained showed that the method can distinguish
between benign and malignant cases.
Ojha and Goel [18], use different ML algorithm to predict recurrent cases of breast cancer
using the Wisconsin Prognostic Breast Cancer (WPBC) data set. The evaluation result
produced SVM and decision tree (C 5.0) as the best predictors with 81% accuracy, while
fuzzy c-means was found to have the lowest accuracy of 37%.
Ghosh et al. [19], diagnose and analyze breast cancer disease using two well-known
classifiers which are Multilayer Perceptron using Back Propagation Neural Network (MLP
BPN) and SVM. The experimental results of their work revealed SVM was the best classifier.
Osareh and Shadgar [20], investigated the issues of breast cancer diagnosis and prognostic
risk evaluation of recrudescence and metastasis using SVM, K-nearest neighbor (KNN) and
probabilistic neural network (PNN). These classifiers were combined with signal-to-noise
ratio (SNR) feature ranking method, sequential forward selection-based (SFS) feature
selection and PCA feature transformation. The SVM-RBF was found to obtain the best
overall accuracies of 98.80%.
Bazazeh and Shubair [21], investigated SVM, random forest (RF) and Bayesian networks
(BN) for breast cancer diagnosis and performed a comparative analysis on them. The WBC
dataset was used as training set to evaluate the performance of the machine learning
classifiers. The experimental results showed that SVM had the best performance in terms of
accuracy, specificity and precision, while RF had the highest probability of correctly
classifying tumors.
Azmi and Cob [22], built a system that can classify breast cancer tumor by employing neural
network with feed –forward back propagation algorithm. The dataset used in their work was
obtained from University of Wisconsin (UCI) machine learning repository. Experimental
results revealed that neural network with hidden layer of 7 achieved the best accuracy of
96.63% when compared to others.
Gayathri and Sumathi [23], conducted a comparative study of relevance vector machine
(RVM) with other ML algorithms used for detecting breast cancer. They used linear
discriminant analysis (LDA) to reduce features. The data was classified by the RVM
algorithm. The dataset used in this work is the WBC. The accuracy equals 96%. The
sensitivity and specificity obtained from the simulation results are 98% and 94% respectively.
Nowadays cancer is one of the major diseases that is affecting the humans of young and adult
age. Its severeness is affecting to the vast number of people and they are suffering from it. To
get precautionary people visit hospitals and need to wait for the turns to get consulted and this
will consume time and money. It’s necessary to get the treatment as early as possible. There
are chances of losing the data that are manually maintained.
In this current situation the women’s or ladies hesitate for saying to the person or a
doctor about their disease. And they do not have time to consult a doctor to know about their
disease. That’s why we are creating a model to predict the cancer is harmful or not using
some of the machine learning models.
glycaemic or immune system profiling data within the context of human breast
cancer.
• Age is also a major factor for the breast cancer. Breast cancer diagnosed in women in
UK was reported in. According to this report 50 % cases are diagnosed in the women
in the age of 50-69.
• 13 % of the patients were diagnosed in the age of less than 50,40 % diagnosed in the
age between 50-69 and 47 % are diagnosed over age 70. Means increasing age is also
one of the strongest risk factor for breast cancer.
• According to if breast cancer is detected in the primary stage then the chances of
survival is more. The chances of survival in the case small tumour diagnosis are more.
• Other risk factor is gender. Breast cancer chances in woman are high in comparison to
the men. It may affluence more frequently in western population. It had been analysed
that breast cancer cases can be high if it is a hereditary disease.
• Regular habits of taken alcohol intake can also be the strongest cause. Women
diagnosed with invasive breast cancer have an increased risk of developing another
breast cancer. Dense breast tissue on mammography is also emerging as a strong risk
factor.
• Overweight or obese increase the chances of breast cancer. Area is also a major
cause. A survey on 30 % Australian women diagnosed with breast cancer which are
lived in major cities and find the chances are high in urban area in comparison to the
rural area.
The project “Breast Cancer Classification Using Machine Learning” is to realise any system
the viability of producing it is to be analysed. The feasibility is assessed for the technical,
operational, economic. A feasibility study is an analysis that considers all of a project's
relevant factors including economic, technical, legal, and scheduling considerations to
ascertain the likelihood of completing the project successfully.
Whether a project is feasible or not can depend on several factors, including the project's cost
and return on investment, meaning whether the project generated enough revenue or sales
from consumers. However, a feasibility study isn't only used for projects looking to measure
and forecast financial gains. In other words, feasible can mean something different,
depending on the industry and the project's goal. For example, a feasibility study could help
determine whether a hospital can generate enough donations and investment dollars to
expand and build a new cancer centre.
dynamic typing and translation environment make it an ideal language for swiftly building
and developing apps.
2.3.2 Jupyter
JupyterLab is the latest web-based interactive development environment for notebooks, code,
and data. Its flexible interface allows users to configure and arrange workflows in data
science, scientific computing, computational journalism, and machine learning. A modular
design invites extensions to expand and enrich functionality.
The Jupyter Notebook is the original web application for creating and sharing
computational documents. It offers a simple, streamlined, document-centric experience.
Jupyter NoteBook Provides
● Langauage choice
● Share Notebook
● Interactive Output
● Bigdata Integration
ML makes computer systems learn and act similar to humans by learning to progressively
improve their performance on a specific task [20]. ML algorithms are used to serve many
applications such as classification, feature selection and clustering, etc. Learning a desired
function “f” that connect each attribute set “x” to one of the specified class Labels “y” is the
problem of classification. A classification model is another name for the target function. A
classification algorithm (also known as a classifier) is a method for creating classification
models from a data-set. Each method employs a learning mechanism that identifies a model
so as to best fit the training attributes and related class label. Such learning mechanisms are to
both match the input data and correctly corresponding class label in the testing data-set. As a
result, the aim of learning is to generalize, i.e. to handle any data-set in the future. Regarding
application of ML algorithms in classification of cancer,
ML algorithms effectively distinguish between Benign and Malignant, assisting the physician
in making a diagnosis, moreover, for ML classifiers, identifying the subset of features is
critical. There are many ML techniques that are commonly used in BC classification,
progression monitoring, treatment, and prediction, such as Support Vector Machine, Decision
Trees, Naïve Bayes, k–Nearest Neighbours, Adaptive Boosting, traditional Neural Networks,
The ML algorithms that I have used in this project are briefly explained below:
Data pre-processing is one of the most data mining tasks which includes preparation and
transformation of data into a suitable form of mining procedure. Data preprocessing aims to
reduce the data size, find the relations between data, normalize data, remove outliers and
extract features for data. It includes several techniques like data cleaning, integration,
transformation and reduction (Alasadi & Bhaya, 2017).
Feature scaling is a technique that is used to normalize the range of independent variables or
features of data. In data pre-processing, it is also known as data normalization and is usually
employed during the data pre-processing step.
Supervised machine learning is the search for algorithms that cogitate from externally
supplied instances to give general hypotheses, which then infer predictions about future
instances. In other words, the goal of supervised learning is to build an incisive model of the
distribution of class labels in terms of predictor features. The resulting classifier is then used
to assign class labels to the testing instances where the values of the predictor features are
known, but the value of the class label is unknown.
In supervised learning as shown in Fig. 2.1, the learner is provided with two sets of data, a
training set, and a test set. The idea is for the learner to “learn” from a set of labelled
examples in the training set so that it can identify unlabelled examples in the test set with the
highest possible accuracy.
2.3.4 Classification
We use the training dataset to get better boundary conditions which could be used to
determine each target class. Once the boundary conditions are determined, the next task is to
predict the target class. The whole process is known as classification. There exist two types
of classification, the supervised and unsupervised.
CHAPTER 3
SYSTEM REQUIREMENT SPECIFICATION
System Requirement Specification (SRS) define the product performance required to meet all
the needs of participants (business, users). Standard SRS includes purpose, complete
description, and specific software requirements. In this section, we discuss the needs and
challenges we had in fulfilling our request at launch. Operational Requirements describe in
detail how well our system works, while non-operational requirements monitor how it
operates. To show the importance of each requirement, they are all listed in their categories.
The majority of software systems begin with a client that wishes to either automate an
existing manual system or build a new one. The developer develops the software framework,
which the end user can utilise till it is finished. As a result, there are three major stakeholders
in a new framework: the client, the consumers, and the developer.
The difficulty is that most clients are unfamiliar with software and the software
development process, and developers aren't necessarily familiar with the client's problem or
technological domain. This creates a contact chasm between the participants in the
development effort. One of the most important objectives of software requirement
specifications is to bridge the communication gap. SRS is the means by which the client and
user requirements are precisely defined; in reality, SRS is the foundation of software
development. A good SRS should satisfy both stakeholders, which is a difficult task that
requires trade-offs and persuasion.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast
mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian:
"Robust Linear Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
It comprises of 569 samples,The Wisconsin Breast Cancer Data (WBCD) data were obtained
via fine needle aspirates of affected tissue with virtually assessed nuclear features from
patient’s breasts.
The program uses a curve-fitting algorithm, as shown in Fig. 3.2, to compute ten features
from each one of the cells in the sample, then it calculates the mean value, standard value and
standard error(worst error) of each feature for the image, returning a 30 real-valuated vector
(Rodrigues, 2016).
Fig. 3.4 shows the dataset, and Table 3.4 shows the detailed dataset description of the
attributes.
Out of the 569 observations, 357 (or 62.7%) have been labeled malignant, while the rest 212
(or 37.3%) have been labeled benign. Later when we develop a predictive model and test it
on unseen data, we should expect to see a similar proportion of labels. Although our dataset
has 30 columns excluding the id and the diagnosis columns, they are all in fact very closely
related since they all contain information on the same 10 key attributes but only differ in
terms of their perspectives (i.e., the mean, standard errors, and the mean of the three largest
values denoted as "worst".
Features of Dataset:
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
Dataset No. of Attributes No. of Instances No. of Classes
WBCD 31 569 2
Radius Numeric 1-30
texture Numeric 1-30
Features
Perimeter Numeric 1-30
area Numeric 1-30
Smoothness Numeric 1-30
The data pre-processing technique employed in this work was done to handle the 16
missing values found in the ‘Bare Nuclei’ attribute of the data. To handle this problem, the
mean of the non-missing values was calculated and this calculated mean is then used to fill
up the 16 missing values. In this pre-processing stage, we also employed a feature scaling
technique to normalize the data set.
Feature scaling
This method is widely used for normalization in many machine learning algorithms
(e.g., support vector machines, logistic regression, and artificial neural networks) (Grus,
2015). The general method of calculation is to determine the distribution mean and
standard deviation for each feature. Next, we subtract the mean from each feature. Then
we divide the values (mean is already subtracted) of each feature by its standard deviation.
Where 𝑥′ = 𝑥 −𝑥̅
𝜎
Where 𝑥 is the original feature vector, 𝑥̅ = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 (𝑥) is the mean of that feature vector,
and 𝜎 is its standard deviation.
3.4.4 Machine learning classifiers
Random Forest Classifier:
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the
final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random-forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
K-Nearest Neighbor(KNN):
K Nearest Neighbor (KNN) is a very simple, easy to understand, versatile and one of the
topmost machine learning algorithms. KNN used in the variety of applications such as
finance, healthcare, political science, handwriting detection, image recognition and video
recognition. In Credit ratings, financial institutes will predict the credit rating of customers. In
loan disbursement, banking institutes will predict whether the loan is safe or risky. In political
science, classifying potential voters in two classes will vote or won’t vote. KNN algorithm
used for both classification and regression problems. KNN algorithm based on feature
similarity approach
In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding
factor. K is generally an odd number if the number of classes is 2. When K=1, then the
algorithm is known as the nearest neighbor algorithm. This is the simplest case. Suppose P1
is the point, for which label needs to predict. First, you find the one closest point to P1 and
then the label of the nearest point assigned to P1.
Suppose P1 is the point, for which label needs to predict. First, you find the k closest point to
P1 and then classify points by majority vote of its k neighbors. Each object votes for their
class and the class with the most votes is taken as the prediction. For finding closest similar
points, you find the distance between points using distance measures such as Euclidean
distance, Hamming distance, Manhattan distance and Minkowski distance. KNN has the
following basic steps:
1. Calculate distance
2. Find closest neighbors
3. Vote for labels
The user will provide the essential details like radius, texture, smootness, etc that
were found on the Fine needle Aspiration is a type of biopsy procedure. And these details are
to be analyse and predicted by the Machine Learning Model that gives the predicted output to
the user as they are having the Malignant or Benign tumor cells.
The necessity to accomplish something determines the system and its elements in the
Development Plan. Defines how software works. Work is associated with the installation
required to start the work, its performance to accomplish the task, and the results obtained.
In this part, we compare all the details provided by user to service provider to predict
the cancer of the user from the available Dataset.
The service required by the system is to predict the type of the cancer cell as
Malignant(0) or Benign(1).
Non-functional requirements are assessed by the whole system as a business and the
following is determined
● Requirements that aren't functional ensure that the software system adheres to all
applicable laws and regulations.
● They ensure a positive user experience and software that is simple to use.
CHAPTER 4
DESIGN
4.1 Design Phases
Collection of data
Data processing techniques and processes are numerous. We collected data for
Mumbai’s real estate properties from various real estate websites. The data would be having
attributes such as Location, carpet area, built-up area, age of the property, zip code, etc. We
must collect the quantitative data which is structured and categorized. Data collection is
needed before any kind of machine learning research is carried out. Dataset validity is a must
otherwise there is no point in analyzing the data.
Data pre-processing
Data pre-processing is the process of cleaning our data set. There might be missing
values or outliers in the dataset. These can be handled by data cleaning. If there are many
missing values in a variable we will drop those values or substitute it with the average value.
Training the model Since the data is broken down into two modules:
Training set and Test set, we must initially train the model. The training set includes
the target variable. The decision tree regressor algorithm is applied to the training data set.
The Decision tree builds a regression model in the form of a tree structure. Phase 4: Testing
and Integrating with UI The trained model is applied to test dataset and house prices are
predicted. The trained model is then integrated with the front end using Flask in python.
The trained model is applied to test dataset and house prices are predicted. The
trained model is then integrated with the front end using Flask in python.
Then the various Regression models like Linear Regression, Lasso regression, Ridge
Regression are applied over the Dataset in the IDE Jupyter and then we can analyse the
dataset and removing the unnecessary missing values and the outliers from the dataset. Then
we found the cleaned Dataset.
After the cleaning of Dataset Regression models analyse over the dataset and predicts the
results based on the user Requirements, then the user will obtain the predicted price of their
requirements.
CHAPTER 5
IMPLEMENTATION
5.1 Data Collection
The statistics were gathered from Bangalore home prices. The information includes many
variables such as area type, availability, location, BHK, society, total square feet, bathrooms,
and balconies.
Outliers badly affect mean and standard deviation of the dataset. These may
statistically give erroneous results. It increases the error variance and reduces the power of
statistical tests. If the outliers are non-randomly distributed, they can decrease normality. So
we applied various logics such as business logic, bathroom feature to remove the outliers.
between the input (X) and the output (Y). It is one of the most well-known and well-
understood machine learning algorithms. Simple linear regression, ordinary least squares,
Gradient Descent, and Regularization are the linear regression models.
There are two basic types of regression techniques, which are simple linear regression and
multiple linear regression, for more complicated data and analysis we use non-linear
regression method like polynomial regression. Simple linear regression uses only one
independent variable to create the prediction output of the dependent variable Y, whereas
multiple linear regression uses two or more independent variables to create prediction
outcome.
Where:
● Y = Dependent variable
● X = Independent variable
● b = the slope.
Regression Trees
It supports both continuous and categorical input variables. Regression trees are regarded as
research with various machine algorithms for the regression issue, with the Decision Tree
approach providing the lowest loss. The R-Squared value for the Decision Tree is 0.998,
indicating that it is an excellent model. The Decision Tree was used to complete the web
developmen
Lasso Regression
The word “LASSO” denotes Least Absolute Shrinkage and Selection Operator. Lasso
regression follows the regularization technique to create prediction. It is given more priority
over the other regression methods because it gives an accurate prediction. Lasso regression
model uses shrinkage technique. In this technique, the data values are shrunk towards a
central point similar to the concept of mean. The lasso regression algorithm suggests a
simple, sparse models (i.e. models with fewer parameters), which is well-suited for models or
data showing high levels of multicollinearity or when we would like to automate certain parts
of model selection, like variable selection or parameter elimination using feature engineering.
Where,
● If λ = 0 it implies that all the features are considered and now it is equivalent to the
linear regression in which only the residual sum of squares is used to build a
predictive model.
Ridge Regression
Ridge Regression is another type of regression algorithm in data science and is usually
considered when there is a high correlation between the independent variables or model
parameters. As the value of correlation increases the least square estimates evaluates
unbiased values. But if the collinearity in the dataset is very high, there can be some bias
value. Therefore, we create a bias matrix in the equation of Ridge Regression algorithm. It is
a useful regression method in which the model is less susceptible to overfitting and hence the
model works well even if the dataset is very small.
Where λ is the penalty variable. λ given here is denoted by an alpha parameter in the
ridge function. Hence, by changing the values of alpha, we are controlling the penalty term.
Greater the values of alpha, the higher is the penalty and therefore the magnitude of the
coefficients is reduced.
Ridge and Lasso regression uses two different penalty functions for regularisation.
Ridge regression uses L2 on the other hand lasso regression go uses L1 regularisation
technique. In ridge regression, the penalty is equal to the sum of the squares of the
coefficients and in the Lasso, penalty is considered to be the sum of the absolute values of the
coefficients. In lasso regression, it is the shrinkage towards zero using an absolute value (L1
penalty or regularization technique) rather than a sum of squares(L2 penalty or regularization
technique).
Since we know that in ridge regression the coefficients can’t be zero. Here, we
either consider all the coefficients or none of the coefficients, whereas Lasso regression
algorithm technique, performs both parameter shrinkage and feature selection simultaneously
and automatically because it nulls out the co-efficients of collinear features. This helps to
select the variable(s) out of given n variables while performing lasso regression easier and
more accurate.
CHAPTER 6
TESTING
6.1 Test Case-1
The application must show the first rows of dataset
Expected Output: Display first 5 rows
Actual Output:
CHAPTER 7
RESULT AND ANALYSIS
The Graph that we have plotted using the Ridge regression Model and this graph gives
the best accuracy score for the analysis of the data, and graph is plotted between Actual Price
and Predicted Price.
From the range 0 to 400 there are so many houses are available for the customer that we
have predicted with actual price and the predicted price from the customer perspective.
CHAPTER 8
FUTURE SCOPE
In the future, we are going to present a comparative study of the systems’ predicted price and
the price from real estate websites such as Housing.com for the same user input. Also, to
simplify it for the user, we are going to recommend real estate properties to the user based on
the predicted price. The current dataset only includes cities of Mumbai, expanding it to other
cities and states of India is the future goal. To make the system even more informative and
user-friendly, we will be including Gmap. This will show the neighborhood amenities such as
hospitals, schools surrounding a region of 1 km from the given location. This can also be
included in making predictions since the presence of such factors increases the valuation of
real estate propert
References
[1] Lakshmi, B. N., and G. H. Raghunandhan. "A conceptual overview of data mining." 2011
National Conference on Innovations in Emerging Technology. IEEE, 2011.
[2] Manjula, R., et al. "Real estate value prediction using multivariate regression models."
Materials Science and Engineering Conference Series. Vol. 263. No. 4. 2017.
[3] A. Varma et al., “House Price Prediction Using Machine Learning And Neural
Networks,” 2018 Second International Conference on Inventive Communication and
Computational Technologies, pp. 1936–1939, 1936.
[4] Arietta, Sean M., et al. "City forensics: Using visual elements to predict non-visual city
attributes." IEEE transactions on visualization and computer graphics 20.12 (2014): 2624-
2633.
[5] Yu, H., and J. Wu. "Real estate price prediction with regression and classification CS 229
Autumn 2016 Project Final Report 1–5." (2016).
[6] Li, Li, and Kai-Hsuan Chu. "Prediction of real estate price variation based on economic
parameters." 2017 International Conference on Applied System Innovation (ICASI). IEEE,
2017.
[7] Nihar Bhagat, Ankit Mohokar, Shreyash Mane "House Price Forecasting using Data
Mining" International Journal of Computer Applications,2016.
[8] N. N. Ghosalkar and S. N. Dhage, "Real Estate Value Prediction Using Linear
Regression," 2018 Fourth International Conference on Computing Communication Control
and Automation (ICCUBEA), Pune, India, 2018, pp. 1-5.