COMPLEX PROBLEMS
Student Name: Harsh  Singh
               Praval Kumar                                     23MAI10028
                                                          UID: 23MAI10018
 Branch: ME-CSE(AIML)                                     Section/Group: 23MAI-1A
 Semester: 1st                                            Date of Performance: 6/11/2023
 Subject Name: AI LAB                                      Subject Code: 23CSH-621
Aim:
Computing the Correlation
You are given the scores of N students in three different subjects - Mathematics,*Physics*
and Chemistry; all of which have been graded on a scale of 0 to 100. Your task is to compute
the Pearson product-moment correlation coefficient between the scores of different pairs of
subjects (Mathematics and Physics, Physics and Chemistry, Mathematics and Chemistry)
based on this data. This data is based on the records of the CBSE K-12 Examination - a
national school leaving examination in India, for the year 2013.
Pearson product-moment correlation coefficient
This is a measure of linear correlation described well on this Wikipedia page. The formula, in
brief, is given by:
where x and y denote the two vectors between which the correlation is to be measured.
Input Format
The first row contains an integer N.
This is followed by N rows containing three tab-space ('\t') separated integers, M P C
corresponding to a candidate's scores in Mathematics, Physics and Chemistry respectively.
Each row corresponds to the scores attained by a unique candidate in these three subjects.
Input Constraints
1 <= N <= 5 x 105
0 <= M, P, C <= 100
Output Format
The output should contain three lines, with correlation coefficients computed
and rounded off correct to exactly 2 decimal places.
The first line should contain the correlation coefficient between Mathematics and Physics
scores.
The second line should contain the correlation coefficient between Physics and Chemistry
scores.
The third line should contain the correlation coefficient between Chemistry and Mathematics
scores.
Test Cases
There is one sample test case with scores obtained in Mathematics, Physics and Chemistry by
20 students. The hidden test case contains the scores obtained by all the candidates who
appeared for the examination and took all three tests (Mathematics, Physics and Chemistry).
Think:* How can you efficiently compute the correlation coefficients within the given time
constraints, while handling the scores of nearly 400k students?*
CODE/PROGRAM:
def pearson_correlation(x, y, n):
  mean_x = sum([i for i in x]) / len(x)
  mean_y = sum([i for i in y]) / len(y)
  sum_xy = sum([x[i] * y[i] for i in range(n)])
  Sx = pow(sum([pow(i - mean_x, 2) for i in x]) / (n-1), 0.5)
  Sy = pow(sum([pow(i - mean_y, 2) for i in y]) / (n-1), 0.5)
  corr = (sum_xy - n * mean_x * mean_y) / ((n-1) * Sx * Sy)
  return corr
n = int(input())
data = [list(map(float, input().split())) for i in range(n)]
math = [data[i][0] for i in range(n)]
physics = [data[i][1] for i in range(n)]
chem = [data[i][2] for i in range(n)]
print("%.2f" % pearson_correlation(math, physics, n))
print("%.2f" % pearson_correlation(physics, chem, n))
print("%.2f" % pearson_correlation(chem, math, n))
SCREEN-SHOT:
Aim:
Multiple Linear Regression: Predicting House Prices
Objective
In this challenge, we practice using multiple linear regression to predict housing prices.
Check out the Resources tab for helpful videos!
Task
Charlie wants to buy a house. He does a detailed survey of the area where he wants to live, in
which he quantifies, normalizes, and maps the desirable features of houses to values on a
scale of to so the data can be assembled into a table. If Charlie noted features, each row
contains space-separated values followed by the house price in dollars per square foot
(making for a total of columns). If Charlie makes observations about houses, his observation
table has rows. This means that the table has a total of entries.
Unfortunately, he was only able to get the price per square foot for certain houses and thus
needs your help estimating the prices of the rest! Given the feature and pricing data for a set
of houses, help Charlie estimate the price per square foot of the houses for which he has
compiled feature data but no pricing.
Important Observation: The prices per square foot form an approximately linear function for
the features quantified in Charlie's table. For the purposes of prediction, you need to figure
out this linear function.
Recommended Technique: Use a regression-based technique. At this point, you are not
expected to account for bias and variance trade-offs.
Input Format
The first line contains space-separated integers, (the number of observed features) and (the
number of rows/houses for which Charlie has noted both the features and price per square
foot).
The subsequent lines each contain space-separated floating-point numbers describing a row
in the table; the first elements are the noted features for a house, and the very last element is
its price per square foot.
The next line (following the table) contains a single integer, , denoting the number of houses
for for which Charlie noted features but does not know the price per square foot.
The subsequent lines each contain space-separated floating-point numbers describing the
features of a house for which pricing is not known.
Constraints
CODE/PROGRAM
import numpy as np
from sklearn.linear_model import Lasso
f, n = map(int, input().split())
X_train = np.empty((n, f))
y_train = np.empty(n)
# Read the training data
for i in range(n):
   row = list(map(float, input().split()))
   X_train[i] = row[:f]
   y_train[i] = row[-1]
t = int(input())
X_test = np.empty((t, f))
# Read the testing data
for i in range(t):
   X_test[i] = list(map(float, input().split()))
lasso_model = Lasso(alpha=0.02)
lasso_model.fit(X_train, y_train)
# Predict on the test data
y_pred = lasso_model.predict(X_test)
for val in y_pred:
   print("{:.2f}".format(val))
SCREEN-SHOT
Aim:
Correlation and Regression Lines
Here are the test scores of 10 students in physics and history:
Physics Scores 15 12 8 8 7 7 7 6 5 3
History Scores 10 25 17 11 13 17 20 13 9 15
Compute Karl Pearson’s coefficient of correlation between these scores.
Compute the answer correct to three decimal places.
Output Format
In the text box, using the language of your choice, print the floating point/decimal value
required. Do not leave any leading or trailing spaces.
For example, if your answer is 0.255. In python you can print using
print("0.255")
This is NOT the actual answer - just the format in which you should provide your answer.
CODE/PROGRAM
import math
line1 = "Physics Scores 15 12 8 8 7 7 7 6 5 3"
line2 = "History Scores 10 25 17 11 13 17 20 13 9 15"
x = [int(e) for e in line1.split()[2:]]
y = [int(e) for e in line2.split()[2:]]
n = len(x)
ex = sum(x)/n
ey = sum(y)/n
cov = sum([(xi-ex)*(yi-ey) for (xi,yi) in zip(x,y)])
stdx = math.sqrt(sum([(xi-ex)**2 for xi in x]))
stdy = math.sqrt(sum([(yi-ey)**2 for yi in y]))
result = cov/(stdx*stdy)
print(f"{result:.3f}")
SCREEN-SHOT
Aim:
Basic Statistics Warmup
You are given an array of N integers separated by spaces, all in one line.
Display the following:
       1. Mean (m): The average of all the integers.
       2. Median of this array: In case, the number of integers is odd, the middle element;
          else, the average of the middle two elements.
       3. Mode: The element(s) which occurs most frequently. If multiple elements satisfy
          this criteria, display the numerically smallest one.
       4. Standard Deviation (SD).
          SD = (((x1-m)2+(x2-m)2+(x3-m)2+(x4-m)2+...(xN-m)2))/N)0.5
          where xi is the ith element of the array
       5. Lower and Upper Boundary of the 95% Confidence Interval for the mean,
          separated by a space. This might be a new term to some. However, it is an
          important concept with a simple, formulaic solution. Look it up!
Other than the modal values (which should all be integers) the answers should be in decimal
form till one place of decimal (0.0 format). An error margin of +/- 0.1 will be tolerated for the
standard deviation and the confidence interval boundaries. The mean, mode and median
values should match the expected answers exactly.
Assume that these numbers were sampled from a normal distribution; the sample is a
reasonable representation of the distribution; a user can approximate that the population
standard deviation =~ standard deviation computed for the given points- with the
understanding that assumptions of normality are convenient approximations. Some relevant
Youtube videos:
Mean, Median and Mode
Input Format
The first line contains the number of integers.
The second line contains space separated integers for which you need to find the mean,
median, mode, standard deviation and confidence interval boundaries.
Constraints
10 <= N <= 2500
0 < xi <= 105
Output Format
A total of five lines are required.
Mean (format:0.0) on the first line
Median (format: 0.0) on the second line
Mode(s) (Numerically smallest Integer in case of multiple integers)
Standard Deviation (format:0.0)
Lower and Upper Boundary of Confidence Interval (format: 0.0) with a space between them.
CODE/PROGRAM
#Without using numpy or statistics libraries
import sys
from collections import defaultdict
for line in sys.stdin:
   nums = line.split(" ")
mean = 0
mode = 0
median = 0
deviation = 0
lowerconfidence = 0
upperconfidence = 0
d = defaultdict(int)
for n in nums:
   mean+=int(n)
   d[n] += 1
n = len(nums)
mean = mean / n
nums = [int(x) for x in nums]
median = (sorted(nums)[n//2]) if n % 2 else (sorted(nums)[n//2] + sorted(nums)[n//2 - 1])/ 2
modeCandidates = []
freq = max(d.values())
for k in d:
    if freq == d[k]:
       modeCandidates.append(int(k))
if len(k) == 1:
    mode = k
else:
    mode = sorted(modeCandidates)[0]
deviation = [(x - mean) ** 2 for x in nums]
deviation = (sum(deviation) / n) ** .5
lowerconfidence = mean - 1.96 * (deviation) / (len(nums) **.5)
upperconfidence = mean + 1.96 * (deviation) / (len(nums) **.5)
print('{:.1f}'.format(mean))
print('{:.1f}'.format(median))
print('{:.0f}'.format(mode))
print('{:.1f}'.format(deviation))
print(('{lowerconfidence:.1f} {upperconfidence:.1f}').format(lowerconfidence = lowerconfid
ence, upperconfidence = upperconfidence))
SCREEN-SHOT
Aim:
Your Correlation and Regression Lines
Here are the test scores of 10 students in physics and history:
Physics Scores 15 12 8 8 7 7 7 6 5 3
History Scores 10 25 17 11 13 17 20 13 9 15
Compute the slope of the line of regression obtained while treating Physics as the
independent variable. Compute the answer correct to three decimal places.
Output Format
In the text box, enter the floating point/decimal value required. Do not leave any leading or
trailing spaces. Your answer may look like: 0.255
This is NOT the actual answer - just the format in which you should provide your answer.
CODE/PROGRAM
def get_regr_slope(x,y):
  xy = [a*b for a, b in zip(x,y)]
  n = len(x)
  x_2 = [a*a for a in x]
  num = n*sum(xy) - sum(x)*sum(y)
  denom = n*sum(x_2) - sum(x)**2
  slope = num/denom
  print("%.3f" % slope)
physics = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
history = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
get_regr_slope(physics, history)
SCREEN-SHOT