KEMBAR78
AI & Stats Lab Exercises | PDF | Mathematical Analysis | Algebra
0% found this document useful (0 votes)
41 views13 pages

AI & Stats Lab Exercises

The document provides information about a programming assignment to compute Pearson's correlation coefficient between student scores in three subjects - Mathematics, Physics, and Chemistry. It includes the aim, input format, constraints, output format, sample test case, and a suggestion to efficiently compute correlations for a large dataset. The code provided calculates the correlations between subject scores and prints the results to three decimal places as specified.

Uploaded by

js9118164
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views13 pages

AI & Stats Lab Exercises

The document provides information about a programming assignment to compute Pearson's correlation coefficient between student scores in three subjects - Mathematics, Physics, and Chemistry. It includes the aim, input format, constraints, output format, sample test case, and a suggestion to efficiently compute correlations for a large dataset. The code provided calculates the correlations between subject scores and prints the results to three decimal places as specified.

Uploaded by

js9118164
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

COMPLEX PROBLEMS

Student Name: Harsh Singh


Praval Kumar 23MAI10028
UID: 23MAI10018
Branch: ME-CSE(AIML) Section/Group: 23MAI-1A
Semester: 1st Date of Performance: 6/11/2023
Subject Name: AI LAB Subject Code: 23CSH-621

Aim:
Computing the Correlation
You are given the scores of N students in three different subjects - Mathematics,*Physics*
and Chemistry; all of which have been graded on a scale of 0 to 100. Your task is to compute
the Pearson product-moment correlation coefficient between the scores of different pairs of
subjects (Mathematics and Physics, Physics and Chemistry, Mathematics and Chemistry)
based on this data. This data is based on the records of the CBSE K-12 Examination - a
national school leaving examination in India, for the year 2013.
Pearson product-moment correlation coefficient
This is a measure of linear correlation described well on this Wikipedia page. The formula, in
brief, is given by:
where x and y denote the two vectors between which the correlation is to be measured.
Input Format
The first row contains an integer N.
This is followed by N rows containing three tab-space ('\t') separated integers, M P C
corresponding to a candidate's scores in Mathematics, Physics and Chemistry respectively.
Each row corresponds to the scores attained by a unique candidate in these three subjects.
Input Constraints
1 <= N <= 5 x 105
0 <= M, P, C <= 100

Output Format
The output should contain three lines, with correlation coefficients computed
and rounded off correct to exactly 2 decimal places.
The first line should contain the correlation coefficient between Mathematics and Physics
scores.
The second line should contain the correlation coefficient between Physics and Chemistry
scores.
The third line should contain the correlation coefficient between Chemistry and Mathematics
scores.
Test Cases
There is one sample test case with scores obtained in Mathematics, Physics and Chemistry by
20 students. The hidden test case contains the scores obtained by all the candidates who
appeared for the examination and took all three tests (Mathematics, Physics and Chemistry).
Think:* How can you efficiently compute the correlation coefficients within the given time
constraints, while handling the scores of nearly 400k students?*

CODE/PROGRAM:
def pearson_correlation(x, y, n):
mean_x = sum([i for i in x]) / len(x)
mean_y = sum([i for i in y]) / len(y)

sum_xy = sum([x[i] * y[i] for i in range(n)])

Sx = pow(sum([pow(i - mean_x, 2) for i in x]) / (n-1), 0.5)


Sy = pow(sum([pow(i - mean_y, 2) for i in y]) / (n-1), 0.5)

corr = (sum_xy - n * mean_x * mean_y) / ((n-1) * Sx * Sy)

return corr

n = int(input())
data = [list(map(float, input().split())) for i in range(n)]

math = [data[i][0] for i in range(n)]


physics = [data[i][1] for i in range(n)]
chem = [data[i][2] for i in range(n)]

print("%.2f" % pearson_correlation(math, physics, n))


print("%.2f" % pearson_correlation(physics, chem, n))
print("%.2f" % pearson_correlation(chem, math, n))
SCREEN-SHOT:
Aim:
Multiple Linear Regression: Predicting House Prices
Objective
In this challenge, we practice using multiple linear regression to predict housing prices.
Check out the Resources tab for helpful videos!

Task
Charlie wants to buy a house. He does a detailed survey of the area where he wants to live, in
which he quantifies, normalizes, and maps the desirable features of houses to values on a
scale of to so the data can be assembled into a table. If Charlie noted features, each row
contains space-separated values followed by the house price in dollars per square foot
(making for a total of columns). If Charlie makes observations about houses, his observation
table has rows. This means that the table has a total of entries.

Unfortunately, he was only able to get the price per square foot for certain houses and thus
needs your help estimating the prices of the rest! Given the feature and pricing data for a set
of houses, help Charlie estimate the price per square foot of the houses for which he has
compiled feature data but no pricing.

Important Observation: The prices per square foot form an approximately linear function for
the features quantified in Charlie's table. For the purposes of prediction, you need to figure
out this linear function.

Recommended Technique: Use a regression-based technique. At this point, you are not
expected to account for bias and variance trade-offs.

Input Format

The first line contains space-separated integers, (the number of observed features) and (the
number of rows/houses for which Charlie has noted both the features and price per square
foot).
The subsequent lines each contain space-separated floating-point numbers describing a row
in the table; the first elements are the noted features for a house, and the very last element is
its price per square foot.
The next line (following the table) contains a single integer, , denoting the number of houses
for for which Charlie noted features but does not know the price per square foot.
The subsequent lines each contain space-separated floating-point numbers describing the
features of a house for which pricing is not known.

Constraints

CODE/PROGRAM
import numpy as np
from sklearn.linear_model import Lasso

f, n = map(int, input().split())

X_train = np.empty((n, f))


y_train = np.empty(n)

# Read the training data


for i in range(n):
row = list(map(float, input().split()))
X_train[i] = row[:f]
y_train[i] = row[-1]

t = int(input())

X_test = np.empty((t, f))

# Read the testing data


for i in range(t):
X_test[i] = list(map(float, input().split()))

lasso_model = Lasso(alpha=0.02)
lasso_model.fit(X_train, y_train)

# Predict on the test data


y_pred = lasso_model.predict(X_test)

for val in y_pred:


print("{:.2f}".format(val))

SCREEN-SHOT
Aim:
Correlation and Regression Lines
Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3
History Scores 10 25 17 11 13 17 20 13 9 15
Compute Karl Pearson’s coefficient of correlation between these scores.

Compute the answer correct to three decimal places.

Output Format

In the text box, using the language of your choice, print the floating point/decimal value

required. Do not leave any leading or trailing spaces.

For example, if your answer is 0.255. In python you can print using

print("0.255")
This is NOT the actual answer - just the format in which you should provide your answer.

CODE/PROGRAM
import math
line1 = "Physics Scores 15 12 8 8 7 7 7 6 5 3"
line2 = "History Scores 10 25 17 11 13 17 20 13 9 15"

x = [int(e) for e in line1.split()[2:]]


y = [int(e) for e in line2.split()[2:]]
n = len(x)
ex = sum(x)/n
ey = sum(y)/n
cov = sum([(xi-ex)*(yi-ey) for (xi,yi) in zip(x,y)])
stdx = math.sqrt(sum([(xi-ex)**2 for xi in x]))
stdy = math.sqrt(sum([(yi-ey)**2 for yi in y]))
result = cov/(stdx*stdy)
print(f"{result:.3f}")
SCREEN-SHOT
Aim:
Basic Statistics Warmup

You are given an array of N integers separated by spaces, all in one line.

Display the following:

1. Mean (m): The average of all the integers.

2. Median of this array: In case, the number of integers is odd, the middle element;

else, the average of the middle two elements.

3. Mode: The element(s) which occurs most frequently. If multiple elements satisfy

this criteria, display the numerically smallest one.

4. Standard Deviation (SD).

SD = (((x1-m)2+(x2-m)2+(x3-m)2+(x4-m)2+...(xN-m)2))/N)0.5

where xi is the ith element of the array

5. Lower and Upper Boundary of the 95% Confidence Interval for the mean,

separated by a space. This might be a new term to some. However, it is an

important concept with a simple, formulaic solution. Look it up!

Other than the modal values (which should all be integers) the answers should be in decimal

form till one place of decimal (0.0 format). An error margin of +/- 0.1 will be tolerated for the

standard deviation and the confidence interval boundaries. The mean, mode and median

values should match the expected answers exactly.

Assume that these numbers were sampled from a normal distribution; the sample is a

reasonable representation of the distribution; a user can approximate that the population

standard deviation =~ standard deviation computed for the given points- with the

understanding that assumptions of normality are convenient approximations. Some relevant

Youtube videos:
Mean, Median and Mode

Input Format

The first line contains the number of integers.


The second line contains space separated integers for which you need to find the mean,
median, mode, standard deviation and confidence interval boundaries.
Constraints

10 <= N <= 2500

0 < xi <= 105

Output Format

A total of five lines are required.


Mean (format:0.0) on the first line
Median (format: 0.0) on the second line
Mode(s) (Numerically smallest Integer in case of multiple integers)
Standard Deviation (format:0.0)
Lower and Upper Boundary of Confidence Interval (format: 0.0) with a space between them.

CODE/PROGRAM
#Without using numpy or statistics libraries
import sys
from collections import defaultdict
for line in sys.stdin:
nums = line.split(" ")
mean = 0
mode = 0
median = 0
deviation = 0
lowerconfidence = 0
upperconfidence = 0
d = defaultdict(int)
for n in nums:
mean+=int(n)
d[n] += 1
n = len(nums)
mean = mean / n
nums = [int(x) for x in nums]
median = (sorted(nums)[n//2]) if n % 2 else (sorted(nums)[n//2] + sorted(nums)[n//2 - 1])/ 2
modeCandidates = []
freq = max(d.values())
for k in d:
if freq == d[k]:
modeCandidates.append(int(k))
if len(k) == 1:
mode = k
else:
mode = sorted(modeCandidates)[0]
deviation = [(x - mean) ** 2 for x in nums]
deviation = (sum(deviation) / n) ** .5
lowerconfidence = mean - 1.96 * (deviation) / (len(nums) **.5)
upperconfidence = mean + 1.96 * (deviation) / (len(nums) **.5)
print('{:.1f}'.format(mean))
print('{:.1f}'.format(median))
print('{:.0f}'.format(mode))
print('{:.1f}'.format(deviation))
print(('{lowerconfidence:.1f} {upperconfidence:.1f}').format(lowerconfidence = lowerconfid
ence, upperconfidence = upperconfidence))

SCREEN-SHOT
Aim:
Your Correlation and Regression Lines
Here are the test scores of 10 students in physics and history:
Physics Scores 15 12 8 8 7 7 7 6 5 3
History Scores 10 25 17 11 13 17 20 13 9 15

Compute the slope of the line of regression obtained while treating Physics as the
independent variable. Compute the answer correct to three decimal places.
Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or
trailing spaces. Your answer may look like: 0.255
This is NOT the actual answer - just the format in which you should provide your answer.

CODE/PROGRAM

def get_regr_slope(x,y):
xy = [a*b for a, b in zip(x,y)]
n = len(x)
x_2 = [a*a for a in x]
num = n*sum(xy) - sum(x)*sum(y)
denom = n*sum(x_2) - sum(x)**2
slope = num/denom
print("%.3f" % slope)

physics = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]


history = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
get_regr_slope(physics, history)
SCREEN-SHOT

You might also like