1.
Functions of the three Python packages
(NumPy, Pandas, MatPlotLib) - 6 marks
NumPy:
- Array Operations:
Provides support for large multi-dimensional arrays and matrices, along with a large library of high-level
mathematical functions to operate on these arrays.
- Mathematical Functions:
Includes functions for operations like statistical analysis, linear algebra, Fourier transforms, and random
number generation.
- Efficiency:
Optimized for performance, allowing operations on arrays to be performed much faster than with
standard Python lists.
Pandas:
- Data Structures:
Introduces data structures like Series (one-dimensional) and Data Frame (two-dimensional) for efficient
data manipulation and analysis.
- Data Manipulation:
Provides tools for data cleaning, merging, reshaping, and filtering.
- Handling Missing Data:
Includes functions to handle missing data, such as filling or dropping null values.
Matplotlib:
- Plotting:
Provides a comprehensive library for creating static, animated, and interactive visualizations in Python.
- Customization:
Allows for extensive customization of plots, including control over line styles, font properties, and more.
- Integration:
Works well with other libraries like NumPy and Pandas, enabling easy plotting of data stored in these
structures.
2. Describe what the following command does - 3 marks
x <- 3 if(x>2) y else y <- 3*x
This command contains a logical error. In R, if statements require a condition and two separate
commands for the if and else clauses. The correct form should use proper syntax such as:
x <- 3
if(x > 2) {
y <- y
} else {
y <- 3 * x
In the corrected command:
- x is assigned the value 3.
- The if condition checks if x is greater than 2. Since x is 3, the condition is true.
- If true, y is supposed to be assigned a value. However, y is not defined, so this will result in an error
unless y has been defined previously.
3. State and describe five types of data representation in a computer - 5 marks
a. Binary (Machine Code):
The most basic form of data representation, using binary digits (0s and 1s) to represent all types of data.
b. Text (ASCII/Unicode):
Characters are represented using standards like ASCII or Unicode, allowing text data to be
encoded in a binary format.
c. Integer:
Whole numbers represented in binary form, either as signed or unsigned integers.
d. Floating-point:
Numbers with fractional parts, represented using a specific format (like IEEE 754) to encode the
value in binary.
e. Boolean:
Logical data that can be either true or false, often represented as 1 or 0 in binary.
4. Explain the difference between supervised and unsupervised learning - 4 marks
Supervised Learning:
- Definition: Involves training a model on a labeled dataset, where the correct output is known for each
training example.
- Purpose: Used for tasks like classification and regression where the goal is to predict an output based
on input data.
- Example: Predicting house prices based on features like size, location, and number of rooms.
Unsupervised Learning:
- Definition: Involves training a model on an unlabeled dataset, where the output is not provided, and
the model tries to find patterns or structures in the data.
- Purpose: Used for tasks like clustering and dimensionality reduction.
- Example: Grouping customers into segments based on purchasing behavior.
5. Differentiate between overfitting and underfitting in data models - 4 marks
Overfitting:
- Definition: Occurs when a model learns the training data too well, including noise and outliers, leading
to poor performance on unseen data.
- Symptoms: High accuracy on training data but low accuracy on test data.
- Solution: Use techniques like cross-validation, pruning, regularization, and simplifying the model.
Underfitting:
- Definition: Occurs when a model is too simple to capture the underlying patterns in the data, leading
to poor performance on both training and test data.
- Symptoms: Low accuracy on both training and test data.
- Solution: Use more complex models, adding features and reducing bias.
6. Briefly describe any three problem-solving strategies - 6 marks
a. Divide and Conquer:
- Approach: Break down a large problem into smaller, more manageable sub-problems, solve each sub-
problem individually, and then combine the solutions.
- Example: Sorting algorithms like Merge Sort and Quick Sort.
b. Dynamic Programming:
- Approach: Solve complex problems by breaking them down into simpler overlapping sub-problems
and storing the results of these sub-problems to avoid redundant computations.
- Example: Fibonacci sequence calculation, shortest path algorithms like Dijkstra's.
c. Greedy Algorithm:
- Approach: Make a series of choices by selecting the best option available at each step without
reconsidering previous choices.
- Example: Coin change problem, Kruskal’s algorithm for minimum spanning trees.
7. Define the following terms - 2 marks
Algorithm:
- Definition: A step-by-step procedure or formula for solving a problem, often expressed in pseudocode
or a programming language.
Debugging:
- Definition: The process of identifying, analyzing, and removing errors or bugs in a computer program to
ensure it runs as expected.
8. Write a Python code to create a data frame with appropriate headings from the list - 4 marks
Here's a Python example to create a DataFrame from a list of dictionaries:
python
import pandas as pd
# List of dictionaries
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
# Creating DataFrame
df = pd.DataFrame(data)
# Display DataFrame
print(df)
9. Environmental data analysis - 16 marks
Preprocessing Steps (5 marks):
a. Handling Missing Data:
Identify missing values and decide whether to fill them (imputation) or remove them. For
instance, using mean/mode for imputation or dropping rows/columns with excessive missing
data.
b. Outlier Detection:
Identify and handle outliers using statistical methods or visualization techniques like box plots.
c. Normalization/Standardization:
Normalize or standardize data to bring different features onto a similar scale, which can
improve the performance of many machine learning algorithms.
d. Encoding Categorical Data:
Convert categorical variables into numerical format using techniques like one-hot encoding.
e. Data Splitting:
Split the dataset into training and testing sets to validate the model's performance on unseen
data.
Correlation Analysis (4 marks):
a. Calculate Correlation Coefficients:
Use methods like Pearson, Spearman, or Kendall to calculate correlation coefficients between
industrial emissions and air quality metrics.
b. Visualize Correlation:
Create correlation matrices and heatmaps to visualize the relationships between different
variables.
c. Interpret Results:
Analyze the correlation coefficients to understand the strength and direction of the
relationships.
Variables Selection (2 marks):
- Industrial Emissions: Key variables might include emissions of specific pollutants like CO2, NOx, SOx.
- Air Quality Metrics: Include variables like PM2.5 levels, ozone levels, and other relevant air quality
indices.
- Reasoning: These variables are chosen because they directly measure the pollutants and air quality
levels which are necessary to assess the impact of industrial emissions.
Time Series Analysis ( 5 marks):
a. Decomposition: Decompose the time series data into trend, seasonal, and residual components to
understand the underlying patterns.
b. Visualization: Plot time series graphs to visualize trends, seasonal patterns, and anomalies over time.
c. Modeling: Apply time series models like ARIMA, SARIMA, or Exponential Smoothing to model and
forecast air quality trends.
d. Validation: Use techniques like cross-validation on time series data to ensure the model's accuracy.
e. Interpretation: Analyze the results to identify long-term trends, seasonal effects, and potential
impacts of industrial emissions on air quality.
10. Discuss the two sources of errors in computational methods - 4 marks
a. Truncation Error:
- Definition: Arises when an infinite process is approximated by a finite one, such as truncating an
infinite series or using a finite number of terms.
- Example: Approximating the value of π using a limited number of terms in its series representation.
b. Round-off Error:
- Definition: Occurs due to the finite precision with which computers represent real numbers, leading to
small discrepancies between the true value and its computer representation.
- Example: When performing arithmetic operations on floating-point numbers, the precision limits of the
hardware can introduce small errors that accumulate over multiple operations.