KEMBAR78
BA Assignment - PDF - v1 | PDF | Coefficient Of Determination | Regression Analysis
0% found this document useful (0 votes)
28 views6 pages

BA Assignment - PDF - v1

BA

Uploaded by

rasikajs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views6 pages

BA Assignment - PDF - v1

BA

Uploaded by

rasikajs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Weather Forecasting Analysis using the Linear Regression Algorithm

Rasika Jayawardena
CBO12267- Msc. in IT
Asia Pacific Institute of Information Technology
Colombo, Srilanka
cb012267@students.apiit.lk

ABSTRACT ➢ Transportation
Weather forecasts help shipping lines predict storm dangers,
The paper discusses the increasing diversity of information determine sailing routes, and avoid flight delays and
sources and the exponential growth of data volume, cancellations worldwide.
particularly in open data initiatives and platforms. It ➢ Disaster Management
highlights the importance of Information Visualization (OD) Natural disasters worldwide can be predicted and predicted
in various fields and sectors, particularly CSV files, and how using big data analytics, reducing fatalities, saving lives, and
it can help users quickly understand the structure and issues of reducing economic damage. The accuracy and lead time vary
these files. by disaster type

INTRODUCTION RESEARCH METHODOLOGY


The data analysis cycle is a process of transforming raw data into
The exponential growth of heterogeneous data from various meaningful insights that can inform decision-making. The cycle
sources, including the Internet of Things and user-generated typically consists of the following steps.
content, presents both challenges and opportunities for
businesses and academics. This data, encompassing both • Defining the question
structured and unstructured data, has transformed operations • Collecting the data
through improved customer service, payments, business • Cleaning the data
models, and online engagement. • Analyzing the data
• Share the results
BACKGROUND

Weather refers to the daily fluctuations of the atmosphere,


collected through various observations like sea, ground, and
radar. This data is used to forecast weather through various
applications and models. Weather forecasts are crucial for
daily decision-making, impacting agriculture, irrigation,
marine trade, and saving lives from accidents. They also affect
industries, transportation, disaster management, and energy
management.

➢ Agriculture/Food
Weather forecasting and big data analytics can improve
agricultural production by predicting soil erosion, DEFINING THE QUESTING
overwatering, and drought, enabling farmers to estimate This study examines Sri Lanka's weather data from 2010 to 2023,
food prices, and supermarket chains to control stock more focusing on the impact of weather on agriculture. Variables like
efficiently. temperature, precipitation, and sunlight hours are crucial. Time-
➢ Tourism related variables are used to analyze seasonal changes and
Climate conditions are crucial in tourism, as they precipitation patterns. Extreme weather events are recognized
influence people's choice of destinations for various using parameters like 'precipitation_sum' and
purposes. Weather forecasts are essential for ensuring 'temperature_2m_max'. Stakeholders, such as agricultural experts,
safety and convenience, and can also estimate the environmental scientists, and local communities, are discussed to
industry's benefits based on climate change. understand the problem and its causes. This helps frame the
➢ Construction problem more accurately.
Weather forecasting helps protect construction workers,
activities, and resources from climate-related hazards, COLLECTING THE DATA
enabling earlier identification and planning, thereby
ensuring worker safety and saving money.
The Sri Lanka Weather Dataset, available on Kaggle, contains
➢ Sports
comprehensive weather data for 30 major cities in Sri Lanka from
Weather conditions like rainfall, lightning, and wind
2010-2023. The data is analyzed for exploratory data analysis
significantly impact sports industries, such as sailing
(EDA) to ensure accuracy and consistency, checking column
races and golfers, and forecasting helps determine when
names, data types, and overall consistency. The dataset is used for
to cover courts during rain.
better understanding and decision-making.
An overview of the dataset

S/N Attribute Description Data Type


1 Time Observation timestamps. Date
2 Weathercode A numerical code that Numeric
represents the current
weather conditions.
3 Temperature_2m_max The maximum temperature Numeric
at 2 meters.
4 Temperature_2m_min The minimum temperature Numeric
at 2 meters.
5 Temperature_2m_mean The mean temperature at 2 Numeric
meters.
6 Apparent_temperature_max The maximum apparent Numeric
value of temperatures is
determined by considering
factors like wind chill and
heat index.
7 Apparent_temperature_min The minimum apparent Numeric
value of temperatures is ➢ Error-Free Data
determined by considering
factors like wind chill and Data Cleaning is a crucial process that removes errors and
heat index. garbage values from data, enhancing analysis efficiency and
8 Apparent_temperature_mean The mean apparent value
of temperatures is
Numeric
saving time. Inaccurate data can lead to mistakes, and it's easier
determined by considering to fix incorrect or corrupt data when errors are monitored, and
factors like wind chill and proper reporting is done.
heat index.
9 Sunrise Each day's sunrise time. Datetime ➢ Data Quality
10 Sunset
Each day's sunset time . Datetime The analysis of weather data can be improved by addressing
11 Shortwave radiation sum Observed shortwave Numeric potential data quality issues such as missing values, incorrect
radiation date/time formats, temperature and precipitation outliers,
12 Precipitation sum Precipitation duration Numeric
inconsistent unit values, invalid wind direction values, and
13 Rain sum Rain sum Numeric errors in geographical coordinates and elevation.
14 Snowfall sum Snowfall sum Numeric
15 Precipitation hours Precipitation hours Numeric
➢ Accurate and Efficient
16 Windspeed_10m_max Wind speed maximums at Numeric The weather dataset, including time, temperature,
10 meters above ground precipitation, wind, and geographic coordinates, was validated,
17 Windgusts_10m_max 10 meter maximum wind Numeric
gust values handled, and standardized to address issues like inconsistent
18 Winddirection_10m_dominant A 10 meter high wind Numeric temperature values, outliers, and data entry errors, enhancing
direction dominates.
19 et0_fao_evapotranspiration ET0 based on Penman- Numeric
its reliability for climate research and agriculture planning.
Monteith equation ➢ Complete Data
provided by FAO The complete weather dataset includes essential fields like
20 latitude City latitudes and Numeric
elevations time, weather code, temperature, sunrise, sunset, precipitation
21 longitude City longitude and Numeric sums, rain sums, snowfall sums, and more. It has been cleaned
22 elevation
elevations
Geographic coordinates Numeric
and validated to address missing values, inconsistent
and elevations of each city temperature values, outliers, and data entry errors. This refined
23 Country Country names associated String dataset is an excellent resource for climate research, agriculture
with each weather
observation. planning, and energy management.
24 City Observation cities' names String ➢ Maintain Data Consistency
Consistency in data can be measured by comparing systems
. within the same dataset or across multiple datasets. The
weather dataset, which includes fields like time, weather code,
CLEANING THE DATA temperature, sunrise, sunset, and more, requires consistency in
Data errors, duplicates, or mislabeled items from multiple various fields. The dataset uses uniform units, date/time
sources can negatively impact outcomes and algorithms. Clean formats, and measurement methods to maintain high accuracy,
data preparation involves removing garbage, incorrect, improving its use in climate research, agriculture planning, and
duplicate, corrupt, or incomplete data from a dataset. Cleaner energy management.
data sets are essential for the analytical process and information
science, making data analytical tools and business intelligence DATA CLEANING CYCLE
more effective and user-friendly. Data cleaning is a data analysis method that involves
analyzing, identifying, and correcting untidy raw data, filling
in missing values, identifying errors, and correcting them, with
techniques varying depending on the dataset type.
VISUALIZING THE DATA
In order to gain a better understanding of the data, plots are
used to visualize it.

The Sri Lanka diagram below shows the cities that are
available in the data set.

To clean CSV data with Python, follow these steps:

1. Importing the necessary libraries: Access the Pandas


library by typing "import pandas as pd". Depending on Analyse Temperature, Precipitation and Wind speed in
your cleaning needs, you might also need to import Srilanka from 2010 to 2023
NumPy and Matplotlib.

2. Reading the CSV file: Pandas' read_csv() function


reads CSV files with parameters like file path, delimiter,
and column names, returning a Data Frame object for
data manipulation and cleaning.

3. Checking for missing values: Use isnull() to create a


Boolean mask for missing values, sum each column's
missing values, remove them with dropna() or fillna()
functions.

4. Removing duplicate rows: The function


'drop_duplicates()' can be used to remove duplicate rows
from a Data Frame, with parameters like'subset' and
'keep' allowing for checking for duplicates in specific
columns or keeping only the first occurrence.

5. Handling outliers: The describe() function can be used


to identify outliers in numerical columns of a Data
Frame. Using the Z-score method or the Interquartile
Range (IQR) method, outliers in specific fields like
temperature, precipitation, and wind speed can be
mitigated. This helps in climate research, agriculture
planning, and energy management by filtering out
extreme values in temperature and precipitation,
ensuring more reliable analyses.
6. Data transformation: To optimize data analysis, it may Analyse Temperature, Precipitation and Wind speed in
be necessary to convert data types, normalize data, or Srilanka Monthly
create new columns based on existing data, such as date
columns.
7. Saving the cleaned data: The Pandas library's to_csv()
function allows cleaning data to be saved as a new CSV
file, taking parameters like file path, index, and header.
8. Data validation: After cleaning, it's crucial to validate
the data by comparing it with original or external data
sources to obtain the necessary information.
Analyse Temperature, Precipitation and Wind speed
in Srilanka Anually

Analyse Temperature, Precipitation and Wind Speed for


cities in Sri Lanka

Using minimum and maximum temperatures, precipitation,


and wind speed to analyze weather conditions can provide
numerous benefits to agriculture, sports, tourism, and travel.
Weather data is crucial to crop management, pest and disease
control, and soil moisture management in agriculture. In addition to
monitoring wind speed, farmers also use temperature and
precipitation patterns to optimize planting, irrigation, and
harvesting schedules. A farmer can also adjust pest management
strategies based on weather information in order to predict and
Analyse Temperature, Precipitation in Seasons manage pest and disease outbreaks. As a result, farmers are able to
plan irrigation schedules and implement better water conservation
strategies using precipitation data.
During outdoor events, sports organizers can implement The graph below shows the relationship between temperature
safety precautions based on weather data. A ball's bounce and _2m_max and precipitation sum, which would lead us to believe
movement are affected by pitch hardness and moisture content. that it would be a very high correlation apparently, but as we can
Rain and precipitation can dampen pitches, favoring spin see, we do not have this correlation.
bowling. A bowler's swing and movement are affected by wind
speed, which alters the trajectory of the ball. Weather Prediction using Regression.
conditions can affect player performance, with high Based on regression modeling, I made predictions for three
temperatures causing fatigue and dehydration, and scenarios.
precipitation making conditions slippery. As well as bowlers
and batsmen, wind speed influences ball behavior. As a result 1. X = df['temperature_max'].values.reshape(-1,1)
of weather conditions, teams strategize their strategies, such as
using swing bowlers when the conditions are swing-friendly.
y = df['evapotranspiration'].values.reshape(-1,1)
Weather interruptions can change match dynamics, affecting
schedules and outcomes.
Tourism destinations and outdoor activities are heavily 2. X = df['temperature_max'].values.reshape(-1,1)
influenced by weather conditions. Travelers prioritize
destinations with favorable weather conditions for specific y = df['apparent_temperature'].values.reshape(-1,1)
activities, and tourism industry professionals use weather data
to promote such destinations. The weather forecast helps 3. X = df['temperature_2m_max'].values.reshape(-1,1)
travelers plan outdoor activities such as hiking, sightseeing, or
beach trips, ensuring they are prepared appropriately and have y = df['target'].values.reshape(-1,1)
a pleasant experience. A traveler's knowledge of precipitation
and wind speed is particularly valuable for optimizing their
travel experiences. A regression model's R-squared indicates how well it fits the
observed data. The percentage of variance in the dependent variable
There is a good correlation between maximum temperature and can be explained by the independent variable.
apparent maximum temperature. According to the first scenario, the R-squared value is
0.5033482483253012. This value represents the coefficient of
determination (R-squared) for a linear regression model using the
given X_test and Y_test. Based on the features in X_test,
approximately 50.33% of the variance in y_test can be explained by
the linear regression model.
Second scenario, the R-squared value is
0.7342544181383424. The model with this X_test provides a better
fit than the previous one, since it explains more of the variability in
the target variable based on the R-squared value. Based on this, a
linear regression model with a different set of features can explain
approximately 73.43% of the variance in y_test.
As a result, the R-squared value is 0.8167653460930548.
Here, the R-squared value is 81.68%, which means that it captures
a significant portion of the variability in the target variable,
suggesting a strong fit.
A linear regression model's intercept represents the
predicted value of the dependent variable (y) when all independent
variables (X) are zero. The intercept values provided can be
interpreted as follows:
The intercept value 6.10439771 represents the linear
regression model's y-intercept. The model predicts a target value of
approximately -6.10 when all independent variables (features) are
zero. Therefore, even if all features have no impact on the target
variable (e.g., temperature, humidity, etc.), the model still predicts
a negative value.
In the same way, this intercept value 0.75067964 is the y-
intercept for another linear regression model. When all features are
zero, the model predicts a target value of approximately -0.75.This
indicates that the model has a baseline prediction even without any
features.
When all features are zero, the model predicts a target
value of approximately 2.69. This positive baseline prediction
indicates that the model expects a non-zero value regardless of
feature inputs.
PLOT DIAGRAMS OF THREE SCENARIOS

CONCLUSION

It is the forecasting of the weather that is the most scientific and


technologically challenging problem in the world. In order to
predict weather conditions, two things need to be done
correctly: collecting data from the meteorological department
and selecting the right data mining methods. Accuracy of the
model and its timely output are the two most important aspects
of weather prediction. Due to the complex nature of the
problem domain of weather forecasting, it is very feasible to
use data mining techniques to provide some accurate results in
a thorough manner. Weather prediction is improved by
applying more than one data mining technique in parallel. By
combining several forecasting and data mining techniques, we
attempt to forecast different weather conditions. The proposed
model achieved an impressive classification accuracy using
limited parameters despite having many parameters.

You might also like