hello, I want you to review my classmate2019;s project and give feedback in 3-4 pages, This is my prof comments about the peer review (You have received a data mining project to review on Canvas. Please use Appendix A (DS) and the CRISP-DM document as guidelines, drawing on the topics we have covered this semester. Your review should include both positive comments and constructive criticism. In these reviews, I will be looking for evidence of your knowledge of the course material applied to the ideas presented in the proposal, reflected as both strengths in the write-up of the proposed project, as well as areas of suggested improvement. Points will be deducted if you merely describe the proposal without offering any constructive criticism (both positive and negative), or if you neglect to address all the sections of the CRISP-DM framework in your review.Your review should be approximately 2-4 pages. The gist of your comments will be returned, anonymously, to the authors of the project. Please don2019;t include your name in the file you submit, though you may use the standard file-naming convention. doneNow these are lessons we have covered so far from the DS book (DS means Data Science for Business. O2019;Reilly Media )- (Course Overview and Introduction60C; Data Description & Statistical Analysis60C; Business Intelligence & Visualization60C; Data Mining Overview60C; Predictive Modeling60C; Model Fitting60C; Avoiding Model Overfitting60C; Modeling Similarity 2013; Neighbors & Clusters60C; Evaluating Models60C; Text, Web & Social Media Analytics60C; Other Data Science Tasks & Techniques60C; Data Science & Business Strategy 2013; Evaluating Proposals 60C; Wrapping Up & Looking Toward the Future)I will send you three1- project you have to read it and give your feedback (Conceptual Model for Predicting
Indoor Air Quality )2- CRISP file 3- two sample reviews for you to look This work worths 100 points
project_n.pdf
crisp_dm.pdf
sample_peer_review1__1_.docx
sample_peer_review2.docx
Unformatted Attachment Preview
BIT 5524 Project Deliverable #3
Conceptual Model for Predicting
Indoor Air Quality
1
Introduction and Problem Statement
The problem our team would like to address is indoor air pollution. There is a lot of data
available regarding air pollution outside of the home, but not as much data to analyze
the air quality inside the home. This isn2019;t a major problem for the United States, but the
air quality in homes abroad is a significant health concern for developing nations. The
primary pollutants include fuel-burning combustion appliances, tobacco products,
building materials that off-gas or deteriorate over time, cleaning products, and even
outdoor pollutants that make their way inside the home (Introduction to Indoor Air
Quality). These pollutants are responsible for long- and short-term illnesses.
Short-term illnesses may include symptoms like irritation of the eyes, headaches, and
fatigue. And long-term illnesses may include serious respiratory diseases, heart
disease, and cancer. Our goal is to design a conceptual project to gather data to help
us predict indoor air quality and better understand who is most susceptible to the health
problems caused by indoor air pollution.
The target will not be a specific population, rather our business processes will outline
the steps needed to produce a predictive model that will help us estimate the air quality
in any home which we can then apply to various groups or individuals. This project, if
implemented would help developing nations allocate resources and address indoor air
pollution earlier in the development cycle. This research and analysis would also
benefit those looking at the health effects. The impact on one2019;s health is based on the
time spent in a polluted environment, taking into consideration the levels of pollution.
We propose using data measuring the access to clean fuels, air pollution death rates,
outdoor air quality levels, households using solid fuels for cooking and other metrics that
we find useful to predict indoor air quality. For the purpose of the project, we are
focusing on two air pollution monitoring stations around Chicago.
2
Data Understanding and Preparation
Most of the current air quality forecasting use expensive technologies like air quality
monitoring stations to collect the data. Once the data is collected it is pretty
straightforward to forecast air quality using linear statistical models. These models are
mostly applied to outdoor air quality data. In order to predict the ambient air quality, we
would have to identify the target variable. For this project, we are going tackle air quality
forecasting by using standard classification or regression models to predict the hourly
concentration of air pollutants (e.g., ozone, particulate matter (PM2.5) and sulfur
dioxide).
The air pollutant data is from the EPA, From their databases, we fetched consecutive
hourly measurements of various meteorological variables and pollutants reported by two
air quality monitoring stations and two air pollutant monitoring sites in the Chicago
area(12). Each record of hourly measurements includes two important variables such as
PM 2.5, PM10. P
200B; M stands for particulate matter (also called particle pollution): the term
for a mixture of solid particles and liquid droplets found in the air. Some particles, such
as dust, dirt, soot, or smoke, are large or dark enough to be seen with the naked eye.
Others are so small they can only be detected using an electron microscope.
Particle pollution includes:
25CF; PM10:200B;200B; inhalable particles, with diameters that are generally 10 micrometers and
smaller; and
25CF; PM2.5:200B;200B; fine inhalable particles, with diameters that are generally 2.5 micrometers
and smaller.
How small is 2.5 micrometers? Think about a single hair from your head. The average
human hair is about 70 micrometers in diameter 2013; making it 30 times larger than the
largest fine particle. The other variable of interest is the Ozone molecule (O3) is harmful
to air quality, outside of the ozone layer. EPA regulations help states reduce ozone
3
levels in their outdoor air and finally Sulfur Dioxide (SO2) is one of a group of gas which
is a harmful air pollutant.
After a detailed analysis of all the variables, we are focusing only the following air
pollutants as input attributes
1. PM2.5 2013; Particulate Matter 2.5
2. O3 2013; Ozone
3. SO2 2013; Sulphur DiOxide
4. AQI 2013; Outdoor Air Quality Index
The target variable would Indoor Air Quality Index and will be of binary value which
denotes if the indoor air quality is good or bad.
Table 1 shows our input and the target variable.
Inputs (X)
PM2.5 (X1)
O3 (x2)
Target (Yk)
SO2(x3)
AQ Index(Yk)
For this project, we found that the data available online was not entirely sufficient for a
predictive model. As part of this collection, most of the data will be numeric, which will
be easy to process, but we will also need information on airflow or what type of
circulation is present in the home. Below are a few issues we will address regarding
preparing the elements of the data collected:
1. Subjective factors (may face language barriers when answering survey questions)
2. May need to calculate a standard metric to understand the total air flow produced
from various power/sized fans and AC
In addition to these problems, we must ensure the integrity of the data throughout our
analysis as we determine which cases are incomplete and must be scrapped or
adjusted to provide useful information in the model.
We will consolidate all of our data into Microsoft Excel, and use the filter tool and other
searches to protect the integrity of the data throughout the transformation process. Any
data that we don2019;t collect ourselves will also be exported to Excel and examined for
irregularities that may cause an issue.
4
Data Modelling
As we have identified the input and target attribute we are going to use the standard
classification model as our data mining solution. We will be evaluating the following
classification data modeling techniques:
25CF; Decision Tree Model
Decision tree model is a tree model in which each branch node represents a
choice between several alternatives, and each leaf node represents a decision. It
is a supervised learning technique which uses a predictive model to map
observations about an item to conclusions about the item2019;s target value to create
mapping from features to classification with a decision tree model which uses
entropy to select an ordering of feature values to consider in the concept rule
description to predict PM levels in air. Since a decision tree generates decision
rules as its model, the researchers have used it to understand the attributes that
were most influential in predicting the air quality class.
25CF; Support Vector Machine
Support Vector Machines is technique based on statistical learning theory used
for regression and time series prediction which overcomes most of the
drawbacks of other models and has been reported to show promising results in
other similar projects
The above two are the most common supervised learning models. In this project, we
focus on refined modeling for predicting weekly air pollutant concentrations on the basis
of historical air pollution data. Our conceptual model is to emphasize how to regularize
the model in order to improve its generalization performance and how to learn with
advanced optimization algorithms(18). The content of this paper is to describe the
designed classification model and the achieved results of the two classification models
5
we picked. For the modeling of air quality we used a data set that contains 1 dependent
variable 200B;Y200B; 200B; and 3 independent variables 200B;x200B; 1200B; , 200B;x200B; 2200B; and 200B;x200B; 3200B; 200B; represented in Table 1.
The prediction models are induced through a process of supervised learning. The
objective is, thus, to make predictions 200B;y200B; about instances 200B;x200B; of the target problem. For
this, a parametrized function 2131;:200B;x200B; 2192; 200B;y200B; is induced.200B; I200B; n order to perform supervised
learning, a dataset 200B;X200B; of examples, and corresponding values 200B;Y200B; , are compiled and used
for model induction. Each row of 200B;X200B; is a feature vector 200B;x200B;=(200B;x200B; 1,200B;x200B; 2,2026;,200B;x200B; 200B;n200B; ). The features 200B;x200B; i200B;
are equivalent to the data variables in a statistical context. The label set, 200B;Y200B; 2208;{200B;y200B; 1,200B;y200B; 2,2026;200B;y
n200B; }, specifies the value that each corresponding instance x200B; i200B; takes. In discrete prediction
tasks, the class labels are typically mutually exclusive but do not necessarily have to be.
For continuous values prediction, the value space typically involves real numbers, 200B;Y200B; 2208;
211B;, but can also apply to integers, 200B;Y200B; 2208; 2110;.
Model Evaluation
We need a way to choose between models,different model types, tuning parameters,
and features. We plan on using a model evaluation procedure to estimate how well a
model will generalize to out-of-sample data and also requires a model evaluation metric
to quantify the model performance.
Model evaluation procedures
Model evaluation process is a very important aspect of applying model algorithms to
real-world applications. For a given machine learning algorithm, model selection refers
to the choosing of a parameterized version of the model based on the training data. The
key is to select a model that will perform well on unseen data in the future. Once a
parametrized model has been selected, the evaluation process provides an estimate of
how the model will perform during future application.
6
Train/test split200B;200B;, We randomly partitioned the dataset into two parts. In regards to the
classification model creation, two thirds of the original dataset was allocated to the
training set and the remaining objects were allocated to the testing set. Using the same
objects to train and estimate their accuracy may result in misleading estimates due to
overfitting200B;200B;.
K-folds 200B;Cross-validation200B;200B; is a statistical method used to estimate the accuracy of
learning models.It is commonly used to compare and select a model for a given
predictive modeling problem because it is easy to understand, easy to implement, and
results in skill estimate that generally has a lower bias than other methods.
A f200B; itting graph200B;200B; is another simple analytical tool to determine the complexity of the tree
induction model and avoid the problem of overfitting. T
200B; here may be measurement error in
the variables, model may be interpreted wrongly and cannot be used when data is very large.
Accuracy is simply the count of correct decisions divided by the total number if decisions
Accuracy = T P + T N / P + N
The next step in our process to validate the prediction accuracy of the selected models. We
would be validating the result of the weight concentrations measures of substance in the air.
The evaluation of data for air quality is in Table 2. This table would be used to evaluate the
results of the model to see if the model is predicting the index correctly for the input variables.
Table 2
Air Quality
Index
SO2
O3
PM 2.5
1 week
1 week
1 week
ug/m3 LC
Very Good
Low
0-25
0-33
0-15
Good
Low
25-50
33-65
15-30
Favorable
Medium
50-120
65-120
30-50
Satisfactory
Medium
120-250
120-180
50-70
Bad
High
250-500
180-240
70-150
Very Bad
High
>500
>240
>150
7
The resulting classifiers would be tested on the train and test sets, and many tests would be
realized. We are planning to use 200B;confusion matrix200B;200B;, a convenient tool for analyzing the
performance of a classifier. It is a square matrix that specifies the accuracy of the classifier to
the classification problem. A good classifier should have a diagonal confusion matrix (all
off-diagonal values are zero) . The Confusion Matrix for each one of the algorithms is presented
along with the percentage of the correctly classified records.The data model would be tested the
final valid dataset which consisted of a total of 500 training dataset, were 20% of the dataset
would be used for validation and another 24 datasets for testing.
In order to conduct the experiments, we evaluated our first model. The decision tree was
constructed with assumption that all input functions have final domains, there is one target
function classification of air pollution data (three classes) . Every internal node in the tree
contains input feature. Every leaf of the tree contains class.
Table 3
Predicted Class
High
High
True Class
Low
Med
27
10
7
26
25
Low
Med
5
The result of the accuracy of different sized decision trees: single tree, medium tree, complex
tree are presented below in the Table 4
Method
Number of Splits
Accuracy
Single Tree
10
76.00%
Medium Tree
20
78.00%
Complex Tree
100
78.00%
It can be seen that, for complex and medium tree, the accuracy result is same 78.0%, and for
simple tree is 76.0%. The experiment were made as supervised learning, i.e. algorithm knows
input and output values from training database and test database.
8
The second model that we choose for evaluation is SVM with supervised learning models.
Table 5 shows the confusion matrix for the same
Predicted Class
High
High
True Class
Low
20
Low
Med
5
Med
5
31
6
4
29
This shows that 20 out of 25 High samples are classified as correct and 5 of them are classified
in Medium class. 31/37 Low class samples are correctly classified and 6 of them are classified
in Medium class. And 29/38 samples of Medium class are correctly classified, 9/38 classified as
wrong, 4 of them are in Low class and 5 samples in High class.In SVM algorithm we tried
different kernel functions to get the highest accuracy. Experiments lead to a conclusion that
maximum accuracy of SVM was when linear kernel function was used. Table 6 shows the
results for SVM with different kernel functions.
Kernel
Accuracy
Linear
80.0%
Quadratic
77.0%
Cubic
79.0%
Based on the two experiments conducted and analysis of their results we are able to say that
though Decision tree is faster and easier to understand, is actually less accurate. SVM Linear
with a maximum accuracy of 80.0% is a better model when compared to Decision Tree with a
maximum accuracy of 76.0%
Model Deployment
Predictive model deployment gives the opportunity to deploy the analytical model in day to day
operations and decision making process and automating them.
We foresee that the benefits of this model will far outweigh any limited risks associated with
knowing the information produced. This model will help identify when indoor air pollution will be
an issue. This model will accept data that can be gathered from around the world, making it
universally beneficial. We hope this model will be used to help guide resources and improve
9
the air quality in high risk areas. It will provide concrete information so policy makers and
donors know which areas require immediate attention. This model will also aid engineers,
planners and developers in making sound decisions in regards to future construction projects
and designs. In addition, the information produced may help spark new ideas to improve indoor
air quality at a lower cost.
We plan to make the model freely accessible and work with various organizations to address
any concerns and provide updates as needed. We may find that the model requires fine tuning
in certain regions. We encourage independent testers to use our product and provide feedback.
To limit our costs, we will rely on those using the model to provide this information.
We will evaluate the success of the deployment based on the accuracy of the model in the field.
While it is important to us that the model is used to improve the indoor air quality around the
world, we understand that these successes are not based solely on identifying the problem.
With this being said, we would still like to survey those who use our model and find out how it
was used and the value it provided. This information would be more descriptive than numeric,
but useful for marketing purposes and future improvements.
References
1. Identifying Problems in the Indoor Environments, EPA
https://www.epa.gov/indoor-air-quality-iaq/identifying-problems-indoor-environme
nts
2. Indoor Air Pollution, Max Roser et. al, 2017,
https://ourworldindata.org/indoor-air-pollution
10
3. Solid Fuel Use for Household Cooking, Sophie Bonjour et. al, 2013
https://ehp.niehs.nih.gov/1205987/
4. World Development Indicators, The World Bank
https://datacatalog.worldbank.org/dataset/world-development-indicators
5. Age-Standardized Air pollution death rates, Institute for Health Metrics and
Evaluation (HMIE), 2016
https://datacatalog.worldbank.org/dataset/world-development-indicators
6. Air Data: Air Quality Data Collected at Outdoor Monitors Across the USA, EPA
https://www.epa.gov/outdoor-air-quality-data
7. Monitoring ambient air quality for health impact assessment
http://www.euro.who.int/__data/assets/pdf_file/0010/119674/E67902.pdf
8. Big Data Analytics Using Neural networks
https://pdfs.semanticscholar.org/fab5/b9077ad7f47335fe5cebc4d4854b59394f55.
pdf
9. Multi-Linear regression model
http://www.stat.yale.edu/Courses/1997-98/101/linmult.htm
10. Data Science for Business
http://shop.oreilly.com/product/0636920028918.do
11. Air Quality Measures on the National Environment Health Tracking Network,
Data.gov
https://catalog.data.gov/dataset/air-quality-measures-on-the-national-environmen
tal-health-tracking-network
12. Air Data: Air Quality Data Collected at Outdoor Monitors Across the US
https://www.epa.gov/outdoor-air-quality-data
13. Bruce, Nigel, et al. 201C;Indoor Air Pollution in Developing Countries.201D; 200B;WHO200B;, World
Health Organization, 200B;www.who.int/bulletin/archives/78(9)1078.pdf200B;.
14. Roser, Max, and Hannah Ritchie. 201C;Indoor Air Pollution.201D; 200B;Our World in Data200B;, 2018,
ourworldindata.org/indoor-air-pollution
11
Missar, Paruchuri, Vasudevanpillai, Penumalli 2013; #Group 9
15. 201C;Introduction to Indoor Air Quality.201D; 200B;EPA200B;, Environmental Protection Agency, 29
Jan. 2018, 200B;www.epa.gov/indoor-air-quality-iaq/introduction-indoor-air-quality200B;.
16. 201C;Assessment of Exposure to Indoor Air Pollutants201D; WHO Regional Publications,
European Series, No. 78
http://apps.who.int/iris/bitstream/handle/10665/107300/euro_series_78.pdf?sequ
ence=1&isAllowed=y
17. 201C;Indoor Carbon Dioxide Concentrations and Sick Building Syndrome Symptoms
in the Base Study Revisited: Analyses of the 100 building Dataset201D; Indoor
Environment Dept., Lawrence Berkeley National Laboratory, Berkeley, CA, USA
https://www.epa.gov/sites/production/files/2014-08/documents/base_3c2o2.pdf
18. A systematic review of data mining and machine learning for air pollution
epidemiology 200B;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5704396/#CR8
12
CRISP-DM 1.0
Step-by-step data mining guide
Pete Chapman (NCR), Julian Clinton (SPSS), Randy Kerber (NCR),
Thomas Khabaza (SPSS), Thomas Reinartz (DaimlerChrysler),
Colin Shearer (SPSS) and RFC;diger Wirth (DaimlerChrysler)
SPSS is a registered trademark and the other SPSS products named are trademarks of SPSS Inc. All other names are trademarks of their respective owners. A9; 2000 SPSS Inc. CRISPMWP-1104
This document describes the CRISP-DM process model and contains information about the CRISP-DM methodology, the
CRISP-DM reference model, the CRISP-DM user guide, and the CRISP-DM reports, as well as an appendix with additional
related information. This document and information herein are the exclusive property of the partners of the CRISP-DM
consortium: NCR Systems Engineering Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA),
and OHRA Verzekeringen en Bank Groep B.V. (The Netherlands).
Copyright A9; 1999, 2000
All trademarks and service marks mentioned in this document are marks of their respective owners and are as such
acknowledged by the members of the CRISP-DM consortium.
CRISP-DM 1.0
Foreword
CRISP-DM was conceived in late 2026;
Purchase answer to see full
attachment