<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Machine Learning | Nick Analytics</title>
	<atom:link href="https://www.nickanalytics.com/category/machine-learning/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.nickanalytics.com</link>
	<description></description>
	<lastBuildDate>Mon, 03 Jun 2024 19:14:25 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.1</generator>

<image>
	<url>https://www.nickanalytics.com/wp-content/uploads/2024/03/cropped-mini-logo-wordpress-32x32.jpg</url>
	<title>Machine Learning | Nick Analytics</title>
	<link>https://www.nickanalytics.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>I created a Predictive Energy Model</title>
		<link>https://www.nickanalytics.com/predict-electricity-consumption-and-production/</link>
		
		<dc:creator><![CDATA[Nick]]></dc:creator>
		<pubDate>Mon, 06 May 2024 18:40:14 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://nickanalytics.com/?p=683</guid>

					<description><![CDATA[I'm going to delve into the world of predictive energy modeling by using the Enefit Energy dataset. This dataset was one of the most interesting and challenging I've done so far. Goal was to predict energy consumption and production for Estonia. Predictions had to be made hourly for the next 2 days.]]></description>
										<content:encoded><![CDATA[
<div class="et_pb_section et_pb_section_0 et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_0">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_0  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module et_pb_text et_pb_text_0  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1><strong>Predicting Electricity Consumption and Production</strong></h1>
<p>Welcome to another blog post about my data analytics journey! Today, I&#8217;m going to delve into the world of predictive energy modeling by using the <strong>Enefit Energy dataset</strong> from Kaggle. This Kaggle competition was one of the most interesting and challenging I&#8217;ve done so far. Goal was to predict energy consumption and production for the country of Estonia. Predictions had to be made on an hourly basis for the next 2 days. Many  variables were available to come up with a good working model. Examples are the weather forecast, installed solar panel capacity, historical consumption and others.</p>
<p>So, if you are ready, let&#8217;s explore the powerful world of energy!</p>
<p>&nbsp;</p>
<h2>Introduction</h2>
<p>It is has become increasingly important for energy companies to predict energy <strong>consumption</strong> and <strong>production</strong> on any given day. Here are several key reasons:</p>
<p><strong></strong></p>
<p><strong>1. Balancing Supply and Demand:</strong> <span>Energy supply must match with demand to make the system reliable and avoid outages. Accurate predictions help manage the generation and supply of energy. For example, during peak demand, additional resources are activated, but in low demand, there is a saving in resources and costs.</span></p>
<p><strong>2. Integrating Renewable Energy Sources:</strong> As more renewable energy sources like wind and solar are integrated into the energy grid, predicting energy production becomes more complex. Accurate forecasting helps in planning the necessary backup from more controllable power sources like natural gas or hydroelectric power.</p>
<p><strong>3. Grid Stability and Reliability:</strong> Predicting consumption patterns and production levels are crucial for maintaining grid stability. Sudden changes in energy demand or supply can lead to grid instability and even failures. Predictive analytics help reduce these risks by providing advanced warnings and allowing for proactive adjustments.</p>
<p><strong>4. Operational Efficiency:</strong> By predicting when energy usage will be high or low, companies can optimize their operations, reduce waste, and lower costs. This includes more strategic purchasing of fuel and better maintenance scheduling.</p>
<p><strong>5. Economic Planning:</strong> For energy companies, being able to forecast energy trends accurately is crucial for economic planning and investment decisions. This includes deciding where and when to build new infrastructure or expand existing capabilities.</p>
<p><strong>6. Market Pricing:</strong> Energy pricing can fluctuate based on supply and demand dynamics. Accurate predictions allow companies to optimize their pricing strategies, potentially leading to better profitability or market share.</p>
<p>Overall, the ability to predict energy consumption and production with high accuracy allows companies to respond better to market demands, integrate renewable energy sources more effectively, maintain grid stability, and optimize economic outcomes.</p>
<p>&nbsp;</p>
<h2>Exploring the Dataset</h2>
<p>The dataset consists of several .csv files like <strong>electricity &amp; gas prices</strong>, <strong>clients</strong>, <strong>weather forecast</strong> (3.5 million records), <strong>historical</strong> <strong>weather</strong>, regions with <strong>weather stations</strong> and a <strong>training file</strong> (over 2 million records).</p>
<p>The <strong>forecast weather</strong> is an historical file of weather forecasts over the last 1,5 years. It contains 3,5 million records with data of 112 GPS areas, each with 1 prediction per hour, each 1,2,3,4&#8230;48 hours ahead. Elements recorded are:</p>
<p>latitude,longitude,origin_datetime,hours_ahead,temperature,dewpoint,cloudcover_high,cloudcover_low,cloudcover_mid,cloudcover_total,10_metre_u_wind_component,10_metre_v_wind_component,data_block_id,forecast_datetime,direct_solar_radiation,surface_solar_radiation_downwards,snowfall,total_precipitation</p>
<p><strong>Historical weather</strong> has the same type of information, but holds the <strong>actual weather</strong> and not the forecast.</p>
<p>Another crucial source of information is the <strong>client file</strong>. I won&#8217;t go into too much detail, but it became very important to categorize customers based on product (contract) type, county, whether the client is a business or not, and the available solar power capacity.</p>
<p>During this exploration phase I did some checks to see if I understood the data correctly and if there are any obvious pitfalls I could detect right from the start.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_0">
				
				
				
				
				<span class="et_pb_image_wrap has-box-shadow-overlay"><div class="box-shadow-overlay"></div><img fetchpriority="high" decoding="async" width="712" height="513" src="https://nickanalytics.com/wp-content/uploads/2024/05/Weather-Stations-on-map.jpg" alt="Sales Price Distribution" title="Weather Stations on map" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/05/Weather-Stations-on-map.jpg 712w, https://www.nickanalytics.com/wp-content/uploads/2024/05/Weather-Stations-on-map-480x346.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 712px, 100vw" class="wp-image-697" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_1  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><em>Position of the weather stations over Estonia</em></p>
<p><em></em></p>
<h2>Relationships between variables</h2>
<p>I did some checks to understand the relationship of certain variables to another variable. Noteworthy ones I can mention here are:</p>
<ul>
<li><strong>Energy Capacity over time:</strong> The plot I created here shows the growth of energy capacity, related to Product or Contract Type and Business/Household combitions.
<div>
<div><span>The visual shows a clear capacity growth, specifically for Product Type 1 and 3.</span></div>
<div></div>
</div>
</li>
</ul>
<ul></ul>
<p>&nbsp;</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_1">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/05/installed_capacity_combinations.png" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap has-box-shadow-overlay"><div class="box-shadow-overlay"></div><img decoding="async" width="1587" height="765" src="https://nickanalytics.com/wp-content/uploads/2024/05/installed_capacity_combinations.png" alt="" title="installed_capacity_combinations" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/05/installed_capacity_combinations.png 1587w, https://www.nickanalytics.com/wp-content/uploads/2024/05/installed_capacity_combinations-1280x617.png 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/05/installed_capacity_combinations-980x472.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/05/installed_capacity_combinations-480x231.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 1587px, 100vw" class="wp-image-706" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_2  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><em>Product (Contract) Type: </em>0: &#8220;Combined&#8221;, 1: &#8220;Fixed&#8221;, 2: &#8220;General service&#8221;, 3: &#8220;Spot&#8221;</p>
<p>&nbsp;</p>
<p>Next relationship I investigated was between the weather components and the variable that I needed to predict (production and consumption).</p>
<p><strong></strong></p>
<ul>
<li><strong>Relevant weather components related to power <span style="color: #ff6600;">consumption</span>:</strong> In the plots below I show the most relevant attributes of the weather forecast when it comes to energy consumption. It is clear that energy consumption declines when the <strong>temperature rises</strong>. And we can also conclude that there is a negative correlation between energy consumption and the <strong>amount of radiation (sunlight)</strong>.</li>
</ul></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_2">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/05/two_cons_pairplots_1.png" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap has-box-shadow-overlay"><div class="box-shadow-overlay"></div><img loading="lazy" decoding="async" width="480" height="251" src="https://nickanalytics.com/wp-content/uploads/2024/05/two_cons_pairplots_1.png" alt="" title="two_cons_pairplots_1" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/05/two_cons_pairplots_1.png 480w, https://www.nickanalytics.com/wp-content/uploads/2024/05/two_cons_pairplots_1-300x157.png 300w" sizes="(max-width: 480px) 100vw, 480px" class="wp-image-711" /></span></a>
			</div><div class="et_pb_module et_pb_image et_pb_image_3">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/05/three_cons_pairplots_2.png" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap has-box-shadow-overlay"><div class="box-shadow-overlay"></div><img loading="lazy" decoding="async" width="722" height="251" src="https://nickanalytics.com/wp-content/uploads/2024/05/three_cons_pairplots_2.png" alt="" title="three_cons_pairplots_2" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/05/three_cons_pairplots_2.png 722w, https://www.nickanalytics.com/wp-content/uploads/2024/05/three_cons_pairplots_2-480x167.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 722px, 100vw" class="wp-image-712" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_3  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><i>Weather elements that influence energy consumption.</i></p>
<p>&nbsp;</p>
<p>I also investigated the influence of weather on energy production.<strong></strong></p>
<p>&nbsp;</p>
<ul>
<li><strong>Relevant weather components related to power <span style="color: #ff6600;">production</span>:</strong> In the plots below we see the opposite effect when compared to power consumption. When there is more radiation (sunlight) there is more energy production. This is a possitive correlation. We can also see that most energy is produced between 10 hrs am and 17 hrs pm. That makes sense, but it is nice to see that the data backs up this &#8216;no brainer&#8217;. Another positive correlation is seen at the installed capacity. More capacity means more production. Another &#8216;no brainer&#8217;. Final plot shows the relationship with temperature. Cold temperatures mean less production, and temperatures between 5-15 degrees the highest. After 15 degress there is no further increase.</li>
</ul></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_4">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/05/three_solar_pairplots_2.png" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap has-box-shadow-overlay"><div class="box-shadow-overlay"></div><img loading="lazy" decoding="async" width="719" height="251" src="https://nickanalytics.com/wp-content/uploads/2024/05/three_solar_pairplots_2.png" alt="" title="three_solar_pairplots_2" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/05/three_solar_pairplots_2.png 719w, https://www.nickanalytics.com/wp-content/uploads/2024/05/three_solar_pairplots_2-480x168.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 719px, 100vw" class="wp-image-714" /></span></a>
			</div><div class="et_pb_module et_pb_image et_pb_image_5">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/05/three_solar_pairplots_1.png" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap has-box-shadow-overlay"><div class="box-shadow-overlay"></div><img loading="lazy" decoding="async" width="779" height="251" src="https://nickanalytics.com/wp-content/uploads/2024/05/three_solar_pairplots_1.png" alt="" title="three_solar_pairplots_1" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/05/three_solar_pairplots_1.png 779w, https://www.nickanalytics.com/wp-content/uploads/2024/05/three_solar_pairplots_1-480x155.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 779px, 100vw" class="wp-image-713" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_4  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2></h2>
<h2></h2>
<h2>Preprocessing of the Data</h2>
<p>Preprocessing of data is a vital step in every Data Science project. Most datasets are far from perfect and need to undergo vital steps for it to serve as input to a Machine Learning Model.</p>
<p>The steps I took in this Energy dataset were:</p>
<h3>1. Handling Missing Values</h3>
<p>One of the initial challenges in any data science project is dealing with missing values. In my case there weren&#8217;t many so I won&#8217;t go into that.</p>
<p>&nbsp;</p>
<h3>2. Checking for Outliers</h3>
<p>Outliers can significantly impact the performance of predictive models. I utilized the same <b>pairplot</b> techniques (see above) and <strong>Z-score analysis</strong> to identify and remove outliers from the data. As it turned out only the electricity prices had some significant outliers. These outliers were removed and filled with the same price as a previous meaningful price point.</p>
<p>&nbsp;</p>
<h3>3. Encoding Cat. Variables</h3>
<p>Categorical variables need to be encoded into a numerical format before feeding them into machine learning models. I did not have to do any encoding to this data.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_5  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2></h2>
<h2></h2>
<h2>Building the Machine Learning Models</h2>
<p>After completing the pre-processing steps, it is time to create the Machine Learning model. The challenge is to predict energy consumption and production. These two variables can take on pretty much any value, so we consider the predictions to be &#8216;continuous&#8217; (as opposed to for example predicting a fixed outcome of &#8216;yes or no&#8217;, &#8216;true or false&#8217; etc.).</p>
<p>I decided to build different types of ML models for <strong>Energy Production </strong>and <strong>Energy Consumption</strong>.</p>
<p>&nbsp;</p>
<h3>1. A model that predicts <span style="color: #ff6600;">Energy Production</span></h3>
<p>I evaluated 6 different models, suitable for predicting continous variables. The models are: LightGBM, XGBoost, CatBoost, Random Forest, AdaBoost, Decision Trees.</p>
<p>The outcome:</p>
<p>Training and evaluating LightGBM&#8230;<br />Mean Squared Error: 9846.536495704782<br />Mean Absolute Error: 25.022088199724102</p>
<p>Training and evaluating XGBoost&#8230;<br />Mean Squared Error: 10593.784468962709<br />Mean Absolute Error: 25.470343101038633</p>
<p><strong><span style="color: #339966;">Training and evaluating CatBoost&#8230;</span></strong><br /><strong><span style="color: #339966;">Mean Squared Error: 8252.14802273349</span></strong><br /><strong><span style="color: #339966;">Mean Absolute Error: 23.15684970360955</span></strong></p>
<p>Training and evaluating Random Forest&#8230;<br />Mean Squared Error: 11256.553856624487<br />Mean Absolute Error: 23.881381789537063</p>
<p>Training and evaluating AdaBoost&#8230;<br />Mean Squared Error: 78885.39620655986<br />Mean Absolute Error: 171.638968077889</p>
<p>Training and evaluating Decision Trees&#8230;<br />Mean Squared Error: 20994.16420299445<br />Mean Absolute Error: 32.875448153599145</p>
<p><strong>Model of Choice: CatBoost</strong></p>
<p>&nbsp;</p>
<h3>Feature Importance CatBoost</h3>
<p>Understanding which features contribute the most to my CatBoost&#8217;s predictions is crucial for making informed decisions. The outcome of such analysis is depicted in the plot below. With this information I can take the most important features and neglect less importance features to improve the model even more.</p></div>
			</div><div class="et_pb_module et_pb_code et_pb_code_0">
				
				
				
				
				
			</div><div class="et_pb_module et_pb_image et_pb_image_6">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1075" height="547" src="https://nickanalytics.com/wp-content/uploads/2024/05/feature-importance-catboost.png" alt="" title="feature importance catboost" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/05/feature-importance-catboost.png 1075w, https://www.nickanalytics.com/wp-content/uploads/2024/05/feature-importance-catboost-980x499.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/05/feature-importance-catboost-480x244.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1075px, 100vw" class="wp-image-722" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_6  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><em>This plot displays the most important features (columns). Those are the <strong>installed capacity</strong>, <strong>solar power </strong>(radiation), <strong>eic_count</strong> (count of energy production sources), <strong>temperature</strong> and <strong>several others</strong>.</em></p>
<p>&nbsp;</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_7  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2></h2>
<h3>2. A model that predicts <span style="color: #ff6600;">Energy Consumption</span></h3>
<p>For the Energy Consumption part I went for the <strong>Gradient Boosting Model</strong>. This model performed best in my tests. </p>
<p><strong>Model of Choice: Gradient Boosting Model</strong></p>
<p>&nbsp;</p>
<h3>Feature Importance Gradient Boosting</h3>
<p>For Energy Consumption I found that the most important features were: <strong>temperature</strong>, <strong>working day</strong> and <strong>hour of the day</strong>. </p>
<p>&nbsp;</p>
<h3>I created 69 Gradient Boosting Models  </h3>
<p>For each County / Business-Household &amp; Product or Contract type I created different models, all with the same parameters but trained on different data. This resulted in 69 model, tailored to each possible consumption scenario.</p>
<p>&nbsp;</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_8  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>Making Predictions</h3>
<p>With my code and models, I&#8217;ve made predictions on new unseen data generated in the Kaggle competition. I&#8217;m excited to see how my model performs against other competitors and contribute to the advancement of predictive modelling in the energy world.</p>
<p>Stay tuned for updates on my model&#8217;s performance and further insights from the competition!</p>
<p>&nbsp;</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_9  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2>Key Take Aways</h2>
<p>The predictive energy modeling project using the Enefit Energy dataset provided several key insights:</p>
<p>&nbsp;</p>
<ol>
<li><strong>Accuracy:</strong> Accurate predictions of energy consumption and production are crucial for balancing supply and demand, integrating renewables, maintaining grid stability, and optimizing operations.</li>
<li><strong>Key Relationships:</strong> Energy consumption declines with higher temperatures and lower radiation, while energy production increases with higher radiation and optimal temperatures.</li>
<li><strong>Data Preprocessing:</strong> Effective preprocessing, such as handling missing values and removing outliers, is vital for reliable model input.</li>
<li><strong>Model Performance:</strong> <span style="color: #ff6600;">CatBoost</span> was the best model for predicting energy production, whereas <span style="color: #ff6600;">Gradient Boosting</span> was best for consumption predictions.</li>
<li><strong>Feature Importance:</strong> Installed capacity, solar radiation, and temperature were crucial for production predictions, while temperature, working day, and time of day were key for consumption.</li>
<li><strong>Customized Models:</strong> Creating 69 distinct models for different scenarios improved prediction accuracy.</li>
<li><strong>Implications:</strong> These models can significantly improve energy management, resource allocation, and grid reliability.</li>
</ol></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_10  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>The entire code</h3>
<p>Check out all of the code of this project on Github: <a href="https://github.com/nickanalytics/Predict-Electricity-Production-and-Consumption" title="Nick Analytics - Predictive Energy Model">Nick Analytics &#8211; Predictive Energy Model</a></p></div>
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Boston Marathon: Predict Finish Times</title>
		<link>https://www.nickanalytics.com/boston-marathon-predict-finish-times-2/</link>
		
		<dc:creator><![CDATA[Nick]]></dc:creator>
		<pubDate>Fri, 01 Mar 2024 16:02:58 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://nickanalytics.com/?p=552</guid>

					<description><![CDATA[In this blog I explore the possibilities of predicting finish times of runners participating in the Boston Marathon. I’ll go over the variables that come into play when trying to make a prediction, and applied the most important ones into a Machine Learning Model.]]></description>
										<content:encoded><![CDATA[
<div class="et_pb_section et_pb_section_1 et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_1">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_1  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module et_pb_text et_pb_text_11  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1 data-sourcepos="3:1-3:62" style="text-align: left;">Predicting Finish Times in an online Dashboard</h1>
<p>Welcome to this new post about my Data Analytics journey.</p>
<p>In a previous blog post I have done a Statistical Analysis and Comparison of the Boston Marathon, edition 2022 and 2023. In my new blog post I want to dive deeper and see if there is a way to <strong>predict finish times</strong> of runners participating in the Boston Marathon. The idea is to take the times recorded at the checkpoints (at 5k, 10k etc) and take these as input to a model that can predict the finish time. How cool would that be !</p>
<p>During this venture I was pretty overwhelmed with all the data and complexity I had to deal with. Should I build one model, multiple models, models for males and females, for fast and slow runners, for each check point? I had to think it over and in the end came up with a good working solution.</p>
<p>The work I did in this project roughly consisted of two main parts:</p>
<p>1. Creating a good working finish time <strong>prediction model</strong></p>
<p>2. Building a website with a <strong>dashboard</strong> to showcase the model and see some statistics.</p>
<p>&nbsp;</p>
<p>So, if you are interested in running, and in predicting finish times, buckle up for a great read.</p>
<p>&nbsp;</p>
<h1></h1>
<h1>The dataset (year = 2023)</h1>
<p><span>In my previous blog I described that I collected the data from the official Boston Marathon website. In 2023 there were some <strong>26.000 runners</strong> of which I could use the following properties:</span></p>
<p>&#8211; <strong>Bib number</strong> &#8211; the unique identification runners wear on their shirt<br /><span></span><span></span><span></span><span>&#8211; <strong>Age of the runner</strong><br />&#8211; <strong>Gender of the runner</strong><br />&#8211; <strong>Passing times</strong> at 5k, 10k, 15k, 20k, Half way, 25k, 30k, 35k, 40k, Finish</span></p>
<p>The environment I used to process the data and create the model (with the dashboard) was <strong>VS Code</strong> with the <strong>Python</strong> coding language. <span></span></p>
<p>&nbsp;</p>
<h1><span>Creating the Machine Learning model</span></h1>
<h3></h3>
<h3>Diagram<br /><span></span></h3>
<p>In the diagram below I visualized the steps I took to create the Machine Learning prediction model.</p>
<p>My most important challenge was how to perceive the data. Is this a <strong>time series problem</strong> were the intermediate passing points are sequential timestamps in hours, minutes and seconds, or should I convert the timestamps to a numeric value that represents a certain <strong>duration</strong>.  I decided to go for the last option because Machine Learning Models that work with time series can&#8217;t have a time series as the timeline and have the target variable (predicted value) as a datetime value as well.</p>
<p>So, I took the timestamps for all runners at each checking point and for mathmetical reasons converted this information from a hh-mm-ss value to a numeric value that expresses the time in minutes with decimals. Example: 1 hour, 20 minutes and 30 seconds became a value of 80,50 minutes.</p>
<p>So, in the end my steps to generate the model looked something like this:</p>
<p>&nbsp;</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_7">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/flow-chart3.png" class="et_pb_lightbox_image" title="pre-processing"><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="859" height="680" src="https://nickanalytics.com/wp-content/uploads/2024/04/flow-chart3.png" alt="pre-processing" title="flow-chart3" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/flow-chart3.png 859w, https://www.nickanalytics.com/wp-content/uploads/2024/04/flow-chart3-480x380.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 859px, 100vw" class="wp-image-557" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_12  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><em>Overview of the modelling steps. </em></p>
<p>Let me explain a bit better what feature engineering means and also the train, test steps of Machine Learning Model creation.</p>
<p>&nbsp;</p>
<h3>Extra features</h3>
<p>During the pre-processing phase I added a bunch of new columns to the data, to see if it would improve my prediction process:</p>
<p>Features (columns) that I added to the data were:</p>
<p><strong>&#8211; Average Pace between each checkpoint</strong></p>
<p><strong>&#8211; Percentage decay between each checkpoint</strong></p>
<p><strong>&#8211; Mean pace at each checkpoint</strong></p>
<p><strong>&#8211; Average Standard Deviation at each checkpoint</strong></p>
<p>My idea was to generate as much relevant data as I could and then put it all in a Machine Learning cycle to see which features a truly predictive. The outcome was kinda surprising as you can see in the plot just below.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_8">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/feature_importance.png" class="et_pb_lightbox_image" title="feature importance"><span class="et_pb_image_wrap has-box-shadow-overlay"><div class="box-shadow-overlay"></div><img loading="lazy" decoding="async" width="1005" height="547" src="https://nickanalytics.com/wp-content/uploads/2024/04/feature_importance.png" alt="feature importance" title="feature_importance" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/feature_importance.png 1005w, https://www.nickanalytics.com/wp-content/uploads/2024/04/feature_importance-980x533.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/04/feature_importance-480x261.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1005px, 100vw" class="wp-image-564" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_13  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><em>This plot was a check on what features would be most predictive of the runner&#8217;s finish times at the 35k check point. Basically the only things that seem to count are the <strong>current passing time</strong>, the <strong>average pace</strong> to 35k and the <strong>decay</strong> from 30k to 35k. Age and gender don&#8217;t seem to matter once you&#8217;re on the run <img decoding="async" src="https://nickanalytics.com/wp-content/themes/Divi/includes/builder/frontend-builder/assets/vendors/plugins/emoticons/img/smiley-wink.gif" alt="wink" /></em></p>
<p><em></em></p>
<p><em></em></p>
<h3 style="text-align: left;"><em></em>Deciding on the best model</h3>
<p>The best model is the one that makes the best predictions overall. In ML terms we can also say that it is the model with the lowest error. I&#8217;ve tested a couple of ML models that come out of the box (like Random Forest and Linear Regression), and decided to go for Linear Regression (LR).</p>
<p>Along the way I found out that <strong>one size fits all</strong> is not the best approach, so I created different LR models for each checkpoint (so one for the point where the 5k results come in, one for the 10k results etc). I ended up with 9 models. </p>
<p>An example on how well the model does is depicted here below. This bit of code result is taken at the 10k passing point. It indicates that the MAE (Mean Absolute Error) is 7.15 minutes for all runners, so a bandwith of 3,5 minutes to the upside or downside.</p></div>
			</div><div class="et_pb_module et_pb_code et_pb_code_1">
				
				
				
				
				<div class="et_pb_code_inner"># model scores at 10k:
RMSE of the best Linear Regression model: 9.93486201737597
MAE of the best Linear Regression model: 7.154524133407156
Best hyperparameters:
fit_intercept: False
normalize: True</div>
			</div><div class="et_pb_module et_pb_text et_pb_text_14  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>I measured the error (deviation) at other passing point as well, and this resulted in the plot here below:</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_9">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/model_deviation.png" class="et_pb_lightbox_image" title="clustered histogram"><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="861" height="400" src="https://nickanalytics.com/wp-content/uploads/2024/04/model_deviation.png" alt="clustered histogram" title="model_deviation" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/model_deviation.png 861w, https://www.nickanalytics.com/wp-content/uploads/2024/04/model_deviation-480x223.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 861px, 100vw" class="wp-image-565" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_15  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;"><em>Performance of the prediction model.</em></p>
<p style="text-align: left;"><em></em>As you can see the performance of the model(s) improved significantly the closer we got to the finish line. This makes sense of course, but it is nice to see it back like this. After 30k the model is only <strong>2 minutes off</strong> on average with only a handful of parameters. Truly remarkable !</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_16  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1></h1>
<p>&nbsp;</p>
<h1>Constructing the Dashboard</h1>
<p>I uses a Python library called Streamlit to showcase my work on a webpage. Streamlit is a really nice tool to quicky <span>create web applications for machine learning and data science projects. It allows you to write Python scripts and turn them into interactive, visually appealing web apps.</span></p>
<p><span></span></p>
<h3><span>ML driven online finish time predictor</span></h3>
<p><span>I created a simple tile on the dashboard where you can predict your finish time based on the time at the checkpoint at 5k, 20k and 35k. I added gender as well because males are expected to be faster than females. The result looks like the image below:</span></p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_10">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/finish_time_predictor.png" class="et_pb_lightbox_image" title="streamlit dashboard"><span class="et_pb_image_wrap has-box-shadow-overlay"><div class="box-shadow-overlay"></div><img loading="lazy" decoding="async" width="596" height="635" src="https://nickanalytics.com/wp-content/uploads/2024/04/finish_time_predictor.png" alt="streamlit dashboard" title="finish_time_predictor" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/finish_time_predictor.png 596w, https://www.nickanalytics.com/wp-content/uploads/2024/04/finish_time_predictor-480x511.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 596px, 100vw" class="wp-image-571" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_17  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><em>The finish time predictor on my dashboard</em></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_18  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3 style="text-align: left;">Future application of the model</h3>
<p style="text-align: left;">The predictor I created is a great step in marathon finish time prediction. But of course we want to take it to a point where runners get their predicted times straight on their phone app or smartwatch during the run. As I don&#8217;t have the real time data or a large inferencing server I cannot make that happen. But understanding the variables that are in play and taking those to creating a well working Machine Learning model are crucial steps.</p>
<p style="text-align: left;">Another extension could be to create more models tailored to age and gender, combined with the checkpoint time. Lastly it would be great to distinguish professional runners from amateurs and to have an end time indication of each runner before the start. That could be something they fill in on their application.</p>
<p style="text-align: left;">So, many improvements are possible, but for now I&#8217;m happy with the progress I made.</p>
<p style="text-align: left;"></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_19  et_pb_text_align_left et_pb_text_align_justified-phone et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>Other elements on my dashboard</h3>
<p>I added some other tiles on my dashboard with statistical facts like average pace, number of males/females and others.<strong></strong></p>
<p>A previous is displayed here:</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_11">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/dashboard-02.png" class="et_pb_lightbox_image" title="Mean Pace at Checkpoints"><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1024" height="384" src="https://nickanalytics.com/wp-content/uploads/2024/04/dashboard-02.png" alt="Mean Pace at Checkpoints" title="dashboard-02" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/dashboard-02.png 1024w, https://www.nickanalytics.com/wp-content/uploads/2024/04/dashboard-02-980x368.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/04/dashboard-02-480x180.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1024px, 100vw" class="wp-image-577" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_20  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;"><em>A glimpse of other elements on my marathon dashboard.</em><em></em></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_21  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>Check out my dashboard</h3>
<p>My dashboard can be found on this link: <a href="https://nickanalytics-boston-marathon---predict-finish-time-main-hyyhg1.streamlit.app/" title="Nick Analytics Dashboard">Nick Analytics Dashboard</a> or press the button:</p></div>
			</div><div class="et_pb_button_module_wrapper et_pb_button_0_wrapper et_pb_button_alignment_center et_pb_module ">
				<a class="et_pb_button et_pb_button_0 et_pb_bg_layout_light" href="https://nickanalytics-boston-marathon---predict-finish-time-main-hyyhg1.streamlit.app/" target="_blank">My Dashboard</a>
			</div><div class="et_pb_module et_pb_text et_pb_text_22  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1></h1>
<h1>Conclusion</h1>
<p>In this blog post, I&#8217;ve described the steps I took to create a Machine Learning model that can predict finish times during the Boston Marathon. It was important to choose the right parameters to train the model. After some iterations I concluded that the most predictive features were the <strong>current passing time</strong>, the <strong>average pace</strong> to 35k and the <strong>decay</strong> from 30k to 35k. <strong>Age</strong> and <strong>gender</strong> don&#8217;t seem to matter that much <strong>once you&#8217;re actually running</strong>.</p>
<p>The models I created become more accurate as runners get closer to the finish line. At the first checkpoint the mean absolute error is 8 minutes (so, plus or minus 4 minutes). After 30k this deviation drops to below 2 minutes (plus or minus 1 minute). Note that these times apply to all runners participating. Creating additional models for example for professional runners or males/females could reduce the error even more.</p>
<p> Thanks for reading my blog. Nick.</p>
<p>&nbsp;</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_23  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>The coding I&#8217;ve done (in VS Code)</h3>
<p>Check out the code of this project on Github: <a href="https://github.com/nickanalytics/Boston-Marathon---Predict-Finish-Times" title="Nick Analytics - Predict Finish Times">Nick Analytics &#8211; Predict Finish Times</a><a href="https://colab.research.google.com/drive/12fzJuVqZ5-AMaH_h2nXqaicj_Z8doglX?usp=sharing" title="PyCaret use case"></a></p></div>
			</div><div class="et_pb_module et_pb_divider et_pb_divider_0 et_pb_divider_position_ et_pb_space"><div class="et_pb_divider_internal"></div></div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Text Clustering and Labeling</title>
		<link>https://www.nickanalytics.com/text-classification-and-labeling/</link>
		
		<dc:creator><![CDATA[Nick]]></dc:creator>
		<pubDate>Sun, 21 Jan 2024 18:32:41 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://nickanalytics.com/?p=431</guid>

					<description><![CDATA[I stumbled upon a dataset with over 8000 rows, all containing tips that had been sent out by a Dog Trainer to her clients. This list was not classified or labeled and it contained many duplicates. This caused the trainer to spend lots of time connecting the right tip(s).]]></description>
										<content:encoded><![CDATA[
<div class="et_pb_section et_pb_section_2 et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_2">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_2  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module et_pb_text et_pb_text_24  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1 data-sourcepos="3:1-3:62" style="text-align: left;">Classify and Label short texts</h1>
<p>Welcome to this new post about my Data Analytics journey.</p>
<p>One day I stumbled upon an Excel sheet with a few thousand rows and 1 column, all containing tips that had been sent out by a Dog Trainer to her clients. This list of tips was not classified in any way, it contained many similar cells, and had not been labeled. This caused the Dog Trainer to spend a lot of time connecting the right tip(s) to the right client, each day a new client came forward that needed support.</p>
<p data-sourcepos="5:1-5:294">I took the challenge, improved the dataset and added a lot of value to it. Let&#8217;s see how this went.</p>
<p>&nbsp;</p>
<h1>The dataset</h1>
<p><span>I had one Excel file that consisted of <strong>8600 rows (tips)</strong> all in written text. The size of the dataset was relatively small, but needed some cleaning and preprocessing. The steps I took in this respect were:</span></p>
<p><span>&#8211; Load the Excel in VS Code (Python Editor) for further processing<br />&#8211; Cut the file into 3 columns (name of the dog, the tip, sequence number)<br />&#8211; I then removed special characters (comma&#8217;s, quotes, icons) from the Tip column to not upset the ML process.<br />&#8211; I removed leading spaces, whitespace and empty rows.<br /></span><span></span><span></span></p>
<p><span>I saved the file as a <strong>parquet type</strong>. Parquet files are faster to read by ML models, and an easy way to retain the texts.</span></p>
<h2><span></span></h2>
<h1><span>Let&#8217;s classify</span></h1>
<h3></h3>
<h3>The Classification Method<br /><span></span></h3>
<p><span>The first challenge for me was to cluster/ classify all of the tips into <strong>bins</strong>. There is no easy (automatic) way to do this, so as a data scientist I had to come up with a solution. I concluded that working with unsupervised (unstructured) data required a specific ML model to do the job. I choose KNN (K-Nearest Neighbour). </span></p>
<p><span>KNN sounds complex, but it is a relative simple way to group data points around a center (centroid). In the algorithm you can specify how many &#8216;K&#8217;s&#8217; or bins you want to use. It is very important to make an estimated guess, because a small number of bins can mean that certain texts are grouped with texts that are not really similar. After some checking and tweaking I came up with a number of 50. So, the system will put each Tip into 1 of 50 clusters, depending on how similar the texts are.</span></p>
<p><span></span></p>
<h3>The choosen software solution (Azure ML Designer by Microsoft)</h3>
<p>I am a <strong>certified Azure ML scientist</strong>, and know how to use Azure Machine Learning by Microsoft. The system is a state of the art solution that can handle virtually any type of problem that requires Machine Learning Models to make predictions. So, for my case an excellent opportunity to use the system with a real live example.</p>
<p>Running a job on Azure ML requires some configuring (and hurdles to overcome). The system is huge and has endless possibilities. My choice was to go for the <strong>Designer flow</strong>, which is a drag and drop system with a canvas that can hold different modules that each need to be configured. The entire flow of modules is called a <strong>pipeline</strong>.</p>
<p>When the pipeline (steps) are ready, you have to select a Computer Instance to run this job on. This is the moment you start to pay for what you are doing. On Azure, it&#8217;s all about computer power and computer time.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_12">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="880" height="761" src="https://nickanalytics.com/wp-content/uploads/2024/04/Designer-04.jpg" alt="Designer Pipeline" title="Designer-04" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/Designer-04.jpg 880w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Designer-04-480x415.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 880px, 100vw" class="wp-image-444" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_25  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><em>Overview of the Designer Pipeline with data input at the top, a module to preprocess texts and the K-Means model that will take care of the clustering. The Training takes places in the final module. Here a new parquet file is created with cluster data added to it.</em></p>
<h3></h3>
<h3>Classification ready</h3>
<p>In my case it took about 30 minutes for Azure ML to complete the job. The output is a parquet file with each Tip assigned a cluster number, and added information about its distance to the closest centroid. Having this information we can now filter out all duplicate or similar tips by picking each cluster number.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_26  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3 style="text-align: left;">A closer look</h3>
<p style="text-align: left;">If we zoom in further on the cluster column and we find the following data:</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_13">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/Designer-05.jpg" class="et_pb_lightbox_image" title="classified text"><span class="et_pb_image_wrap has-box-shadow-overlay"><div class="box-shadow-overlay"></div><img loading="lazy" decoding="async" width="416" height="323" src="https://nickanalytics.com/wp-content/uploads/2024/04/Designer-05.jpg" alt="classified text" title="Designer-05" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/Designer-05.jpg 416w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Designer-05-300x233.jpg 300w" sizes="(max-width: 416px) 100vw, 416px" class="wp-image-446" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_27  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;"><em>This box shows us 15 of the 8500 rows that have been put into a cluster. </em></p>
<p style="text-align: left;">It is striking that the <strong>Preprocessing Module</strong> has cleaned almost all special characters, numbers and capital letters. Basically only words are left. These words can be compared and clustered, like the KNN model has done here. It has categorized the tips into 50 clusters or bins. When done in Azure ML Designer the system will not only add the cluster number but also a couple of other columns with the distances to the centers (centroids) of other bins. This can help to decide whether the number of clusters is correct or could be adjusted somewhat.</p>
<h3 style="text-align: left;"></h3>
<h3 style="text-align: left;">Two insightful plots</h3>
<p style="text-align: left;">I have created to plots related to this new clustering:</p>
<p style="text-align: left;">&#8211; first one is a <strong>histogram</strong> with an overview of how many tips are categorized per cluster</p>
<p style="text-align: left;">&#8211; second one is a <strong>box plot</strong><span> which displays the spread and the outliers in the lengths of the texts within each cluster</span> </p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_14">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/count-of-tips-per-cluster-sorted.png" class="et_pb_lightbox_image" title="clustered histogram"><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="2025" height="1122" src="https://nickanalytics.com/wp-content/uploads/2024/04/count-of-tips-per-cluster-sorted.png" alt="clustered histogram" title="count of tips per cluster (sorted)" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/count-of-tips-per-cluster-sorted.png 2025w, https://www.nickanalytics.com/wp-content/uploads/2024/04/count-of-tips-per-cluster-sorted-1280x709.png 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/04/count-of-tips-per-cluster-sorted-980x543.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/04/count-of-tips-per-cluster-sorted-480x266.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 2025px, 100vw" class="wp-image-467" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_28  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: center;"><em>Histogram of texts grouped per cluster</em></p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_15">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/Length-of-text-variations-per-cluster.png" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1779" height="1087" src="https://nickanalytics.com/wp-content/uploads/2024/04/Length-of-text-variations-per-cluster.png" alt="" title="Length of text variations per cluster" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/Length-of-text-variations-per-cluster.png 1779w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Length-of-text-variations-per-cluster-1280x782.png 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Length-of-text-variations-per-cluster-980x599.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Length-of-text-variations-per-cluster-480x293.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 1779px, 100vw" class="wp-image-466" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_29  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: center;"><em>Box plot showing <span>the spread and outliers in the lengths of the texts per </span>cluster</em></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_30  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1>Next step: Labeling</h1>
<p>Labeling is the process of adding one to three word descriptions to an item. In my case to a piece of text. The intention of the process is:</p>
<p>&nbsp;</p>
<ol>
<li><strong>Generate labels</strong> from the (already clustered) texts</li>
<li><strong>Verify</strong> if the labels cover your needs</li>
<li><strong>Add </strong>the labels to the texts</li>
<li><strong>Compare</strong> similar labels between different clusters</li>
</ol></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_31  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>1. Generate the labels using Data Analyst</h3>
<p>I uploaded my texts to Data Analyst (part of ChatGPT) and asked for the labels. I provided a series of about 40 relevant labels that served as input and example to the model. After some iterations the sytem provided me with around <strong>70 labels</strong> that should cover the essence of all of the text. Truly amazing <img decoding="async" src="https://nickanalytics.com/wp-content/themes/Divi/includes/builder/frontend-builder/assets/vendors/plugins/emoticons/img/smiley-smile.gif" alt="smile" /></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_32  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3 style="text-align: left;">2. Make sure the labels cover your needs</h3>
<p style="text-align: left;">This is basically a manual process. Read texts from each cluster and see if there are 1 or more labels that cover the most important topics. Keep in mind that the goal of this exercise is to create a blueprint of 8000 training tips, to easily select the right tip to the right problem. So, in my case either the <strong>behavioral problem(s)</strong> or the <strong>training goal(s)</strong> needed to be displayed in the label. Some of the words that I used were: visit, bite, bark, barking behavior, communication, doorbell, clarity, own energy, emotions, obedience, sounds, mood and 60 others.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_33  et_pb_text_align_left et_pb_text_align_justified-phone et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3><strong>3. Add the labels to the texts</strong></h3>
<p>The biggest operation is to add the labels to the 8000 tips. We have the pre-processed texts and we have the labels. Now we need to link them together. I would say there are 3 ways to do that:</p>
<ul>
<li>Via ChatGPT <strong>Data Analyst</strong></li>
<li>Via the Labeling Option in <strong>Azure ML</strong></li>
<li>Via Python Code using the <strong>Spacy</strong> library</li>
</ul>
<p>First, I tried the <strong>ChatGPT Data Analyst<br /></strong>This option gave me really good results at times, but it struggled when the dataset got too big. The outcome I got looked like this:</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_16">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1300" height="319" src="https://nickanalytics.com/wp-content/uploads/2024/04/labeled_tips.jpg" alt="" title="labeled_tips" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/labeled_tips.jpg 1300w, https://www.nickanalytics.com/wp-content/uploads/2024/04/labeled_tips-1280x314.jpg 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/04/labeled_tips-980x240.jpg 980w, https://www.nickanalytics.com/wp-content/uploads/2024/04/labeled_tips-480x118.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 1300px, 100vw" class="wp-image-462" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_34  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;"><em>Note: Those are a snippet of the labels added by Data Analyst</em><em></em></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_35  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>Next I tried the <strong>Azure ML Labeling option:</strong></p>
<p>I uploaded the file and labels but got in trouble with the required computing power. I needed to scale up but that was not possible in my subscription. But there&#8217;s also a manual way of doing this without Machine Learning. But this requires you to train the model manually by teaching it for at least 100 tips which label(s) belong to which tips. I started doing this but it was too cumbersome so I abandoned this option.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_17">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="671" height="434" src="https://nickanalytics.com/wp-content/uploads/2024/04/Labelling-03.jpg" alt="" title="Labelling-03" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/Labelling-03.jpg 671w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Labelling-03-480x310.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 671px, 100vw" class="wp-image-464" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_36  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: center;"><em>I stopped labeling after manually teaching Azure ML Labeling 3 tips. </em><em>This is simple too time consuming.</em></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_37  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>Third option I used was the <strong>Spacy</strong> Library in Python. Spacy is an excellent Text Processing tool with multiple language modules.</p>
<p>The steps I followed were:</p>
<p><strong>&#8211; Load the SpaCy Model</strong>: import the correct language module</p>
<p><strong>&#8211; Add Match Patterns</strong>: Each keyword is transformed into a pattern where the matching is based on the lemma (base form) of the word in the text. These patterns are added to a &#8216;Matcher&#8217; with a corresponding label.</p>
<p><strong>&#8211; Assign Labels</strong>: A function named <code>assign_labels</code> was defined, which processes a given text using the SpaCy model to convert it and then uses a Matcher to find patterns.</p>
<p>The great thing about this code is that it can be re-used to automatically tag text with predefined labels thus applying the labeling process to new unseen data.</p>
<p>&nbsp;</p>
<p>The result I got looked something like this:</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_18">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1257" height="461" src="https://nickanalytics.com/wp-content/uploads/2024/04/labeled_tips_2.jpg" alt="" title="labeled_tips_2" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/labeled_tips_2.jpg 1257w, https://www.nickanalytics.com/wp-content/uploads/2024/04/labeled_tips_2-980x359.jpg 980w, https://www.nickanalytics.com/wp-content/uploads/2024/04/labeled_tips_2-480x176.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1257px, 100vw" class="wp-image-474" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_38  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><em>The labels are somewhat different because I reduced the input label list, but the result is excellent.</em></p>
<p><em></em></p>
<p><em></em></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_39  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1>Final step: combining cluster and label</h1>
<h3></h3>
<h3></h3>
<h3>Reducing the number of tips to 1 per cluster</h3>
<p>After the detailed labeling process I took a new efficiency step to bring the labeling to a higher level. I took each cluster and compared them with the labels (in Excel). I then decided to give each cluster a new top level label of max 3 words, derived from the Labels column previously generated. This step reduced the list of tips to 50, one for each cluster. </p>
<h3>Reducing the labels to 5 top level labels</h3>
<p>My final step was to reduce the 50 labels I had to only 5 top level ones. These top level labels represented the <strong>5 main areas</strong> for which tips were provided. I was able to link the 5 main areas to the number of tips that were used, thus creating an insight what areas are targeted the most.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_19">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="816" height="141" src="https://nickanalytics.com/wp-content/uploads/2024/04/Final_Keywords.png" alt="High Level Keys" title="Final_Keywords" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/Final_Keywords.png 816w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Final_Keywords-480x83.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 816px, 100vw" class="wp-image-476" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_40  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>Mindmap</h3>
<p>One final way of looking at the end result is by using a &#8216;<strong>mindmap</strong>&#8216;. This technique aggregates all high level tips and labels into a <span> structure. It helps analysis ny showing relationships between elements.</span></p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_20">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1600" height="814" src="https://nickanalytics.com/wp-content/uploads/2024/04/diagram.png" alt="mindmap" title="diagram" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/diagram.png 1600w, https://www.nickanalytics.com/wp-content/uploads/2024/04/diagram-1280x651.png 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/04/diagram-980x499.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/04/diagram-480x244.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 1600px, 100vw" class="wp-image-496" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_41  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1>Conclusion</h1>
<p>In this blog post, I have explored the steps of analyzing a large number of texts by clustering and labeling them. Ultimate goal in this project was to provide high level insight in areas that need to be targeted in order to improve dog behavior and increase skills and knowledge of the dog owner. This work can serve as an input to an automatic Machine Learning model that could then cluster and label texts automatically.</p>
<p>My analytics work may be valuable in any organization where texts need to be labeled or classified. This may involve social media posts, chats conversations, sentiment analysis of earning transcripts and many more.</p>
<p>Thanks for reading my blog.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_42  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>The coding I&#8217;ve done (in VS Code)</h3>
<p>Check out the code of this project on Github: <a href="https://github.com/nickanalytics/Text-Clustering-and-Labeling" title="Text-Clustering-and-Labeling">Text-Clustering-and-Labeling</a><a href="https://colab.research.google.com/drive/12fzJuVqZ5-AMaH_h2nXqaicj_Z8doglX?usp=sharing" title="PyCaret use case"></a></p></div>
			</div><div class="et_pb_module et_pb_divider et_pb_divider_1 et_pb_divider_position_ et_pb_space"><div class="et_pb_divider_internal"></div></div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Credit Card Fraud Detection</title>
		<link>https://www.nickanalytics.com/credit-card-fraud-detection/</link>
		
		<dc:creator><![CDATA[Nick]]></dc:creator>
		<pubDate>Fri, 29 Dec 2023 17:13:54 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://nickanalytics.com/?p=374</guid>

					<description><![CDATA[Credit card fraud is a major challenge nowadays, with cybercriminals using advanced techniques to exploit vulnerabilities in the system. However, armed with knowledge, awareness, and proactive measures, we can strengthen ourselves against the threat of fraudulent activities.]]></description>
										<content:encoded><![CDATA[
<div class="et_pb_section et_pb_section_3 et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_3">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_3  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module et_pb_text et_pb_text_43  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1 data-sourcepos="3:1-3:62" style="text-align: left;">Credit Card Fraud Detection</h1>
<p>Welcome to this new post about my Data Analytics journey.</p>
<p>Credit card fraud is a huge challenge in the digital age, with cybercriminals employing increasingly sophisticated tactics to exploit vulnerabilities in the system. However, armed with knowledge, awareness, and proactive measures, we can strengthen ourselves against the threat of fraudulent activities.</p>
<p data-sourcepos="5:1-5:294">I came across a very nice challenge posted by <strong>American Express</strong>. This company provided a huge (50GB) set of data representing <strong>regular transactions along with fraudulent ones</strong>. Some files contained more than 10 million rows making it hard for even a strong computer to open them. Let alone processing the information and creating a prediction model. On top of that the data was completely anonymized, making it impossible to understand what each column meant.</p>
<p data-sourcepos="5:1-5:294">I took the challenge, reduced the files, investigated several prediction models, and came up with a good working one. Let&#8217;s start this blog with my first step, how to turn huge files into &#8216;manageable&#8217; ones.</p>
<p>&nbsp;</p>
<h2>The dataset</h2>
<p><span>I had these three files</span></p>
<p><span><strong>&#8211; a training set of 5 million transactions (16GB)</strong></span></p>
<p><span><strong>&#8211; a test set of 11 million transactions (33GB)</strong></span></p>
<p><span><strong>&#8211; each set has 190 columns of anonimyzed data</strong></span></p>
<p>&nbsp;</p>
<p>The largest dataset has 11 million rows and 190 columns. Those add up to over <strong>2 billion</strong> data points <img decoding="async" src="https://nickanalytics.com/wp-content/themes/Divi/includes/builder/frontend-builder/assets/vendors/plugins/emoticons/img/smiley-surprised.gif" alt="surprised" /><br />A training set contains information meant to train a machine learning model. Apart from all the data about the transactions, it also hold a column that indicates if the transaction was fraudulent or normal. We call this the &#8216;target&#8217; column.</p>
<p>The test set is used to make predictions about the type of transaction (fraudulent or normal)</p>
<p><span>The columns or column titles don&#8217;t reveal anything about the transactions. They come with names like &#8216;S_2&#8217;, &#8216;P_2&#8217;, &#8216;D_39&#8217;, &#8216;B_1&#8217;, &#8216;B_2&#8217;, &#8216;R_1&#8217;. And values between 0 and 1.</span></p>
<p>&nbsp;</p>
<h2>Reducing the file sizes</h2>
<p>With my computer it was no possible to work with these large sizes.<br />What I did was using a special Python library called &#8216;<strong>Dask</strong>&#8216; to do the heavy work for me. Dask can read a file without actually loading it into your memory. So I read the file and then chopped it up into much smaller chunks. I saved them in the &#8216;feather&#8217; format and then reloaded and saving them into a parquet file. I then recombined the files and ended up with sizes that had been reduced to <strong>10%</strong> of its original size.</p>
<p>&nbsp;</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_44  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2>Inspecting the data with Sweetviz</h2>
<p>I used a Python library called Sweetviz to conduct an initial investigation on the data. Sweetviz can provide very important information on the dataset with respect to:</p>
<ul>
<li><strong>Associations</strong> of the target variable (fraud or not) against all other features</li>
<li>Indication which column are <strong>categorical</strong> and which ones <strong>numerical</strong> or dates</li>
<li>The amount of <strong>missing data</strong> in each column</li>
<li><strong>Target analysis</strong>, <strong>Comparison</strong>, <strong>Feature analysis</strong>, <strong>Correlation</strong></li>
<li><strong>Numerical analysis</strong> (min/max/range, quartiles, mean, mode, standard deviation, sum etc.</li>
</ul>
<p>My first analysis of a subset of the training data (5000 records) yielded the following results:</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_21">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="986" height="831" src="https://nickanalytics.com/wp-content/uploads/2024/04/Sweetviz-03a.jpg" alt="" title="Sweetviz-03a" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/Sweetviz-03a.jpg 986w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Sweetviz-03a-980x826.jpg 980w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Sweetviz-03a-480x405.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 986px, 100vw" class="wp-image-392" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_45  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p><em>The image above depicts the most important column (our target column, in black) along with all other 190 columns below it (you see only two of them). In the top section you can see the number of rows, duplicates, feature types (categorical/numerical or text).</em></p>
<p>&nbsp;</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_46  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2 style="text-align: left;">A closer look</h2>
<p style="text-align: left;">If we zoom in further on the target column and its associations we find the following data:</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_22">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/Sweetviz-04a.jpg" class="et_pb_lightbox_image" title="current sales"><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="844" height="571" src="https://nickanalytics.com/wp-content/uploads/2024/04/Sweetviz-04a.jpg" alt="current sales" title="Sweetviz-04a" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/Sweetviz-04a.jpg 844w, https://www.nickanalytics.com/wp-content/uploads/2024/04/Sweetviz-04a-480x325.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 844px, 100vw" class="wp-image-393" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_47  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;">This part shows us that 75% of the transactions are &#8216;normal&#8217;, and 25% are fraudulent.</p>
<p style="text-align: left;">Along with the plot there are 3 tables that display how closely related the target column is with the numerical columns and categorical columns. I am most interested in:</p>
<p style="text-align: left;">&#8211; the <strong>Correlation Ratio</strong> (table at the bottom left) because if there&#8217;s a strong correlation between target and numerical feature we could say that the feature influences the outcome, and thus would be of interest in our prediction model.</p>
<p style="text-align: left;">&#8211; the <strong>Association Ratio</strong> (table at the top right) for the same reason as the correlation ratio, only in this case we&#8217;re dealing with categorical features.</p>
<p style="text-align: left;">So, if we want to reduce dimensionality in our dataset we could decide to only involve high scoring ratio&#8217;s and leave out the low scoring ones.</p>
<p style="text-align: left;"></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_48  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2 style="text-align: left;">Dimensionality reduced</h2>
<p>The exercise above has lead to a <strong>reduction of 153 columns to 39</strong>. This step is very important to keep the dataset &#8216;manageable&#8217; and suitable for machine learning. Too much complexity required extreme calculation power and in general poorer results.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_49  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2 style="text-align: left;">Dealing with missing data</h2>
<p>Machine Learning models require information to be complete. If this is not the case (like in our example), we need to decide how to solve this problem. There are 3 ways:</p>
<p>&#8211; If a lot of information is missing in a column, we can <strong>remove</strong> the column</p>
<p>&#8211; If only some fields are empty we can fill them with the <strong>average</strong> for the column</p>
<p>&#8211; If key columns miss data we can make a <strong>prediction model</strong> just for this purpose</p>
<p> Note I used <strong>Pycaret</strong> for further analysis. This package deals with missing values automatically by imputing its values.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_50  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2></h2>
<h2>Selecting a Machine Learning model with PyCaret</h2>
<p>PyCaret is a really nice Python library if you want quickly get insights which machine learning models may work best on your data, together with nice plots to back it up. </p>
<p>In this step I loaded my (reduced) dataset and set it up in PyCaret. I then put PyCaret to work telling it I wanted to get the best perfoming model. Here&#8217;s what happened:</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_23">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/model_select.jpg" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="789" height="472" src="https://nickanalytics.com/wp-content/uploads/2024/04/model_select.jpg" alt="" title="model_select" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/model_select.jpg 789w, https://www.nickanalytics.com/wp-content/uploads/2024/04/model_select-480x287.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 789px, 100vw" class="wp-image-405" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_51  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;">The plot shows a list of ML models that have been tested (first two columns) with the results of each one of them in next columns. How do we interpret the most important indicators:</p>
<ol>
<li><strong>Accuracy</strong>: <span>overall correctness of the model</span>.</li>
<li><strong>AUC</strong>: &#8216;Area Under the Curve&#8217;. How well does the model classify the positives and negatives.</li>
<li><strong>Recall</strong>: <span>the proportion of actual positives correctly identified by the model</span></li>
<li><strong>Precision</strong>: <span>True positive predictions among <span style="text-decoration: underline;">all</span> positive predictions. In simple terms, the model could have predicted all normal transactions as normal, but may have overlooked many fraudulent ones. So, a score of 1 (perfect) only tells us something about positive (normal) predictions.</span></li>
<li><span><strong>F1</strong>: (harmonic) mean of precision and recall. It is useful when you want to consider both false positives and false negatives.</span><span></span></li>
</ol>
<p><span>For me the most important indicator is the one that is best at predicting true fraudulent transactions and as few as possible false positives. We don&#8217;t want to bother customer with legal transactions and tell them it&#8217;s a fraud.</span></p>
<p><span>I went on with <strong>Naive Bayes</strong> as it has a high accuracy and a better precision than KNN.</span></p>
<p><span>Here below I displayed the Confusion Matrix and UAC plot of this model:</span></p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_24">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/04/conf_matrix.jpg" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="787" height="524" src="https://nickanalytics.com/wp-content/uploads/2024/04/conf_matrix.jpg" alt="" title="conf_matrix" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/conf_matrix.jpg 787w, https://www.nickanalytics.com/wp-content/uploads/2024/04/conf_matrix-480x320.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 787px, 100vw" class="wp-image-404" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_52  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;"><em>On the trainset it predicted 11692 correctly as &#8216;normal&#8217; transactions and 2310 correctly as &#8216;fraudulent&#8217;. 808 were predicted as &#8216;normal&#8217;, but were fraudulent, and 1785 were predicted as fraudulent, but were normal.</em></p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_25">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="696" height="498" src="https://nickanalytics.com/wp-content/uploads/2024/04/auc.jpg" alt="" title="auc" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/auc.jpg 696w, https://www.nickanalytics.com/wp-content/uploads/2024/04/auc-480x343.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 696px, 100vw" class="wp-image-403" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_53  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;"><span>A higher AUC indicates better model performance in terms of classification.</span></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_54  et_pb_text_align_left et_pb_text_align_justified-phone et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2 style="text-align: left;"></h2>
<h2 style="text-align: left;">Actual Predictions</h2>
<p>I used the model and created predictions on data the model never saw before (called a test set). The nice thing is that we not only get a prediction (0 or 1), but also a <span>prediction_score (how confident is the model). It looks something like this:</span></p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_26">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="338" height="627" src="https://nickanalytics.com/wp-content/uploads/2024/04/score.jpg" alt="" title="score" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/04/score.jpg 338w, https://www.nickanalytics.com/wp-content/uploads/2024/04/score-162x300.jpg 162w" sizes="(max-width: 338px) 100vw, 338px" class="wp-image-406" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_55  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;"><em>Note: I just displayed the last 3 columns of all 39 columns.</em></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_56  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2>Conclusion</h2>
<p>In this blog post, I have explored the steps of analyzing a large dataset with fraudulent credit card transactions. I have given insights by showing graphs of the way data correlated and can be reduced to leave only relevant features. With PyCaret I tested and selected a Machine Learning model for predictions. I evaluated how accurate the predictions will be by explaining the classifiers that belong to this model.</p>
<p>My analytics work may be valuable in the financial world where the battle against fraud is taking more and more time and manpower.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_57  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>The entire code</h3>
<p>Check out all of the code of this project at Github (sweetviz part): <a href="https://github.com/nickanalytics/Credit-Card-Fraud" title="Nick Analytics - Credit Card Fraud pt. 1">Nick Analytics &#8211; Credit Card Fraud pt. 1</a></p>
<p>and on Google Colab (the PyCaret part): <a href="https://colab.research.google.com/drive/12fzJuVqZ5-AMaH_h2nXqaicj_Z8doglX?usp=sharing" title="PyCaret use case">Nick Analytics &#8211; Credit Card Fraud pt. 2</a></p></div>
			</div><div class="et_pb_module et_pb_divider et_pb_divider_2 et_pb_divider_position_ et_pb_space"><div class="et_pb_divider_internal"></div></div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Sales Forecasting with Prophet</title>
		<link>https://www.nickanalytics.com/sales-forecasting-with-prophet/</link>
		
		<dc:creator><![CDATA[Nick]]></dc:creator>
		<pubDate>Mon, 04 Dec 2023 14:46:47 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://nickanalytics.com/?p=334</guid>

					<description><![CDATA[In this blog I delve into Sales Forecasting using Prophet, a library by Meta. The blog walks you through data preparation, model building and evaluation. My insights offer practical tips on how to optimize inventory, and balance marketing plans and warehouse staff.]]></description>
										<content:encoded><![CDATA[
<div class="et_pb_section et_pb_section_4 et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_4">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_4  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module et_pb_text et_pb_text_58  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1 data-sourcepos="3:1-3:62" style="text-align: left;">Forecasting Sales Using Prophet</h1>
<p data-sourcepos="5:1-5:294">Welcome to another post about my Data Analytics journey. As you all know businesses rely heavily on accurate forecasting to make informed decisions and plan for the future. Time series forecasting, in particular, provides valuable insights into trends and patterns, making it a crucial tool for various industries.</p>
<p data-sourcepos="7:1-7:293">In this blog post, I&#8217;ll explore the process of forecasting sales using a python library called &#8216;<strong>Prophet</strong>&#8216;. Prophet was developed by Meta&#8217;s Core Data Science team, and is still a powerful tool today. I&#8217;ll walk you through each step of the predicting process, from data preparation to model evaluation, offering practical insights and tips along the way.</p>
<p>&nbsp;</p>
<h2>Introduction: e-commerce data</h2>
<p><span>Sales forecasting is an essential task for businesses across all sectors. Whether you&#8217;re a retail giant or a small-scale e-commerce store, understanding future sales trends can help:</span></p>
<p><span><strong>&#8211; optimize inventory mgt</strong></span></p>
<p><span><strong>&#8211; plan marketing campaigns</strong></span></p>
<p><span><strong>&#8211; allocate resources</strong> <strong>effectively</strong>. </span></p>
<p><span>The data I&#8217;m using in this project is a sample of the sales from an e-commerce webshop. Sales were recorded over a 2 year period.</span></p>
<p>&nbsp;</p>
<h2><strong>Getting Started</strong></h2>
<p><strong></strong></p>
<h3>Understanding the dataset</h3>
<p>The first steps in understanding a dataset is to do some <strong>exploratory analysis</strong>. We want to know how large the file is, how many columns we&#8217;ve got, the min &amp; max values for each datetime- and numerical column, and some statistical information.</p>
<p>In my case, I&#8217;ll be analyzing sales data, which typically includes information such as product names, quantities sold, and timestamps of transactions.</p>
<ul data-sourcepos="15:1-17:0"></ul>
<h3>Check data quality</h3>
<p>This step involves a number of action like:</p>
<p>&#8211; check each column for <strong>missing values</strong></p>
<p>&#8211; check the set for <strong>outliers</strong> (for example: a-typical sales qty&#8217;s or prices)</p>
<p>In most cases some information needs to be removed or re-engineered to make the data suitable for further processing. Empty values or null values can cause the model to be less accurate in its predictions.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_27">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/03/outliers.png" class="et_pb_lightbox_image" title="outliers"><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="647" height="545" src="https://nickanalytics.com/wp-content/uploads/2024/03/outliers.png" alt="outliers" title="outliers" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/outliers.png 647w, https://www.nickanalytics.com/wp-content/uploads/2024/03/outliers-480x404.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 647px, 100vw" class="wp-image-338" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_59  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;">Ideally most points fall within the blue box, or within the whiskers (the 2 vertical black lines). But as you can see there are quite a number of individual black data points that can be considered <strong>outliers</strong>. Let me explain the plot in a bit more detail:</p>
<ul style="text-align: left;">
<li><strong>Box</strong>: The box represents the interquartile range (IQR), which spans from the 25th percentile (Q1) to the 75th percentile (Q3) of the data distribution. The length of the box indicates the spread of the middle 50% of the data. The line inside the box represents the median (50th percentile) of the data.</li>
<li><strong>Whiskers</strong>: The whiskers extend from the edges of the box to the furthest data points within 1.5 times the IQR from the quartiles. Any data points beyond the whiskers are considered outliers and are plotted individually as points.</li>
<li><strong>Outliers</strong>: The individual data points that fall outside the whiskers are plotted as individual points. These points represent values that are significantly different from the rest of the data and may need further investigation.</li>
</ul>
<p style="text-align: left;">For the objective of this project I had no need to investigate the anomalies in the Discount column. I just used it as an example of what you can find when investigating a dataset.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_60  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2 style="text-align: left;">Data Filtering</h2>
<p style="text-align: left;">In order to create a prediction model I experimented using just <strong>one product</strong> from the entire dataset. So I filtered the data and then made a split in order to prepare it for the Prophet model</p>
<p style="text-align: left;">Filtering steps I took in this e-commerce dataset were:</p>
<h3 style="text-align: left;">1. Identifying the Most Popular Product</h3>
<p style="text-align: left;">To demonstrate the forecasting process, I began by identifying the most popular product in the sales dataset. This involves analyzing the total quantity of each product sold over the entire time period (3 years. Result was one product (code: <span>Go-Wo-NMDVGP) sold on <strong>905</strong> days.</span></p>
<p style="text-align: left;"><span></span></p>
<h3 style="text-align: left;"><span>2. Check its sales over time (3-year period)</span></h3>
<p style="text-align: left;"><span></span>As you can see sales seem to have a pretty regular pattern (but is that the whole story&#8230;?)</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_28">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/03/current_sales.png" class="et_pb_lightbox_image" title="current sales"><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1442" height="498" src="https://nickanalytics.com/wp-content/uploads/2024/03/current_sales.png" alt="current sales" title="current_sales" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/current_sales.png 1442w, https://www.nickanalytics.com/wp-content/uploads/2024/03/current_sales-1280x442.png 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/03/current_sales-980x338.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/03/current_sales-480x166.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 1442px, 100vw" class="wp-image-343" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_61  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2 style="text-align: left;">Building the Forecasting Model</h2>
<p style="text-align: left;">My goal is to make forecasts to see what sales we can expect in the coming weeks or year. We have the historical data and need to feed it into a Prophet model. Steps I took are:</p>
<h3 style="text-align: left;">1. Split the Data</h3>
<p style="text-align: left;">Having the target product, I split the sales data into <strong>training</strong> and <strong>testing</strong> <strong>sets</strong>. The training set will be used to train the Prophet model, while the testing set will be used to evaluate its performance.</p>
<h3 style="text-align: left;">2. Create the Prophet Model</h3>
<p style="text-align: left;">With the data prepared, I created a Prophet model and fit it to the training data. Prophet&#8217;s intuitive interface allows to specify various parameters, such as seasonality and holidays, to customize the forecasting model according to our dataset.</p>
<p style="text-align: left;">I created this piece of python code to create the model:</p>
<div>
<div><span>model_Go_Wo_NMDVGP</span><span> </span><span>=</span><span> </span><span>Prophet</span><span>(</span><strong>weekly_seasonality</strong><span>=</span><span>&#8216;auto&#8217;</span><span>, </span><strong>holidays</strong><span>=</span><span>None</span><span>)</span></div>
</div>
<p>This tells the model that I wanted to include seasonality on a weekly basis, and disregard any holidays.</p>
<p>&nbsp;</p>
<h2>Generating Forecasts</h2>
<p>With the trained Prophet model, I can now generate forecasts for future time periods.<br />In my example, I have predicted sales for the next 52 weeks, thus providing valuable insights into long-term trends and potential fluctuations.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_29">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/03/forecast.png" class="et_pb_lightbox_image" title="forecast"><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="900" height="600" src="https://nickanalytics.com/wp-content/uploads/2024/03/forecast.png" alt="forecast" title="forecast" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/forecast.png 900w, https://www.nickanalytics.com/wp-content/uploads/2024/03/forecast-480x320.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 900px, 100vw" class="wp-image-344" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_62  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;"><em>The plot shows all weekly data points,<strong> a (rising) trend line</strong> and <strong>a forecast line</strong> (with  <strong>confidence interval</strong>).<br />Interestingly there are some points outside the confidence intervals indicating the wideness of the spread (variance) of sales over the weeks. Orders for over 30 items are non-typical, but they show up from time to time. It is important to further zoom to these occurances.</em></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_63  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2 style="text-align: left;">Plotting 3 trend components</h2>
<p>To visualize the individual components of the trends and patterns, I have a created a 3-chart plot. This forecast plot depicts the <strong>long term trend</strong>, the expected <strong>yearly trend</strong> and the <strong>weekly trend</strong>.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_30">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/03/trends.png" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="900" height="600" src="https://nickanalytics.com/wp-content/uploads/2024/03/trends.png" alt="" title="trends" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/trends.png 900w, https://www.nickanalytics.com/wp-content/uploads/2024/03/trends-480x320.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 900px, 100vw" class="wp-image-345" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_64  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: left;">The plot shows 3 components:</p>
<ol>
<li><strong>Trend Component</strong>: It shows the overall trend in the data. It helps visualize the long-term behavior of the time series data, allowing you to identify patterns and trends.</li>
<li><strong>Seasonality Component</strong>: The second plot is the seasonality component of the forecast. In my case it illustrates the weekly seasonality. By examining this plot, you can identify seasonal fluctuations and understand how they contribute to the overall pattern of the time series.</li>
<li><strong>Weeky Component</strong>: The plot on the bottom depicts the weekly sales or ordering pattern of this product. Monday&#8217;s and Saturday&#8217;s don&#8217;t seem to be very popular <img decoding="async" src="https://nickanalytics.com/wp-content/themes/Divi/includes/builder/frontend-builder/assets/vendors/plugins/emoticons/img/smiley-undecided.gif" alt="undecided" /></li>
</ol>
<h2></h2>
<h2>Changepoints in the trend</h2>
<p>The Prophet tool is good at indicating clear changepoints in trend, but in our case the trendline is steadily moving up, and does not show major breaks in the trend.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_31">
				
				
				
				
				<a href="https://nickanalytics.com/wp-content/uploads/2024/03/changepoints.png" class="et_pb_lightbox_image" title=""><span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="989" height="590" src="https://nickanalytics.com/wp-content/uploads/2024/03/changepoints.png" alt="" title="changepoints" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/changepoints.png 989w, https://www.nickanalytics.com/wp-content/uploads/2024/03/changepoints-980x585.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/03/changepoints-480x286.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 989px, 100vw" class="wp-image-342" /></span></a>
			</div><div class="et_pb_module et_pb_text et_pb_text_65  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p style="text-align: right;"><em>In red the main trend line.</em></p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_66  et_pb_text_align_left et_pb_text_align_justified-phone et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2 style="text-align: left;">Evaluating Performance</h2>
<h3></h3>
<p>To assess the accuracy of the forecast, I have calculated performance metrics such as Root Mean Squared Error (RMSE) and <strong>Mean Absolute Error (MAE)</strong>. These metrics quantify the difference between the predicted and actual values, providing a measure of the model&#8217;s predictive power. In my case having a wide spread in demand from 1 to sometimes over 50, resulted in an overall 3 year <strong>MAE of 9</strong>.</p>
<p>&nbsp;</p>
<h2>Conclusion</h2>
<p>In this blog post, I have explored the process of forecasting sales using Prophet. I have given insights in its capabilities by showing graphs of how forecasts depict trend and seasonal fluctuation. This is information that can be hidden under the service of any business that is involved in sales or logistics.</p>
<p>By following the steps outlined above, businesses can leverage the power of time series analysis to make data-driven decisions and gain a competitive edge in today&#8217;s markets.</p>
<p>I believe that every data scientist or business owner should look at this as an opportunity to optimize his or her daily operations, thus driving growth and profit.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_67  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>The entire code</h3>
<p>Check out all of the code of this project at Github: <a href="https://github.com/nickanalytics/Demand-Prediction-with-Prophet/blob/main/Weekly%20Demand%20Prediction%20with%20Prophet.ipynb">Nick Analytics &#8211; Demand Prediction with Prophet</a></p></div>
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>House Value Prediction with ML</title>
		<link>https://www.nickanalytics.com/house-value-prediction-with-ml/</link>
		
		<dc:creator><![CDATA[Nick]]></dc:creator>
		<pubDate>Wed, 22 Nov 2023 12:33:02 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<guid isPermaLink="false">https://nickanalytics.com/?p=199</guid>

					<description><![CDATA[Join me into the exciting world of predictive modeling in real estate ! I used Kaggle’s House Prices dataset to build a powerful machine learning model that predicts house prices. Let me show you which the steps I took to reach my goal and how I uncovered key market insights.]]></description>
										<content:encoded><![CDATA[
<div class="et_pb_section et_pb_section_5 et_section_regular" >
				
				
				
				
				
				
				<div class="et_pb_row et_pb_row_5">
				<div class="et_pb_column et_pb_column_4_4 et_pb_column_5  et_pb_css_mix_blend_mode_passthrough et-last-child">
				
				
				
				
				<div class="et_pb_module et_pb_text et_pb_text_68  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h1>Predicting House Prices: My Data Science Journey</h1>
<p>Welcome to another blog post! Today, I&#8217;m delving into the exciting world of predictive modeling in real estate using the <strong>House Prices dataset</strong> from Kaggle&#8217;s Advanced Regression Techniques competition.<br />In this (brief) post, I&#8217;ll walk you through the entire process of preprocessing the data, building a machine learning model, and making predictions. So, grab your favorite beverage, and let&#8217;s dive in!</p>
<p>&nbsp;</p>
<h2>Introduction</h2>
<p>The housing market is a complex ecosystem influenced by various factors ranging from location and size to architectural style and amenities. <strong>Predicting house prices</strong> accurately is crucial for both buyers and sellers. In this project, I aim to develop a robust predictive model that can estimate house prices based on several features provided in the dataset.</p>
<p>&nbsp;</p>
<h2>Exploring the Housing Dataset</h2>
<p><span>Before diving into the pre-processing part, let&#8217;s first take a closer look at our dataset.<br />I started by conducting some general (statistical) checks using:</span></p>
<p><span>&#8211; the </span><code>describe()</code><span> function</span></p>
<p><span>&#8211; the <code>info()</code> function</span></p>
<p><span><code>- shape</code> method</span></p>
<p><span> That gave me some knowledge about the summary statistics of the features (columns) and some key metrics like <strong>min/max values</strong>, and <strong>standard deviation</strong> of each column.</span></p>
<p>In order to gain more insights into the <strong>distribution of the sales prices</strong> I created a distribution plot that looks like this:</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_32">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1001" height="499" src="https://nickanalytics.com/wp-content/uploads/2024/03/Sales-Price-Distribution5.jpg" alt="Sales Price Distribution" title="Sales Price Distribution5" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/Sales-Price-Distribution5.jpg 1001w, https://www.nickanalytics.com/wp-content/uploads/2024/03/Sales-Price-Distribution5-980x489.jpg 980w, https://www.nickanalytics.com/wp-content/uploads/2024/03/Sales-Price-Distribution5-480x239.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) 1001px, 100vw" class="wp-image-210" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_69  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>The distribution of the Sales Prices looks to be <strong>right skewed</strong>. This means:</p>
<ul>
<li><strong>The Majority of Houses are Lower Priced:</strong> The peak of the distribution, where most of the data points lie, is towards the lower end of the price range. This suggests that there are more houses with lower prices compared to higher prices.</li>
</ul>
<ul>
<li><strong>Fewer Expensive Houses:</strong> As the distribution extends towards the higher prices, there are fewer houses with expensive prices. This could indicate that high-priced houses are less common or less frequently sold compared to lower-priced ones.</li>
</ul>
<ul>
<li><strong>Right Tail:</strong> The right-skewed nature of the distribution means that there are some houses with exceptionally high prices, leading to a longer tail on the right side of the distribution.</li>
</ul>
<p>&nbsp;</p>
<h2>Preprocessing of the Data</h2>
<p>Preprocessing of data is a vital step in every Data Science project. Most datasets are far from perfect and need to undergo vital steps for it to serve as input to a Machine Learning Model.</p>
<p>The steps I took in this housing dataset were:</p>
<h3>1. Handling Missing Values</h3>
<p>One of the initial challenges in any data science project is dealing with missing values. In my case most missing values existed in the &#8216;LotFrontage&#8217; feature. In order to tackle this problem I created a Random Forest ML model to predict the values of this missing data. It would have been an option to remove this column from the dataset, or to remove rows with missing data, but I decided to pursue a more solid solution.</p>
<p>&nbsp;</p>
<h3>2. Checking for Outliers</h3>
<p>Outliers can significantly impact the performance of predictive models. I utilized <b>pairplots</b> (see further) and the <strong>Z-score analysis</strong> to identify and remove outliers from the training data, ensuring the model learns from clean and reliable data.</p>
<p>&nbsp;</p>
<h3>3. Encoding Cat. Variables</h3>
<p>Categorical variables need to be encoded into a numerical format before feeding them into machine learning models. I employed <strong>LabelEncoding</strong> to do so. LabelEncoding creates new columns where the categorial data is transformed into numbers.</p>
<p>&nbsp;</p>
<h2>Exploratory Data Analysis</h2>
<p>&nbsp;</p>
<h3 style="text-align: left;">Understanding Feature Relationships</h3>
<p>Before getting into the model building, it&#8217;s essential to explore the <strong>relationships</strong> between our features and the target variable (Sales Price). I visualized these relationships using <strong>pairplots</strong> for both <span style="text-decoration: underline;">numeric</span> and <span style="text-decoration: underline;">categorical</span> features. This provides insights into potential correlations and trends.</p>
<p>In the 2 images below you can see the relationship between some numeric features and price (first one). The second one shows the count of categorical features.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_33">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1365" height="183" src="https://nickanalytics.com/wp-content/uploads/2024/03/pairplots3.jpg" alt="" title="pairplots3" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/pairplots3.jpg 1365w, https://www.nickanalytics.com/wp-content/uploads/2024/03/pairplots3-1280x172.jpg 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/03/pairplots3-980x131.jpg 980w, https://www.nickanalytics.com/wp-content/uploads/2024/03/pairplots3-480x64.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 1365px, 100vw" class="wp-image-217" /></span>
			</div><div class="et_pb_module et_pb_image et_pb_image_34">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1449" height="191" src="https://nickanalytics.com/wp-content/uploads/2024/03/pairplots4.jpg" alt="" title="pairplots4" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/pairplots4.jpg 1449w, https://www.nickanalytics.com/wp-content/uploads/2024/03/pairplots4-1280x169.jpg 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/03/pairplots4-980x129.jpg 980w, https://www.nickanalytics.com/wp-content/uploads/2024/03/pairplots4-480x63.jpg 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 1449px, 100vw" class="wp-image-218" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_70  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>Correlation Analysis</h3>
<p>I calculated the correlation coefficients between numeric features and Sales Price and visualized them using a heatmap. This allowed us to identify the most influential features affecting house prices.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_35">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1628" height="1561" src="https://nickanalytics.com/wp-content/uploads/2024/03/heatmap_tiny.png" alt="" title="heatmap_tiny" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/heatmap_tiny.png 1628w, https://www.nickanalytics.com/wp-content/uploads/2024/03/heatmap_tiny-1280x1227.png 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/03/heatmap_tiny-980x940.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/03/heatmap_tiny-480x460.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 1628px, 100vw" class="wp-image-317" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_71  et_pb_text_align_left et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>The plot is a bit hard to read on a small screen but this map tells us that the strongest correlation (the darker blue the stronger) exists between:</p>
<ul>
<li><strong>GrLivArea</strong>: Above grade (ground) living area square feet &amp;</li>
<li><strong>TotRmsAbvGrd</strong>: Total rooms above grade (does not include bathrooms)</li>
</ul>
<p>This correlation between them is <strong>0.83</strong> (closer to 1 means stronger)</p>
<p>In Machine Learning we could now decide to remove one of these columns to reduce complexity.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_72  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2></h2>
<h2>Building the Machine Learning Model</h2>
<p>After completing the pre-processing steps, it is time to create the Machine Learning model. We are dealing with a challenge where we want to predict housing prices. As housing prices can take pretty much any value, we consider our predictions to be &#8216;continuous&#8217; (as opposed to for example predicting a fixed outcome of &#8216;yes or no&#8217;, &#8216;true or false&#8217; etc.).</p>
<p>&nbsp;</p>
<h3>Linear Regression</h3>
<p>I started with a simple yet powerful Linear Regression model to predict the housing prices. After training the model on my pre-processed data, I evaluated its performance using metrics such as RMSE and R2 score.</p>
<p>In this first iteration the outcome was:</p>
<p><span><code><strong>Validation score</strong>: 0.9244<br /><br />
<strong>Validation RMSE</strong>: 23816.8230<br /><br />
<strong>Test score</strong>: 0.8915<br /><br />
<strong>Test RMSE</strong>: 23576.6045</code> </span></p>
<p><strong>Interpretation</strong>:</p>
<ul>
<li>Overall, the linear regression model performs well on the test dataset, as indicated by the relatively <strong>high test score</strong> (0.8915) and the <strong>relatively low RMSE</strong> (23576.6045).</li>
<li>The test score suggests that the model explains a large proportion of the variability in house prices using the available features.</li>
<li>The RMSE indicates that, on average, the model&#8217;s predictions are approximately $23,576.60 away from the actual house prices in the test dataset.</li>
</ul>
<h3></h3>
<h3>Feature Importance</h3>
<p>Understanding which features contribute the most to my model&#8217;s predictions is crucial for making informed decisions. I analysed feature importance using coefficients and permutation techniques, gaining insights into the key drivers of house prices.</p>
<p>The outcome of such analysis is depicted in the plot below. With this information I can take the most important features and neglect less importance features to improve the model even more.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_36">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="849" height="834" src="https://nickanalytics.com/wp-content/uploads/2024/03/feature-importance.png" alt="" title="feature importance" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/feature-importance.png 849w, https://www.nickanalytics.com/wp-content/uploads/2024/03/feature-importance-480x472.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) 849px, 100vw" class="wp-image-230" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_73  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>This plot confirms the most important features (columns). Those are the <strong>size of the house</strong>, <strong>overall quality</strong> and <strong>the externals</strong>.</p>
<p>&nbsp;</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_74  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h2>Model Evaluation and Prediction</h2>
<p>&nbsp;</p>
<h3>Assessing Model Performance</h3>
<p>I evaluated my model&#8217;s performance by comparing predicted values against actual values using scatter plots. This visual representation allowed me to identify areas of improvement and assess the model&#8217;s accuracy. I created a nice plot that instantly describes how well the model performs.</p></div>
			</div><div class="et_pb_module et_pb_image et_pb_image_37">
				
				
				
				
				<span class="et_pb_image_wrap "><img loading="lazy" decoding="async" width="1448" height="450" src="https://nickanalytics.com/wp-content/uploads/2024/03/predicted_vs_actuals.png" alt="" title="predicted_vs_actuals" srcset="https://www.nickanalytics.com/wp-content/uploads/2024/03/predicted_vs_actuals.png 1448w, https://www.nickanalytics.com/wp-content/uploads/2024/03/predicted_vs_actuals-1280x398.png 1280w, https://www.nickanalytics.com/wp-content/uploads/2024/03/predicted_vs_actuals-980x305.png 980w, https://www.nickanalytics.com/wp-content/uploads/2024/03/predicted_vs_actuals-480x149.png 480w" sizes="(min-width: 0px) and (max-width: 480px) 480px, (min-width: 481px) and (max-width: 980px) 980px, (min-width: 981px) and (max-width: 1280px) 1280px, (min-width: 1281px) 1448px, 100vw" class="wp-image-231" /></span>
			</div><div class="et_pb_module et_pb_text et_pb_text_75  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><p>As you can see the model works pretty well and has a linear slope. There are a number of outliers, especially when <strong>prices are over 300k</strong>. The model seems to <strong>underestimate</strong> a number of cases. Further investigation, and maybe introducing a 2nd model for high-end homes, could improve my model&#8217;s accuracy.</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_76  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>Making Predictions</h3>
<p>With the trained model, I&#8217;ve made predictions on the test dataset to participate in the Kaggle competition. I&#8217;m excited to see how my model performs against other competitors and contribute to the advancement of predictive modelling in real estate.</p>
<p>Stay tuned for updates on my model&#8217;s performance and further insights from the competition!</p></div>
			</div><div class="et_pb_module et_pb_text et_pb_text_77  et_pb_text_align_justified et_pb_bg_layout_light">
				
				
				
				
				<div class="et_pb_text_inner"><h3>The entire code</h3>
<p>Check out all of the code of this project at Github: <a href="https://github.com/nickanalytics/House-Value-Prediction-with-ML">Nick Analytics &#8211; House-Value-Prediction-with-ML</a></p></div>
			</div>
			</div>
				
				
				
				
			</div>
				
				
			</div>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
