Vector Autoregression (VAR) for house property sales time series
This blog post aims to generate forecasts of house property sales using Vector Autoregression (VAR) models. The dataset is downloaded from: https://www.kaggle.com/htagholdings/property-sales?select=ma_lga_12345.csv We utilize “ma_lga_12345.csv” dataset that contains data resampled using Median Price Moving Average (MA) in quarterly intervals. The data range from 30 September 2007 to 30 September 2019 at the time of download (8 July 2020). We focus on predicting the house price … Continue reading Vector Autoregression (VAR) for house property sales time series
Forecast analysis with Random Forest for house property sales data.
In this blog post, I will perform a House Property Sales forecast using a Random Forest technique with a Linear Regression and a Time Series. To conduct these models, it was used two databases: The Raw data: 29580 observations of recorded sales data from 2007 to 2019. The MA data: 347 observations of Moving Average of Median Price grouped by quarterly intervals per property type … Continue reading Forecast analysis with Random Forest for house property sales data.
Web Scraping – How to retrieve 1807 skills using three lines of code
This blog post will detail the steps required to begin your journey into web-scraping. Web Scraping can help solve many of the challenges that are faced in an ever-increasing digital word. Some of these challenges include being able to process the vast amount of data online, having a system that can react quickly to this data changing frequently, and making sure that the quality of the … Continue reading Web Scraping – How to retrieve 1807 skills using three lines of code
Dataset: House Property Sales. Exploratory analysis.
By Maria Fernanda Ibarra Gutiérrez and Thu Trang Dinh In this blog post, we will describe the database about House Property Sales, which can be downloaded from: https://www.kaggle.com/htagholdings/property-sales?select=raw_sales.csv According to the first Figure, this database describes some characteristics of the property sales into 5 variables and 29,580 observations from the 7th of February 2007 to the 26 of July 2019. This database does not have … Continue reading Dataset: House Property Sales. Exploratory analysis.
The impact of Covid-19 in World’s Economy
By Maria Fernanda Ibarra Gutiérrez The Coronavirus disease (Covid-19) is a worldwide health problem that according to the World Health Organization (WHO) has spread in 213 countries. Up to the 13th of April 2020, there were 1,807,308 cases around the world according to the Our World in Data database (Ritchie, 2020). At the current moment, the United States has the higher number of cases … Continue reading The impact of Covid-19 in World’s Economy
Future predictions of Coronavirus cases using ARIMA model
This post aims to track the spread of COVID-19, also known as 2019 Novel Coronavirus. It is a new respiratory virus first identified in Wuhan in December 2019. According to Centers for Disease Control and Prevention (2020) the virus probably initially emerged from an animal source but now there are many affected cases indicating person-to-person spread occurring. At this time, how easily or sustainably this … Continue reading Future predictions of Coronavirus cases using ARIMA model
Visualization with R Studio
“The simple graph has brought more information to the data analyst’s mind than any other device”. —- American mathematician John Tukey,the co-creator of Cooley–Tukey Fast Fourier Transform algorithm During the data exploration, there are many approaches leading us to different aspects of the dataset. Among them, visualization is a recommended start point due to its clear and straightforward results. The output of visualization could help … Continue reading Visualization with R Studio
Article review: Exploring crime patterns in Mexico City
By Maria Fernanda Ibarra Gutiérrez Big Data analysis is a research approach that has been growing in importance to study several aspects of society, as we live surrounded by governmental and private systems, technological devices and social media platforms that gather information from our daily activities, choices, purchases, searches, health patterns and other digital touchpoints. Therefore, there is a large amount of data suitable for … Continue reading Article review: Exploring crime patterns in Mexico City
Introduction to scatter plot
By Maria Fernanda Ibarra Gutiérrez This blog looks at the ways in which scatter plots can be used to visualise multiple sets of data and the relationships between several variables. It takes a data set and deals with outliers, formatting the graphs for clarity, using bubbles to show a third variable, adding regression models and trend to the plots and splitting the data into separate … Continue reading Introduction to scatter plot
Sticky post
How to learn R
There are an incredibly large number of resources available to you for learning R, Starting from the R Manual, step by step books, youtube videos, more books, R blogs and so on. Here you will find a list of only some of these resources: Where to get R (Software) You can download R form https://cran.r-project.org/. R-Studio: An R Editor with additional plus, and which provides … Continue reading How to learn R
Determining the Number of Clusters
By Adán José-García Discovering the number of clusters (k) in a dataset is a fundamental problem in data clustering or cluster analysis. Clustering is an unsupervised learning technique aiming to discover the natural partition of data objects into clusters. Clustering algorithms can be broadly divided into two groups: hierarchical and partitional. Both categories of clustering algorithms, i.e., k-means and single-link algorithms, require as input the … Continue reading Determining the Number of Clusters
Dataset I: The Atlas of Economic Complexity-Part A
In this Blogpost, I would like to introduce the Atlas of economic complexity. It is a powerful data visualisation tool developed by the Harvard growth lab. Even at first glance, there is an abundance of information presented in a compelling way on the homepage. If we take more time to dive into the data and tweak different settings, the website delivers even more knowledge and … Continue reading Dataset I: The Atlas of Economic Complexity-Part A
Dataset I: The Atlas of Economic Complexity-Part B
The first dataset we would like to discuss is the Atlas of Economic Complexity, derived from http://atlas.cid.harvard.edu/explore. After exploring and appreciating those pretty charts and graphs in the post of Part A, we are also interested in the meta datasets under those graphs. The data sources of global goods trade and service trade are United Nations Statistical Division and Direction of Trade Statistics database (IMF) … Continue reading Dataset I: The Atlas of Economic Complexity-Part B
Examining Purchasing Power Parity theory by a time regression model
Introduction According to Bank International Settlements (2019), the foreign exchange market (or forex market) is the largest and the most liquid financial market in the world with global daily trading of $6.6 trillion in April 2019. Among leading currencies, the British pound sterling (GBP) is ranked fourth in line as one of the most widely traded currencies in the world and the pound has a … Continue reading Examining Purchasing Power Parity theory by a time regression model
SDG Indicator Filtering Function
It is a common issue to handle missing values in data preparation step before analysis. In R, missing values are represented by NA, and there are abundant NA-related functions in R to deal with NA values. Since we would like to cluster the SDG indicators later, it is highly recommended to construct a filtering function to guarantee there are no NA values in filtered data … Continue reading SDG Indicator Filtering Function
Chapter 3 – DBSCAN: Density-Based Clustering
As mentioned before, agglomerative algorithms work slowly on large data sets, kmeans could not be applied to non-convex data sets, and both are not able to detect and delete outliers. Thus, density-based clustering, or DBSCAN was proposed to meet the requirements like distinction and removal of noises, dealing with data sets in arbitrary shapes and improvement in efficiency of processing data sets with large size … Continue reading Chapter 3 – DBSCAN: Density-Based Clustering
Chapter 2 – Kmeans Clustering
Partitioning algorithms are one of the most widely used and deeply studied clustering algorithms. It aims to partition the dataset into several clusters with similar objects while maximize the between-cluster variations(Dabbura, 2018). Though there are many modified partitioning algorithms, we will focus on Kmeans algorithm in this blog. Kmeans has been widely applied in data mining, pattern recognition, image compression and many machine learning fields … Continue reading Chapter 2 – Kmeans Clustering
Introduction to ggplot2
We will use the bric_data and melted_data1 for this tutorial. bric_data has two variables: Country.Name and gini; melted_data1 has three variables: Country.Name, year, and value Continue reading Introduction to ggplot2
Chapter 1 – Agglomerative Hierarchical Clustering
Nowadays, clustering techniques are frequently used in data analysis. Among them, partitioning and hierarchical clustering are the two most deeply studied and widely used clustering methods. Different from partitioning, hierarchical clustering approach the problem via constructing a hierarchy of clusters, thus, it is heavily used in Bioinformatics, (e.g.) phylogenetic trees of animal evolution or virus transmission (Kilitcioglu, 2018). In this blog, the hierarchical clustering method … Continue reading Chapter 1 – Agglomerative Hierarchical Clustering
