Article, Data Science / Machine Learning

Vector Autoregression (VAR) for house property sales time series

This blog post aims to generate forecasts of house property sales using Vector Autoregression (VAR) models. The dataset is downloaded from: https://www.kaggle.com/htagholdings/property-sales?select=ma_lga_12345.csv We utilize “ma_lga_12345.csv” dataset that contains data resampled using Median Price Moving Average (MA) in quarterly intervals. The data range from 30 September 2007 to 30 September 2019 at the time of download (8 July 2020). We focus on predicting the house price … Continue reading Vector Autoregression (VAR) for house property sales time series

dinhtrang24July 22, 2020September 11, 2020Leave a comment

Article, Data Science / Machine Learning, Finance

Forecast analysis with Random Forest for house property sales data.

In this blog post, I will perform a House Property Sales forecast using a Random Forest technique with a Linear Regression and a Time Series. To conduct these models, it was used two databases: The Raw data: 29580 observations of recorded sales data from 2007 to 2019. The MA data: 347 observations of Moving Average of Median Price grouped by quarterly intervals per property type … Continue reading Forecast analysis with Random Forest for house property sales data.

mferibgJuly 15, 2020July 16, 2020

AlgorithmTips, Article, Data Science / Machine Learning, Security

Web Scraping – How to retrieve 1807 skills using three lines of code

This blog post will detail the steps required to begin your journey into web-scraping. Web Scraping can help solve many of the challenges that are faced in an ever-increasing digital word. Some of these challenges include being able to process the vast amount of data online, having a system that can react quickly to this data changing frequently, and making sure that the quality of the … Continue reading Web Scraping – How to retrieve 1807 skills using three lines of code

edouardpayne94July 13, 2020July 13, 2020

Article, Data posts, Data Science / Machine Learning, Finance, Finance Data

Dataset: House Property Sales. Exploratory analysis.

By Maria Fernanda Ibarra Gutiérrez and Thu Trang Dinh In this blog post, we will describe the database about House Property Sales, which can be downloaded from: https://www.kaggle.com/htagholdings/property-sales?select=raw_sales.csv According to the first Figure, this database describes some characteristics of the property sales into 5 variables and 29,580 observations from the 7th of February 2007 to the 26 of July 2019. This database does not have … Continue reading Dataset: House Property Sales. Exploratory analysis.

mferibgJune 29, 2020June 30, 2020

Article, Finance

The impact of Covid-19 in World’s Economy

By Maria Fernanda Ibarra Gutiérrez The Coronavirus disease (Covid-19) is a worldwide health problem that according to the World Health Organization (WHO) has spread in 213 countries. Up to the 13th of April 2020, there were 1,807,308 cases around the world according to the Our World in Data database (Ritchie, 2020). At the current moment, the United States has the higher number of cases … Continue reading The impact of Covid-19 in World’s Economy

mferibgApril 13, 2020April 14, 2020

Article, Data Science / Machine Learning, LifeScience

Future predictions of Coronavirus cases using ARIMA model

This post aims to track the spread of COVID-19, also known as 2019 Novel Coronavirus. It is a new respiratory virus first identified in Wuhan in December 2019. According to Centers for Disease Control and Prevention (2020) the virus probably initially emerged from an animal source but now there are many affected cases indicating person-to-person spread occurring. At this time, how easily or sustainably this … Continue reading Future predictions of Coronavirus cases using ARIMA model

dinhtrang24April 13, 2020April 17, 2020

AlgorithmTips

Visualization with R Studio

“The simple graph has brought more information to the data analyst’s mind than any other device”. —- American mathematician John Tukey，the co-creator of Cooley–Tukey Fast Fourier Transform algorithm During the data exploration, there are many approaches leading us to different aspects of the dataset. Among them, visualization is a recommended start point due to its clear and straightforward results. The output of visualization could help … Continue reading Visualization with R Studio

KEFANJINMarch 27, 2020

Article, Security, SocialScience Data

Article review: Exploring crime patterns in Mexico City

By Maria Fernanda Ibarra Gutiérrez Big Data analysis is a research approach that has been growing in importance to study several aspects of society, as we live surrounded by governmental and private systems, technological devices and social media platforms that gather information from our daily activities, choices, purchases, searches, health patterns and other digital touchpoints. Therefore, there is a large amount of data suitable for … Continue reading Article review: Exploring crime patterns in Mexico City

mferibgFebruary 27, 2020April 15, 2020

AlgorithmTips, Data posts, SocialScience Data

Introduction to scatter plot

By Maria Fernanda Ibarra Gutiérrez This blog looks at the ways in which scatter plots can be used to visualise multiple sets of data and the relationships between several variables. It takes a data set and deals with outliers, formatting the graphs for clarity, using bubbles to show a third variable, adding regression models and trend to the plots and splitting the data into separate … Continue reading Introduction to scatter plot

mferibgFebruary 27, 2020February 27, 2020

AlgorithmTips, Article, Data Science / Machine Learning

How to learn R

There are an incredibly large number of resources available to you for learning R, Starting from the R Manual, step by step books, youtube videos, more books, R blogs and so on. Here you will find a list of only some of these resources: Where to get R (Software) You can download R form https://cran.r-project.org/. R-Studio: An R Editor with additional plus, and which provides … Continue reading How to learn R

leospinafFebruary 6, 2020July 24, 2020

AlgorithmTips, Data Science / Machine Learning, GuestPost

Determining the Number of Clusters

By Adán José-García Discovering the number of clusters (k) in a dataset is a fundamental problem in data clustering or cluster analysis. Clustering is an unsupervised learning technique aiming to discover the natural partition of data objects into clusters. Clustering algorithms can be broadly divided into two groups: hierarchical and partitional. Both categories of clustering algorithms, i.e., k-means and single-link algorithms, require as input the … Continue reading Determining the Number of Clusters

leospinafJanuary 20, 2020January 20, 2020Leave a comment

Article, Data posts, Finance, Finance Data

Dataset I: The Atlas of Economic Complexity-Part A

In this Blogpost, I would like to introduce the Atlas of economic complexity. It is a powerful data visualisation tool developed by the Harvard growth lab. Even at first glance, there is an abundance of information presented in a compelling way on the homepage. If we take more time to dive into the data and tweak different settings, the website delivers even more knowledge and … Continue reading Dataset I: The Atlas of Economic Complexity-Part A

keithwangjunzheNovember 22, 2019December 4, 2019Leave a comment

Article, Data posts, Finance, Finance Data

Dataset I: The Atlas of Economic Complexity-Part B

The first dataset we would like to discuss is the Atlas of Economic Complexity, derived from http://atlas.cid.harvard.edu/explore. After exploring and appreciating those pretty charts and graphs in the post of Part A, we are also interested in the meta datasets under those graphs. The data sources of global goods trade and service trade are United Nations Statistical Division and Direction of Trade Statistics database (IMF) … Continue reading Dataset I: The Atlas of Economic Complexity-Part B

KEFANJINNovember 22, 2019December 4, 2019

Article, Data posts, Finance, Finance Data

Examining Purchasing Power Parity theory by a time regression model

Introduction According to Bank International Settlements (2019), the foreign exchange market (or forex market) is the largest and the most liquid financial market in the world with global daily trading of $6.6 trillion in April 2019. Among leading currencies, the British pound sterling (GBP) is ranked fourth in line as one of the most widely traded currencies in the world and the pound has a … Continue reading Examining Purchasing Power Parity theory by a time regression model

dinhtrang24November 10, 2019February 28, 2020

Data Science / Machine Learning, SocialScience, SocialScience Data

SDG Indicator Filtering Function

It is a common issue to handle missing values in data preparation step before analysis. In R, missing values are represented by NA, and there are abundant NA-related functions in R to deal with NA values. Since we would like to cluster the SDG indicators later, it is highly recommended to construct a filtering function to guarantee there are no NA values in filtered data … Continue reading SDG Indicator Filtering Function

KEFANJINJuly 29, 2019October 24, 2019

Data Science / Machine Learning

Chapter 3 – DBSCAN: Density-Based Clustering

As mentioned before, agglomerative algorithms work slowly on large data sets, kmeans could not be applied to non-convex data sets, and both are not able to detect and delete outliers. Thus, density-based clustering, or DBSCAN was proposed to meet the requirements like distinction and removal of noises, dealing with data sets in arbitrary shapes and improvement in efficiency of processing data sets with large size … Continue reading Chapter 3 – DBSCAN: Density-Based Clustering

KEFANJINJuly 28, 2019October 24, 2019

Data Science / Machine Learning

Chapter 2 – Kmeans Clustering

Partitioning algorithms are one of the most widely used and deeply studied clustering algorithms. It aims to partition the dataset into several clusters with similar objects while maximize the between-cluster variations(Dabbura, 2018). Though there are many modified partitioning algorithms, we will focus on Kmeans algorithm in this blog. Kmeans has been widely applied in data mining, pattern recognition, image compression and many machine learning fields … Continue reading Chapter 2 – Kmeans Clustering

KEFANJINJuly 12, 2019October 23, 2019

AlgorithmTips, Article

Introduction to ggplot2

We will use the bric_data and melted_data1 for this tutorial. bric_data has two variables: Country.Name and gini; melted_data1 has three variables: Country.Name, year, and value Continue reading Introduction to ggplot2

keithwangjunzheJuly 5, 2019October 2, 2019

Data Science / Machine Learning

Chapter 1 – Agglomerative Hierarchical Clustering

Nowadays, clustering techniques are frequently used in data analysis. Among them, partitioning and hierarchical clustering are the two most deeply studied and widely used clustering methods. Different from partitioning, hierarchical clustering approach the problem via constructing a hierarchy of clusters, thus, it is heavily used in Bioinformatics, (e.g.) phylogenetic trees of animal evolution or virus transmission (Kilitcioglu, 2018). In this blog, the hierarchical clustering method … Continue reading Chapter 1 – Agglomerative Hierarchical Clustering

KEFANJINJuly 1, 2019October 24, 2019

Data and Methods Exploration Group

Blog on Data Science research and projects conducted by lab members

Data Science / Machine Learning

Life Science

Dataset

Security

Finance

Vector Autoregression (VAR) for house property sales time series

Forecast analysis with Random Forest for house property sales data.

Web Scraping – How to retrieve 1807 skills using three lines of code

Dataset: House Property Sales. Exploratory analysis.

The impact of Covid-19 in World’s Economy

Future predictions of Coronavirus cases using ARIMA model

Visualization with R Studio

Article review: Exploring crime patterns in Mexico City

Introduction to scatter plot

How to learn R

Determining the Number of Clusters

Dataset I: The Atlas of Economic Complexity-Part A

Dataset I: The Atlas of Economic Complexity-Part B

Examining Purchasing Power Parity theory by a time regression model

SDG Indicator Filtering Function

Chapter 3 – DBSCAN: Density-Based Clustering

Chapter 2 – Kmeans Clustering

Introduction to ggplot2

Chapter 1 – Agglomerative Hierarchical Clustering