Data Science / Machine Learning

Blog post relating to Data Science / Machine Learning methodologies, as well as some interesting data sets descriptions.

Paper summary: On the modelling and impact of negative edges in Graph Convolutional Netwoks for node classificationOctober 31, 2024
Introduction In the paper “On the modelling and impact of negative edges in graph convolutional networks for node classification” (Dinh, Handl and Ospina-Forero(2023)), accepted by NeurIPS 2023 Workshop: New Frontiers in Graph Learning, the authors examine existing Graph Convolutional Network (GCN) frameworks for node classification in signed graphs, focusing on how these frameworks integrate signed edge information and their strengths and weaknesses. The authors conducted …
Belief aggregation through weighted Dempster’s rule of combinationAugust 13, 2023
Decision-making refers to a dynamic process which begins from evidence accumulation to belief adjustment, where belief aggregation is a crucial step in this process. In many decision-making scenarios, individuals may have different beliefs or preferences, and the goal of belief aggregation is to arrive at a consensus or collective decision that represents the overall view of the group. This process can involve various methods such …
Graph Convolutional Networks for node classification in signed graphs-Part 1July 27, 2023
Introduction Signed graphs are a type of graph that can simultaneously express positive and negative relationships. These data structures have been receiving increasing attention due to the rising popularity of online social networks. For example, in social graphs, people create positive relationships, such as friendships, trust, and approval, as well as negative relationships, such as foes, distrust and disapproval. Compared to unsigned graphs that only …
Ball mapper over bank’s customers.July 3, 2023
In this blog post, I will show an R application of a Topological Data Analysis tool called Ball Mapper (BM), to visualise the distribution of the bank’s customers that have stayed or exited the bank across the joint distribution of the customers’ characteristics. BM is a useful tool to visualise datasets with multiple dimensions, to do so, BM summarises points that are close to each …
Article review: Generalized measures for the evaluation of Community Detection methods.April 25, 2022
In this blog post, I will summarise an article that proposes a modified version of three community detection assessment measures (Purity, Adjusted Rand Index and Normalized Mutual Information). The modified measures include network topological information to assess misclassification errors according to nodes’ integration into the network. This article was published in 2013 in the International Journal of Social Network Mining by Vincent Labatut (Labatut, 2015). …
Multivariate Time Series analysis with volatility-Oil PricesApril 4, 2022
With the basic analysis on the univariate time series on last blog post “Univariate Time Series Analysis -Oil Prices”. This blog post will continue the analysis on multivariate time series. First is using Henze-Zirklers test to check the multivariate normality. The mvnTest = ”hz” in the mvn function can perform the Henze-Zirklers test. The last column indicates whether data set follows a multivariate normality or …
Univariate Time Series Analysis -Oil PricesApril 4, 2022
This blog post will try to modeling and forecasting univariate time series dataset with ARMA-GARCH model and exam the goodness of fit with some basic tests. The oil prices dataset is the log returns of four benchmarks(West Texas Intermediate (WTI), Brent Blend, Dubai Crude and Maya) from 10/1/1997 to 4/6/2010. In this data set, each benchmark contains 698 observations, each of them was divided between …
Article review. Triadic closure in two-mode networks: Redefining the global and local clustering coefficients.March 31, 2022
In this post, I will summarise an article that proposes a redefinition of the clustering coefficients for two-mode networks. The new definition aims to solve some problems that arise from applying, in projected two-mode networks, the clustering coefficient defined in one-mode networks. This article was published in 2013 in the Journal ELSEVIER by Tore Opsahl (Opsahl, 2013). The author introduced the article by explaining some …
Article review: The scales of human mobility.March 31, 2022
In this blog post, I will summarise an article that proposes a new approach to model human mobility. This article was published in 2020 in the Journal Nature by Laura Alessandretti, Ulf Aslak and Sune Lehman (Alessandretti et al., 2020). The authors started the article by explaining that human mobility is a key issue to understand other phenomena such as people’s commuting flows, money’s …
Article review. Analyzing and Modeling Real-World Phenomena with Complex Networks: A Survey of ApplicationsMarch 31, 2022
This blog will review a survey of the applications of complex networks to real-world problems. In particular, six applications related to Social Networks, Economy and Security and Surveillance will be summarised. This article was published in the Journal Advances in Physics, in 2008 by Luciano da Fontoura Costa, Osvaldo N. Oliveira Jr., Gonzalo Travieso, Francisco Aparecido Rodrigues, Paulino Ribeiro Villas Boas, Lucas Antiqueira, Matheus Palhares …
Introduction to Graph Convolutional NetworkFebruary 27, 2021
Many important real-world datasets come in the form of graphs or networks: social networks, citation networks, protein-interaction networks, the World Wide Web, etc. The high interpretability of graph and the rise of deep learning has motivated to create a new intersection between deep learning and graph theory. When both these fields meet they create what we call geometric deep learning or graph neural network. It …
Article review: Predicting the direction, maximum, minimum and closing prices of daily Bitcoin exchange rate using Machine Leaning techniquesDecember 3, 2020
Created in 2009, Bitcoin now is the most accepted cryptocurrency in the world and is traded on over 40 exchanges worldwide. Several innovative features of the Bitcoin such as decentralized peer-to-peer payment network without central banks, anonymity and greater accessibility relative to traditional currencies make it appealing to investors and traders. Thus, there is an increasing number of research that study the time series of …
Article review: Modeling complex systems with adaptive networksDecember 3, 2020
In this post, I will review an article that used adaptative networks to model complex systems in some real-world problems. This article was published in 2013 in the Journal ELSEVIER by Hiroki Sayama, Irene Pestov, Jeffrey Schmidt, Benjamin James Bush, Chun Wong, Junichi Yamanoi and Thilo Gross (Sayama et al., 2013). This article aimed to introduce fundamental concepts and properties of adaptive networks through a …
Article review: Optimal forecast combination based on neural network for time series forecastingDecember 3, 2020
Time series forecasting plays an important role in various practical applications ranging from energy, electrical load, tourism to finance. Improving forecasting performance is an important yet regularly difficult task. Forecast combination is considered as one effective way to improve the performance of forecasting. With the aim of utilizing Artificial Neural Network (ANN) to improve time series forecasting, the article titled “Optimal forecast combination based on …
Introduction to Convolutional Neural NetworksOctober 16, 2020
This is the first of two blogposts taking a look at the paper CNNPred: CNN-based stock market prediction using several data sources Ehsan Hoseinzade , Saman Haratizadeh , Faculty of New Sciences and Technologies, University of Tehran which attempts to showcase the application of CNNs to the stock market. In this first blogpost we are going to focus on the basics of convolutional neural networks, more specifically how they are applied …
A Gentle Introduction for Seasonal ARIMAX (SARIMAX)September 15, 2020
Introduction When it comes to financial data, there is a high chance that seasonal patterns will be present there. These are defined as patterns that have cyclic behavior. Let’s assume, that there is a store that sells ice-cream during the whole year. An example of monthly seasonal patterns could be the increased ice-cream sales in that store during the summer period in comparison with the …
Vector Autoregression (VAR) for house property sales time seriesJuly 22, 2020
This blog post aims to generate forecasts of house property sales using Vector Autoregression (VAR) models. The dataset is downloaded from: https://www.kaggle.com/htagholdings/property-sales?select=ma_lga_12345.csv We utilize “ma_lga_12345.csv” dataset that contains data resampled using Median Price Moving Average (MA) in quarterly intervals. The data range from 30 September 2007 to 30 September 2019 at the time of download (8 July 2020). We focus on predicting the house price …
Forecast analysis with Random Forest for house property sales data.July 15, 2020
In this blog post, I will perform a House Property Sales forecast using a Random Forest technique with a Linear Regression and a Time Series. To conduct these models, it was used two databases: The Raw data: 29580 observations of recorded sales data from 2007 to 2019. The MA data: 347 observations of Moving Average of Median Price grouped by quarterly intervals per property type …
Web Scraping – How to retrieve 1807 skills using three lines of codeJuly 13, 2020
This blog post will detail the steps required to begin your journey into web-scraping. Web Scraping can help solve many of the challenges that are faced in an ever-increasing digital word. Some of these challenges include being able to process the vast amount of data online, having a system that can react quickly to this data changing frequently, and making sure that the quality of the …
Dataset: House Property Sales. Exploratory analysis.June 29, 2020
By Maria Fernanda Ibarra Gutiérrez and Thu Trang Dinh In this blog post, we will describe the database about House Property Sales, which can be downloaded from: https://www.kaggle.com/htagholdings/property-sales?select=raw_sales.csv According to the first Figure, this database describes some characteristics of the property sales into 5 variables and 29,580 observations from the 7th of February 2007 to the 26 of July 2019. This database does not have …
Future predictions of Coronavirus cases using ARIMA modelApril 13, 2020
This post aims to track the spread of COVID-19, also known as 2019 Novel Coronavirus. It is a new respiratory virus first identified in Wuhan in December 2019. According to Centers for Disease Control and Prevention (2020) the virus probably initially emerged from an animal source but now there are many affected cases indicating person-to-person spread occurring. At this time, how easily or sustainably this …
How to learn RFebruary 6, 2020
There are an incredibly large number of resources available to you for learning R, Starting from the R Manual, step by step books, youtube videos, more books, R blogs and so on. Here you will find a list of only some of these resources: Where to get R (Software) You can download R form https://cran.r-project.org/. R-Studio: An R Editor with additional plus, and which provides …
Determining the Number of ClustersJanuary 20, 2020
By Adán José-García Discovering the number of clusters (k) in a dataset is a fundamental problem in data clustering or cluster analysis. Clustering is an unsupervised learning technique aiming to discover the natural partition of data objects into clusters. Clustering algorithms can be broadly divided into two groups: hierarchical and partitional. Both categories of clustering algorithms, i.e., k-means and single-link algorithms, require as input the …
SDG Indicator Filtering FunctionJuly 29, 2019
It is a common issue to handle missing values in data preparation step before analysis. In R, missing values are represented by NA, and there are abundant NA-related functions in R to deal with NA values. Since we would like to cluster the SDG indicators later, it is highly recommended to construct a filtering function to guarantee there are no NA values in filtered data …
Chapter 3 – DBSCAN: Density-Based ClusteringJuly 28, 2019
As mentioned before, agglomerative algorithms work slowly on large data sets, kmeans could not be applied to non-convex data sets, and both are not able to detect and delete outliers. Thus, density-based clustering, or DBSCAN was proposed to meet the requirements like distinction and removal of noises, dealing with data sets in arbitrary shapes and improvement in efficiency of processing data sets with large size …
Chapter 2 – Kmeans ClusteringJuly 12, 2019
Partitioning algorithms are one of the most widely used and deeply studied clustering algorithms. It aims to partition the dataset into several clusters with similar objects while maximize the between-cluster variations(Dabbura, 2018). Though there are many modified partitioning algorithms, we will focus on Kmeans algorithm in this blog. Kmeans has been widely applied in data mining, pattern recognition, image compression and many machine learning fields …
Chapter 1 – Agglomerative Hierarchical ClusteringJuly 1, 2019
Nowadays, clustering techniques are frequently used in data analysis. Among them, partitioning and hierarchical clustering are the two most deeply studied and widely used clustering methods. Different from partitioning, hierarchical clustering approach the problem via constructing a hierarchy of clusters, thus, it is heavily used in Bioinformatics, (e.g.) phylogenetic trees of animal evolution or virus transmission (Kilitcioglu, 2018). In this blog, the hierarchical clustering method …