Visualization with R Studio

 “The simple graph has brought more information to the data analyst’s mind than any other device”. —- American mathematician John Tukey,the co-creator of Cooley–Tukey Fast Fourier Transform algorithm During the data exploration, there are many approaches leading us to different aspects of the dataset. Among them, visualization is a recommended start point due to its clear and straightforward results. The output of visualization could help … Continue reading Visualization with R Studio

Dataset I: The Atlas of Economic Complexity-Part B

The first dataset we would like to discuss is the Atlas of Economic Complexity, derived from http://atlas.cid.harvard.edu/explore. After exploring and appreciating those pretty charts and graphs in the post of Part A, we are also interested in the meta datasets under those graphs. The data sources of global goods trade and service trade are United Nations Statistical Division and Direction of Trade Statistics database (IMF) … Continue reading Dataset I: The Atlas of Economic Complexity-Part B

SDG Indicator Filtering Function

It is a common issue to handle missing values in data preparation step before analysis. In R, missing values are represented by NA, and there are abundant NA-related functions in R to deal with NA values. Since we would like to cluster the SDG indicators later, it is highly recommended to construct a filtering function to guarantee there are no NA values in filtered data … Continue reading SDG Indicator Filtering Function

Chapter 3 – DBSCAN: Density-Based Clustering

As mentioned before, agglomerative algorithms work slowly on large data sets, kmeans could not be applied to non-convex data sets, and both are not able to detect and delete outliers. Thus, density-based clustering, or DBSCAN was proposed to meet the requirements like distinction and removal of noises, dealing with data sets in arbitrary shapes and improvement in efficiency of processing data sets with large size … Continue reading Chapter 3 – DBSCAN: Density-Based Clustering

Chapter 2 – Kmeans Clustering

Partitioning algorithms are one of the most widely used and deeply studied clustering algorithms. It aims to partition the dataset into several clusters with similar objects while maximize the between-cluster variations(Dabbura, 2018). Though there are many modified partitioning algorithms, we will focus on Kmeans algorithm in this blog. Kmeans has been widely applied in data mining, pattern recognition, image compression and many machine learning fields … Continue reading Chapter 2 – Kmeans Clustering

Chapter 1 – Agglomerative Hierarchical Clustering

Nowadays, clustering techniques are frequently used in data analysis. Among them, partitioning and hierarchical clustering are the two most deeply studied and widely used clustering methods. Different from partitioning, hierarchical clustering approach the problem via constructing a hierarchy of clusters, thus, it is heavily used in Bioinformatics, (e.g.) phylogenetic trees of animal evolution or virus transmission (Kilitcioglu, 2018). In this blog, the hierarchical clustering method … Continue reading Chapter 1 – Agglomerative Hierarchical Clustering