Exploratory Data Analysis

Introduction-
Generally EDA is a platform to visualise and transformation to explore data for understanding and find scenarios for performing the analysis. when we have acquired the data then we will have need to diagnose the data to check any problem with data quality and either data is correct or not.
There are some basic and important terminology for exploratory data analysis
Data diagnostics provides information and visualisation of missing values and outliers and unique and negative values to help you understand the distribution and quality of your data.
Data exploration provides information and visualisation of the descriptive statistics of univariate variables, normality tests and outliers, correlation of two variables, and relationship between target variable and predictor.
Data transformation supports binning for categorising continuous variables, imputates missing values and outliers, resolving skewness. And it creates automated reports that support these three tasks.
Before analysis we need to
1-generate questions about our data,
2-search for answers by visualising, transforming and modelling our data,
3-Use what you learn to refine your questions or generate new questions.
EDA is an important part of any data analysis and Data cleaning is only one application of exploratory data analysis. We ask questions about whether your data meets your expectations or not. To do data cleaning, we’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.When we ask a question, the question focuses our attention on a specific part of our dataset and helps us decide which graphs, models, or transformations to make.
To illustrate the basic use of EDA in 'dlookr' package, I use a "titanic" datasets.
Now we have to know "What is 'dlookr' package?".
dlookr package - A collection of tools that support data diagnosis, exploration, and transformation.
So first of all we have to install some packages
And some another important packages
Example:
So now for better understanding we take titanic dataset as an example. With titanic dataset we perform exploratory analysis in a systematic way. 'Titanic' is an R package containing data sets providing information on the fate of passengers on the fatal maiden voyage of the ocean liner "Titanic", summarised according to economic status (class), sex, age and survival.
Dataset- titanic disaster
Data description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.
In this Blog I will do only basic Exploratory Data Analysis on Titanic dataset using R & ggplot & attempt to answer few questions about Titanic Tragedy based on dataset.
Data dictionary
Variable
|
Definition
|
Key
|
survival
|
Survival
|
0 = No, 1 = Yes
|
pclass
|
Ticket class
|
1 = 1st, 2 = 2nd, 3 = 3rd
|
embarked
|
Port of Embarkation
|
C = Cherbourg, Q = Queenstown, S = Southampton
|
sex
|
sex
|
|
Age
|
Age in years
|
|
sibsp
|
# of siblings / spouses aboard the Titanic
|
|
parch
|
# of parents / children aboard the Titanic
|
|
ticket
|
Ticket number
|
|
fare
|
Passenger fare
|
|
cabin
|
Cabin number
|
|
Passengers ID
|
ID no. of passengers
|
|
load the data into R and examine the data structures
The data is in CSV format. You must download it from the Kaggle competition and place it in a folder. I also load the test data, may be handy if we decide some transformation.
I will do step by step. First of all we have to understand the data after that we will start analysis
1-Import the Data
2-To know the no. of raw or no. of observations and no. of column
5-Names of variables in dataset
6- Structure of the dataset
Note- here we can see that the type of the variable is different according to data type. So we have to change data type for particular variable. In our data set the data type of the variable Survived, Pclass, Sibsp, Parch is integer so we have to convert these variable into the factor data type.
7-Make tables that shows who survived, stratified by Sex and Pclass
Note-the work of function 'xtabs' and 'tally' is almost similiar. For tally function we have to call library(mosaic).
8-How to combine two datasets
Note-the work of function 'xtabs' and 'tally' is almost similiar. For tally function we have to call library(mosaic).
8-How to combine two datasets
There must be equal number of columns in both datasets to combined. In our example we have two datasets i.e. "train_data" & "test_data". But the no. of columns in "train_data" are 12 and in "test_data" are 11. So there will be arise a problem to combine both the datasets. So in "train_data" we have an extra variables named "Survived" and we have to add this variable into the "test_data". After that we can combine two datasets.