Sunday, 2 September 2018

Exploratory Data Analysis using R with titanic dataset Part-1



Exploratory Data Analysis

   Related image
Introduction-
Generally EDA is a platform to visualise and transformation to explore data for understanding and find scenarios for performing the analysis. when we have acquired the data then we will have need to diagnose the data to check any problem with data quality and either data is correct or not.



There are some basic and important terminology for exploratory data analysis 
      Data diagnostics provides information and visualisation of missing values and outliers and unique and negative values to help you understand the distribution and quality of your data.
     Data exploration provides information and visualisation of the descriptive statistics of univariate variables, normality tests and outliers, correlation of two variables, and relationship between target variable and predictor. 
      Data transformation supports binning for categorising continuous variables, imputates missing values and outliers, resolving skewness. And it creates automated reports that support these three tasks. 
     
   Before analysis we need to
1-generate questions about our data, 
2-search for answers by visualising, transforming and modelling our data,
3-Use what you learn to refine your questions or generate new questions. 
    
EDA is an important part of any data analysis and Data cleaning is only one application of exploratory data analysis. We ask questions about whether your data meets your expectations or not. To do data cleaning, we’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.When we ask a question, the question focuses our attention on a specific part of our dataset and helps us decide which graphs, models, or transformations to make. 
      To illustrate the basic use of EDA in 'dlookr' package, I use a "titanic" datasets. 
Now we have to know "What is 'dlookr' package?".
dlookr package - A collection of tools that support data diagnosis, exploration, and transformation.
  
So first of all we have to install some packages


And some another important packages







Note-These are the some package to do exploratory data analysis. First of all you have to install these packages.

Example:

So now for better understanding we take titanic dataset as an example. With titanic dataset we perform exploratory analysis in a systematic way. 'Titanic' is an R package containing data sets providing information on the fate of passengers on the fatal maiden voyage of the ocean liner "Titanic", summarised according to economic status (class), sex, age and survival. 


                                      Related image


                                                    Dataset-  titanic disaster 

Data description


The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.
In this Blog I will do only basic Exploratory Data Analysis on Titanic dataset using R & ggplot & attempt to answer few questions about Titanic Tragedy based on dataset
Data dictionary
Variable
Definition
Key
survival
Survival
0 = No, 1 = Yes
pclass
Ticket class
1 = 1st, 2 = 2nd, 3 = 3rd
embarked
Port of Embarkation
C = Cherbourg, Q = Queenstown, S = Southampton
sex
sex

Age
Age in years

sibsp
# of siblings / spouses aboard the Titanic

parch
# of parents / children aboard the Titanic

ticket
Ticket number

fare
Passenger fare

cabin
Cabin number

Passengers ID
ID no. of passengers




 
load the data into R and examine the data structures

The data is in CSV format. You must download it from the Kaggle competition and place it in a  folder. I also load the test data, may be handy if we decide some transformation.
      I will do step by step. First of all we have to understand the data after that we will start analysis


1-Import the Data

2-To know the no. of raw or no. of observations and no. of column 


3- Look at the first few lines

4-summary of the data


5-Names of variables in dataset









6- Structure of the dataset












Note- here we can see that the type of the variable is different according to data type. So we have to change data type for particular variable. In our data set the data type of the variable Survived, Pclass, Sibsp, Parch is integer so we have to convert these variable into the factor data type.





7-Make tables that shows who survived, stratified by Sex and Pclass

Note-the work of function 'xtabs' and 'tally' is almost similiar. For tally function we have to call library(mosaic).


























8-How to combine two datasets

There must be equal number of columns in both datasets to combined. In our example we have two datasets i.e. "train_data" & "test_data". But the no. of columns in "train_data" are 12 and in "test_data" are 11. So there will be arise a problem to combine both the datasets. So in "train_data" we have an extra variables named "Survived" and we have to add this variable into the "test_data". After that we can combine two datasets.