Exploratory Data Analysis using R with titanic dataset Part-1: 2018

Exploratory Data Analysis

Introduction-

Generally EDA is a platform to visualise and transformation to explore data for understanding and find scenarios for performing the analysis. when we have acquired the data then we will have need to diagnose the data to check any problem with data quality and either data is correct or not.

There are some basic and important terminology for exploratory data analysis

Data diagnostics provides information and visualisation of missing values and outliers and unique and negative values to help you understand the distribution and quality of your data.

Data exploration provides information and visualisation of the descriptive statistics of univariate variables, normality tests and outliers, correlation of two variables, and relationship between target variable and predictor.

Data transformation supports binning for categorising continuous variables, imputates missing values and outliers, resolving skewness. And it creates automated reports that support these three tasks.

Before analysis we need to

1-generate questions about our data,

2-search for answers by visualising, transforming and modelling our data,

3-Use what you learn to refine your questions or generate new questions.

EDA is an important part of any data analysis and Data cleaning is only one application of exploratory data analysis. We ask questions about whether your data meets your expectations or not. To do data cleaning, we’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.When we ask a question, the question focuses our attention on a specific part of our dataset and helps us decide which graphs, models, or transformations to make.

To illustrate the basic use of EDA in 'dlookr' package, I use a "titanic" datasets.

Now we have to know "What is 'dlookr' package?".

dlookr package - A collection of tools that support data diagnosis, exploration, and transformation.

So first of all we have to install some packages

And some another important packages

Note-These are the some package to do exploratory data analysis. First of all you have to install these packages.

Example:

So now for better understanding we take titanic dataset as an example. With titanic dataset we perform exploratory analysis in a systematic way. 'Titanic' is an R package containing data sets providing information on the fate of passengers on the fatal maiden voyage of the ocean liner "Titanic", summarised according to economic status (class), sex, age and survival.

Dataset- titanic disaster

Data description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

In this Blog I will do only basic Exploratory Data Analysis on Titanic dataset using R & ggplot & attempt to answer few questions about Titanic Tragedy based on dataset.

Data dictionary

Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3^rd
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton
sex	sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
Passengers ID	ID no. of passengers

load the data into R and examine the data structures

The data is in CSV format. You must download it from the Kaggle competition and place it in a folder. I also load the test data, may be handy if we decide some transformation.

I will do step by step. First of all we have to understand the data after that we will start analysis

1-Import the Data

2-To know the no. of raw or no. of observations and no. of column

3- Look at the first few lines

4-summary of the data

5-Names of variables in dataset

6- Structure of the dataset

Note- here we can see that the type of the variable is different according to data type. So we have to change data type for particular variable. In our data set the data type of the variable Survived, Pclass, Sibsp, Parch is integer so we have to convert these variable into the factor data type.

7-Make tables that shows who survived, stratified by Sex and Pclass

Note-the work of function 'xtabs' and 'tally' is almost similiar. For tally function we have to call library(mosaic).

8-How to combine two datasets

There must be equal number of columns in both datasets to combined. In our example we have two datasets i.e. "train_data" & "test_data". But the no. of columns in "train_data" are 12 and in "test_data" are 11. So there will be arise a problem to combine both the datasets. So in "train_data" we have an extra variables named "Survived" and we have to add this variable into the "test_data". After that we can combine two datasets.

Sunday, 2 September 2018