Titanic:Machine Learning From Disaster-I

This is my learning procedure of the Kaggle competition-Titanic: Machine Learning From Disaster, this problem serves as the very first attemp for most of the rookie in Kaggle.Without further ado,let’s get started. This post follows the steps of the Alexis Cook‘s Titanic Tutorial.

1.Get started

The challenge

The competition is simple:use the Titanic passenger data to try to predict who will survive and who will die.

The data

Data is crucial for data science.Let’s have a look.The data has split into two groups:training set(train.csv) and test set(test.csv)

Data Dictionary

Variable	Definition	Key
survival	Surviaval	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

(1).train.csv

(2).test.csv

2.Show me the code

I choose Python as my programming language to handle data problems. Although the R language is efficient too, for me, I still love Python. I copy the introduction of pandas library from its official website.

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

pandas is a NumFOCUS sponsored project. This will help ensure the success of development of pandas as a world-class open-source project, and makes it possible to donate to the project.

Load the data

The loading procedure is simple.

the location of the csv files depends on the distribution of the files in your system

import pandas as pd
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

···