This is my learning procedure of the Kaggle competition-Titanic: Machine Learning From Disaster, this problem serves as the very first attemp for most of the rookie in Kaggle.Without further ado,let’s get started. This post follows the steps of the Alexis Cook‘s Titanic Tutorial.
1.Get started
The challenge
The competition is simple:use the Titanic passenger data to try to predict who will survive and who will die.
The data
Data is crucial for data science.Let’s have a look.The data has split into two groups:training set(train.csv) and test set(test.csv)
Data Dictionary
Variable | Definition | Key |
---|---|---|
survival | Surviaval | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
(1).train.csv
(2).test.csv
2.Show me the code
I choose Python as my programming language to handle data problems. Although the R language is efficient too, for me, I still love Python. I copy the introduction of pandas library from its official website.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
pandas is a NumFOCUS sponsored project. This will help ensure the success of development of pandas as a world-class open-source project, and makes it possible to donate to the project.
Load the data
The loading procedure is simple.
the location of the csv files depends on the distribution of the files in your system
import pandas as pd
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
···