CS 329P:Practical Machine Learning(2021 Fall)

2.4 Feature Engeineering

ML algorithms prefer well defined fixed length input/output
Feature engineering is the key to ML models before deep learning
Deep learning train deep neural networks to extract features.
2.4.1 Tabular Data Features
Int/Float:directly use or bin to n unique int values
Categorical data:one-hot encoding
- Map rare categories into “Unknow”
Data-time:a feature list such as
- [year, month, day, day_of_year, week_of_year, day_of_week]
Feature combination:Cartesian product of two feature such as
- [cat, dog] x [male, female] -> [(cat, male), (cat, female), (dog, male), (dog, female)]

Represent text as token features
- Bag of words(BoW) model
  - limitations:needs careful vocabulary design, missing content
Word Embeddings(e.g Word2vec):
- vectorizing words such that similar words are placed closed together
- Trained by predicting target word from context words
Pre-trained language models(e.g BERT, GPT-3)
- Giant transformer models
- Trained with large amount of unannotated data
- Fine-tuning for downstream tasks

traditionally extract images by hand-craft features such as SIFT
Now commonly use pre-trained deep neural networks
- ResNet:trained with ImageNet(image classification)
- I3D:trained with Kinetics(action classification)