2.4 Feature Engeineering
- ML algorithms prefer well defined fixed length input/output
- Feature engineering is the key to ML models before deep learning
- Deep learning train deep neural networks to extract features.
-
2.4.1 Tabular Data Features
- Int/Float:directly use or bin to n unique int values
- Categorical data:one-hot encoding
- Map rare categories into “Unknow”
- Data-time:a feature list such as
- [year, month, day, day_of_year, week_of_year, day_of_week]
- Feature combination:Cartesian product of two feature such as
- [cat, dog] x [male, female] -> [(cat, male), (cat, female), (dog, male), (dog, female)]
2.4.2 Text Features
- Represent text as token features
- Bag of words(BoW) model
- limitations:needs careful vocabulary design, missing content
- Word Embeddings(e.g Word2vec):
- vectorizing words such that similar words are placed closed together
- Trained by predicting target word from context words
- Pre-trained language models(e.g BERT, GPT-3)
- Giant transformer models
- Trained with large amount of unannotated data
- Fine-tuning for downstream tasks
2.4.3 Image/Video Features
- traditionally extract images by hand-craft features such as SIFT
- Now commonly use pre-trained deep neural networks
- ResNet:trained with ImageNet(image classification)
- I3D:trained with Kinetics(action classification)
Summary
- Features matter
- Features are hand-crafted or learned by deep neural networks