2.4 Feature Engeineering

  • ML algorithms prefer well defined fixed length input/output
  • Feature engineering is the key to ML models before deep learning
  • Deep learning train deep neural networks to extract features.
  • 2.4.1 Tabular Data Features

  • Int/Float:directly use or bin to n unique int values
  • Categorical data:one-hot encoding
    • Map rare categories into “Unknow”
  • Data-time:a feature list such as
    • [year, month, day, day_of_year, week_of_year, day_of_week]
  • Feature combination:Cartesian product of two feature such as
    • [cat, dog] x [male, female] -> [(cat, male), (cat, female), (dog, male), (dog, female)]

2.4.2 Text Features

  • Represent text as token features
    • Bag of words(BoW) model
      • limitations:needs careful vocabulary design, missing content
  • Word Embeddings(e.g Word2vec):
    • vectorizing words such that similar words are placed closed together
    • Trained by predicting target word from context words
  • Pre-trained language models(e.g BERT, GPT-3)
    • Giant transformer models
    • Trained with large amount of unannotated data
    • Fine-tuning for downstream tasks

2.4.3 Image/Video Features

  • traditionally extract images by hand-craft features such as SIFT
  • Now commonly use pre-trained deep neural networks
    • ResNet:trained with ImageNet(image classification)
    • I3D:trained with Kinetics(action classification)

Summary

  • Features matter
  • Features are hand-crafted or learned by deep neural networks