1.4 Data Labeling

1.4.1 Semi-Supervised Learning(SSL)

  • Focus on the scenario where there is a small amount of labeled data,along with a large amount of unlabeled data.
  • Make assumptions on data distribution to use unlabeled data
    • Continuity assumption
    • Cluster assumption
    • Manifold assumption
  • Self-training

  • We can use expensive models like deep neural network,model ensemble/bagging

1.4.2 Label through Crowsourcing

  • Challenges
    • Simplify user interaction
    • Qulity control
    • Cost

1.4.3 Active Learning + Self-training

1.4.4 Quality Control

1.4.5 Weak Supervision

  • Semi-automatically generate labels
    • Less accurate than manual ones,but good enough for training
  • Data programming