1.1 Intro

之前学习了李沐的机器学习实战pytorch版本。前面坚持的还不错,之后又断掉了,感觉学习的轨迹就是不断的开新坑,半途而废,继续开新坑。不过就像读书一样,一遍一遍读,总能最后读完的。所以,这次又来了。抛开别的不谈,李沐还是很棒的老师,在业界的同时还能兼顾教育。废话不多说,还是关注内容。 CS329P是Stanford的Practical Machine Learning课程。我跟着一步一步写笔记,这就是要开的新坑。 Intro部分不多赘述,直接从1.2,data acquisition开始。

1.2 Data Acquisition

1.2.1 数据是一切的基础。所以data acquisition是第一步。常见数据集有:

  • MNIST
  • ImageNet
  • AudioSet
  • Kinetics
  • KITTI
  • Amazon Review
  • SQuAD
  • LibriSpeech

1.2.2Where to find datasets:

  • Paperwithcodes Datasets
  • Kaggle Datasets
  • Google Dataset search
  • toolkits datasets:tensorflow huggingface pytorch
  • Conference/company ML competitions
  • Open Data on AWS
  • Data lakes in your own organization

1.2.3 Datasets comparison

  • Academic datasets:Clean,proper difficulty but limited choices,too simplified,usually small scale
  • Closer to real ML applications but still simplified,and only avaliable for hot topics
  • Raw data:Great flexibility but needs a lot of effort to process

1.2.4 Data integration

这一部分可以看pandas里的merge函数的使用。也就是把多种数据merge在一起。

1.2.5 Generate synthetic data

  • Use GANs
  • Data augmentations

1.2.6 Summary

  • Finding the right data is challenging
  • Raw data in industrial settings VS academic datasets
  • Data integration combines data from multiple sources
  • Data augmentation a common practice
  • Synthesizing data is getting popular