1.1 Intro
之前学习了李沐的机器学习实战pytorch版本。前面坚持的还不错,之后又断掉了,感觉学习的轨迹就是不断的开新坑,半途而废,继续开新坑。不过就像读书一样,一遍一遍读,总能最后读完的。所以,这次又来了。抛开别的不谈,李沐还是很棒的老师,在业界的同时还能兼顾教育。废话不多说,还是关注内容。 CS329P是Stanford的Practical Machine Learning课程。我跟着一步一步写笔记,这就是要开的新坑。 Intro部分不多赘述,直接从1.2,data acquisition开始。
1.2 Data Acquisition
1.2.1 数据是一切的基础。所以data acquisition是第一步。常见数据集有:
- MNIST
- ImageNet
- AudioSet
- Kinetics
- KITTI
- Amazon Review
- SQuAD
- LibriSpeech
1.2.2Where to find datasets:
- Paperwithcodes Datasets
- Kaggle Datasets
- Google Dataset search
- toolkits datasets:tensorflow huggingface pytorch
- Conference/company ML competitions
- Open Data on AWS
- Data lakes in your own organization
1.2.3 Datasets comparison
- Academic datasets:Clean,proper difficulty but limited choices,too simplified,usually small scale
- Closer to real ML applications but still simplified,and only avaliable for hot topics
- Raw data:Great flexibility but needs a lot of effort to process
1.2.4 Data integration
这一部分可以看pandas里的merge函数的使用。也就是把多种数据merge在一起。
1.2.5 Generate synthetic data
- Use GANs
- Data augmentations
1.2.6 Summary
- Finding the right data is challenging
- Raw data in industrial settings VS academic datasets
- Data integration combines data from multiple sources
- Data augmentation a common practice
- Synthesizing data is getting popular