###### 编程技术网

 用户名 Email 自动登录 找回密码 密码 立即注册

# 如何处理高维输入空间的机器学习问题?:How to approach machine learning problems with high dimensional input space?

Guney Ozsan 机器学习 2022-5-6 19:33 6人围观

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?

1、2 或 3 维数据可以与算法的结果一起可视化，因此您可以了解正在发生的事情，并了解如何解决问题.一旦数据超过 3 维，除了直观地玩弄参数之外，我真的不知道如何攻击它?

1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?

### 问题解答

What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.

To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.

Here is one discussion that debates the use of PCA before SVM: link

What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:

1. 在分类之前缩放特征.
2. 尝试获得平衡的课程.如果不可能，那么惩罚一个班级比另一个班级多.查看更多关于 SVM 不平衡的参考资料.
3. 检查 SVM 参数.尝试多种组合以获得最佳组合.
4. 首先使用 RBF 内核.它几乎总是效果最好(从计算上来说).
5. 差点忘了……在测试之前，交叉验证

Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.

^