数据格式:
这里,原始特征的输入文件的格式使用libsvm的格式,即每行是label index1:value1 index2:value2这种稀疏矩阵的格式。
sklearn中自带了很多种特征选择的算法。我们选用特征选择算法的依据是数据集和训练模型。
下面展示chi2的使用例。chi2,采用卡方校验的方法进行特征选择,比较适合0/1型特征和稀疏矩阵。
from sklearn.externals.joblib import Memory from sklearn.datasets import load_svmlight_file mem = Memory("./mycache") @mem.cache def get_data(): data = load_svmlight_file("labeled_fea.txt") return data[0], data[1] X, y = get_data() from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 data = SelectKBest(chi2, k=10000).fit_transform(X, y) from sklearn.datasets import dump_svmlight_file dump_svmlight_file(data, y, "labeled_chi2_fea.txt",False)
sklearn中分类模型也很多,接口统一,非常方便使用。
分类之前,可以不进行特征选择,也可以先独立进行特征选择后再做分类,还可以通过pipeline的方式让特征选择和分类集成在一起。
from sklearn.externals.joblib import Memory from sklearn.datasets import load_svmlight_file mem = Memory("./mycache") @mem.cache def get_data(): data = load_svmlight_file("labeled_fea.txt") return data[0], data[1] X, y = get_data() train_X = X[0:800000] train_y = y[0:800000] test_X = X[800000:] test_y = y[800000:] print(train_X.shape) print(test_X.shape) from sklearn.feature_selection import SelectKBest, chi2 from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import BernoulliNB, MultinomialNB from sklearn.linear_model import RidgeClassifier from sklearn.linear_model import Perceptron from sklearn.neighbors import NearestCentroid from sklearn.linear_model import SGDClassifier from sklearn.svm import LinearSVC from sklearn.ensemble import GradientBoostingClassifier from sklearn import metrics from time import time #独立的特征选择 ch2 = SelectKBest(chi2, k=10000) train_X = ch2.fit_transform(train_X, train_y) test_X = ch2.transform(test_X) #根据一个分类模型,训练模型后,进行测试 def benchmark(clf): print('_' * 80) print("Training: ") print(clf) t0 = time() clf.fit(train_X, train_y) train_time = time() - t0 print("train time: %0.3fs" % train_time) t0 = time() pred = clf.predict(test_X) test_time = time() - t0 print("test time: %0.3fs" % test_time) score = metrics.accuracy_score(test_y, pred) print("accuracy: %0.3f" % score) clf_descr = str(clf).split('(')[0] return clf_descr, score, train_time, test_time clf = RandomForestClassifier(n_estimators=100) #clf = RidgeClassifier(tol=1e-2, solver="lsqr") #clf = Perceptron(n_iter=50) #clf = LinearSVC() #clf = GradientBoostingClassifier() #clf = SGDClassifier(alpha=.0001, n_iter=50,penalty="l1") #clf = SGDClassifier(alpha=.0001, n_iter=50,penalty="elasticnet") #clf = NearestCentroid() #clf = MultinomialNB(alpha=.01) #clf = BernoulliNB(alpha=.01) #pipeline模型特征选择和分类模型结合在一起 #clf = Pipeline([ ('feature_selection', LinearSVC(penalty="l1", dual=False, tol=1e-3)), ('classification', LinearSVC())]) benchmark(clf)
值得注意的是,上面的程序训练和预测阶段都是在同一份程序执行。而实际应用中,训练和预测是分开的。因此,要使用python的对象序列化特征。每次训练完之后,序列化模型对象,保存模型的状态,预测时反序列化模型对象,还原模型的状态。
参考资料:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html
http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection
本文作者:linger
本文链接:http://blog.csdn.net/lingerlanlan/article/details/47960127