山西大学大数据科学与产业研究院

Local Feature Selection for Large-scale Data Sets with Limited Labels

Authors: Tian Yang, Yanfang Deng, Bin Yu, Yuhua Qian, Jianhua Dai

Abstract:

Processing large-scale data sets with limited labels has always been a difficult task in data mining. Facing this difficulty, two local feature selection algorithms, LARD and LRSD, have been proposed based on dependency degree, which can process partially labeled data sets and greatly improve the computational efficiency. However, it is very difficult for these algorithms to calculate large-scale data with millions of samples on a typical personal computer. Although the related family method is a more efficient approach than dependency degree, it cannot be used for partially labeled large-scale data. As a result, a local feature selection method based on related family is proposed to accelerate data processing in the paper. Experiments show that the proposed algorithm can run 405 times faster than LARD on partially labeled data sets and maintain high classification accuracy. In addition, this new algorithm can effectively process partially labeled large-scale data sets with 5,000,000 samples or 20,000 features on a typical personal computer.

Keywords： Data mining, Semi-supervised learning, Local feature selection, Rough set, Related family

20220720.pdf

Wed Jul 20 10:27:05 CST 2022