Info.
|
Vol.4 - No.4 (2010.12.20) |
Title
|
Document clustering of MEDLINE abstracts based on non-negative matrix factorization using local confidence assessment |
Authors
|
Byeong-Chul Kang1, Zee-Won Sur2, Chulhwan Park3 & Man-gi Cho4 |
Institutions
|
1Insilicogen Inc., #909 Chomdan Venture ValleySuwon,
Gosaek-dong, Gweonseon-gu, Gyeonggi-do 441-813, Korea
2Bayer AG, Largenfeld, Germany 40764
3Department of Chemical Engineering, Kwangwoon University,
Kwangwoon-gil 26, Nowon-gu, Seoul 139-701, Korea
4Department of Food and Biotechnology, Dongseo University,
San 69-1 Jurye-2-dong, Sasang-gu, Busan 617-716, Korea
Correspondence and requests for materials should be addressed to
C. Park and M.-G. Cho ( chpark@kw.ac.kr, mgcho@gdsu.dongseo.ac.kr) |
Abstract
|
A document search in PubMed is certainly one of the most exhaustive ways for finding information related to any biological or biomedical topic. However, a keyword search in this database that is not specific enough will provide a number of results that exceeds by far an amount of documents the user can read through one by one. In this work, we therefore present a new document clustering tool called Med-Clus for bioinformaticians in order to make a keyword search result from PubMed more concise by grouping such a set of documents into clusters. MedClus contains two modules. First, a pre-clustering module that creates the data matrix. This matrix contains term-document frequencies according to the TF*IDF method and optional weights. These weights are given by comparing the term list with the MeSH terms contained in the related MEDLINE abstracts. Second, it contains a clustering module, which is based on a Non-negative Matrix Factorization algorithm that finds an approximate factorization of the data matrix. This application was tested in different experiments evaluating its performance and reliability. Based on these results, a list of recommended ranges for crucial parameters such as the number of clusters was edited in order to constitute an user assistance for the application of Med-Clus. Finally, some results were analyzed by scientists from the field of medicine and biology, who evaluated the relevance of the terms and the existence of a relation between them. MedClus is a tool that is able to re-structure the result list of a keyword search for documents in PubMed. This is done by extracting terms before and finding latent semantics during the clustering process. Also, it optionally applies weights to terms that also appear as MeSH terms in at least one of the MEDLINE abstracts. Therefore, it helps users to refine a search result in PubMed via term-based clustering in order to economize time and efforts. At this development stage, the software is suitable for experienced users such as bioinformaticians, database administrators and developers. Also Web service for Semantic Toxicogenomics Knowledgebase, available at http://stkb2. labkm.net, has applied this technology to provide comprehensive and accurate relations between chemical and toxicological contexts. |
Keyword
|
Text-mining, Literature clustering, Non-negative matrix factorization, Local onfidence assessment, Bioinformatics |
PDF File
|
# 2010년도 발행분 부터는 Springer 의 BioChip Journal 페이지에서 전문을 열람하실 수 있습니다.
# 학회회원 로그인 후 [ Springer BioChip Journal 열람하기 ] 버튼을 클릭하시면 새창으로 열립니다.
|