[Python] ML - unsupervised text classification for word labeling / toptic modeling python 단어 라벨링 투척!

Study/Python

IT파스칼 2023. 10. 18. 13:24

과제에 쓰일 자료를 정리한 목적입니다. 방대한 단어들을 그룹화하려는 것이 목적입니다.

1. Latent Derilicht Analysis ( LDA ) Conquered

Documents with similar topics will always have similar set of words.
Groups are formed by searching group of words that frequently appear in document. (이미 있는 것을 사용할 것임)
User has to input/provide the value of ‘ K ‘ i.e number of topics in a document.
Documents are assumed to be probability distributions over topics.
Topics are assumed to be probability distributions over words used in documents.

여기 방법이 잘 나와있고 이해하기도 쉬움.

그룹화할 수를 정하면 비슷한 단어끼리 묶임.

비슷한 단어끼리 묶인 것을 카테고리화(Topic)으로 구별.

위에 토픽으로 구별한 것을 수작업으로 각 토픽의 이름을 정해줘야함. (단계 1)

이걸 자동으로 할 수 있을지는 더 찾아봐야할 것 같음.

단어 수가 많을때 사용하면 유용한 모델!

만개정도(?) 넘어갈때 사용하나봐요.

위 모델의 결과 값 미리보기 ▼

Unsupervised Text Classification In Python - Home

Unsupervised text classification using python using LDA ( Latent Derilicht Analysis ) & NMF ( Non-negative Matrix factorization )