Python实现LDA模型

2022-04-30

1354

lda主题模型

文档主题生成模型(Latent Dirichlet Allocation,简称LDA)通常由包含词、主题和文档三层结构组成。LDA模型属于无监督学习技术，它是将一篇文档的每个词都以一定概率分布在某个主题上，并从这个主题中选择某个词语。文档到主题的过程是服从多项分布的，主题到词的过程也是服从多项分布的。

示例代码

目前对lda的理解还不是特别深，分析方法与分析角度的把握暂时也拿不了太准，所以这里暂时记录一个代码，更多的需要进一步学习，比如语义知识处理、根据困惑度确定主题数等各方面内容。

# -*- coding: utf-8 -*-
# @Time : 2022/4/11 11:35
# @Author : MinChess
# @File : lda.py
# @Software: PyCharm 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# 读取数据(已分词)
corpus = []

# 读取预料 一行预料为一个文档
for line in open('fenci.txt', 'r',encoding='utf-8').readlines():
    corpus.append(line.strip())

# 计算TF-IDF值
# 设置特征数
n_features = 2000

tf_vectorizer = TfidfVectorizer(strip_accents='unicode',
                                max_features=n_features,
                                stop_words=['的'],
                                max_df=0.99,
                                min_df=0.002)  # 去除文档内出现几率过大或过小的词汇

tf = tf_vectorizer.fit_transform(corpus)

print(tf.shape)
print(tf)

# LDA分析
from sklearn.decomposition import LatentDirichletAllocation

# 设置主题数
n_topics = 2

lda = LatentDirichletAllocation(n_components=n_topics,
                                max_iter=100,
                                learning_method='online',
                                learning_offset=50,
                                random_state=0)
lda.fit(tf)

# 显示主题数 model.topic_word_
print(lda.components_)
# 几个主题就是几行 多少个关键词就是几列
print(lda.components_.shape)

# 计算困惑度
print(u'困惑度：')
print(lda.perplexity(tf, sub_sampling=False))

# 主题-关键词分布
def print_top_words(model, tf_feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print('Topic #%d:' % topic_idx)
        print(' '.join([tf_feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
        print("")

# 定义好函数之后 暂定每个主题输出前20个关键词
n_top_words = 20
tf_feature_names = tf_vectorizer.get_feature_names()
# 调用函数
print_top_words(lda, tf_feature_names, n_top_words)

# 可视化分析
import pyLDAvis
import pyLDAvis.sklearn

# pyLDAvis.enable_notebook()

data = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
print(data)

# 显示图形
# pyLDAvis.show(data)
# pyLDAvis.save_json(data,' fileobj.html')

声明：本站原创文章文字版权归本站所有，转载务必注明作者和出处；本站转载文章仅仅代表原作者观点，不代表本站立场，图文版权归原作者所有。如有侵权，请联系我们删除。

如何使用Amazon Lambda构建一个云端工具（超详细）

本文探讨了无服务器架构的优势，并以Amazon Lambda和API Gateway为例，详细指导了如何构建和部署无服务器应用程序。内容包括理解无服务器概念、创建Lambda函数、配置API Gateway、处理事件参数、解决第三方库依赖问题，并推荐了简化Lambda管理的开源库python-lambda。

MinChess

1072 1

编程技术

Python爬取小说并写入word文档

喜欢看小说？上班摸鱼不敢明目张胆的看？看小说还要充钱？这年头，不存在的！从网络上扒数据，写到word文档，免费还能掩人耳目，美哉！美哉！

MinChess

3514 3

编程技术

Python与Amazon DynamoDB：构建高效爬虫数据存储解决方案

本文介绍了如何使用Python和Amazon DynamoDB构建高效的爬虫数据存储方案。文章讲解了DynamoDB作为NoSQL数据库的特点，如高吞吐量、低延迟和灵活的数据结构，并通过代码示例展示了如何创建表、操作数据以及爬取Bing搜索结果并存储到DynamoDB中。此方案适合大规模数据存储，为爬虫开发者提供了实用指导，助力高效数据管理。

MinChess

789 0

Python实现LDA模型

lda主题模型

示例代码

推荐阅读

如何使用Amazon Lambda构建一个云端工具（超详细）

Python爬取小说并写入word文档

Python与Amazon DynamoDB：构建高效爬虫数据存储解决方案

评论