Как проанализировать и чистить текстовые данные в Python

Это учебник для начинающих (примером) о том, как проанализировать текстовые данные В Python, используя небольшой и простой набор данных фиктивных твитов и хорошо сконструированного кода. Он покажет вам, как написать код, который будет:

Импорт CSV -файл твитов
Найти твиты, которые содержат определенные вещи, такие как хэштеги и URL -адреса
Создать WordCloud
Очистите текстовые данные, используя Регулярные выражения (“Regex”)
Покажи тебе, что Токенизация есть и как это сделать
Объясните, что Стоп слов и как их удалить
Создать Диаграмма показывая самые частые слова в твитах и их частоты

Записная книжка Юпитера доступна на GitHub Здесь Анкет

Код использует Панды DataFrames Так что, если вы не знакомы с ними, тогда этот учебник может помочь.

Во -первых, импортируйте библиотеки и установите соответствующие конфигурации:

import numpy as np   # an essential python module
import pandas as pd   # for importing & transforming data
import re   # for regular expressions
import matplotlib.pyplot as plt   # for wordclouds & charts
import seaborn as sns   # for charts
sns.set_style("whitegrid");   # chart background style
plt.rcParams['figure.dpi'] = 360   # for high res chart output
from wordcloud import WordCloud   # for the wordcloud :)
import spacy   # for tokenising text
from spacy.lang.en import English  # for tokenising text
nlp = English()   # for tokenising text
from collections import Counter   # for getting freq of words

Затем давайте импортируем файл CSV, содержащий 2 столбца ( tweet_id и твит ) и 10 рядов. Я создал этот CSV, просто копируйте несколько твитов из Twitter в файл Excel и сохранил его в виде файла CSV.

# import the csv file into a Pandas dataframe
tweet_df = pd.read_csv("tweet_sample.csv")

# view the shape of the data (the number of rows and columns)
print(f"The shape of the data is: {tweet_df.shape}")

# view the data with the "tweet" column widened to 800px 
# so that the full tweet is displayed,
# and hide the index column
tweet_df\
.style.set_properties(subset=['tweet'], **{'width': '800px'})\
.hide_index()

Мой вывод выглядит так:

(Примечание: если вы новичок в строках f , то есть print (f "... {}")

Теперь давайте найдем все твиты, которые содержат хэштег.

# let's find out how many tweets contain a hashtag
tweets_with_hashtags = tweet_df.loc[tweet_df["tweet"].str.contains("#")]

# view the number of tweets that contain a hashtag
print(f"Number of tweets containing hashtags: {len(tweets_with_hashtags)}")

# view the tweets that contain a hashtag
tweets_with_hashtags\
.style.set_properties(subset=['tweet'], **{'width': '800px'}).hide_index()

Мой вывод:

Сколько содержит URL?

# how many tweets contain a URL i.e. "http"?
tweets_with_URLs = tweet_df.loc[tweet_df["tweet"].str.contains("http")]

# view the number of tweets that contain a URL
print(f"Number of tweets containing URLs: {len(tweets_with_URLs)}")

# view the tweets that contain a URL
tweets_with_URLs\
.style.set_properties(subset=['tweet'], **{'width': '800px'}).hide_index()

Мой вывод:

Давайте создадим WordCloud Анкет Прежде чем мы сможем это сделать, нам нужно создать одну длинную строку, содержащую все твиты.

# create a single string containing all the tweets, 
# as this will be needed to be able to create a wordcloud
tweet_string = " ".join(tweet for tweet in tweet_df["tweet"])

# view the first 200 elements of the string to check 
# this worked as expected
tweet_string[0:200]

Мой вывод:

Теперь мы можем создать WordCloud, используя эту длинную строку, а затем просмотреть ее. Мы просматрим только 100 лучших слов, так что max_words будет установлен на 100.

# create the wordcloud
tweet_wordcloud = WordCloud(background_color="white", 
                              max_words=100, 
                             ).generate(tweet_string)

# view the wordcloud
plt.imshow(tweet_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Мой вывод:

Давайте очистим данные, используя Re Библиотека Python для использования Регулярные выражения («regex») Анкет В питоне есть полезное руководство по питону Здесь Анкет

Во -первых, заменим упоминания (например, @dilbert_daily) с ‘@user’.

# replace all the mentions (e.g. @Dilbert_Daily) 
# from the tweets with '@USER'
tweet_string = re.sub(r'@\w+','@USER ', tweet_string)

# view the first 200 elements of the string to check 
# this worked as expected
tweet_string[0:200]

Мой вывод:

… затем замените все URL -адреса на ‘_url’

# replace all the URLs with '_URL_'
tweet_string = re.sub(r'http\S+','_URL_ ', tweet_string)

# view the first 200 elements of the string to check 
# this worked as expected
tweet_string[0:200]

Мой вывод:

… затем преобразовать текст в нижний регистр, например, вместо того, чтобы «быть» и «быть», включенные как 2 отдельных слова, у нас есть только «быть»:

# convert the text to lower case so, for example, instead 
# of having "Be" and "be" included
# as 2 separate words, we'd only have "be"
tweet_string = tweet_string.lower()

# view the first 200 elements of the string to check 
# this worked as expected
tweet_string[0:200]

Мой вывод:

Давайте удалим дополнительные места, чтобы между словами было только одно пространство:

# remove extra white spaces so there is only one 
# space between words
tweet_string = re.sub(r'\s+',' ', tweet_string)

# view the first 200 elements of the string to 
# check this worked as expected
tweet_string[0:200]

Мой вывод:

Давайте посмотрим на WordCloud для этой очищенной строки:

# create the wordcloud
tweet_wordcloud = WordCloud(background_color="white", 
                              max_words=100, 
                             ).generate(tweet_string)

# view the wordcloud
plt.imshow(tweet_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Давайте изменим его так, чтобы WordCloud показал только 25 лучших слов:

# create the wordcloud
tweet_wordcloud = WordCloud(background_color="white", 
                              max_words=25, 
                             ).generate(tweet_string)

# view the wordcloud
plt.imshow(tweet_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Далее, давайте токениза Строка, т.е. разделила строку на отдельные токены (то есть отдельные элементы в коллекции), используя Spacy Анкет В этом примере каждый токен будет словом. Это не единственный вариант, используемый в NLP. Мы могли бы выбрать токены быть отдельными символами, частями слов, 2 слова (известные как 2 грамма/би-граммы), 3 слова (3 грамма/три-граммы), 4 слова (4 грамма), предложения .. Анкет Но для этого примера каждый токен будет словом.

# create a spacy document by pointing spacy to the 
# tweet string
tweet_doc = nlp(tweet_string)

# get all tokens that aren't punctuation
tweet_words = [token.text for token in tweet_doc if token.is_punct != True]

# get the frequency of each word (token) in the tweet string
tweet_word_freq = Counter(tweet_words)

# get the 5 most frequent words
five_most_common_words = tweet_word_freq.most_common(5)

# view the 5 most common words
five_most_common_words

Мой вывод:

Теперь давайте создадим DataFrame Pandas, содержащий все токены (слова) и частоту каждого слова, чтобы мы могли использовать это для создания Диаграмма Анкет

# create a Pandas dataframe containing the tokens 
# (words) and their frequencies
freq_df = pd.DataFrame.from_dict(tweet_word_freq, orient='index').reset_index()

# rename the columns to "word" and "freq"
freq_df.columns=["word", "freq"]

# sort the dataframe so that the most frequent word is 
# at the top and view the first 3 rows
freq_df.sort_values(by="freq", ascending=False).head(3)

Мой вывод:

Отображать бар -диаграмму (используя seaborn – которую мы псевдонировали как sns ) из 25 наиболее частых слов и их частоты:

# display a bar chart showing the top 25 words 
# and their frequencies
fig, ax = plt.subplots(figsize=(12,6))
sns.barplot(data=freq_df.sort_values(by="freq", ascending=False).head(25), 
            y="word", 
            x="freq", 
            color='#7bbcd5')
plt.ylabel("Word")
plt.xlabel("Frequency")
plt.title("Top 25 Most Frequent Words")
sns.despine();

Мой вывод:

(Примечание: если вы новичок в создании диаграмм в Python, используя Seaborn тогда вы можете найти этот пост полезный)

Затем давайте удалим слова, которые действительно распространены, но не очень полезны для понимания обсуждаемых тем, таких как «и». В мире НЛП они называются ‘ Стоп слов ‘ Их можно легко удалить, используя Spacy Токенизатор is_stop атрибут:

# get all tokens that aren't punctuation 
# and aren't stopwords
tweet_words = [token.text for token in tweet_doc if token.is_punct != True and \
token.is_stop != True]

# get the frequency of each word (token) in the tweet string
tweet_word_freq = Counter(tweet_words)

# re-create the Pandas dataframe containing the 
# tokens (words) and their frequencies
freq_df = pd.DataFrame.from_dict(tweet_word_freq, orient='index').reset_index()

# rename the columns to "word" and "freq"
freq_df.columns=["word", "freq"]

# display a bar chart showing the top 25 words and their
# frequencies (which will exclude the stopwords this time)
fig, ax = plt.subplots(figsize=(12,6))
sns.barplot(data=freq_df.sort_values(by="freq", ascending=False).head(25), 
            y="word", 
            x="freq", 
            color='#7bbcd5')
plt.ylabel("Word")
plt.xlabel("Frequency")
plt.title("Top 25 Most Frequent Words (Excluding Stopwords)")
plt.xticks([0,1,2,3])
sns.despine();

Мой вывод:

Диаграмма теперь дает нам гораздо лучшее указание на темы, обсуждаемые в тексту твита.

Вы можете сделать еще много, например, например:

Сколько токенов (слов) в самом длинном твите?
Сколько в кратчайшие сроки?
Какое среднее число токенов? (Ответы на эти вопросы длины полезны позже, если вы собираетесь использовать модели машинного обучения)
Есть ли пустые документы (твиты)? Наш набор данных настолько мал, что мы видим, что нет пустых твитов, но в реальных наборах данных, которые больше, вам необходимо выяснить программно.

… но этого, наверное, на данный момент. Очистка и анализ данных являются важной частью работы с текстовыми данными, и решать, что изменить, и как будет зависеть от решения, и является частью искусства науки о данных. Например, должны ли упоминания (например, @dilbert_daily) быть удалены/заменены или их полезный предиктор? Улучшить ли удаление пунктуации или снизить производительность модели машинного обучения или не иметь никакого значения вообще? Следует ли преобразовать текст в нижний чехол? Там нет правильного ответа, поэтому полезно иметь возможность легко играть с текстовыми данными и экспериментом.

Я рекомендую поиграть с вашими фиктивными данными, пробуя разные регулярные выражения с Re модуль и игра с WordCloud , шпажина и Seaborn модули. Там отличный Учебник для Spacy на их веб -сайте Анкет

Оригинал: “https://dev.to/nicfoxds/how-to-analyse-clean-text-data-in-python-2hb9”

Читайте ещё по теме: