Обработка натурального языка: как обрабатывать данные NLP?

Автор оригинала: Pankaj Kumar.

Давайте посмотрим на некоторые из самых популярных задач по обработке естественного языка, и как выполнять их с помощью Python. Обработка естественного языка (NLP) использует алгоритмы интерпретации и манипуляции и манипуляции человека.

Это один из наиболее часто используемых полей машинного обучения.

Если AI продолжает расти, нам понадобятся специалисты в разработке моделей, которые изучают речь и словарный запас, обнаруживают контекстные тенденции и создают текст и аудио Insights.

1. Подготовка наборов данных для проекта обработки натурального языка

Давайте принесем себя некоторые данные. Итак, мы просто скопируем первые 30 строк из www.gutenberg.org/files/35/35-0.txt Это свободный роман из проекта Гутенберга.

Если вы заинтересованы в других бесплатных наборах данных, посмотрите Топ 11 наборов обучения машины

text = '''The Time Traveller (for so it will be convenient to speak of him) was
expounding a recondite matter to us. His pale grey eyes shone and
twinkled, and his usually pale face was flushed and animated. The fire
burnt brightly, and the soft radiance of the incandescent lights in the
lilies of silver caught the bubbles that flashed and passed in our
glasses. Our chairs, being his patents, embraced and caressed us rather
than submitted to be sat upon, and there was that luxurious
after-dinner atmosphere, when thought runs gracefully free of the
trammels of precision. And he put it to us in this way—marking the
points with a lean forefinger—as we sat and lazily admired his
earnestness over this new paradox (as we thought it) and his fecundity.

"You must follow me carefully. I shall have to controvert one or two
ideas that are almost universally accepted. The geometry, for instance,
they taught you at school is founded on a misconception."

"Is not that rather a large thing to expect us to begin upon?" said
Filby, an argumentative person with red hair.

"I do not mean to ask you to accept anything without reasonable ground
for it. You will soon admit as much as I need from you. You know of
course that a mathematical line, a line of thickness _nil_, has no real
existence. They taught you that? Neither has a mathematical plane.
These things are mere abstractions."

"That is all right," said the Psychologist.

"Nor, having only length, breadth, and thickness, can a cube have a
real existence."

"There I object," said Filby. "Of course a solid body may exist. All
real things—"

"So most people think. But wait a moment. Can an _instantaneous_ cube
exist?"

"Don't follow you," said Filby.

"Can a cube that does not last for any time at all, have a real
existence?"

Filby became pensive. "Clearly," the Time Traveller proceeded, "any
real body must have extension in _four_ directions: it must have
Length, Breadth, Thickness, and—Duration. But through a natural
infirmity of the flesh, which I will explain to you in a moment, we
incline to overlook this fact. There are really four dimensions, three
which we call the three planes of Space, and a fourth, Time. There is,
however, a tendency to draw an unreal distinction between the former
three dimensions and the latter, because it happens that our
consciousness moves intermittently in one direction along the latter
from the beginning to the end of our lives."'''

2. stemming данные

Stemming – это процесс, который используется путем извлечения аффиксов из них для удаления основной структуры терминов.

Stemming используется поисковыми системами для условий каталога. Вот почему поисковая система будет хранить только стебли, а не сохранение всех типов слова. Поэтому stemming, поэтому уменьшает масштаб индекса и улучшает точность поиска.

В NLTK (который обозначает набор инструментов натурального языка), у нас есть два основных функция stemming:

Портер stemmer.
Ланкастер Стемемер

Портер stemmer.

Без вопроса порт Stemmer является наиболее широко используемым stemmer, который также является одним из самых крошечных stemmers.

Это также самый старый, широким краем, алгоритмом stemming.

Я буду напрямую будет кодировать, предполагая базовые знания списков Python, петли и т. Д. Так, как мы сделаем это:

import re
text = re.sub("\n"," ",text)

import nltk
from nltk.stem import PorterStemmer

word_stemmer = PorterStemmer()
for word in text.split(" "):
  if len(word)>10:
    print((word,word_stemmer.stem(word)))

Тогда мы получаем вывод как:

('incandescent', 'incandesc') ('after-dinner', 'after-dinn') ('atmosphere,', 'atmosphere,') ('way—marking', 'way—mark') ('forefinger—as', 'forefinger—a') ('earnestness', 'earnest') ('universally', 'univers') ('misconception."', 'misconception."') ('argumentative', 'argument') ('mathematical', 'mathemat') ('mathematical', 'mathemat') ('abstractions."', 'abstractions."') ('Psychologist.', 'psychologist.') ('existence."', 'existence."') ('_instantaneous_', '_instantaneous_') ('existence?"', 'existence?"') ('directions:', 'directions:') ('and—Duration.', 'and—duration.') ('dimensions,', 'dimensions,') ('distinction', 'distinct') ('consciousness', 'conscious') ('intermittently', 'intermitt')

Итак, как вы можете видеть, большинство слов были правильно сокращены. Те, которые не были, например, «Mathemat», однако, произведут это же слово для всех подобных слов. Так что это не проблема.

Ланкастер Стемемер

Алгоритм stemming Lancaster очень грубый.

Самый быстрый алгоритм здесь, и он будет массивно уменьшить ваш словарь корпуса, но не метод, который вы будете использовать, если вы хотите больше дифференцировки.

from nltk.stem import LancasterStemmer
Lanc_stemmer = LancasterStemmer()

for word in text.split(" "):
  if len(word)>10:
    print((word,Lanc_stemmer.stem(word)))

дает:

('incandescent', 'incandesc') ('after-dinner', 'after-dinn') ('atmosphere,', 'atmosphere,') ('way—marking', 'way—marking') ('forefinger—as', 'forefinger—as') ('earnestness', 'earnest') ('universally', 'univers') ('misconception."', 'misconception."') ('argumentative', 'argu') ('mathematical', 'mathem') ('mathematical', 'mathem') ('abstractions."', 'abstractions."') ('Psychologist.', 'psychologist.') ('existence."', 'existence."') ('_instantaneous_', '_instantaneous_') ('existence?"', 'existence?"') ('directions:', 'directions:') ('and—Duration.', 'and—duration.') ('dimensions,', 'dimensions,') ('distinction', 'distinct') ('consciousness', 'conscy') ('intermittently', 'intermit')

3. Лемматизация текстовых данных

Процесс лемматизации – это как stemming.

После лемматизации вывод мы можем получить, называется «лемма», что является корневым словом, а не корневым стволом выхода stemming.

В отличие от stemming, мы получим действительное слово после лемматизации, что подразумевает то же самое.

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for word in text.split():
  if len(word)>5 and word!=lemmatizer.lemmatize(word):
    print((word,lemmatizer.lemmatize(word)))
  elif len(word)>10:
    print((word,lemmatizer.lemmatize(word)))

дает нам:

('incandescent', 'incandescent') ('lights', 'light') ('lilies', 'lily') ('bubbles', 'bubble') ('after-dinner', 'after-dinner') ('atmosphere,', 'atmosphere,') ('trammels', 'trammel') ('way—marking', 'way—marking') ('points', 'point') ('forefinger—as', 'forefinger—as') ('earnestness', 'earnestness') ('universally', 'universally') ('misconception."', 'misconception."') ('argumentative', 'argumentative') ('mathematical', 'mathematical') ('mathematical', 'mathematical') ('things', 'thing') ('abstractions."', 'abstractions."') ('Psychologist.', 'Psychologist.') ('existence."', 'existence."') ('_instantaneous_', '_instantaneous_') ('existence?"', 'existence?"') ('directions:', 'directions:') ('and—Duration.', 'and—Duration.') ('dimensions,', 'dimensions,') ('planes', 'plane') ('distinction', 'distinction') ('dimensions', 'dimension') ('consciousness', 'consciousness') ('intermittently', 'intermittently')

Разница: Класс Porterstemmer отбивает слово «es». Класс WordNetLemMatizer считает, что как истинное слово.

В простых терминах технику stemming выглядит только в форме слова, в то время как метод лемматизации смотрит на значение слова.

4. Часть тегов речи (POS)

Часть от речи ( POS ) метка может быть определена как система, с помощью которой один из частей речи выделяется слово. Как правило, это называется POS маркировка Отказ

Мы можем сказать в ясных терминах, которые POS-метка – это работа о маркировке каждого слова с надлежащей частью речи в выражении.

Мы знаем, что существительные, глаголы, наречия, прилагательные, местоимения, союзники и их подкатегории являются частью словаря.

nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag

for sentence in text.split(".")[0]:
  token = sentence.split(" ")[1:]
  token = [i for i in token if i] 
  tokens_tag = pos_tag(token)
  print(tokens_tag)

дает нам:

[('Time', 'NNP'), ('Traveller', 'NNP'), ('(for', 'NNP'), ('so', 'IN'), ('it', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('convenient', 'JJ'), ('to', 'TO'), ('speak', 'VB'), ('of', 'IN'), ('him)', 'NN'), ('was', 'VBD'), ('expounding', 'VBG'), ('a', 'DT'), ('recondite', 'JJ'), ('matter', 'NN'), ('to', 'TO'), ('us', 'PRP')]

Теперь мы пойдем в некоторые задачи обработки естественного языка.

5. Удалить \ n Теги

Давайте удалим все теги Newline здесь, чтобы мы могли двигаться вперед с чистым текстом.

import re
text = re.sub("\n"," ",text)

6. Найти синонимы

Во-первых, давайте посмотрим, как получить антонимы для слов в вашем тексте. Я, конечно, предполагая, что базовые знания питона здесь. В приведенном ниже примере я нашел синонимы для «достаточно больших» слов (длина> 5), поскольку нам не часто нужны синонимы для гораздо меньших слов:

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

for word in text.split(" "):
  if len(word)>5:
    list_syn = []
    for syn in wordnet.synsets(word): 
      for lemm in syn.lemmas():
        if lemm.name() not in list_syn:
          list_syn.append(lemm.name())
    if list_syn:
      print(word + ":-")
      print(" "+str(list_syn))

Я размещаю для пустых списков синонимов и повторяющихся слов, и мы получаем довольно хороший выход:

Traveller:-
 ['traveler', 'traveller']
convenient:-
 ['convenient', 'commodious']
expounding:-
 ['exposition', 'expounding', 'elaborate', 'lucubrate', 'expatiate', 'exposit', 'enlarge', 'flesh_out', 'expand', 'expound', 'dilate', 'set_forth']
recondite:-
 ['abstruse', 'deep', 'recondite']
matter:-
 ['matter', 'affair', 'thing', 'topic', 'subject', 'issue', 'count', 'weigh']
usually:-
 ['normally', 'usually', 'unremarkably', 'commonly', 'ordinarily']
flushed:-
 ['blush', 'crimson', 'flush', 'redden', 'level', 'even_out', 'even', 'scour', 'purge', 'sluice', 'flushed', 'rose-cheeked', 'rosy', 'rosy-cheeked', 'red', 'reddened', 'red-faced']
radiance:-
 ['radiance', 'glow', 'glowing', 'radiancy', 'shine', 'effulgence', 'refulgence', 'refulgency']
incandescent:-
 ['incandescent', 'candent']
lights:-
 ['light', 'visible_light', 'visible_radiation', 'light_source', 'luminosity', 'brightness', 'brightness_level', 'luminance', 'luminousness', 'illumination', 'lightness', 'lighting', 'sparkle', 'twinkle', 'spark', 'Inner_Light', 'Light', 'Light_Within', 'Christ_Within', 'lighter', 'igniter', 'ignitor', 'illume', 'illumine', 'light_up', 'illuminate', 'fire_up', 'alight', 'perch', 'ignite', 'fall', 'unhorse', 'dismount', 'get_off', 'get_down']

7. Найти антонимы

Точно так же для антонимов:

for word in text.split(" "):
  if len(word)>5:
    list_ant = []
    for syn in wordnet.synsets(word): 
      for lemm in syn.lemmas():
        if lemm.antonyms(): 
            list_ant.append(lemm.antonyms()[0].name())
    if list_ant:
      print(word + ":-")
      print(" "+str(list_ant))

мы получили:

convenient:- ['inconvenient', 'incommodious'] 
expounding:- ['contract'] 
usually:- ['remarkably'] 
lights:- ['dark', 'extinguish'] 
caught:- ['unhitch'] 
passed:- ['fail', 'fail', 'be_born'] 
thought:- ['forget'] 
gracefully:- ['gracelessly', 'ungraciously', 'ungracefully'] 
points:- ['unpointedness'] 
admired:- ['look_down_on'] 
earnestness:- ['frivolity'] 
thought:- ['forget'] 
follow:- ['precede', 'predate', 'precede'] 
founded:- ['abolish'] 
argumentative:- ['unargumentative'] 
accept:- ['reject', 'refuse', 'refuse'] 
reasonable:- ['unreasonable'] 
ground:- ['figure'] 
course:- ['unnaturally'] 
mathematical:- ['verbal'] 
thickness:- ['thinness', 'thinness'] 
mathematical:- ['verbal'] 
having:- ['lack', 'abstain', 'refuse'] 
course:- ['unnaturally'] 
follow:- ['precede', 'predate', 'precede'] 
extension:- ['flexion'] 
natural:- ['unnatural', 'artificial', 'supernatural', 'flat'] 
incline:- ['indispose'] 
overlook:- ['attend_to'] 
unreal:- ['real', 'real', 'natural', 'substantial'] 
former:- ['latter', 'latter'] 
happens:- ['dematerialize', 'dematerialise'] 
consciousness:- ['unconsciousness', 'incognizance'] 
latter:- ['former', 'former'] 
beginning:- ['ending', 'end','finish']

8. Получение фраз, содержащих существительные

Мы можем получить фразы внутри текста, тем самым уменьшая потери информации при токенизации и моделировании темы. Это можно сделать с помощью Spacy библиотека:

import spacy
spacy_obj = spacy.load('en_core_web_sm')

И тогда мы можем просто запустить это по нашему входному тексту:

spacy_text = spacy_obj(text)
for phrase in spacy_text.noun_chunks:
  print(phrase)

Это даст нам фразы, которые содержат существительные, которые являются одним из важнейших аспектов текста, особенно романа:

The Time Traveller
a recondite matter
His pale grey eyes
his usually pale face
the soft radiance
the incandescent lights
a lean forefinger
this new paradox
one or two
ideas
an argumentative person
reasonable ground
a mathematical line
no real
existence
a mathematical plane
mere abstractions
the Psychologist
a
real existence
an _instantaneous_ cube
a real
existence
the Time Traveller
_four_ directions
a natural
infirmity
the three planes
an unreal distinction
the former
three dimensions
our
consciousness

Если мы объединяем эти фразы, это вид формирует историю резюме.

Завершение примечания

Если вам понравилось прочитать эту статью и хочу прочитать больше, следуйте за мной в качестве автора. До тех пор, продолжай кодировать!