Рубрики
Без рубрики

Scrape Google Scholar с Python

СОДЕРЖАНИЕ: INTROQ INTERQUISITES Импорт уменьшает потенциальные блоки SelectorGadgests … Tagged Python, Учебник, Учебник, WebScraping, Datascity.

Содержание:

  • вступление
    • Предварительные условия
    • Импорт
    • Уменьшить потенциальные блоки
    • Selectorgadgets расширение
  • Код
    • Органические результаты поиска
    • Результаты профилей
    • Cite
    • Авторы
    • Авторы Статьи
    • Авторы, цитируемые и публичный доступ
    • Авторы соавторы
  • Полный код
  • Ссылки
  • Outro.

вступление

Этот пост блога является продолжением серии Scraping Web Google. Здесь вы увидите, как Scrape Google Scholar с помощью Python Beautifulsoup , Запросы , lxml Библиотеки. Будет показано альтернативное решение API.

Примечание. Этот пост блога не охватывает все мелочи, которые могут появиться в результатах Google Scholar. HTML-макет может быть изменен в будущем, таким образом, некоторые из CSS Селекторы могут не работать. Дайте мне знать Если что-то не работает.

Предварительные условия

$ pip install requests
$ pip install lxml 
$ pip install beautifulsoup4
$ pip install google-search-results 

Убедитесь, что у вас есть базовые знания библиотек, упомянутых выше ( За исключением API )

Кроме того, убедитесь, что у вас есть основное понимание CSS Селекторы из-за Выберите () / select_one () Beautifulsoup Методы, которые принимают CSS селекторы. CSS Селекторы ссылки Отказ

Импорт

import requests, lxml
from bs4 import BeautifulSoup
from serpapi import GoogleSearch # API soulution

Уменьшить потенциальные блоки

Проблема, которая появится в какой-то момент, является CAPTCHA из-за отправки слишком большого количества запросов или из-за того, что Google обнаружил скрипт в качестве программного обеспечения автоматизации, который отправляет запрос.

Чтобы обойти блоки, вы можете использовать:

  • Proxy ( не требуется строго …|):
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
  'http': os.getenv('HTTP_PROXY') # Or just type without os.getenv()
}
  • Автоматизация браузера, такая как Селен Отказ Добавление прокси в Селен Отказ

Selectorgadgets расширение

Если вы видите меня, используя select_one () или Выберите () от BS4 Методы захватить данные из селекторов CSS, это будет подразумевать, что я использовал Селектором найти их.

Элемент (ы) выделены:

  • красный исключает из поиска.
  • зеленый включены в поиск.
  • желтый Угадается, что пользователь ищет найти и нуждается в дополнительных разъяснений.

Scrape Google Scholar Органические результаты поиска

Этот блок кода Scropes Title, ссылка на статью, информацию публикации, фрагмент, цитируемый результатами, ссылка на связанные статьи, ссылка на разные версии статей.

from bs4 import BeautifulSoup
import requests, lxml, os, json

proxies = {
  'http': os.getenv('HTTP_PROXY') # or just type proxy here without os.getenv()
}

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "samsung",
  "hl": "en",
}

html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

# Scrape just PDF links
for pdf_link in soup.select('.gs_or_ggsm a'):
  pdf_file_link = pdf_link['href']
  print(pdf_file_link)

# JSON data will be collected here
data = []

# Container where all needed data is located
for result in soup.select('.gs_ri'):
  title = result.select_one('.gs_rt').text
  title_link = result.select_one('.gs_rt a')['href']
  publication_info = result.select_one('.gs_a').text
  snippet = result.select_one('.gs_rs').text
  cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
  related_articles = result.select_one('a:nth-child(4)')['href']
  try:
    all_article_versions = result.select_one('a~ a+ .gs_nph')['href']
  except:
    all_article_versions = None

  data.append({
    'title': title,
    'title_link': title_link,
    'publication_info': publication_info,
    'snippet': snippet,
    'cited_by': f'https://scholar.google.com{cited_by}',
    'related_articles': f'https://scholar.google.com{related_articles}',
    'all_article_versions': f'https://scholar.google.com{all_article_versions}',
  })

print(json.dumps(data, indent = 2, ensure_ascii = False))

# Part of the JSON Output:
'''
[
  {
    "title": ""What? I thought Samsung was Japanese": accurate or not, perceived country of origin matters",
    "title_link": "https://www.emerald.com/insight/content/doi/10.1108/02651331111167589/full/html",
    "publication_info": "P Magnusson, SA Westjohn… - International Marketing …, 2011 - emerald.com",
    "snippet": "Purpose–Extensive research has shown that country‐of‐origin (COO) information significantly affects product evaluations and buying behavior. Yet recently, a competing perspective has emerged suggesting that COO effects have been inflated in prior research …",
    "cited_by": "https://scholar.google.com/scholar?cites=341074171610121811&as_sdt=2005&sciodt=0,5&hl=en",
    "related_articles": "https://scholar.google.com/scholar?q=related:U8bh6Ca9uwQJ:scholar.google.com/&scioq=samsung&hl=en&as_sdt=0,5",
    "all_article_versions": "https://scholar.google.com/scholar?cluster=341074171610121811&hl=en&as_sdt=0,5"
  }
]
'''

# Part of PDF Links Output:
'''
https://www.researchgate.net/profile/Peter_Magnusson/publication/232614407_What_I_thought_Samsung_was_Japanese_Accurate_or_not_perceived_country_of_origin_matters/links/09e4150881184a6ad2000000/What-I-thought-Samsung-was-Japanese-Accurate-or-not-perceived-country-of-origin-matters.pdf
https://www.researchgate.net/profile/Hong_Mo_Yang/publication/235291000_Supply_chain_management_six_sigma_A_management_innovation_methodology_at_the_Samsung_Group/links/56e03d0708aec4b3333d0445.pdf
https://www.academia.edu/download/54053930/The_Strategic_Localization_of_Transnatio20170803-32468-4ntcqr.pdf
https://mathsci2.appstate.edu/~wmcb/Class/5340/ClassNotes141/EdelmanAwards/Interfaces2002-S.pdf
'''

Scrape Google Scholar органические результаты с серпапи

Этот блок кода соскребает так же, как указано выше: заголовок, ссылка на статью, информацию публикации, фрагмент, цитируемый результатами, ссылка на соответствующие статьи, ссылка на разные версии статей.

from serpapi import GoogleSearch
import os, json

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar",
  "q": "samsung",
}

search = GoogleSearch(params)
results = search.get_dict()

# This print() looks pretty akward, 
# but the point is that you can grab everything you need in 2-3 lines of code as below.
for result in results['organic_results']:
  print(f"Title: {result['title']}\nPublication info: {result['publication_info']['summary']}\nSnippet: {result['snippet']}\nCited by: {result['inline_links']['cited_by']['link']}\nRelated Versions: {result['inline_links']['related_pages_link']}\n")

# If you want more readable code, here's one example.
data = []

for result in results['organic_results']:
  data.append({
    'title': result['title'],
    'publication_info': result['publication_info']['summary'],
    'snippet': result['snippet'],
    'cited_by': result['inline_links']['cited_by']['link'],
    'related_versions': result['inline_links']['related_pages_link'],
  })

print(json.dumps(data, indent = 2, ensure_ascii = False))

# Part of Non-JSON output:
'''
Title: "What? I thought Samsung was Japanese": accurate or not, perceived country of origin matters
Publication info: P Magnusson, SA Westjohn… - International Marketing …, 2011 - emerald.com
Snippet: Purpose–Extensive research has shown that country‐of‐origin (COO) information significantly affects product evaluations and buying behavior. Yet recently, a competing perspective has emerged suggesting that COO effects have been inflated in prior research …
Cited by: https://scholar.google.com/scholar?cites=341074171610121811&as_sdt=5,44&sciodt=0,44&hl=en
Related Versions: https://scholar.google.com/scholar?q=related:U8bh6Ca9uwQJ:scholar.google.com/&scioq=samsung&hl=en&as_sdt=0,44
'''

# Part of JSON output:
'''
[
  {
    "title": ""What? I thought Samsung was Japanese": accurate or not, perceived country of origin matters",
    "publication_info": "P Magnusson, SA Westjohn… - International Marketing …, 2011 - emerald.com",
    "snippet": "Purpose–Extensive research has shown that country‐of‐origin (COO) information significantly affects product evaluations and buying behavior. Yet recently, a competing perspective has emerged suggesting that COO effects have been inflated in prior research …",
    "cited_by": "https://scholar.google.com/scholar?cites=341074171610121811&as_sdt=5,44&sciodt=0,44&hl=en",
    "related_versions": "https://scholar.google.com/scholar?q=related:U8bh6Ca9uwQJ:scholar.google.com/&scioq=samsung&hl=en&as_sdt=0,44"
  }
]
'''

Scrape Google Scholar Profile результаты результатов

Этот блок кода царапин автор имени, ссылка, принадлежности (ы), электронная почта (если добавлена), интересы (если добавлены), цитируемые.

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/citations?view_op=view_org&hl=en&org=9834965952280547731', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

# Selecting container where all data located 
for result in soup.select('.gs_ai_chpr'):
  name = result.select_one('.gs_ai_name a').text
  link = result.select_one('.gs_ai_name a')['href']
  # https://stackoverflow.com/a/6633693/15164646
  id = link
  id_identifer = 'user='
  before_keyword, keyword, after_keyword = id.partition(id_identifer)
  author_id = after_keyword
  affiliations = result.select_one('.gs_ai_aff').text
  email = result.select_one('.gs_ai_eml').text
  try:
    interests = result.select_one('.gs_ai_one_int').text
  except:
    interests = None
  # "Cited by 107390" = getting text string -> splitting by a space -> ['Cited', 'by', '21180'] and taking [2] index which is the number.
  cited_by = result.select_one('.gs_ai_cby').text.split(' ')[2] 

  print(f'{name}\nhttps://scholar.google.com{link}\n{author_id}\n{affiliations}\n{email}\n{interests}\n{cited_by}\n')

# Part of the output:
'''
Jeong-Won Lee
https://scholar.google.com/citations?hl=en&user=D41VK7AAAAAJ
D41VK7AAAAAJ
Samsung Medical Center
Verified email at samsung.com
Gynecologic oncology
107516
'''

Scrape Google Scholar Profile с серпапи

Этот блок кода соскребает так же, как указано выше: имя автора, ссылка, принадлежность (ы), электронная почта (если добавлена), интересы (если добавлены), цитируемые.

from serpapi import GoogleSearch
import os

params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "google_scholar_profiles",
    "hl": "en",
    "mauthors": "samsung"
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['profiles']:
  name = result['name']
  try:
    email = result['email']
  except:
    email = None
  author_id = result['author_id']
  affiliation = result['affiliations']
  cited_by = result['cited_by']
  interests = result['interests'][0]['title']
  interests_link = result['interests'][0]['link']

  print(f'{name}\n{email}\n{author_id}\n{affiliation}\n{cited_by}\n{interests}\n{interests_link}\n')

# Part of the output:
'''
Jeong-Won Lee
Verified email at samsung.com
D41VK7AAAAAJ
Samsung Medical Center
107516
Gynecologic oncology
https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:gynecologic_oncology
'''

Scrape Google Scholar Cite результаты

Этот блок кода ascrapes цитирует результаты.

# This script is a starting point and probably won't work inside replit.com

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.common.by import By

options = Options()
options.page_load_strategy = 'normal'
driver = webdriver.Chrome(options = options)
wait = WebDriverWait(driver, 10)

# Proxy is needed: 
# https://www.selenium.dev/documentation/en/webdriver/http_proxies
# https://stackoverflow.com/a/40628176/15164646

query = "samsung"

driver.get(f'https://scholar.google.com.ua/scholar?hl=en&as_sdt=0%2C5&as_vis=1&q={query}')
cite = wait.until(presence_of_element_located(By.XPATH, )).click

container = driver.find_element_by_css_selector('#gs_citt').text
print(container)


# Proxy method. Still throws a CAPTCHA
PROXY = "HOST:PORT"
webdriver.DesiredCapabilities.CHROME['proxy'] = {
    "httpProxy": PROXY,
    "proxyType": "MANUAL",
}

with webdriver.Chrome() as driver:
    wait = WebDriverWait(driver, 10)

    query = "samsung"
    driver.get('https://scholar.google.com.ua/scholar?hl=en&as_sdt=0%2C5&as_vis=1&q=samsung')
    cite = wait.until(EC.element_to_be_clickable(By.XPATH, "//*[@id='gs_res_ccl_mid']/div[1]/div[2]/div[3]/a[2]")).click

    container = driver.find_element_by_css_selector('#gs_citt').text
    print(container)

Scrape Google Scholar Cite с серпапи

Этот блок кода также выскабливает результаты цита.

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_cite",
  "q": "FDc6HiktlqEJ"
}

search = GoogleSearch(params)
results = search.get_dict()

for cite in results['citations']:
  print(f'Title: {cite["title"]}\nSnippet: {cite["snippet"]}\n')

# Output:
'''
Title: MLA
Snippet: Schwertmann, U. T. R. M., and Reginald M. Taylor. "Iron oxides." Minerals in soil environments 1 (1989): 379-438.
Title: APA
Snippet: Schwertmann, U. T. R. M., & Taylor, R. M. (1989). Iron oxides. Minerals in soil environments, 1, 379-438.
Title: Chicago
Snippet: Schwertmann, U. T. R. M., and Reginald M. Taylor. "Iron oxides." Minerals in soil environments 1 (1989): 379-438.
Title: Harvard
Snippet: Schwertmann, U.T.R.M. and Taylor, R.M., 1989. Iron oxides. Minerals in soil environments, 1, pp.379-438.
Title: Vancouver
Snippet: Schwertmann UT, Taylor RM. Iron oxides. Minerals in soil environments. 1989 Jan 1;1:379-438.
'''

Scrape Google Scholar Авторы результаты

Этот блок кода ascrapes конкретного имени автора, принадлежности (ы), электронная почта, интересы.

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}


def bs4_scrape_author_result():
  html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers, proxies=proxies).text
  soup = BeautifulSoup(html, 'lxml')

  print('Author info:')
  name = soup.select_one('#gsc_prf_in').text
  affiliation = soup.select_one('#gsc_prf_in+ .gsc_prf_il').text

  try:
    email = soup.select_one('#gsc_prf_ivh').text
  except:
    email = None

  try:
    interests = soup.select_one('#gsc_prf_int').text
  except:
    interests = None


  print(f'{name}\n{affiliation}\n{email}\n{interests}\n')

# Output:
'''
Jun-Youn Kim
Samsung
Verified email at plesseysemi.com
micro ledGaN power device
'''

Scrape Google Scholar статей

Этот блок кода царапин статей от авторского профиля.

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

params = {
   "user": "m8dFEawAAAAJ",
   "sortby": "pubdate",
   "hl": "en"
}

def get_articles():
  html = requests.get('https://scholar.google.com/citations', headers=headers, params=params, proxies=proxies).text
  soup = BeautifulSoup(html, 'lxml')

  print('Article info:')
  for article_info in soup.select('#gsc_a_b .gsc_a_t'):
    title = article_info.select_one('.gsc_a_at').text
    title_link = f"https://scholar.google.com{article_info.select_one('.gsc_a_at')['href']}"
    authors = article_info.select_one('.gsc_a_at+ .gs_gray').text
    publications = article_info.select_one('.gs_gray+ .gs_gray').text

    print(f'Title: {title}\nTitle link: {title_link}\nArticle Author(s): {authors}\nArticle Publication(s): {publications}\n')

# Part of the output:
'''
Article info:
Title: Lifting propositional proof compression algorithms to first-order logic
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&sortby=pubdate&citation_for_view=m8dFEawAAAAJ:abG-DnoFyZgC
Article Author(s): J Gorzny, E Postan, B Woltzenlogel Paleo
Article Publication(s): Journal of Logic and Computation, 2020

Title: Complexity of translations from resolution to sequent calculus
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&sortby=pubdate&citation_for_view=m8dFEawAAAAJ:D03iK_w7-QYC
Article Author(s): G Reis, BW Paleo
Article Publication(s): Mathematical Structures in Computer Science 29 (8), 1061-1091, 2019
'''

Scrape Google Scholar цитируется, публичный доступ

Этот блок сосков кода, приведенный к цитатам (все, с 2016 года), H-индекс (все, с 2016 года), I10 INDEX (все, с 2016 года) и доступом к публичному доступу.

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

print('Citation info:')
for cited_by_public_access in soup.select('.gsc_rsb'):
  citations_all = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
  citations_since2016 = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
  h_index_all = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
  h_index_2016 = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
  i10_index_all = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
  i10_index_2016 = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
  articles_num = cited_by_public_access.select_one('.gsc_rsb_m_a:nth-child(1) span').text.split(' ')[0]
  articles_link = cited_by_public_access.select_one('#gsc_lwp_mndt_lnk')['href']

print(f'{citations_all}\n{citations_since2016}\n{h_index_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n{articles_num}\nhttps://scholar.google.com{articles_link}\n')
# Output:
'''
Citation info:
67599
28242
110
63
967
447
7
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=9PepYk8AAAAJ
'''

Scrape Google Scholar Co-авторов результатов

Этот блок кода Scrapes соавторов от авторского профиля.

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers, proxies=proxies)
soup = BeautifulSoup(html.text, 'lxml')

for container in soup.select('.gsc_rsb_aa'):
  author_name = container.select_one('#gsc_rsb_co a').text
  author_affiliations = container.select_one('.gsc_rsb_a_ext').text
  author_link = container.select_one('#gsc_rsb_co a')['href']
  print(f'{author_name}\n{author_affiliations}\nhttps://scholar.google.com{author_link}\n')

# Part of the output:
'''
Christoph Benzmüller
Professor, FU Berlin
https://scholar.google.com/citations?user=zD0vtfwAAAAJ&hl=en
Pascal Fontaine
LORIA, INRIA, Université de Lorraine, Nancy, France
https://scholar.google.com/citations?user=gHe6EF8AAAAJ&hl=en
Stephan Merz
Senior Researcher, INRIA
https://scholar.google.com/citations?user=jaO3Z3wAAAAJ&hl=en
'''

Полный код для Scrape профиль, авторы

Это полный код соскоба профиля и авторских результатов: статьи, приведенные (включая график) и публичный доступ с соавтором

from bs4 import BeautifulSoup
import requests, lxml, os, json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

def bs4_scrape_profile_results():
  html = requests.get('https://scholar.google.com/citations?view_op=view_org&hl=en&org=9834965952280547731', headers=headers, proxies=proxies)
  soup = BeautifulSoup(html.text, 'lxml')

  author_ids = []

  for result in soup.select('.gs_ai_chpr'):
    name = result.select_one('.gs_ai_name a').text
    link = result.select_one('.gs_ai_name a')['href']
    # https://stackoverflow.com/a/6633693/15164646
    id = link
    id_identifer = 'user='
    before_keyword, keyword, after_keyword = id.partition(id_identifer)
    author_id = after_keyword
    affiliations = result.select_one('.gs_ai_aff').text
    email = result.select_one('.gs_ai_eml').text
    try:
      interests = result.select_one('.gs_ai_one_int').text
    except:
      interests = None
    cited_by = result.select_one('.gs_ai_cby').text.split(' ')[2]

    author_ids.append(author_id)
  print(author_ids)
  return author_ids


def bs4_scrape_author_result(profiles):

  print('Author info:')
  for id in profiles:
    html = requests.get(f'https://scholar.google.com/citations?hl=en&user={id}', headers=headers, proxies=proxies)
    soup = BeautifulSoup(html.text, 'lxml')

    name = soup.select_one('#gsc_prf_in').text
    affiliation = soup.select_one('#gsc_prf_in+ .gsc_prf_il').text

    try:
      email = soup.select_one('#gsc_prf_ivh').text
    except:
      email = None

    try:
      interests = soup.select_one('#gsc_prf_int').text
    except:
      interests = None

    print(f'{name}\n{affiliation}\n{email}\n{interests}\n')

    print('Article info:')
    for article_info in soup.select('#gsc_a_b .gsc_a_t'):
      title = article_info.select_one('.gsc_a_at').text
      title_link = article_info.select_one('.gsc_a_at')['data-href']
      authors = article_info.select_one('.gsc_a_at+ .gs_gray').text
      publications = article_info.select_one('.gs_gray+ .gs_gray').text

      print(f'Title: {title}\nTitle link: https://scholar.google.com{title_link}\Article Author(s): {authors}\Article Publication(s): {publications}\n')

    print('Citiation info:')
    for cited_by_public_access in soup.select('.gsc_rsb'):
      citations_all = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
      citations_since2016 = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
      h_index_all = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
      h_index_2016 = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
      i10_index_all = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
      i10_index_2016 = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
      articles_num = cited_by_public_access.select_one('.gsc_rsb_m_a:nth-child(1) span').text.split(' ')[0]
      articles_link = cited_by_public_access.select_one('#gsc_lwp_mndt_lnk')['href']
      print(f'{citations_all}\n{citations_since2016}\n{h_index_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n{articles_num}\nhttps://scholar.google.com{articles_link}\n')

      print('Co-Author(s):')
      try:
        for container in soup.select('.gsc_rsb_aa'):
          author_name = container.select_one('#gsc_rsb_co a').text
          author_affiliations = container.select_one('.gsc_rsb_a_ext').text
          author_link = container.select_one('#gsc_rsb_co a')['href']
         print(f'{author_name}\n{author_affiliations}\nhttps://scholar.google.com{author_link}\n')
      except:
        pass

        print('Graph result:')
        years = [graph_year.text for graph_year in soup.select('.gsc_g_t')]
        citations = [graph_citation.text for graph_citation in soup.select('.gsc_g_a')]

        data = []

        for year, citation in zip(years,citations):
          print(f'{year} {citation}\n')

          data.append({
            'year': year,
            'citation': citation,
          })

        # print(json.dumps(data, indent=2))

profiles = bs4_scrape_profile_results()
bs4_scrape_author_result(profiles)

Scrape Google Scholar авторских статей с серпапи

Этот блок царапин кода Статья: название, ссылка, авторы, публикации, цитируемые, цитируемые ссылкой, год.

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_author",
  "author_id": "9PepYk8AAAAJ",
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

for article in results['articles']:
  article_title = article['title']
  article_link = article['link']
  article_authors = article['authors']
  article_publication = article['publication']
  cited_by = article['cited_by']['value']
  cited_by_link = article['cited_by']['link']
  article_year = article['year']

  print(f"Title: {article_title}\nLink: {article_link}\nAuthors: {article_authors}\nPublication: {article_publication}\nCited by: {cited_by}\nCited by link: {cited_by_link}\nPublication year: {article_year}\n")

# Part of the output:
'''
Title: Methods for forming liquid crystal displays including thin film transistors and gate pads having a particular structure
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:4X0JR2_MtJMC
Authors: DG Kim, W Lee
Publication: US Patent 5,731,856, 1998
Cited by: 3467
Cited by link: https://scholar.google.com/scholar?oi=bibs&hl=en&cites=1363873152714400726
Publication year: 1998
Title: Thin film transistor, method of manufacturing the same, and flat panel display having the same
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:Dh4RK7yvr34C
Authors: J Park, C Kim, S Kim, I Song, Y Park
Publication: US Patent 8,188,472, 2012
Cited by: 3347
Cited by link: https://scholar.google.com/scholar?oi=bibs&hl=en&cites=12194894272882326688
Publication year: 2012
'''

Scrape Google Scholar автор цитируется С серпапи

Этот блок сосков кода, приведенный к цитатам (все, с 2016 года), H-индекс (все, с 2016 года), I10 INDEX (все, с 2016 года) и доступом к публичному доступу.

from serpapi import GoogleSearch
import os, json

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_author",
  "author_id": "9PepYk8AAAAJ",
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']

print(citations_all)
print(citations_2016)
print(h_inedx_all)
print(h_index_2016)
print(i10_index_all)
print(i10_index_2016)

public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']

print(public_access_link)
print(public_access_available_articles)

# Output:
'''
Cited by:
67599
28242
110
63
967
447
Public accsess:
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=9PepYk8AAAAJ
7
'''

Scrape Google Scholar соавторов с серпапи

Этот блок Code Scrapes соавторов от авторской страницы.

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_author",
  "author_id": "m8dFEawAAAAJ",
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

for authors in results['co_authors']:
  author_name = authors['name']
  author_affiliations = authors['affiliations']
  author_link = authors['link']
  print(f'{author_name}\n{author_affiliations}\n{author_link}\n')

# Part of the output:
'''
Christoph Benzmüller
Professor, FU Berlin
https://scholar.google.com/citations?user=zD0vtfwAAAAJ&hl=en
Pascal Fontaine
LORIA, INRIA, Université de Lorraine, Nancy, France
https://scholar.google.com/citations?user=gHe6EF8AAAAJ&hl=en
Stephan Merz
Senior Researcher, INRIA
https://scholar.google.com/citations?user=jaO3Z3wAAAAJ&hl=en
'''

Полный с использованием Google Scholar API для Scrape Profile , результаты авторов

Этот блок полный код результатов профиля царапин, а также автор: статьи, приведенные и публичный доступ с соавторами.

from serpapi import GoogleSearch
import os

def serpapi_scrape_profile_results_combo():
    params = {
        "api_key": os.getenv("API_KEY"),
        "engine": "google_scholar_profiles",
        "hl": "en",
        "mauthors": "samsung"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    author_ids = []

    for result in results['profiles']:
      name = result['name']
      try:
        email = result['email']
      except:
        email = None
      author_id = result['author_id']
      affiliation = result['affiliations']
      cited_by = result['cited_by']
      interests = result['interests'][0]['title']
      interests_link = result['interests'][0]['link']

      author_ids.append(author_id)

      # Delete prints that not needed
      print(f'{name}\n{email}\n{author_id}\n{affiliation}\n{cited_by}\n{interests}\n{interests_link}\n')

    return author_ids


def serpapi_scrape_author_result_combo(profiles):

  for id in profiles:
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google_scholar_author",
      "author_id": id,
      "hl": "en",
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    print('Author Info:')
    name = results['author']['name']
    affiliations = results['author']['affiliations']
    email = results['author']['email']
    # Add as many interests as needed by adding additional indexes [3] [4] [5] [6] etc.
    try:
      interests1 = results['author']['interests'][0]['title']
      interests2 = results['author']['interests'][1]['title']
    except:
      interests1 = None
      interests2 = None
    print(f'{name}\n{affiliations}\n{email}\n{interests1}\n{interests2}\n')

    print('Articles Info:')
    for article in results['articles']:
      article_title = article['title']
      article_link = article['link']
      article_authors = article['authors']
      try:
        article_publication = article['publication']
      except:
        article_publication = None
      cited_by = article['cited_by']['value']
      cited_by_link = article['cited_by']['link']
      article_year = article['year']

      print(f"Title: {article_title}\nLink: {article_link}\nAuthors: {article_authors}\nPublication: {article_publication}\nCited by: {cited_by}\nCited by link: {cited_by_link}\nPublication year: {article_year}\n")

    print('Citations Info:')
    citations_all = results['cited_by']['table'][0]['citations']['all']
    citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
    h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
    h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
    i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
    i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']
  print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')

    print('Public Access Info:')
    public_access_link = results['public_access']['link']
    public_access_available_articles = results['public_access']['available']

print(f'{public_access_link}\n{public_access_available_articles}\n')

    # Graph results
    try:
      for graph_results in results['cited_by']['graph']:
        year = graph_results['year']
        citations = graph_results['citations']
        print(f'{year} {citations}\n')
    except:
      pass

    print('Co-Authour(s):')
    try:
      for authors in results['co_authors']:
        author_name = authors['name']
        author_affiliations = authors['affiliations']
        author_link = authors['link']
   print(f'{author_name}\n{author_affiliations}\n{author_link}\n')
    except:
      pass 


profiles = serpapi_scrape_profile_results_combo()
serpapi_scrape_author_result_combo(profiles)

Ссылки

Код в онлайн IDEРепозиторий GitHub

Outro.

Если у вас есть какие-либо вопросы или предложения, или что-то не работает правильно, не стесняйтесь бросить комментарий в разделе комментариев или через Twitter в @serp_api Отказ

Вы можете связаться со мной напрямую через Twitter в @dimitryzub Отказ

Твой, димитрий, а остальная часть команды серпапи.

Оригинал: “https://dev.to/dimitryzub/scrape-google-scholar-with-python-32oh”