Selenium WebDriver

Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well.

Selenium has the support of some of the largest browser vendors who have taken (or are taking) steps to make Selenium a native part of their browser. It is also the core technology in countless other browser automation tools, APIs and frameworks.

How to install

sudo pip install selenium

[python] Selenium으로 스크래핑하기

Drivers

Chrome

Gecko

MOZ_HEADLESS=1 환경 변수를 설정하면 Headless 모드로 시작한다.

DOM요소 선택

요소를 찾지 못하면 selenium.common.exceptions.NoSuchElementException 발생

처음요소를 추출

이름	설명
find_element_by_id(id)	id속성으로 요소를 하나 추출
find_element_by_name(name)	name 속성으로 요소를 하나 추출
find_element_by_css_selector(query)	css 선택자로 요소를 하나 추출
find_element_xpath(query)	xpath를 지정해 요소를 하나 추출
find_element_by_tag_name(name)	태그 이름이 name에 해당하는 요소를 하나 추출
find_element_by_link_text(text)	링크 텍스트로 요소를 추출
find_element_by_partial_link_text(text)	링크의 자식 요소에 포함되 있는 텍스트로 요소를 하나 추출
find_element_by_class_name(name)	클래스 이름이 name에 해당하는 요소를 하나 추출

모든 요소를 추출

이름	설명
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

요소를 조작

메서드/속성	설명
clear()	글자를 지운다
click()	요소를 클릭
get_attribute(name)	요소 속성중 name에 해당하는 속성 값을 추출
is_displayed()	요소가 화면에 출력되는지 확인
is_enabled()	요소가 활성화돼 있는지 확인
is_selected()	체크박스 등의 요소가 선택된 상태인지 확인
screenshot(filename)	스크린샷
send_keys(value)	키를 입력
submit()	입력 양식을 전송
value_of_css_property(name)	name에 해당하는 css속성 값을 추출
id	id
location	요소의 위치
parent	부모요소
rect	크기와 위치 정보를 가진 사전자료형 리턴
screenshot_as_base64	스크린샷을 base64로 추출
screenshot_as_png	스크린샷을 png형식의 바이너리로 추출
size	요소의 크리
tag_name	태그 이름
text	요소의 내부 글자

send_key() 에서 특수키 입력

from selenium.Webdriver.common.keys import Keys


ARROW_DOWN / ARROW_LEFT / ARROW_RIGHT / ARROW_UP
BACKSPACE / DELETE / HOME / END /INSERT /
ALT / COMMAND / CONTROL / SHIFT
ENTER / ESCAPE /SPACE / TAB
F1 / F2 / F3 ............./ F12

드라이버 조작

메서드	설명
add_cookie( cookie_dict)	쿠키값을 사전 형식으로 지정
back() / forward()	이전 페이지/ 다음페이지
close()	브라우저 닫기
current_url	현재 url
delete_all_cookies()	모든 쿠키 제거
delete_cookie(name)	name에 해당하는 쿠키 제거
execute( command, params)	브라우저 고유의 명령어 실행
execute_async_script( script, *args)	비동기 처리하는 자바스크립트를 실행
execute_script( script, *args)	동기 처리하는 자바스크립트를 실행
get(url)	웹 페이지를 읽어들임
get_cookie( name)	특정 쿠키 값을 추출
get_cookies()	모든 쿠키값을 사전 형식으로 추출
get_log(type)	로그를 추출(type: browser/driver/client/server)
get_screenshot_as_base64()	base64형식으로 스크린샷을 추출
get_screenshot_as_file(filename)	스크린샷을 파일로 저장
get_screenshot_as_png()	png형식으로 스키란샷의 바이너리를 추출
get_window_position(windowHandle='current')	브라우저의 위치를 추출
get_window_size( windowHandle='current')	브라우저의 크기를 추출
implicitly_wait(sec)	최대 대기 시간을 초 단위로 지정해서 처리가 끈날때 까지 대기
quit()	드라이버를 종료 시켜 브라우저 닫기
save_screenshot(filename)	스크린샷 저장
set_page_load_timeout( time_to_wait)	페이지를 읽는 타임아웃 시간을 지정
set_script_timeout(time_to_wait)	스크립트의 타임아웃 시간을 지정
set_window_position(x,y,windowHandle='current')	브라우저 위치를 지정
set_window_size(width, height, windowHandle='current')	브라우저 크기를 지정
title	현재 타이틀을 추출

sleep vs implicitly_wait vs set_page_load_timeout

결론부터 말하면:

driver.implicitly_wait(10) - 10초안에 웹페이지를 load 하면 바로 넘어가거나, 10초를 기다림.
time.sleep(10) - 10초를 기다림.
driver.set_page_load_timeout(10) - 페이지를 다 읽는 시간에 대한 타임아웃 10초. 타임아웃 발생시 selenium.common.TimeoutException를 raise 한다.

특정 조건에 맞춰 대기

WebDriverWait 와 expected_conditions 모듈을 사용하여 특정 조건이 충족될 때까지 대기할 수 있습니다.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()

# 웹 페이지로 이동
driver.get("https://example.com")

# 최대 10초 동안 대기하며, title이 "Example Domain"이 될 때까지 대기
wait = WebDriverWait(driver, 10)
wait.until(EC.title_is("Example Domain"))

GeckoDriver Example

Example of Selenium with Python on Docker with latest FireFox | takac.dev

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time

# FireFox binary path (Must be absolute path)
FIREFOX_BINARY = FirefoxBinary('/opt/firefox/firefox')

# FireFox PROFILE
PROFILE = webdriver.FirefoxProfile()
PROFILE.set_preference("browser.cache.disk.enable", False)
PROFILE.set_preference("browser.cache.memory.enable", False)
PROFILE.set_preference("browser.cache.offline.enable", False)
PROFILE.set_preference("network.http.use-cache", False)
PROFILE.set_preference("general.useragent.override","Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:72.0) Gecko/20100101 Firefox/72.0")

# FireFox Options
FIREFOX_OPTS = Options()
FIREFOX_OPTS.log.level = "trace"    # Debug
FIREFOX_OPTS.headless = True
GECKODRIVER_LOG = '/geckodriver.log'

class Scraper:
   def __init__(self):
    ff_opt = {
        firefox_binary=FIREFOX_BINARY,
        firefox_profile=PROFILE,
        options=FIREFOX_OPTS,
        service_log_path=GECKODRIVER_LOG
    }
       self.DRIVER = webdriver.Firefox(**ff_opt)

   def scrape(self, link):
       try:
           self.DRIVER.get(link)
           time.sleep(5) # just in case
           html = self.DRIVER.page_source

           return html

       except Exception as e:
           print(e)

다음의 같이 실행:

from scraper import Scraper

pretty_html = SCRAPER.scrape(link).prettify()

Switching iframe context

Stackoverflow - Python Selenium switch into an iframe within an iframe

iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to.default_content()

driver.switch_to.frame(iframe)
driver.find_elements_by_tag_name('iframe')[0]

switch_to_default_content() will return you to the top of the document. What was happening is you switched into the first iframe, switched back to the top of the document, then tried to find the second iframe. Selenium can't find the second iframe, because it's inside of the first iframe.

If you remove the second switch_to_default_content() you should be fine:

Screenshot example

driver = webdriver.PhantomJS(service_args=['--ssl-protocol=any'])
## Or chrome driver:
# driver = webdriver.Chrome();

driver.implicitly_wait(10)
driver.set_window_size(1024, 768)
driver.get('http://www.python.org/')
if driver.save_screenshot('out.png'):
    print 'success.'
driver.quit()

NAVER Fortuen example

네이버 오늘의 운세 획득 예제.

driver.get('https://search.naver.com/search.naver?sm=tab_hty.top&where=nexearch&oquery=%EC%98%A4%EB%8A%98%EC%9D%98+%EC%9A%B4%EC%84%B8&ie=utf8&query=%EC%98%A4%EB%8A%98%EC%9D%98+%EC%9A%B4%EC%84%B8')

fortune = driver.execute_script("return document.getElementById('_fortune_birthCondition');")
selector_date = driver.execute_script("return arguments[0].getElementsByClassName('_selectorText')[0];", fortune)
driver.execute_script("arguments[0].setAttribute('datetime', '{}');".format(date), selector_date)
driver.execute_script("arguments[0].innerHTML = '{}';".format(date), selector_date)

# Check fortune.
driver.execute_script("arguments[0].getElementsByClassName('btn_chs _submit')[0].click();", fortune)

# Check result.
luck_result = driver.execute_script("return arguments[0].getElementsByClassName('luck_result _fortune_birthResult')[0];", fortune)
story0 = driver.execute_script("return arguments[0].getElementsByClassName('luck_story _storyList')[0];", luck_result)
content0 = driver.execute_script("return arguments[0].getElementsByClassName('_luckContent')[0].getElementsByTagName('p')[0].innerHTML;", story0)

result = 'Total luck: {}\n\n'.format(content0)

OR:

driver.set_window_size(1200, 1000)
driver.get('https://search.naver.com/search.naver?sm=tab_hty.top&where=nexearch&oquery=%EC%98%A4%EB%8A%98%EC%9D%98+%EC%9A%B4%EC%84%B8&ie=utf8&query=%EC%98%A4%EB%8A%98%EC%9D%98+%EC%9A%B4%EC%84%B8')

elem = driver.find_elements_by_tag_name('html')
print elem[0].text

if is_female:
    driver.execute_script("document.getElementsByClassName('_genderTarget')[1].click()");
else:
    driver.execute_script("document.getElementsByClassName('_genderTarget')[0].click()");

driver.execute_script("document.getElementById('srch_txt').value = '{}'".format(date))
driver.execute_script("document.getElementsByClassName('contents03')[0].getElementsByClassName('img_btn')[0].click()")

luck_result = WebDriverWait(driver, 10).until(
    EC.visibility_of_element_located((By.CLASS_NAME, "infor _luckText"))
)

title = driver.execute_script("return document.getElementsByClassName('infor _luckText')[0].getElementsByTagName('dd')[0].getElementsByTagName('strong')[0].innerHTML")
body = driver.execute_script("return document.getElementsByClassName('infor _luckText')[0].getElementsByTagName('dd')[0].getElementsByTagName('p')[0].innerHTML")

result = '{}\n{}'.format(title, body)

with BeautifulSoup

나만의 웹 크롤러 만들기(3): Selenium으로 무적 크롤러 만들기

from selenium import webdriver
from bs4 import BeautifulSoup

# setup Driver|Chrome : 크롬드라이버를 사용하는 driver 생성
driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver')
driver.implicitly_wait(3) # 암묵적으로 웹 자원을 (최대) 3초 기다리기
# Login
driver.get('https://nid.naver.com/nidlogin.login') # 네이버 로그인 URL로 이동하기
driver.find_element_by_name('id').send_keys('naver_id') # 값 입력
driver.find_element_by_name('pw').send_keys('mypassword1234')
driver.find_element_by_xpath(
    '//*[@id="frmNIDLogin"]/fieldset/input'
    ).click() # 버튼클릭하기
driver.get('https://order.pay.naver.com/home') # Naver 페이 들어가기
html = driver.page_source # 페이지의 elements모두 가져오기
soup = BeautifulSoup(html, 'html.parser') # BeautifulSoup사용하기
notices = soup.select('div.p_inr > div.p_info > a > span')

for n in notices:
    print(n.text.strip())

ChromeDriver Selenium for Dockerfile

완성된 예제:

FROM python:3.8
MAINTAINER yourname <[email protected]>

COPY . /decanbot
WORKDIR /decanbot

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update && \
    ln -fs /usr/share/zoneinfo/Asia/Seoul /etc/localtime && \
    apt-get install -y tzdata && \
    dpkg-reconfigure --frontend noninteractive tzdata && \
    apt-get install -y --no-install-recommends software-properties-common curl unzip
RUN curl -s "https://dl.google.com/linux/linux_signing_key.pub" | apt-key add - && \
    add-apt-repository "deb [arch=amd64] https://dl.google.com/linux/chrome/deb/ stable main"
RUN apt-get -qq update && \
    apt-get install -y --no-install-recommends google-chrome-stable && \
    export CHROME_DRIVER_VERSION=$(curl -sS "https://chromedriver.storage.googleapis.com/LATEST_RELEASE") && \
    export CHROME_DRIVER_URL="https://chromedriver.storage.googleapis.com/${CHROME_DRIVER_VERSION}/chromedriver_linux64.zip" && \
    mkdir -p /decanbot/storage/driver && \
    curl -sS -o /decanbot/storage/driver/chromedriver_linux64.zip "${CHROME_DRIVER_URL}" && \
    unzip /decanbot/storage/driver/chromedriver_linux64.zip chromedriver -d /decanbot/storage/driver/ && \
    pip3 install -U pip && \
    pip3 install -r requirements.txt

ENTRYPOINT ["python3", "-m", "decanbot"]

GeckoDriver Selenium for Dockerfile

Example of Selenium with Python on Docker with latest FireFox | takac.dev

FROM python:3.9

ENV DEBIAN_FRONTEND noninteractive
ENV GECKODRIVER_VER v0.29.0
ENV FIREFOX_VER 87.0

RUN set -x \
   && apt update \
   && apt upgrade -y \
   && apt install -y \
       firefox-esr \
   && pip install  \
       requests \
       selenium \

# Add latest FireFox
RUN set -x \
   && apt install -y \
       libx11-xcb1 \
       libdbus-glib-1-2 \
   && curl -sSLO https://download-installer.cdn.mozilla.net/pub/firefox/releases/${FIREFOX_VER}/linux-x86_64/en-US/firefox-${FIREFOX_VER}.tar.bz2 \
   && tar -jxf firefox-* \
   && mv firefox /opt/ \
   && chmod 755 /opt/firefox \
   && chmod 755 /opt/firefox/firefox

# Add geckodriver
RUN set -x \
   && curl -sSLO https://github.com/mozilla/geckodriver/releases/download/${GECKODRIVER_VER}/geckodriver-${GECKODRIVER_VER}-linux64.tar.gz \
   && tar zxf geckodriver-*.tar.gz \
   && mv geckodriver /usr/bin/

COPY ./app /app

WORKDIR /app

CMD python ./main.py

Troubleshooting

The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed

(Linux) chrome driver 에러 발생시 해결방법

다음과 같은 옵션을 줘야 한다.

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

DevToolsActivePort file doesn't exist

다음과 같은 에러가 출력된다면:

selenium.common.exceptions.WebDriverException: Message: unknown error: DevToolsActivePort file doesn't exist

Chrome driver option에 --single-process를 추가해보자. 이게 안된다면 --disable-dev-shm-usage를 추가.

timeout: Timed out receiving message from renderer

driver.get("https://...")함수를 사용하여 페이지 로드중 타임아웃 에러가 나올 수 있다.

2022-06-12 16:18:50.864 1/140558718469952 decanbot ERROR driver.get(url='https://...') -> timeout: Timed out receiving message from renderer: 59.458
  (Session info: headless chrome=102.0.5005.115)
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/decanbot/decanbot/web/mana.py", line 129, in _get_last_numbers_from_update_page
    driver.get(url)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 442, in get
    self.execute(Command.GET, {'url': url})
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 430, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: 59.458
  (Session info: headless chrome=102.0.5005.115)
Stacktrace:
#0 0x558ee5eb2f33 <unknown>
#1 0x558ee5bfd118 <unknown>
#2 0x558ee5be8008 <unknown>
#3 0x558ee5be6c0f <unknown>
#4 0x558ee5be719c <unknown>
#5 0x558ee5bf55ff <unknown>
#6 0x558ee5bf6162 <unknown>
#7 0x558ee5c0424d <unknown>
#8 0x558ee5c0766a <unknown>
#9 0x558ee5be75c6 <unknown>
#10 0x558ee5c03f54 <unknown>
#11 0x558ee5c646e8 <unknown>
#12 0x558ee5c50e63 <unknown>
#13 0x558ee5c2682a <unknown>
#14 0x558ee5c27985 <unknown>
#15 0x558ee5ef74cd <unknown>
#16 0x558ee5efb5ec <unknown>
#17 0x558ee5ee171e <unknown>
#18 0x558ee5efc238 <unknown>
#19 0x558ee5ed6870 <unknown>
#20 0x558ee5f18608 <unknown>
#21 0x558ee5f18788 <unknown>
#22 0x558ee5f32f1d <unknown>
#23 0x7f726b340ea7 <unknown>


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/decanbot/decanbot/web/mana.py", line 132, in _get_last_numbers_from_update_page
    raise NetworkConnectionError(f"driver.get(url='{url}') -> {e.msg}")
decanbot.web.mana.NetworkConnectionError: driver.get(url='https://...') -> timeout: Timed out receiving message from renderer: 59.458
  (Session info: headless chrome=102.0.5005.115)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/decanbot/decanbot/core/mixin/context_timer.py", line 34, in timer_main
    await self.on_timer()
  File "/decanbot/decanbot/core/mixin/context_timer.py", line 53, in on_timer
    await self.on_mana_crawling()
  File "/decanbot/decanbot/core/mixin/context_timer.py", line 59, in on_mana_crawling
    manas = await self.mana.get_last_numbers_from_update_page(
  File "/decanbot/decanbot/web/mana.py", line 211, in get_last_numbers_from_update_page
    return await loop.run_in_executor(
decanbot.web.mana.NetworkConnectionError: driver.get(url='https://...') -> timeout: Timed out receiving message from renderer: 59.458
  (Session info: headless chrome=102.0.5005.115)

특별히 문제가 없는데 계속 타임아웃 난다면 타임아웃 시간을 늘리거나

service = Service(executable_path=self.chromedriver_path)
driver = Chrome(options=self.options, service=service)
driver.set_page_load_timeout(60.0)  # 60초 !!

스크립트가 필요 없다면 JavaScript 를 Disable 하거나 이미지 로딩을 하지 않도록 할 수 있다. (로딩 시간 증가의 요인들)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--disable-javascript")  # JavaScript Disable
prefs = {"profile.managed_default_content_settings.images": 2}  # Image Loading Diable
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)

그래도 해결이 안된다면 다른 옵션을 확인해 보자. 나는 single-process옵션을 제거하였다.

self.options.add_argument("disable-gpu")
self.options.add_argument("no-sandbox")
self.options.add_argument("disable-dev-shm-usage")

# self.options.add_argument("single-process")

if not developer:
    self.options.add_argument("headless")

WebDriverException : "Process unexpectedly closed with status 255"

Stackoverflow - WebDriverException : "Process unexpectedly closed with status 255" - selenium/geckodriver/AWS lambda - Python

로그 경로를 수정하지 않았다면 기본 경로인 $(pwd)/geckodriver.log파일을 확인하면 된다. 로드 되지 않은 shared library 가 존재할 가능성이 높다. Firefox#Troubleshooting를 참조. 간단히:

apt-get update
apt-get install -y wget bzip2 libxtst6 packagekit-gtk3-module libx11-xcb-dev libdbus-glib-1-2 libxt6 libpci-dev