Selenium WebDriver
Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well.
Selenium has the support of some of the largest browser vendors who have taken (or are taking) steps to make Selenium a native part of their browser. It is also the core technology in countless other browser automation tools, APIs and frameworks.
How to install
[python] Selenium으로 스크래핑하기
Drivers
Chrome
Gecko
MOZ_HEADLESS=1
환경 변수를 설정하면 Headless 모드로 시작한다.
DOM요소 선택
요소를 찾지 못하면 selenium.common.exceptions.NoSuchElementException
발생
처음요소를 추출
이름 | 설명 |
find_element_by_id(id) | id속성으로 요소를 하나 추출 |
find_element_by_name(name) | name 속성으로 요소를 하나 추출 |
find_element_by_css_selector(query) | css 선택자로 요소를 하나 추출 |
find_element_xpath(query) | xpath를 지정해 요소를 하나 추출 |
find_element_by_tag_name(name) | 태그 이름이 name에 해당하는 요소를 하나 추출 |
find_element_by_link_text(text) | 링크 텍스트로 요소를 추출 |
find_element_by_partial_link_text(text) | 링크의 자식 요소에 포함되 있는 텍스트로 요소를 하나 추출 |
find_element_by_class_name(name) | 클래스 이름이 name에 해당하는 요소를 하나 추출 |
모든 요소를 추출
이름 | 설명 |
find_elements_by_name | |
find_elements_by_xpath | |
find_elements_by_link_text | |
find_elements_by_partial_link_text | |
find_elements_by_tag_name | |
find_elements_by_class_name | |
find_elements_by_css_selector |
요소를 조작
메서드/속성 | 설명 |
clear() | 글자를 지운다 |
click() | 요소를 클릭 |
get_attribute(name) | 요소 속성중 name에 해당하는 속성 값을 추출 |
is_displayed() | 요소가 화면에 출력되는지 확인 |
is_enabled() | 요소가 활성화돼 있는지 확인 |
is_selected() | 체크박스 등의 요소가 선택된 상태인지 확인 |
screenshot(filename) | 스크린샷 |
send_keys(value) | 키를 입력 |
submit() | 입력 양식을 전송 |
value_of_css_property(name) | name에 해당하는 css속성 값을 추출 |
id | id |
location | 요소의 위치 |
parent | 부모요소 |
rect | 크기와 위치 정보를 가진 사전자료형 리턴 |
screenshot_as_base64 | 스크린샷을 base64로 추출 |
screenshot_as_png | 스크린샷을 png형식의 바이너리로 추출 |
size | 요소의 크리 |
tag_name | 태그 이름 |
text | 요소의 내부 글자 |
send_key() 에서 특수키 입력
from selenium.Webdriver.common.keys import Keys
ARROW_DOWN / ARROW_LEFT / ARROW_RIGHT / ARROW_UP
BACKSPACE / DELETE / HOME / END /INSERT /
ALT / COMMAND / CONTROL / SHIFT
ENTER / ESCAPE /SPACE / TAB
F1 / F2 / F3 ............./ F12
드라이버 조작
메서드 | 설명 |
add_cookie( cookie_dict) | 쿠키값을 사전 형식으로 지정 |
back() / forward() | 이전 페이지/ 다음페이지 |
close() | 브라우저 닫기 |
current_url | 현재 url |
delete_all_cookies() | 모든 쿠키 제거 |
delete_cookie(name) | name에 해당하는 쿠키 제거 |
execute( command, params) | 브라우저 고유의 명령어 실행 |
execute_async_script( script, *args) | 비동기 처리하는 자바스크립트를 실행 |
execute_script( script, *args) | 동기 처리하는 자바스크립트를 실행 |
get(url) | 웹 페이지를 읽어들임 |
get_cookie( name) | 특정 쿠키 값을 추출 |
get_cookies() | 모든 쿠키값을 사전 형식으로 추출 |
get_log(type) | 로그를 추출(type: browser/driver/client/server) |
get_screenshot_as_base64() | base64형식으로 스크린샷을 추출 |
get_screenshot_as_file(filename) | 스크린샷을 파일로 저장 |
get_screenshot_as_png() | png형식으로 스키란샷의 바이너리를 추출 |
get_window_position(windowHandle='current') | 브라우저의 위치를 추출 |
get_window_size( windowHandle='current') | 브라우저의 크기를 추출 |
implicitly_wait(sec) | 최대 대기 시간을 초 단위로 지정해서 처리가 끈날때 까지 대기 |
quit() | 드라이버를 종료 시켜 브라우저 닫기 |
save_screenshot(filename) | 스크린샷 저장 |
set_page_load_timeout( time_to_wait) | 페이지를 읽는 타임아웃 시간을 지정 |
set_script_timeout(time_to_wait) | 스크립트의 타임아웃 시간을 지정 |
set_window_position(x,y,windowHandle='current') | 브라우저 위치를 지정 |
set_window_size(width, height, windowHandle='current') | 브라우저 크기를 지정 |
title | 현재 타이틀을 추출 |
sleep vs implicitly_wait vs set_page_load_timeout
- (파이썬) driver.implicitly_wait 와 time.sleep 차이점
- How to set page load timeout in Selenium? - GeeksforGeeks
결론부터 말하면:
-
driver.implicitly_wait(10)
- 10초안에 웹페이지를 load 하면 바로 넘어가거나, 10초를 기다림. -
time.sleep(10)
- 10초를 기다림. -
driver.set_page_load_timeout(10)
- 페이지를 다 읽는 시간에 대한 타임아웃 10초. 타임아웃 발생시selenium.common.TimeoutException
를 raise 한다.
특정 조건에 맞춰 대기
WebDriverWait 와 expected_conditions 모듈을 사용하여 특정 조건이 충족될 때까지 대기할 수 있습니다.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
# 웹 페이지로 이동
driver.get("https://example.com")
# 최대 10초 동안 대기하며, title이 "Example Domain"이 될 때까지 대기
wait = WebDriverWait(driver, 10)
wait.until(EC.title_is("Example Domain"))
GeckoDriver Example
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import time
# FireFox binary path (Must be absolute path)
FIREFOX_BINARY = FirefoxBinary('/opt/firefox/firefox')
# FireFox PROFILE
PROFILE = webdriver.FirefoxProfile()
PROFILE.set_preference("browser.cache.disk.enable", False)
PROFILE.set_preference("browser.cache.memory.enable", False)
PROFILE.set_preference("browser.cache.offline.enable", False)
PROFILE.set_preference("network.http.use-cache", False)
PROFILE.set_preference("general.useragent.override","Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:72.0) Gecko/20100101 Firefox/72.0")
# FireFox Options
FIREFOX_OPTS = Options()
FIREFOX_OPTS.log.level = "trace" # Debug
FIREFOX_OPTS.headless = True
GECKODRIVER_LOG = '/geckodriver.log'
class Scraper:
def __init__(self):
ff_opt = {
firefox_binary=FIREFOX_BINARY,
firefox_profile=PROFILE,
options=FIREFOX_OPTS,
service_log_path=GECKODRIVER_LOG
}
self.DRIVER = webdriver.Firefox(**ff_opt)
def scrape(self, link):
try:
self.DRIVER.get(link)
time.sleep(5) # just in case
html = self.DRIVER.page_source
return html
except Exception as e:
print(e)
다음의 같이 실행:
Switching iframe context
iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to.default_content()
driver.switch_to.frame(iframe)
driver.find_elements_by_tag_name('iframe')[0]
switch_to_default_content()
will return you to the top of the document. What was happening is you switched into the first iframe
, switched back to the top of the document, then tried to find the second iframe
. Selenium can't find the second iframe
, because it's inside of the first iframe
.
If you remove the second switch_to_default_content()
you should be fine:
Screenshot example
- Stackoverflow: How do I generate a png file w/ selenium/phantomjs from a string?
- Python Selenium + PhantomJS on AWS EC2 Ubuntu Instance - Headless Browser Automation
driver = webdriver.PhantomJS(service_args=['--ssl-protocol=any'])
## Or chrome driver:
# driver = webdriver.Chrome();
driver.implicitly_wait(10)
driver.set_window_size(1024, 768)
driver.get('http://www.python.org/')
if driver.save_screenshot('out.png'):
print 'success.'
driver.quit()
NAVER Fortuen example
네이버 오늘의 운세 획득 예제.
driver.get('https://search.naver.com/search.naver?sm=tab_hty.top&where=nexearch&oquery=%EC%98%A4%EB%8A%98%EC%9D%98+%EC%9A%B4%EC%84%B8&ie=utf8&query=%EC%98%A4%EB%8A%98%EC%9D%98+%EC%9A%B4%EC%84%B8')
fortune = driver.execute_script("return document.getElementById('_fortune_birthCondition');")
selector_date = driver.execute_script("return arguments[0].getElementsByClassName('_selectorText')[0];", fortune)
driver.execute_script("arguments[0].setAttribute('datetime', '{}');".format(date), selector_date)
driver.execute_script("arguments[0].innerHTML = '{}';".format(date), selector_date)
# Check fortune.
driver.execute_script("arguments[0].getElementsByClassName('btn_chs _submit')[0].click();", fortune)
# Check result.
luck_result = driver.execute_script("return arguments[0].getElementsByClassName('luck_result _fortune_birthResult')[0];", fortune)
story0 = driver.execute_script("return arguments[0].getElementsByClassName('luck_story _storyList')[0];", luck_result)
content0 = driver.execute_script("return arguments[0].getElementsByClassName('_luckContent')[0].getElementsByTagName('p')[0].innerHTML;", story0)
result = 'Total luck: {}\n\n'.format(content0)
OR:
driver.set_window_size(1200, 1000)
driver.get('https://search.naver.com/search.naver?sm=tab_hty.top&where=nexearch&oquery=%EC%98%A4%EB%8A%98%EC%9D%98+%EC%9A%B4%EC%84%B8&ie=utf8&query=%EC%98%A4%EB%8A%98%EC%9D%98+%EC%9A%B4%EC%84%B8')
elem = driver.find_elements_by_tag_name('html')
print elem[0].text
if is_female:
driver.execute_script("document.getElementsByClassName('_genderTarget')[1].click()");
else:
driver.execute_script("document.getElementsByClassName('_genderTarget')[0].click()");
driver.execute_script("document.getElementById('srch_txt').value = '{}'".format(date))
driver.execute_script("document.getElementsByClassName('contents03')[0].getElementsByClassName('img_btn')[0].click()")
luck_result = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.CLASS_NAME, "infor _luckText"))
)
title = driver.execute_script("return document.getElementsByClassName('infor _luckText')[0].getElementsByTagName('dd')[0].getElementsByTagName('strong')[0].innerHTML")
body = driver.execute_script("return document.getElementsByClassName('infor _luckText')[0].getElementsByTagName('dd')[0].getElementsByTagName('p')[0].innerHTML")
result = '{}\n{}'.format(title, body)
with BeautifulSoup
from selenium import webdriver
from bs4 import BeautifulSoup
# setup Driver|Chrome : 크롬드라이버를 사용하는 driver 생성
driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver')
driver.implicitly_wait(3) # 암묵적으로 웹 자원을 (최대) 3초 기다리기
# Login
driver.get('https://nid.naver.com/nidlogin.login') # 네이버 로그인 URL로 이동하기
driver.find_element_by_name('id').send_keys('naver_id') # 값 입력
driver.find_element_by_name('pw').send_keys('mypassword1234')
driver.find_element_by_xpath(
'//*[@id="frmNIDLogin"]/fieldset/input'
).click() # 버튼클릭하기
driver.get('https://order.pay.naver.com/home') # Naver 페이 들어가기
html = driver.page_source # 페이지의 elements모두 가져오기
soup = BeautifulSoup(html, 'html.parser') # BeautifulSoup사용하기
notices = soup.select('div.p_inr > div.p_info > a > span')
for n in notices:
print(n.text.strip())
ChromeDriver Selenium for Dockerfile
완성된 예제:
FROM python:3.8
MAINTAINER yourname <[email protected]>
COPY . /decanbot
WORKDIR /decanbot
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get -qq update && \
ln -fs /usr/share/zoneinfo/Asia/Seoul /etc/localtime && \
apt-get install -y tzdata && \
dpkg-reconfigure --frontend noninteractive tzdata && \
apt-get install -y --no-install-recommends software-properties-common curl unzip
RUN curl -s "https://dl.google.com/linux/linux_signing_key.pub" | apt-key add - && \
add-apt-repository "deb [arch=amd64] https://dl.google.com/linux/chrome/deb/ stable main"
RUN apt-get -qq update && \
apt-get install -y --no-install-recommends google-chrome-stable && \
export CHROME_DRIVER_VERSION=$(curl -sS "https://chromedriver.storage.googleapis.com/LATEST_RELEASE") && \
export CHROME_DRIVER_URL="https://chromedriver.storage.googleapis.com/${CHROME_DRIVER_VERSION}/chromedriver_linux64.zip" && \
mkdir -p /decanbot/storage/driver && \
curl -sS -o /decanbot/storage/driver/chromedriver_linux64.zip "${CHROME_DRIVER_URL}" && \
unzip /decanbot/storage/driver/chromedriver_linux64.zip chromedriver -d /decanbot/storage/driver/ && \
pip3 install -U pip && \
pip3 install -r requirements.txt
ENTRYPOINT ["python3", "-m", "decanbot"]
GeckoDriver Selenium for Dockerfile
FROM python:3.9
ENV DEBIAN_FRONTEND noninteractive
ENV GECKODRIVER_VER v0.29.0
ENV FIREFOX_VER 87.0
RUN set -x \
&& apt update \
&& apt upgrade -y \
&& apt install -y \
firefox-esr \
&& pip install \
requests \
selenium \
# Add latest FireFox
RUN set -x \
&& apt install -y \
libx11-xcb1 \
libdbus-glib-1-2 \
&& curl -sSLO https://download-installer.cdn.mozilla.net/pub/firefox/releases/${FIREFOX_VER}/linux-x86_64/en-US/firefox-${FIREFOX_VER}.tar.bz2 \
&& tar -jxf firefox-* \
&& mv firefox /opt/ \
&& chmod 755 /opt/firefox \
&& chmod 755 /opt/firefox/firefox
# Add geckodriver
RUN set -x \
&& curl -sSLO https://github.com/mozilla/geckodriver/releases/download/${GECKODRIVER_VER}/geckodriver-${GECKODRIVER_VER}-linux64.tar.gz \
&& tar zxf geckodriver-*.tar.gz \
&& mv geckodriver /usr/bin/
COPY ./app /app
WORKDIR /app
CMD python ./main.py
Troubleshooting
The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed
다음과 같은 옵션을 줘야 한다.
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
DevToolsActivePort file doesn't exist
- DevToolsActivePort file doesn't exist 에러 해결하기 · MOONGCHI
- Stackoverflow - WebDriverException: unknown error: DevToolsActivePort file doesn't exist while trying to initiate Chrome Browser
- DevToolsActivePort file doesn't exist error 해결법
다음과 같은 에러가 출력된다면:
selenium.common.exceptions.WebDriverException: Message: unknown error: DevToolsActivePort file doesn't exist
Chrome driver option에 --single-process
를 추가해보자. 이게 안된다면 --disable-dev-shm-usage
를 추가.
timeout: Timed out receiving message from renderer
- Stackoverflow - python - Selenium gives "Timed out receiving message from renderer" for all websites after some execution time - Stack Overflow
- Stackoverflow - How to disable java script in Chrome Driver Selenium Python - Stack Overflow
- Stackoverflow - Python: Disable images in Selenium Google ChromeDriver - Stack Overflow
driver.get("
https://
...")
함수를 사용하여 페이지 로드중 타임아웃 에러가 나올 수 있다.
2022-06-12 16:18:50.864 1/140558718469952 decanbot ERROR driver.get(url='https://...') -> timeout: Timed out receiving message from renderer: 59.458
(Session info: headless chrome=102.0.5005.115)
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/decanbot/decanbot/web/mana.py", line 129, in _get_last_numbers_from_update_page
driver.get(url)
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 442, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 430, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: 59.458
(Session info: headless chrome=102.0.5005.115)
Stacktrace:
#0 0x558ee5eb2f33 <unknown>
#1 0x558ee5bfd118 <unknown>
#2 0x558ee5be8008 <unknown>
#3 0x558ee5be6c0f <unknown>
#4 0x558ee5be719c <unknown>
#5 0x558ee5bf55ff <unknown>
#6 0x558ee5bf6162 <unknown>
#7 0x558ee5c0424d <unknown>
#8 0x558ee5c0766a <unknown>
#9 0x558ee5be75c6 <unknown>
#10 0x558ee5c03f54 <unknown>
#11 0x558ee5c646e8 <unknown>
#12 0x558ee5c50e63 <unknown>
#13 0x558ee5c2682a <unknown>
#14 0x558ee5c27985 <unknown>
#15 0x558ee5ef74cd <unknown>
#16 0x558ee5efb5ec <unknown>
#17 0x558ee5ee171e <unknown>
#18 0x558ee5efc238 <unknown>
#19 0x558ee5ed6870 <unknown>
#20 0x558ee5f18608 <unknown>
#21 0x558ee5f18788 <unknown>
#22 0x558ee5f32f1d <unknown>
#23 0x7f726b340ea7 <unknown>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/decanbot/decanbot/web/mana.py", line 132, in _get_last_numbers_from_update_page
raise NetworkConnectionError(f"driver.get(url='{url}') -> {e.msg}")
decanbot.web.mana.NetworkConnectionError: driver.get(url='https://...') -> timeout: Timed out receiving message from renderer: 59.458
(Session info: headless chrome=102.0.5005.115)
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/decanbot/decanbot/core/mixin/context_timer.py", line 34, in timer_main
await self.on_timer()
File "/decanbot/decanbot/core/mixin/context_timer.py", line 53, in on_timer
await self.on_mana_crawling()
File "/decanbot/decanbot/core/mixin/context_timer.py", line 59, in on_mana_crawling
manas = await self.mana.get_last_numbers_from_update_page(
File "/decanbot/decanbot/web/mana.py", line 211, in get_last_numbers_from_update_page
return await loop.run_in_executor(
decanbot.web.mana.NetworkConnectionError: driver.get(url='https://...') -> timeout: Timed out receiving message from renderer: 59.458
(Session info: headless chrome=102.0.5005.115)
특별히 문제가 없는데 계속 타임아웃 난다면 타임아웃 시간을 늘리거나
service = Service(executable_path=self.chromedriver_path)
driver = Chrome(options=self.options, service=service)
driver.set_page_load_timeout(60.0) # 60초 !!
스크립트가 필요 없다면 JavaScript 를 Disable 하거나 이미지 로딩을 하지 않도록 할 수 있다. (로딩 시간 증가의 요인들)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--disable-javascript") # JavaScript Disable
prefs = {"profile.managed_default_content_settings.images": 2} # Image Loading Diable
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
그래도 해결이 안된다면 다른 옵션을 확인해 보자. 나는 single-process
옵션을 제거하였다.
self.options.add_argument("disable-gpu")
self.options.add_argument("no-sandbox")
self.options.add_argument("disable-dev-shm-usage")
# self.options.add_argument("single-process")
if not developer:
self.options.add_argument("headless")
WebDriverException : "Process unexpectedly closed with status 255"
로그 경로를 수정하지 않았다면 기본 경로인 $(pwd)/geckodriver.log
파일을 확인하면 된다. 로드 되지 않은 shared library 가 존재할 가능성이 높다. Firefox#Troubleshooting를 참조. 간단히:
apt-get update
apt-get install -y wget bzip2 libxtst6 packagekit-gtk3-module libx11-xcb-dev libdbus-glib-1-2 libxt6 libpci-dev
See also
- Web crawler
- Python
- WebDriver
- Beautiful Soup
- PhantomJS
- Chrome
- FireFox
- Web crawler
- Helium - 더 사용하기 편한 Selenium-Python
- Autotab - 복잡한 웹 작업을 API로 만들어 주는 도구
- Playwright - Playwright enables reliable end-to-end testing for modern web apps.
- Stagehand - AI 기반 오픈 소스 브라우저 자동화 프레임워크
- Simplex - 코드와 자연어를 사용하여 브라우저 워크플로우 자동화하기
Favorite site
- Selenium web site
- Python - Web Driver & Selenium 사용하기
- selenium PhantomJS 이용하기
- PhantomJS 사용하기
- Selenium을 이용한 UI 테스트
- Selenium Tutorial: Web Scraping with Selenium and Python
- JavaScript Injection with Selenium, Puppeteer, and Marionette in Chrome and Firefox
- 웹 자동화는 Selenium 대신 Playwright를 쓰자 | GeekNews (Selenium vs Playwright)