파이썬(Python)으로 HTML contents를 file로 저장할 때 HTTP Error 403: Forbidden이 발생한다면 어떻게 해야 할까?

 

파이썬으로 특정 페이지를 스크랩하는 프로그램을 만들었습니다. (저장하는 부분은 생략)

import urllib.request

fullUrl = '......'

response = urllib.request.urlopen(fullUrl)
data = response.read()
text = data.decode('utf-8')
print(text)

잘 되더군요.

그런데 또 다른 특정 페이지에 적용을 해보니 403 에러가 납니다.

Traceback (most recent call last):
  File "pageScrap.py", line 5, in 
    response = urllib.request.urlopen(fullUrl)
  File "/app/python361/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/app/python361/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/app/python361/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/app/python361/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/app/python361/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/app/python361/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

 

[ 원인 ]

Stackoverflow는 우리를 배신하지 않습니다.

"This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it's easily detected)."

 

[ 해결 ]

headers를 삽입해봅시다.

req = urllib.request.Request(fullUrl, headers={'User-Agent': 'Mozilla/5.0'})
response = urllib.request.urlopen(req).read()
text = response.decode('utf-8')
print(text)