python爬虫:网页重定向问题

2018/03/05

Categories: 爬虫 Tags: http JS html

使用python+requests爬虫时常常遇到请求URL地址变化(返回的URL地址不再是请求时的地址),这些很大可能是网页被重定向导致。所谓重定向(Redirect)就是通过各种方法将各种网络请求重新转到其它位置(URL)。每个网站主页是网站资源的入口,当重定向发生在网站主页时,如果不能正确处理就很有可能会错失这整个网站的内容。

爬取网页时主要三种重定向的情况

如百度搜索requests后第一个结果地址https://www.baidu.com/link?url=n2d6IqviMKE2UKdm3cJo02edoksu6FX81jzThBQbkehNlFLpXO18Wry6_S3p_sp8&wd=&eqid=9b51b77c000016fb000000045a9ca929这个地址会跳转到http://www.python-requests.org/

<meta http-equiv="refresh" content="0.1;url=http://www.redirectedtoxxx.com/"><!--本网页会在0.1秒内refresh为url所指的网页-->  


<meta content="always" name="referrer">
<script>try{if(window.opener&&window.opener.bds&&window.opener.bds.pdc&&window.opener.bds.pdc.sendLinkLog){window.opener.bds.pdc.sendLinkLog();}}catch(e) {};var timeout = 0;if(/bdlksmp/.test(window.location.href)){var reg = /bdlksmp=([^=&]+)/,matches = window.location.href.match(reg);timeout = matches[1] ? matches[1] : 0};setTimeout(function(){window.location.replace("http://www.python-requests.org/")},timeout);window.opener=null;</script>
<noscript>
    <META http-equiv="refresh" content="0;URL='http://www.python-requests.org/'">
</noscript>

这种行为发生在客户端(浏览器),所以使用python requests 时不能实现自动跳转,返回结果仍然是原始URL地址。

import requests
url = 'https://www.baidu.com/link?url=n2d6IqviMKE2UKdm3cJo02edoksu6FX81jzThBQbkehNlFLpXO18Wry6_S3p_sp8&wd=&eqid=9b51b77c000016fb000000045a9ca929'

r = requests.get(url)
r.status_code
#200

r.url
#'https://www.baidu.com/link?url=n2d6IqviMKE2UKdm3cJo02edoksu6FX81jzThBQbkehNlFLpXO18Wry6_S3p_sp8&wd=&eqid=9b51b77c000016fb000000045a9ca929'

解决方法:获取真正要请求的URL,再重新requests

# xpath('//meta[@http-equiv="refresh" and @content]/@content')提取出content的值
# 正则表达式提取出重定向的url
import requests
import re
from lxml import etree

def find_RealURL(url):
    r = requests.get(url, headers=header_code)
    html = r.text
    html = etree.HTML(html)
    xpth_refresh = '//meta[@http-equiv="Refresh" and @content]/@content'
    divs1 = html.xpath(xpth_refresh)[0]
    rstyle = re.compile('URL=(.*)')
    res = re.findall(rstyle, divs1)[0]
    return res
  window.location.href='http://www.python-requests.org/'
  window.location.replace("http://www.python-requests.org/")

  <script>try{if(window.opener&&window.opener.bds&&window.opener.bds.pdc&&window.opener.bds.pdc.sendLinkLog){window.opener.bds.pdc.sendLinkLog();}}catch(e) {};var timeout = 0;if(/bdlksmp/.test(window.location.href)){var reg = /bdlksmp=([^=&]+)/,matches = window.location.href.match(reg);timeout = matches[1] ? matches[1] : 0};setTimeout(function(){window.location.replace("http://www.python-requests.org/")},timeout);window.opener=null;</script>

解决方法:如果理清js的执行过程及结果,可以直接正则提取需要的地址;简单粗暴的方法selenium+chrome等模拟浏览器获得请求地址