python爬虫:网页重定向问题

使用python+requests爬虫时常常遇到请求URL地址变化（返回的URL地址不再是请求时的地址），这些很大可能是网页被重定向导致。所谓重定向(Redirect)就是通过各种方法将各种网络请求重新转到其它位置（URL）。每个网站主页是网站资源的入口，当重定向发生在网站主页时，如果不能正确处理就很有可能会错失这整个网站的内容。

爬取网页时主要三种重定向的情况

服务器端重定向，在服务器端完成，一般来说爬虫可以自适应，是不需要特别处理的，如响应代码301（永久重定向）、302（暂时重定向）等。具体来说，可以通过requests请求得到的response对象中的url、status_code两个属性来判断。当status_code为301、302或其他代表重定向的代码时，表示原请求被重定向；当response对象的url属性与发送请求时的链接不一致时，也说明了原请求被重定向且已经自动处理。
客户端重定向：请求头meta refresh设置，即网页中的<meta>标签声明了网页重定向的链接，这种重定向由浏览器完成，需要编写代码进行处理。例如，某一重定向如下面的html代码第三行中的注释所示，浏览器能够自动跳转，但爬虫只能得到跳转前的页面，不能自动跳转。

如百度搜索requests后第一个结果地址https://www.baidu.com/link?url=n2d6IqviMKE2UKdm3cJo02edoksu6FX81jzThBQbkehNlFLpXO18Wry6_S3p_sp8&wd=&eqid=9b51b77c000016fb000000045a9ca929这个地址会跳转到http://www.python-requests.org/

<meta http-equiv="refresh" content="0.1;url=http://www.redirectedtoxxx.com/"><!--本网页会在0.1秒内refresh为url所指的网页-->  


<meta content="always" name="referrer">
<script>try{if(window.opener&&window.opener.bds&&window.opener.bds.pdc&&window.opener.bds.pdc.sendLinkLog){window.opener.bds.pdc.sendLinkLog();}}catch(e) {};var timeout = 0;if(/bdlksmp/.test(window.location.href)){var reg = /bdlksmp=([^=&]+)/,matches = window.location.href.match(reg);timeout = matches[1] ? matches[1] : 0};setTimeout(function(){window.location.replace("http://www.python-requests.org/")},timeout);window.opener=null;</script>
<noscript>
    <META http-equiv="refresh" content="0;URL='http://www.python-requests.org/'">
</noscript>

这种行为发生在客户端（浏览器），所以使用python requests 时不能实现自动跳转，返回结果仍然是原始URL地址。

import requests
url = 'https://www.baidu.com/link?url=n2d6IqviMKE2UKdm3cJo02edoksu6FX81jzThBQbkehNlFLpXO18Wry6_S3p_sp8&wd=&eqid=9b51b77c000016fb000000045a9ca929'

r = requests.get(url)
r.status_code
#200

r.url
#'https://www.baidu.com/link?url=n2d6IqviMKE2UKdm3cJo02edoksu6FX81jzThBQbkehNlFLpXO18Wry6_S3p_sp8&wd=&eqid=9b51b77c000016fb000000045a9ca929'

解决方法：获取真正要请求的URL，再重新requests

# xpath('//meta[@http-equiv="refresh" and @content]/@content')提取出content的值
# 正则表达式提取出重定向的url
import requests
import re
from lxml import etree

def find_RealURL(url):
    r = requests.get(url, headers=header_code)
    html = r.text
    html = etree.HTML(html)
    xpth_refresh = '//meta[@http-equiv="Refresh" and @content]/@content'
    divs1 = html.xpath(xpth_refresh)[0]
    rstyle = re.compile('URL=(.*)')
    res = re.findall(rstyle, divs1)[0]
    return res

客户端重定向：执行js代码重定向，百度搜索结果跳转也有该方法

  window.location.href='http://www.python-requests.org/'
  window.location.replace("http://www.python-requests.org/")

  <script>try{if(window.opener&&window.opener.bds&&window.opener.bds.pdc&&window.opener.bds.pdc.sendLinkLog){window.opener.bds.pdc.sendLinkLog();}}catch(e) {};var timeout = 0;if(/bdlksmp/.test(window.location.href)){var reg = /bdlksmp=([^=&]+)/,matches = window.location.href.match(reg);timeout = matches[1] ? matches[1] : 0};setTimeout(function(){window.location.replace("http://www.python-requests.org/")},timeout);window.opener=null;</script>

解决方法：如果理清js的执行过程及结果，可以直接正则提取需要的地址；简单粗暴的方法selenium+chrome等模拟浏览器获得请求地址

python爬虫:网页重定向问题

2018/03/05