标签 爬虫 下的文章

函数说明

可以直接获取网址重定向 (302,301) 之后的地址

函数源码

function get_location($url,$ua=0){

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);

$httpheader[] = "Accept:*/*";

$httpheader[] = "Accept-Encoding:gzip,deflate,sdch";

$httpheader[] = "Accept-Language:zh-CN,zh;q=0.8";

$httpheader[] = "Connection:close";

curl_setopt($ch, CURLOPT_HTTPHEADER, $httpheader);

curl_setopt($ch, CURLOPT_HEADER, true);

if ($ua) {

curl_setopt($ch, CURLOPT_USERAGENT, $ua);

} else {

curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Linux; U; Android 4.0.4; es-mx; HTC_One_X Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0");

}

curl_setopt($ch, CURLOPT_NOBODY, 1);

curl_setopt($ch, CURLOPT_ENCODING, "gzip");

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$ret = curl_exec($ch);

curl_close($ch);

preg_match("/Location: (.*?)\r\n/iU",$ret,$location);

return $location[1];

}

使用示例

//使用默认ua

echo get_location('http://example.com');

//使用自定义ua

$ua = 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/13.0 MQQBrowser/9.0.0 Mobile/15B87 Safari/604.1 MttCustomUA/2 QBWebViewType/1 WKType/1';

echo get_location('http://example.com',$ua);

 

转载:https://blog.csdn.net/weixin_29924799/article/details/116287105

1、requests方式

(1) 无头部信息

import requests
url = "https://www.cnblogs.com/dearvee/p/6558571.html"
response = requests.get(url)
response.encoding = 'utf-8'
print(response.text)

(2) 有头部信息

import requests
url = "https://www.cnblogs.com/dearvee/p/6558571.html"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
print(response.text)


2、urllib.request方式

(1) 无Request请求

from urllib import request
url = "https://www.cnblogs.com/dearvee/p/6558571.html"
response = request.urlopen(url)
print(response.read().decode('utf-8'))

(2) 构造Request请求

from urllib import request
url = "https://www.cnblogs.com/dearvee/p/6558571.html"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
req = request.Request(url, headers=headers)
response = request.urlopen(req)
print(response.read().decode('utf-8'))

 

3、捕获错误信息

from urllib import request, error
url = "https://www.douban.com"
try:
  req = request.Request(url)
  response = request.urlopen(req)
  print(response.read().decode('utf-8'))
except error.HTTPError as e:
  print(e)

 

 4、随机获取头部信息

from fake_useragent import UserAgent
ua = UserAgent()
print(ua.ie) #随机打印ie浏览器任意版本
print(ua.firefox) #随机打印firefox浏览器任意版本
print(ua.chrome) #随机打印chrome浏览器任意版本
print(ua.random) #随机打印任意厂家的浏览器

 

1. 爬取页面数据

$url = "http://www.zongscan.com/demo333/178.html";
$request = new GuzzleRequest('GET', $url);
$client = new \GuzzleHttp\Client();
$response = $client->send($request, ['timeout' => 5]);


2. 获取爬虫结果

$content = $response->getBody()->getContents();

 

3. 将结果转换为数组

$data = json_decode($content,true);

 

参考:https://www.zongscan.com/demo333/187.html