标签爬虫下的文章

PHP 获取重定向之后的地址

作者: yessn
时间: 2024-06-28
分类: php
1 条评论

函数说明

可以直接获取网址重定向 (302，301) 之后的地址

函数源码

function get_location($url,$ua=0){

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);

$httpheader[] = "Accept:*/*";

$httpheader[] = "Accept-Encoding:gzip,deflate,sdch";

$httpheader[] = "Accept-Language:zh-CN,zh;q=0.8";

$httpheader[] = "Connection:close";

curl_setopt($ch, CURLOPT_HTTPHEADER, $httpheader);

curl_setopt($ch, CURLOPT_HEADER, true);

if ($ua) {

curl_setopt($ch, CURLOPT_USERAGENT, $ua);

} else {

curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Linux; U; Android 4.0.4; es-mx; HTC_One_X Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0");

}

curl_setopt($ch, CURLOPT_NOBODY, 1);

curl_setopt($ch, CURLOPT_ENCODING, "gzip");

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$ret = curl_exec($ch);

curl_close($ch);

preg_match("/Location: (.*?)\r\n/iU",$ret,$location);

return $location[1];

}

使用示例

//使用默认ua

echo get_location('http://example.com');

//使用自定义ua

$ua = 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/13.0 MQQBrowser/9.0.0 Mobile/15B87 Safari/604.1 MttCustomUA/2 QBWebViewType/1 WKType/1';

echo get_location('http://example.com'，$ua);

转载：https://blog.csdn.net/weixin_29924799/article/details/116287105

Python爬虫的两种方式

作者: yessn
时间: 2024-06-28
分类: python
评论

1、requests方式

(1) 无头部信息

import requests
url = "https://www.cnblogs.com/dearvee/p/6558571.html"
response = requests.get(url)
response.encoding = 'utf-8'
print(response.text)

(2) 有头部信息

import requests
url = "https://www.cnblogs.com/dearvee/p/6558571.html"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
print(response.text)

2、urllib.request方式

(1) 无Request请求

from urllib import request
url = "https://www.cnblogs.com/dearvee/p/6558571.html"
response = request.urlopen(url)
print(response.read().decode('utf-8'))

(2) 构造Request请求

from urllib import request
url = "https://www.cnblogs.com/dearvee/p/6558571.html"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
req = request.Request(url, headers=headers)
response = request.urlopen(req)
print(response.read().decode('utf-8'))

3、捕获错误信息

from urllib import request, error
url = "https://www.douban.com"
try:
　　req = request.Request(url)
　　response = request.urlopen(req)
　　print(response.read().decode('utf-8'))
except error.HTTPError as e:
　　print(e)

4、随机获取头部信息

from fake_useragent import UserAgent
ua = UserAgent()
print(ua.ie) #随机打印ie浏览器任意版本
print(ua.firefox) #随机打印firefox浏览器任意版本
print(ua.chrome) #随机打印chrome浏览器任意版本
print(ua.random) #随机打印任意厂家的浏览器

Laravel + Guzzle 实现简单爬虫

作者: yessn
时间: 2024-06-28
分类: php
评论

1. 爬取页面数据

$url = "http://www.zongscan.com/demo333/178.html";
$request = new GuzzleRequest('GET', $url);
$client = new \GuzzleHttp\Client();
$response = $client->send($request, ['timeout' => 5]);

2. 获取爬虫结果

$content = $response->getBody()->getContents();

3. 将结果转换为数组

$data = json_decode($content,true);

参考：https://www.zongscan.com/demo333/187.html

PHP 获取重定向之后的地址

Python爬虫的两种方式

Laravel + Guzzle 实现简单爬虫

最新文章

最近回复

分类

归档

其它