碼字不易,喜歡請(qǐng)點(diǎn)贊!!!
背景 :其實(shí)兩年前就爬了天眼查的很多信息,包括電話、地址等基本信息之外,還有公司的股東、專利以及對(duì)外投資等信息,但是當(dāng)時(shí)的電腦沒備份,代碼都沒了。這次山東的某個(gè)教育機(jī)構(gòu)有償找我?guī)团捞煅鄄楣倦娫捯约暗刂沸畔ⅲ跃椭匦屡懒艘幌绿煅鄄椤?
準(zhǔn)備
:selenium+PhatomJS或者selenium+Firefox
我這里直接用的后者selenium+Firefox
思路 爬取這部分信息的話,代碼其實(shí)不難,主要包括模擬登陸、獲得頁面網(wǎng)址以及抓取頁面信息。
模擬登陸
網(wǎng)址:https://www.tianyancha.com/login
頁面如下:
使用selenium模擬登陸代碼:
time
.
sleep
(
random
.
random
(
)
+
1
)
browser
.
get
(
loginURL
)
time
.
sleep
(
random
.
random
(
)
+
random
.
randint
(
2
,
3
)
)
browser
.
find_element_by_css_selector
(
'div.title:nth-child(2)'
)
.
click
(
)
time
.
sleep
(
random
.
uniform
(
0.5
,
1
)
)
phone
=
browser
.
find_element_by_css_selector
(
'div.modulein:nth-child(2) > div:nth-child(2) > input:nth-child(1)'
)
phone
.
send_keys
(
zhangHao
)
time
.
sleep
(
random
.
uniform
(
0.4
,
0.9
)
)
password
=
browser
.
find_element_by_css_selector
(
'.input-pwd'
)
password
.
send_keys
(
miMa
)
click
=
browser
.
find_element_by_css_selector
(
'div.modulein:nth-child(2) > div:nth-child(5)'
)
click
.
click
(
)
time
.
sleep
(
random
.
uniform
(
0.5
,
1
)
+
10
)
登錄之后頁面:
關(guān)鍵詞對(duì)應(yīng)的頁面網(wǎng)址:https://www.tianyancha.com/search?key= + key
這里以“滴滴”為例:https://www.tianyancha.com/search?key=滴滴
頁面內(nèi)容如下:
獲取公司頁面網(wǎng)址
解析滴滴關(guān)鍵詞頁面HTML,獲得每個(gè)公司對(duì)應(yīng)得URL。
注意
:非會(huì)員只能查看前五頁的公司信息
代碼:
#獲取頁面數(shù)
try
:
pages
=
soup
.
find
(
'ul'
,
class_
=
'pagination'
)
.
find_all
(
'li'
)
[
-
2
]
.
getText
(
)
.
replace
(
'...'
,
''
)
except
:
pages
=
1
finally
:
print
(
'pages:'
,
pages
)
def
getUid
(
soup
)
:
urls
=
[
]
divs
=
soup
.
find
(
'div'
,
class_
=
'result-list sv-search-container'
)
.
find_all
(
'div'
,
class_
=
'search-item sv-search-company'
)
for
div
in
divs
:
urls
.
append
(
div
.
find
(
'div'
,
class_
=
'header'
)
.
find
(
'a'
)
[
'href'
]
)
return
urls
#非會(huì)員只能爬前五頁
if
(
int
(
pages
)
>
5
)
:
pages
=
5
urls
=
[
]
for
i
in
range
(
1
,
pages
+
1
)
:
url
=
'https://www.tianyancha.com/search/p'
+
str
(
i
)
+
'?key='
+
key
browser
.
get
(
url
)
time
.
sleep
(
random
.
uniform
(
0.6
,
1
)
+
2
)
soup
=
BeautifulSoup
(
browser
.
page_source
,
'lxml'
)
urls
.
extend
(
getUid
(
soup
)
)
獲得企業(yè)信息
最后根據(jù)企業(yè)網(wǎng)頁HTML內(nèi)容,解析獲取需要的信息,看頁面可以發(fā)現(xiàn)這里需要的電話和地址都在最上面就有。
獲取這部分內(nèi)容代碼:
#這里為了避免意外每次將結(jié)果直接寫入Excel文件
try
:
for
url
in
urls
:
path
=
r
'C:\Users\liuliang_i\Desktop\tianYanCha.xlsx'
try
:
df1
=
pd
.
read_excel
(
path
)
except
:
df1
=
pd
.
DataFrame
(
columns
=
[
'Company'
,
'Phone'
,
'Address'
,
'Url'
]
)
browser
.
get
(
url
)
time
.
sleep
(
random
.
uniform
(
0.4
,
0.8
)
+
1
)
soup
=
BeautifulSoup
(
browser
.
page_source
,
'lxml'
)
company
=
soup
.
find
(
'div'
,
class_
=
'header'
)
.
find
(
'h1'
,
class_
=
'name'
)
.
getText
(
)
phone
=
soup
.
find
(
'div'
,
class_
=
'in-block sup-ie-company-header-child-1'
)
.
find_all
(
'span'
)
[
1
]
.
getText
(
)
address
=
soup
.
find
(
'div'
,
class_
=
'auto-folder'
)
.
find
(
'div'
)
.
getText
(
)
df1
.
loc
[
df1
.
shape
[
0
]
,
'Company'
]
=
company
df1
.
loc
[
df1
.
shape
[
0
]
-
1
,
'Phone'
]
=
phone
df1
.
loc
[
df1
.
shape
[
0
]
-
1
,
'Address'
]
=
address
df1
.
loc
[
df1
.
shape
[
0
]
-
1
,
'Url'
]
=
url
df1
.
to_excel
(
path
,
index
=
0
)
except
:
pass
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號(hào)聯(lián)系: 360901061
您的支持是博主寫作最大的動(dòng)力,如果您喜歡我的文章,感覺我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長非常感激您!手機(jī)微信長按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對(duì)您有幫助就好】元
