웹 크롤링 - 데이터 저장(링크)

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

잡

웹 크롤링 - 데이터 저장(링크) 본문

프로젝트/파이썬

웹 크롤링 - 데이터 저장(링크)

뚜스머리 2017. 7. 25. 20:47

웹에서 데이터를 저장하는 방법에는 여러가지가 있을 수 있다.

각종 파일을 직접 저장하는 방법도 있고, 링크를 저장할 수 있다.

그러나 링크를 저장하는 경우, 해당 파일이 외부에 있기 때문에 파일변경등의 문제가 발생하면 전혀 해결할 수 없다.

따라서 필수적인 자료는 저장하는게 더 나은 선택일 수 있다.

여기서는 urlretrieve를 사용(https://docs.python.org/3/library/urllib.request.html#legacy-interface)

예시1

<http://doohaproject.tistory.com/12>

from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup

html = urlopen("http://doohaproject.tistory.com/12")
bsObj = BeautifulSoup(html,"html.parser")
Location = bsObj.find("span",{"class":"imageblock"}).find("a")['href']
print(Location)
urlretrieve(Location, "first.zip")

지정된 파일명에 맞춰 스크립트가 있는 위치에 파일이 저장할 수 있다.

예시2

<http://doohaproject.tistory.com/19>

위 글에 올려둔 사진을 모두 저장하고 싶다면

import os
from urllib.request import urlopen,urlretrieve
from bs4 import BeautifulSoup

baseUrl = "https://doohaproject.tistory.com/19"

def getAbsoluteUrl(source):
if source.startswith("http://"):
url = source
else:
url = "http:"+source
return url

html = urlopen(baseUrl)
bsObj = BeautifulSoup(html,"html.parser")
downloadList = bsObj.findAll("img")

for download in downloadList:
if download.has_attr("filename"):
cUrl = getAbsoluteUrl(download["src"])
print(cUrl)
urlretrieve(cUrl, download["filename"])

위 소스 역시 마찬가지로 사진 파일을 스크립트가 저장된 디렉토리에 저장함을 알 수 있다.

이렇게 저장하게 되면 정리가 되어있지 않아 복잡하다.

따라서 os를 import하고 directory를 생성하여 처음 데이터를 얻을 때 부터 파일을 분류하여 저장하는것이 좋다.

'프로젝트 > 파이썬' 카테고리의 다른 글

웹 크롤링 - 문서 읽기 (0)	2017.07.27
웹 크롤링 - 데이터 저장 (0)	2017.07.26
pip 오류 - Failed building wheel for cryptography (0)	2017.07.17
파이썬 웹 크롤링 (0)	2017.06.26
파이썬 가상환경 virtualenv (0)	2017.04.28

'프로젝트/파이썬' Related Articles

잡

웹 크롤링 - 데이터 저장(링크) 본문

웹 크롤링 - 데이터 저장(링크)

'프로젝트 > 파이썬' 카테고리의 다른 글

티스토리툴바