반복되는 링크 스크래핑 (python,selenium이용)

IT기타

반복되는 링크 스크래핑 (python,selenium이용)

emilyyoo 2024. 7. 18. 13:25

728x90

하려는 것 :
메인 페이지가 있고 거기에 링크걸린 '리스트'가 100개 있는 상황.
그 모든 링크에 들어가 내용을 다 스크랩하고 메모장에 기록하고 싶다.

파악해야 하는 것 : html 코드 확인

패턴이 있는 페이지였다.
->>> 스크랩하려는 부분의 html 태그 패턴 확인

- 메인 페이지에서 읽어들일 리스트는 div로 감싸져 있다.
<div block-id="~~"> 이 div 안에 있는 link를 읽어와야 한다.

- 상세페이지로 전환시 url 패턴 파악 : href 속성에 포함된 쿼리 파라미터가 제거된 URL을 기반.

- 상세 페이지에서 읽어올 텍스트를 감싸는 css 확인. 몇 페이지 확인 결과 모두 같은 패턴예상.

코딩 : 메인에서 상세페이지를 클릭-> 텍스트를 긁고 -> 메모장에 기록하는 것을 반복.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

# Chrome 웹 드라이버 경로 설정
driver_path = 'YOUR_CHROMEDRIVER_PATH'  # 예: 'C:/path/to/chromedriver.exe'
main_page_url = '메인페이지 url'
base_url = '상세페이지마다 공통적인 url 부분'

# ChromeService 객체 생성
service = Service(executable_path=driver_path)

# Chrome 웹 드라이버 실행
driver = webdriver.Chrome(service=service)
driver.get(main_page_url)

# 페이지 로딩 대기 (필요에 따라 조정)
time.sleep(5)

# div[block-id="~~~~"] 내의 모든 링크 가져오기
div_block_id = '~~~~'
div_element = driver.find_element(By.CSS_SELECTOR, f'div[data-block-id="{div_block_id}"]')
links = div_element.find_elements(By.TAG_NAME, 'a')

# 링크 URL 중복 제거
unique_links = set(link.get_attribute('href').split('?')[0] for link in links if link.get_attribute('href'))

# 각 링크를 클릭하고 내용 크롤링
with open('output.txt', 'w', encoding='utf-8') as file:
    for link_url in unique_links:
        if not link_url.startswith('http'):
            full_link_url = base_url + link_url.lstrip('/')
        else:
            full_link_url = link_url

        print(f'Navigating to: {full_link_url}')
        
        # 구분 표시 추가
        file.write(f'================= Navigating to: {full_link_url} *****\n')
        
        # 링크 클릭
        driver.get(full_link_url)
        time.sleep(5)  # 페이지 로딩 대기 (필요에 따라 조정)

        # 원하는 데이터 추출
        data_elements = driver.find_elements(By.CSS_SELECTOR, 'div.읽어롤 css')

        for element in data_elements:
            file.write(element.text + '\n')

        # 원래 페이지로 돌아가기
        driver.get(main_page_url)
        time.sleep(5)  # 페이지 로딩 대기 (필요에 따라 조정)

# 웹 드라이버 종료
driver.quit()

**참고 : anaconda에 가상환경 설정한 상태에서 selenium 등 라이브러리 설치.

1. **Selenium 설치:

conda install -c conda-forge selenium

2. 추가 라이브러리 설치 (예: requests, BeautifulSoup):*

conda install -c conda-forge requests

conda install -c conda-forge beautifulsoup4

728x90

저작자표시 비영리 변경금지

'IT기타' 카테고리의 다른 글

what's different BeautifulSoup and Selenium (0)	2024.07.24
(Selenium 사용) 메인메이지 -> 링크 리스트 -> 상세데이터 수집 (페이징 데이터수집) (2)	2024.07.24
ChromeDriver download 최신버전 다운 (0)	2024.07.17
'conda' is not recognized as an internal or external command, operable program or batch file. (0)	2024.07.15
("시작메뉴 - 찾기" 안되는) Reason Cybersecurity 제거 (0)	2024.07.14

현재글반복되는 링크 스크래핑 (python,selenium이용)

크롬 확장프로그램, 챌린지리그, fly.io, MSA #마이크로서비스, CICD, 모바일 크롬 확장프로그램, reason cybersecurity, 워드클라우드, 네이버웹소설, 터치 이벤트, visual studio code 단축키, 게시보류, 스크롤 동기화, 베스트리그, 자동배포, 보안프로그램 제거, github actions, intellij Git, Git 브랜치 관리, 맥북처음,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Power to use tools