<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.9.5">Jekyll</generator><link href="https://birdf00t.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://birdf00t.github.io/" rel="alternate" type="text/html" hreflang="en" /><updated>2024-07-30T06:25:23+00:00</updated><id>https://birdf00t.github.io/feed.xml</id><title type="html">birdfoot</title><subtitle>my study, project blog</subtitle><author><name>birdfoot</name></author><entry><title type="html">ensemble of trees</title><link href="https://birdf00t.github.io/%ED%98%BC%EA%B3%B5%EB%A8%B8%EC%8B%A0/2024/07/29/ensemble-of-trees.html" rel="alternate" type="text/html" title="ensemble of trees" /><published>2024-07-29T00:00:00+00:00</published><updated>2024-07-29T00:00:00+00:00</updated><id>https://birdf00t.github.io/%ED%98%BC%EA%B3%B5%EB%A8%B8%EC%8B%A0/2024/07/29/ensemble-of-trees</id><content type="html" xml:base="https://birdf00t.github.io/%ED%98%BC%EA%B3%B5%EB%A8%B8%EC%8B%A0/2024/07/29/ensemble-of-trees.html"><![CDATA[<h3 id="ensemble-learning앙상블-학습">ensemble learning(앙상블 학습)</h3>

<p>더 좋은 예측 결과를 만들기 위해 여러 개의 모델을 훈련하는 머신러닝 알고리즘
정형 데이터를 다루는데 가장 뛰어난 성과를 나타내는 알고리즘으로 대부분 결정 트리 기반으로 만들어짐.</p>

<h3 id="사이킷런의-앙상블-학습-알고리즘">사이킷런의 앙상블 학습 알고리즘</h3>

<p><strong>랜덤 포레스트</strong></p>

<p>대표적
인 결정 트리 기반의 앙상블 학습 방법으로 안정적인 성능 덕분에 많이 사용되고 있어서 앙상블 학습을 적용할 때 가장 먼저 시도해보길 권한다.</p>

<p>랜덤 포레스트는 결정 트리를 랜덤하게 만들어 결정 트리(나무)의 숲을 만든다. 결정 트리의 예측을 사용해 최종 예측을 만든다.</p>

<p>랜덤 포레스트는 각 트리를 훈련하기 위한 데이터를 랜덤하게 만드는데 입력한 훈련 데이터에서 랜덤하게 샘플을 추출하여 만든다. (중복된 샘플이 추출될 수 있다) 이렇게 만들어진 샘플을 부트스트랩 샘플이라고 하며 기본적으로 훈련 세트의 크기와 같게 만든다.</p>

<p>부트스트랩이란 데이터 세트에서 중복을 허용하여 데이터를 샘플링하는 방식을 의미한다.</p>

<p>또한 노드 분할시 전체 특성 중 일부 특성을 무작위로 골라 최선의 분할을 찾는다.</p>

<p>랜덤 포레스트 분류 모델은 전체 특성 개수의 제곱근만큼의 특성을 선택한다. (4개의 특성이 있다면 노드마다 2개를 랜덤하게 선택하여 사용)
랜덤포레스트 회귀 모델은 전체 특성을 사용한다</p>

<p>사이킷런의 랜덤 포레스트는 100개의 결정 트리를 이런 방식으로 훈련하고 분류일 때는 각 트리의 클래스별 확률을 평균하여 가장 높은 확률을 가진 클래스를 예측으로 회귀는 각 트리의 예측을 평균한다.</p>

<p>랜덤하게 선택한 샘플과 특성을 사용해서 훈련 세트 과대적합을 막아주고 안정적인 성능을 얻을 수 있어 기본 매개변수 설정만으로 좋은 결과를 내기도 한다</p>

<p>와인 데이터를 가져와서 화이트 와인을 분류하는 문제</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>

<span class="n">wine</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'https://bit.ly/wine_csv_data'</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">wine</span><span class="p">[[</span><span class="s">'alcohol'</span><span class="p">,</span><span class="s">'sugar'</span><span class="p">,</span><span class="s">'pH'</span><span class="p">]].</span><span class="n">to_numpy</span><span class="p">()</span>
<span class="n">target</span> <span class="o">=</span> <span class="n">wine</span><span class="p">[</span><span class="s">'class'</span><span class="p">].</span><span class="n">to_numpy</span><span class="p">()</span>
<span class="n">train_input</span><span class="p">,</span> <span class="n">test_input</span><span class="p">,</span> <span class="n">train_target</span><span class="p">,</span><span class="n">test_target</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
</code></pre></div></div>

<p>100개의 결정 트리를 사용하기 때문에 모든 cpu 코어를 사용하는 것이 좋다</p>

<p>n_jobs = -1 : 모든 cpu 코어를 사용
return_train_score = True : 검증 점수와 훈련 세트 점수 같이 반환 (기본값은 False)</p>

<p>결과를 보면 훈련 세트에 과대 적합된 것을 확인할 수 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">cross_validate</span>
<span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">RandomForestClassifier</span>

<span class="n">rf</span> <span class="o">=</span> <span class="n">RandomForestClassifier</span><span class="p">(</span><span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">scores</span> <span class="o">=</span> <span class="n">cross_validate</span><span class="p">(</span><span class="n">rf</span><span class="p">,</span> <span class="n">train_input</span><span class="p">,</span> <span class="n">train_target</span><span class="p">,</span><span class="n">return_train_score</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'train_score'</span><span class="p">]),</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'test_score'</span><span class="p">]))</span>
</code></pre></div></div>

<p>-&gt; 0.9973541965122431 0.8905151032797809</p>

<p>feature_importance : 랜덤포레스트 모델을 훈련 후 특성 중요도<br />
각 [알코올 도수, 당도, pH] 이다.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rf</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_input</span><span class="p">,</span><span class="n">train_target</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">rf</span><span class="p">.</span><span class="n">feature_importances_</span><span class="p">)</span>
</code></pre></div></div>

<p>-&gt; [0.23167441 0.50039841 0.26792718]</p>

<p>oob_score: 부트스트랩 샘플에 포함되지 않고 남는 샘플(OOB)로 모델을 평가한 점수</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rf</span><span class="o">=</span><span class="n">RandomForestClassifier</span><span class="p">(</span><span class="n">oob_score</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">rf</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_input</span><span class="p">,</span><span class="n">train_target</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">rf</span><span class="p">.</span><span class="n">oob_score_</span><span class="p">)</span>
</code></pre></div></div>

<p>-&gt; 0.8934000384837406</p>

<p><strong>엑스트라 트리</strong></p>

<p>랜덤 포레스트와 비슷하게 결정 트리를 사용하여 앙상블 모델을 만들지만 부트스트랩 샘플을 사용하지 않고 전체 훈련 세트를 사용하지만 대신 랜덤하게 노드를 분할해 성능은 낮지만 과대적합을 감소시키고 검증 세트의 점수를 높인다. 엑스트라 트리는 무작위성이 더 커서 랜덤 포레스트보다 더 많은 결정 트리를 훈련 시켜야 하지만 랜덤 노드 분할로 인해 빠른 계산 속도를 갖는다.</p>

<p>교차 검증 점수를 확인해본 결과 아까 랜덤 포레스트와 비슷한 결과를 얻었다. 특성이 많지 않아서 두 모델의 차이가 크지 않다.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">ExtraTreesClassifier</span>

<span class="n">et</span> <span class="o">=</span> <span class="n">ExtraTreesClassifier</span><span class="p">(</span><span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">scores</span> <span class="o">=</span> <span class="n">cross_validate</span><span class="p">(</span><span class="n">et</span><span class="p">,</span><span class="n">train_input</span><span class="p">,</span> <span class="n">train_target</span><span class="p">,</span><span class="n">return_train_score</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'train_score'</span><span class="p">]),</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'test_score'</span><span class="p">]))</span>
</code></pre></div></div>

<p>-&gt; 0.9974503966084433 0.8887848893166506</p>

<p>엑스트라 트리 모델에서의 특성 중요도</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">et</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_input</span><span class="p">,</span><span class="n">train_target</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">et</span><span class="p">.</span><span class="n">feature_importances_</span><span class="p">)</span>
</code></pre></div></div>

<p>-&gt; [0.20183568 0.52242907 0.27573525]</p>

<p><strong>그레이디언트 부스팅</strong></p>

<p>깊이가 얕은 결정 트리를 사용하여 이전 트리의 오차를 보완하는 방식으로 앙상블하는 방법<br />
깊이가 얕은 결정 트리를 사용하기 때문에 과대적합에 강하고 일반적으로 높은 일반화 성능 기대할 수 있지만 순서대로 트리를 추가하기 때문에 속도가 느리다</p>

<p>경사하강법을 사용해서 트리를 앙상블에 추가하고 분류는 로지스틱 손실 함수, 회귀는 평균 제곱 오차 함수 사용</p>

<p>과대적합에 가능해서 결정 트리의 개수를 늘려도 과대적합이 되지 않는다<br />
트리의 개수를 늘리면 성능이 더 향상될 수 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">GradientBoostingClassifier</span>

<span class="n">gb</span> <span class="o">=</span> <span class="n">GradientBoostingClassifier</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">scores</span><span class="o">=</span><span class="n">cross_validate</span><span class="p">(</span><span class="n">gb</span><span class="p">,</span> <span class="n">train_input</span><span class="p">,</span> <span class="n">train_target</span><span class="p">,</span> <span class="n">return_train_score</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'train_score'</span><span class="p">]),</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'test_score'</span><span class="p">]))</span>
</code></pre></div></div>

<p>-&gt; 0.8881086892152563 0.8720430147331015</p>

<p>결정 트리 개수를 5배 늘렸지만 과대적합을 잘 억제하고 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gb</span> <span class="o">=</span> <span class="n">GradientBoostingClassifier</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">500</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">scores</span> <span class="o">=</span> <span class="n">cross_validate</span><span class="p">(</span><span class="n">gb</span><span class="p">,</span><span class="n">train_input</span><span class="p">,</span><span class="n">train_target</span><span class="p">,</span><span class="n">return_train_score</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'train_score'</span><span class="p">]),</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'test_score'</span><span class="p">]))</span>
</code></pre></div></div>

<p>-&gt; 0.9464595437171814 0.8780082549788999</p>

<p>특성 중요도를 보면 랜덤 포레스트보다 당도에 더 집중된 것을 확인할 수 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gb</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_input</span><span class="p">,</span><span class="n">train_target</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">gb</span><span class="p">.</span><span class="n">feature_importances_</span><span class="p">)</span>
</code></pre></div></div>

<p>-&gt; [0.15872278 0.68010884 0.16116839]</p>

<p><strong>히스토그램 기반 그레이디언트 부스팅</strong></p>

<p>정형 데이터를 다루는 머신러닝 알고리즘 중 가장 인기가 높다</p>

<p>입력 특성을 256개 구간으로 나누고 노드를 분할할 때 최적의 분할을 매우 빠르게 찾을 수 있다</p>

<p>과대적합을 잘 억제하면서 그레이디언트 부스팅보다 조금 더 높은 성능을 제공한다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.experimental</span> <span class="kn">import</span> <span class="n">enable_hist_gradient_boosting</span>
<span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">HistGradientBoostingClassifier</span>

<span class="n">hgb</span> <span class="o">=</span> <span class="n">HistGradientBoostingClassifier</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">scores</span> <span class="o">=</span> <span class="n">cross_validate</span><span class="p">(</span><span class="n">hgb</span><span class="p">,</span><span class="n">train_input</span><span class="p">,</span><span class="n">train_target</span><span class="p">,</span><span class="n">return_train_score</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'train_score'</span><span class="p">]),</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">scores</span><span class="p">[</span><span class="s">'test_score'</span><span class="p">]))</span>
</code></pre></div></div>

<p>-&gt; 0.9321723946453317 0.8801241948619236</p>

<p>훈련세트의 특성 중요도<br />
각 순서대로 특성 중요도, 특성 평균, 표준 편차를 담고 있다</p>

<p>그레이디언트와 비슷하게 당도에 집중하고 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.inspection</span> <span class="kn">import</span> <span class="n">permutation_importance</span>

<span class="n">hgb</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_input</span><span class="p">,</span><span class="n">train_target</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">permutation_importance</span><span class="p">(</span><span class="n">hgb</span><span class="p">,</span> <span class="n">train_input</span><span class="p">,</span> <span class="n">train_target</span><span class="p">,</span> <span class="n">n_repeats</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">,</span><span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">importances</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">importances_mean</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">importances_std</span><span class="p">)</span>
</code></pre></div></div>

<p>-&gt; [[0.08793535 0.08350972 0.08908986 0.08312488 0.09274581 0.08755051 0.08601116 0.09601693 0.09082163 0.09082163]
 [0.22782374 0.23590533 0.23936887 0.23436598 0.23725226 0.23436598 0.23359631 0.23398114 0.23994612 0.22724649]
 [0.08581874 0.08601116 0.08062344 0.07504329 0.08427939 0.07792957 0.07234943 0.07465846 0.08139311 0.08466423]]
[0.08876275 0.23438522 0.08027708]
[0.00382333 0.00401363 0.00477012]</p>

<p>테스트 세트로 성능 점수</p>

<p>약 87퍼의 정확도를 얻은걸 보아 앙상블 모델은 단일 결정 트리보다 더 좋은 결과를 얻을 수 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">result</span> <span class="o">=</span> <span class="n">permutation_importance</span><span class="p">(</span><span class="n">hgb</span><span class="p">,</span> <span class="n">test_input</span><span class="p">,</span><span class="n">test_target</span><span class="p">,</span><span class="n">n_repeats</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">,</span><span class="n">n_jobs</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">hgb</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">test_input</span><span class="p">,</span><span class="n">test_target</span><span class="p">)</span>
</code></pre></div></div>

<p>-&gt; 0.8723076923076923</p>

<h3 id="사이킷런-라이브러리">사이킷런 라이브러리</h3>

<p><strong>RandomForestClassifier(랜덤 포레스트 분류 클래스)</strong><br />
n_estimators : 앙상블 구성 트리 개수 기본값 100<br />criterion : 불순도 기본값 ‘gini’<br />max_depth : 트리 최대 깊이 기본값 None<br />min_sample_split : 노드를 나누기 위한 최소 샘플 개수 기본값 2<br />max_features : 최적희 분할을 위한 탐색할 특성 개수 기본값 auto(특성 개수의 제곱근)<br />bootstrap : 부트스트랩 샘플 사용 여부 기본값 True<br />oob_score : OOB 샘플로 훈련 모델 평가 여부 기본값 False<br />n_jobs : 사용할 CPU 코어 수 기본값 1, -1은 모든 코어 사용</p>

<p><strong>ExtraTreesClassifier(엑스트라 트리 분류 클래스) : 랜덤 포레스트와 동일</strong></p>

<p><strong>GradientBoostingClassifier(그레이디언트 부스팅 분류 클래스)</strong><br />
loss : 손실 함수 지정 기본값 ‘deviance’ 로지스틱<br />learning_rate : 트리가 앙상블에 기여하는 정도 기본값 0.1<br />n_estimators : 부스팅 단계를 수행하는 트리의 개수 기본값 100<br />subsample : 훈련 세트의 샘플 비율 기본값 1.0<br />max_depth : 개별 회귀 트리의 최대 깊이 기본값 3</p>

<p><strong>HistGradientBoostingClassifier(히스토그램 기반 그레이디언트 부스팅 분류 클래스)</strong><br />
learning_rate : 학습률 기본값 0.1, 1.0이면 감쇠가 없다<br />max_iter : 부스팅 단계를 수행하는 트리의 개수 기본값 100<br />max_bins : 입력 데이터를 나눌 구간의 개수 기본값 255</p>

<p>사이킷런 말고도 그레이디언트 부스팅 알고리즘을 XGBoost, LightGBM 라이브러리도 사용할 수 있다</p>]]></content><author><name>birdfoot</name></author><category term="혼공머신" /><summary type="html"><![CDATA[ensemble learning(앙상블 학습)]]></summary></entry><entry><title type="html">필요한 도구들</title><link href="https://birdf00t.github.io/%EC%9D%B4%EB%AF%B8%EC%A7%80%EC%B2%98%EB%A6%AC%EB%B0%94%EC%9D%B4%EB%B8%94/2024/07/08/tools-for-image-processing.html" rel="alternate" type="text/html" title="필요한 도구들" /><published>2024-07-08T00:00:00+00:00</published><updated>2024-07-08T00:00:00+00:00</updated><id>https://birdf00t.github.io/%EC%9D%B4%EB%AF%B8%EC%A7%80%EC%B2%98%EB%A6%AC%EB%B0%94%EC%9D%B4%EB%B8%94/2024/07/08/tools-for-image-processing</id><content type="html" xml:base="https://birdf00t.github.io/%EC%9D%B4%EB%AF%B8%EC%A7%80%EC%B2%98%EB%A6%AC%EB%B0%94%EC%9D%B4%EB%B8%94/2024/07/08/tools-for-image-processing.html"><![CDATA[<h3 id="1-opencv">1. OpenCV</h3>
<p>컴퓨터로 이미지나 영상을 읽고, 이미지의 사이즈 변환이나 회전, 선분 및 도형 그리기, 채널 분리 등의 연산을 처리할 수 있도록 만들어진 오픈 소스 라이브러리로 이미지 처리 분야에서 가장 많이 사용됨.</p>

<p>OpenCV 라이브러리는 구글 코랩이 아닌 로컬 컴퓨터 기준으로 만들어져서 구글 코랩에서는 모든 기능을 사용할 수 없다. 코랩에서는 ‘cv2.imshow’를 사용할 수 없고 ‘cv2_imshow’를 대신 사용한다</p>

<p>파이썬 기본 자료형 리스트는 연산할 때 느리다는 단점이 있어서 이미지와 같이 연산량이 많아진다면 리스트가 아닌 numpy를 사용하면 속도가 빨라지고 더 적은 메모리 공간을 차지한다.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">cv2</span> <span class="c1">#OpenCV 사용하기 위해
</span><span class="kn">from</span> <span class="nn">google.colab.patches</span> <span class="kn">import</span> <span class="n">cv2_imshow</span> <span class="c1">#이미지 출력을 위해
</span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="c1">#바이트를 넘파이로 변환해주기 위해
</span><span class="kn">import</span> <span class="nn">urllib.request</span> <span class="c1">#url에서 이미지 불러오기
</span>
<span class="n">resp</span> <span class="o">=</span> <span class="n">urllib</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">urlopen</span><span class="p">(</span><span class="s">'https://raw.githubusercontent.com/Cobslab/imageBible/main/image/like_lenna224.png'</span><span class="p">)</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">(</span><span class="nb">bytearray</span><span class="p">(</span><span class="n">resp</span><span class="p">.</span><span class="n">read</span><span class="p">()),</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'uint8'</span><span class="p">)</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">imdecode</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">IMREAD_COLOR</span><span class="p">)</span>

<span class="n">cv2_imshow</span><span class="p">(</span><span class="n">image</span><span class="p">)</span> <span class="c1">#이미지 출력
</span></code></pre></div></div>
<p><img src="/img/pasted_image_20240706013412.png" alt="" /></p>

<p>각 픽셀 수로 가로 224, 세로 224, 3차원</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"이미지 배열 형태: </span><span class="si">{</span><span class="n">image</span><span class="p">.</span><span class="n">shape</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<p>-&gt; 이미지 배열 형태: (224, 224, 3)</p>

<p><strong>이미지 사이즈 변환</strong><br />
cv2.resize 함수로 이미지의 사이즈 가로, 세로 100, 100으로 수정했다</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">image_small</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">image</span><span class="p">,(</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">))</span>
<span class="n">cv2_imshow</span><span class="p">(</span><span class="n">image_small</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/img/pasted_image_20240706014013.png" alt="" /></p>

<p>이미지의 배율을 변환한다</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">image_big</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">dsize</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">fx</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span><span class="n">fy</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">cv2_imshow</span><span class="p">(</span><span class="n">image_big</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/img/pasted_image_20240706014246.png" alt="" /></p>

<p><strong>대칭 변환</strong><br />
cv2.flip 함수를 사용하여 이미지 대칭, 변환할 수 있다
0 : 수평축 반전, 1 : 세로축 반전</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">image_fliped</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">flip</span><span class="p">(</span><span class="n">image</span><span class="p">,</span><span class="mi">0</span><span class="p">)</span>
<span class="n">cv2_imshow</span><span class="p">(</span><span class="n">image_fliped</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/img/pasted_image_20240706015140.png" alt="" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">image_fliped</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">flip</span><span class="p">(</span><span class="n">image</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">cv2_imshow</span><span class="p">(</span><span class="n">image_fliped</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/img/pasted_image_20240706015150.png" alt="" /></p>

<p><strong>회전 변환</strong><br />
cv2.warpAffine 함수는 이미지를 원하는 각도로 회전한다</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">height</span><span class="p">,</span> <span class="n">width</span> <span class="o">=</span> <span class="n">image</span><span class="p">.</span><span class="n">shape</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span>
<span class="n">matrix</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">getRotationMatrix2D</span><span class="p">((</span><span class="n">width</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span><span class="n">height</span><span class="o">/</span><span class="mi">2</span><span class="p">),</span><span class="mi">90</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">warpAffine</span><span class="p">(</span><span class="n">image</span><span class="p">,</span><span class="n">matrix</span><span class="p">,(</span><span class="n">width</span><span class="p">,</span><span class="n">height</span><span class="p">))</span>
<span class="n">cv2_imshow</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/img/pasted_image_20240706020244.png" alt="" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">matrix</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">getRotationMatrix2D</span><span class="p">((</span><span class="n">width</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span><span class="n">height</span><span class="o">/</span><span class="mi">2</span><span class="p">),</span><span class="mi">30</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">warpAffine</span><span class="p">(</span><span class="n">image</span><span class="p">,</span><span class="n">matrix</span><span class="p">,(</span><span class="n">width</span><span class="p">,</span><span class="n">height</span><span class="p">),</span><span class="n">borderValue</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>
<span class="n">cv2_imshow</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/img/pasted_image_20240706020311.png" alt="" /></p>

<p><strong>자르기</strong><br />
슬라이싱은 원본 객체의 값을 그대로 참조해서 자른 이미지에 다른 값을 할당시키면 원본 사진 자체가 변한 것을 확인할 수 있다.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err"></span><span class="n">croped_image</span> <span class="o">=</span> <span class="n">image</span><span class="p">[</span><span class="mi">50</span><span class="p">:</span><span class="mi">150</span><span class="p">,</span><span class="mi">50</span><span class="p">:</span><span class="mi">150</span><span class="p">]</span>
<span class="n">croped_image</span><span class="p">[:]</span><span class="o">=</span><span class="mi">200</span>
<span class="n">cv2_imshow</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="pasted_image_20240706021126.png" alt="" />
원본 이미지에 영향을 미치고 싶지 않을 때는 깊은 복사를 사용해야 한다 copy 메서드 사용</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">resp</span> <span class="o">=</span> <span class="n">urllib</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">urlopen</span><span class="p">(</span><span class="s">'https://raw.githubusercontent.com/Cobslab/imageBible/main/image/like_lenna224.png'</span><span class="p">)</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">(</span><span class="nb">bytearray</span><span class="p">(</span><span class="n">resp</span><span class="p">.</span><span class="n">read</span><span class="p">()),</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'uint8'</span><span class="p">)</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">cv2</span><span class="p">.</span><span class="n">imdecode</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">cv2</span><span class="p">.</span><span class="n">IMREAD_COLOR</span><span class="p">)</span>

<span class="n">croped_image</span> <span class="o">=</span> <span class="n">image</span><span class="p">[</span><span class="mi">50</span><span class="p">:</span><span class="mi">150</span><span class="p">,</span> <span class="mi">50</span><span class="p">:</span><span class="mi">150</span><span class="p">].</span><span class="n">copy</span><span class="p">()</span>
<span class="n">croped_image</span><span class="p">[:]</span><span class="o">=</span><span class="mi">200</span>
<span class="n">cv2_imshow</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="pasted_image_20240706021739.png" alt="" /></p>

<p><strong>도형 그리기</strong><br />
OpenCV는 이미지에 도형을 그릴 수 있는 기능을 제공한다</p>
<ul>
  <li>선 그리기 : cv2.line</li>
  <li>원 그리기 : cv2.circle</li>
  <li>직사각형 그리기 : cv2.rectangle</li>
  <li>타원 그리기 : cv2.ellipse</li>
  <li>다각형 그리기 : cv2.polylines, cv2.fillPoly
    <h3 id="2-tensorflow">2. TensorFlow</h3>
  </li>
</ul>

<p>구글 브레인 팀에서 개발되어 공개된 오픈 소스 머신 러닝 라이브러리로 다차원 배열을 기반으로 하는 연산을 수행하며 병렬 처리와 지연 실행을 쉽게 수행할 수 있다.</p>

<p><strong>편의성</strong></p>

<p>고수준 API 지원, 사전 빌드된 층 및 모델 제공, 즉시 실행 모드 등<br />
고수준 API에 Keras(케라스)가 있다. 케라스 API는 모든 표준 모델을 정의해서 선형 회귀부터 복잡한 심층 신경망까지 몇 줄의 코드만으로 모델을 설정하고 컴파일할 수 있다. 또한 모델 아키텍처를 자유롭게 정의할 수 있는 기능도 제공한다. Model 클래스 API를 사용하면 층 그래프를 정의하는데 사용돼서 다중 출력 모델, 방향성 비순환 그래프, 공유 층이 있는 모델을 만들 때 유용하다.</p>

<p><strong>확장성</strong></p>
<ul>
  <li>행렬 및 벡터 연산에 더 효율적인 GPU와 TPU에서 모델을 훈련할 수 있기 때문에 CPU만 사용할 때보다 더 복잡한 모델 빠르게 처리 가능</li>
  <li>분산 컴퓨팅도 지원해서 여러 머신에서 동시에 모델을 훈련할 수 있다.</li>
  <li>모델 학습 후 대규모 모델 배포 도구 제공</li>
</ul>

<p><strong>유연성</strong></p>
<ul>
  <li>모델 아키텍처 설계를 넘어 데이터 전처리부터 배포까지</li>
  <li>텐서플로 저수준 API는 특정 문제에 따라 고유한 층을 제작할 수 있는 기능 제공</li>
  <li>맞춤형 손실 함수 생성 도구 제공</li>
  <li>tf.data API는 효율적인 데이터 파이프라인을 구축 제공, tf.data는 모든 데이터를 처리할 수 있을 만큼 다용도로 사용할 수 있다</li>
</ul>]]></content><author><name>birdfoot</name></author><category term="이미지처리바이블" /><summary type="html"><![CDATA[1. OpenCV 컴퓨터로 이미지나 영상을 읽고, 이미지의 사이즈 변환이나 회전, 선분 및 도형 그리기, 채널 분리 등의 연산을 처리할 수 있도록 만들어진 오픈 소스 라이브러리로 이미지 처리 분야에서 가장 많이 사용됨.]]></summary></entry><entry><title type="html">이미지란</title><link href="https://birdf00t.github.io/%EC%9D%B4%EB%AF%B8%EC%A7%80%EC%B2%98%EB%A6%AC%EB%B0%94%EC%9D%B4%EB%B8%94/2024/07/08/what-is-the-image.html" rel="alternate" type="text/html" title="이미지란" /><published>2024-07-08T00:00:00+00:00</published><updated>2024-07-08T00:00:00+00:00</updated><id>https://birdf00t.github.io/%EC%9D%B4%EB%AF%B8%EC%A7%80%EC%B2%98%EB%A6%AC%EB%B0%94%EC%9D%B4%EB%B8%94/2024/07/08/what-is-the-image</id><content type="html" xml:base="https://birdf00t.github.io/%EC%9D%B4%EB%AF%B8%EC%A7%80%EC%B2%98%EB%A6%AC%EB%B0%94%EC%9D%B4%EB%B8%94/2024/07/08/what-is-the-image.html"><![CDATA[<h3 id="1-디지털-이미지의-구조">1. 디지털 이미지의 구조</h3>

<h4 id="픽셀">픽셀</h4>

<p><strong>픽셀</strong>은 picture(그림)과 element(요소)의 합성어<br />
단일 픽셀 값은 흑백 이미지에서 밝기를 나타내는 숫자, 컬러 이미지에서 픽셀은 빨강, 녹색, 파랑 등 서로 다른 색상을 채널을 나타내는 여러 픽셀 값을 가질 수 있다.</p>

<p><strong>해상도</strong>는 이미지가 보유하고 있는 픽셀의 양<br />
해상도 1920x1080는 이미지의 1920픽셀 너비와 1080픽셀 높이를 가지고 있다. 카메라 센서 사이즈에 의해 주로 결정된다. 4k는 가로 픽셀이 4000 이상을 뜻함</p>

<p><strong>픽셀밀도</strong>는 (pixel per inch)인치당 픽셀 수, 센티미터당 픽셀 수로 측정되며 픽셀이 얼마나 촘촘하게 배열되어 있는지 척도이다. 픽셀밀도가 높을 수록 텍스트와 그래픽이 더 부드럽고 선명해지기에 더 현실감있게 느껴질 수 있다. 이로 인해 VR과 AR 같은 증강 현실에 중요하다</p>

<p><strong>서브 픽셀</strong>은 본질적으로 화면의 픽셀을 구성하는 작은 컬러 요소<br />
RGB 서브 픽셀에서 방출되는 빛의 강도를 조작해서 색상을 구현한다. 더 밝고 에너지 효율적인 디스플레이는 흰색 서브 픽셀이 추가된 RGBW를 사용한다. 흰색 서브 픽셀을 사용하면 RGB를 모두 활성화하지 않고도 흰색을 표현할 수 있어 에너지를 절약할 수 있다.</p>

<h4 id="무손실-압축과-손실-압축">무손실 압축과 손실 압축</h4>

<p>이미지 압축으로 빠른 훈련과 전송 속도, 컴퓨터 비전 모델 성능 향상이라는 장점이 있다. 압축으로 인한 이미지의 왜곡은 품질이 저하된 이미지에서도 모델이 인식할 수 있도록 학습시킬 수 있다.</p>

<p><strong>무손실 압축</strong><br />
portable network graphics의 약자인 PNG 형식은 무손실 압축을 사용한다. 디플레이트 알고리즘을 사용하는데 LZ77과 허프만 코딩이다.</p>

<ul>
  <li>LZ77 : 반복 시퀀스를 효과적으로 축소</li>
  <li>허프만 코딩 : 빈도가 높은 패턴을 짧은 코드로 변환</li>
</ul>

<p><strong>손실 압축</strong><br />
원본 데이터가 일부 손실되는 압축으로 시각으로 인지하기 어려운 특징 데이터를 제거한다. 무손실 압축보다 높은 압축률을 달성할 수 있다. joint photogrephic experts group(JPEG)는 손실 압축으로 그림이나 글자처럼 가장자리가 날카롭고 대비가 있는 이미지에는 아티팩트가 생성되기 때문에 적합하지 않다. 변환, 양자화 및 엔트로피 코딩의 조합을 사용</p>

<ul>
  <li>RGB에서 YCbCr로 변환(서브 샘플링) : Y는 휘도(밝기), Cb와 Cr은 색상을 나타낸다</li>
  <li>양자화 : 어려운 픽셀의 패턴은 인간의 시각으로 모두 확인하기 어렵기 때문에 단순화하는데 이 단계에서 손실이 발생한다</li>
  <li>엔트로피 코딩 : 데이터 심볼이 발생할 확률에 따라 심볼을 적절한 길이로 부호화하여 표한하는 것으로 종류로는 허프만 코딩, 산술 부호화, LZW 부호화가 있다</li>
</ul>

<h3 id="2-색-공간">2. 색 공간</h3>

<h4 id="그레이-스케일">그레이 스케일</h4>

<p>색상 스펙트럼은 없고 밝기값만 주어진다. 0(검정)~255(흰색)으로 이런 단순성 덕분에 계산 저장 공간을 줄이고 처리 속도를 높일 수 있다. 가장자리 감지나 텍스처 분석과 같은 특정 이미지 처리 작업의 경우 색상이 방해되는 경우가 있기에 그레이 스케일이 더 적합하다</p>

<h4 id="rgb">RGB</h4>

<p>red, green, blue에서 첫 글자를 딴 이름으로 세 가지 색상으로 모든 색을 표현할 수 있다. 세 가지 색상은 채널이라는 이름으로 표현되는데 채널은 색상의 정보를 담고 있는 저장공간으로 빨간색 정보를 담고 있는 빨간색 채널, 초록색 채널, 파란색 채널로 구성된다. 빛의 강도를 조절하여 색상을 조합하여 구현한다</p>

<h4 id="cmyk">CMYK</h4>

<p>인쇄물에는 빛과 달리 다른 접근 방식이 필요해서 생긴 방법으로 청록색, 자홍색, 노란색 잉크로 결합하여 색상을 구현한다. 하지만 잉크의 불완전성으로 인해 완벽한 검은색이 나오지 않은 경우가 많아 검정 잉크가 따로 포함된다.</p>

<h4 id="hsv">HSV</h4>

<p>HSV는 색상을 세 가지 요소로 나눈다.</p>

<ul>
  <li>Hue(색조) : 색상을 말한다. 빨강, 파랑, 초록 등</li>
  <li>Saturation(채도) : 색조의 선명도로 채도가 낮을 수록 색이 바랜 것처럼 보인다</li>
  <li>Value(값, 밝기) : 색상의 밝기로 값이 낮을수록 어둡고 칙칙해보인다</li>
</ul>

<h4 id="비트">비트</h4>

<ul>
  <li>1비트 : 두 가지 가능한 값(0과 1)을 제공한다. 그레이 스케일에서 회색 음영이 없는 흑백 이미지</li>
  <li>8비트 : 0(검정)부터 255(흰색)까지 색으로 표현 가능한 이미지</li>
  <li>16비트 : 0부터 65,535까지 색으로 표현 가능 이미지</li>
  <li>24비트(채널당 8비트) : 256x256x256 = 16,777,216개의 색으로 표현 가능한 이미지</li>
  <li>48비트(채널당 16비트)</li>
</ul>

<p><strong>비트 심도의 중요성</strong><br />
색상, 그레이 스케일 음영을 표현하고 구분할 수 있는 정밀도를 결정한다.</p>

<ul>
  <li>이미지 품질 : 비트 심도가 높을수록 그라데이션이 더 부드러워져서 아티팩트가 줄어든다</li>
  <li>파일 사이즈 : 비트 심도가 높을수록 파일 사이즈가 커진다</li>
  <li>편집의 유연성 : 비트가 높을수록 이미지 품질 저하 없이 편집할 수 있다</li>
  <li>특수 이미징 : 천체 사진, 의료 영상과 같은 이미지를 디테일 손실 없이 모든 데이터를 캡쳐하기 위해 특정 비트 심도가 필요하다</li>
</ul>

<h3 id="3-이미지에서의-텐서-이해하기">3. 이미지에서의 텐서 이해하기</h3>

<h4 id="텐서의-이미지-표현">텐서의 이미지 표현</h4>

<ul>
  <li>256x256 사이즈의 그레이 스케일 이미지 (256, 256)</li>
  <li>256x256 사이즈의 RGB 컬러 이미지 (256, 256, 3)</li>
  <li>이미지 배치로 작업할 때나 영상의 경우 4차원 텐서를 사용하는데 여기서 첫 번째 차원 값은 배치에 포함된 이미지 수이다</li>
  <li>100개의 이미지 배치의 256x256 사이즈의 컬러 이미지 (100, 256, 256, 3)</li>
</ul>

<h4 id="이미지-불러오기">이미지 불러오기</h4>

<p>tf.keras.utils.get_file : url 이미지를 다운로드하고 저장된 로컬 경로를 반환한다<br />
tf.io.read_file : 경로에서 이미지 파일을 바이너리 문자열로 읽어온다<br />
tf.image.decode_jpeg : 바이너리 문자열을 숫자 텐서로 디코딩한다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>

<span class="n">url</span> <span class="o">=</span> <span class="s">'https://cobslab.com/wp-content/uploads/2022/02/ai-009-1.jpg'</span>
<span class="n">image_path</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">get_file</span><span class="p">(</span><span class="s">'/content/image.jpg'</span><span class="p">,</span> <span class="n">origin</span><span class="o">=</span><span class="n">url</span><span class="p">)</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">read_file</span><span class="p">(</span><span class="n">image_path</span><span class="p">)</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">image</span><span class="p">.</span><span class="n">decode_jpeg</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">channels</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">image</span>
</code></pre></div></div>

<p>-&gt; &lt;tf.Tensor: shape=(952, 1048, 3), dtype=uint8, numpy=
array([[[ 1, 10, 39],
        [ 1, 10, 39],
        [ 1, 10, 39],
        …(중략)…
        [ 1, 10, 39],
        [ 1, 10, 39],
        [ 1, 10, 39]]], dtype=uint8)&gt;</p>

<p><strong>결과 분석</strong></p>

<ul>
  <li>tf.Tensor : 출력물이 데이터를 캡슐화하는데 사용되는 텐서플로의 핵심 데이터 구조인 텐서 객체</li>
  <li>shape=(952,1048,3) : 텐서 사이즈</li>
  <li>dtype=unit8 : dtype은 텐서 내 요소의 데이터 유형으로 unit8은 값이 0에서 255 사이의 8비트 부호 없는 정수로 효시된다는 의미</li>
  <li>numpy=array : 이미지의 실제 픽셀 값</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>

<span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/img/20240708.png" alt="" /></p>

<h4 id="다양한-색-공간으로-작업하기">다양한 색 공간으로 작업하기</h4>

<h4 id="랜덤-텐서-생성">랜덤 텐서 생성</h4>

<p>tf.random.uniform을 사용하여 랜덤한 값을 가진 텐서를 생성할 수 있다. 랜덤 텐서는 알고리즘 성능을 평가하거나 색상 패턴 실험에서 사용된다.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rgb_image</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">uniform</span><span class="p">([</span><span class="mi">100</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span><span class="n">maxval</span><span class="o">=</span><span class="mi">255</span><span class="p">,</span><span class="n">dtype</span><span class="o">=</span><span class="n">tf</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">rgb_image</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'RGB Image'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/img/20240708-1.png" alt="" /></p>

<h4 id="그레이-스케일로-변환">그레이 스케일로 변환</h4>

<p>rgb_to_grayscale를 사용해서 RGB 이미지를 그레이 스케일로 바꿀 수 있다. 그레이 스케일은 하나의 채널만 같기에 squeeze 를 사용해서 RGB에 해당하는 차원을 삭제한다.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">grayscale_image</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">image</span><span class="p">.</span><span class="n">rgb_to_grayscale</span><span class="p">(</span><span class="n">rgb_image</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">grayscale_image</span><span class="p">.</span><span class="n">numpy</span><span class="p">().</span><span class="n">squeeze</span><span class="p">(),</span><span class="n">cmap</span><span class="o">=</span><span class="s">'gray'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Grayscale Image'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/img/20240708-2.png" alt="" /><br />
RGB 이미지를 그레이 스케일로 변경하는 공식(grayscale = R x 0.299 + G x 0.587 + B x 0.114)으로 코드로 구현할 수 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">R</span> <span class="o">=</span> <span class="n">rgb_image</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="mf">0.299</span>
<span class="n">G</span> <span class="o">=</span> <span class="n">rgb_image</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="mf">0.587</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">rgb_image</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="mi">2</span><span class="p">]</span><span class="o">*</span><span class="mf">0.114</span>
<span class="n">Y</span><span class="o">=</span><span class="n">R</span><span class="o">+</span><span class="n">G</span><span class="o">+</span><span class="n">B</span>

<span class="k">print</span><span class="p">(</span><span class="s">'그레이 스케일 값 :'</span><span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">grayscale_image</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'공식으로 구한 그레이 스케일 값 :'</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">Y</span><span class="p">))</span>
</code></pre></div></div>

<p>-&gt; 그레이 스케일 값 :tf.Tensor([123.073326], shape=(1,), dtype=float32)<br />
공식으로 구한 그레이 스케일 값 :tf.Tensor(123.076744, shape=(), dtype=float32)</p>

<p><strong>결과 분석</strong><br />
이미지의 RGB 채널값을 각각 처리해주고 합쳐서 그레이 스케일로 변경한 값과 실제 그레이 스케일 이미지의 값이 비슷한 것을 확인할 수 있다. 동일하지 않은 이유는 컴퓨터의 부동 소수점 연산이나 텐서플로 함수의 최적화과정 차이 등으로 인한 차이가 발생한다</p>

<h4 id="hsv로-변환">HSV로 변환</h4>
<p>rgb_to_hsv는 RGB 이미지를 HSV로 변환하는데 사용한다.
HSV에서 색소 채널만 추출해서 시각화 시키면서 cmap을 hsv 인수로 사용하면 이미지가 색조 색상 맵으로 렌더링 되어 빨간색(낮은 색조 값)부터 무지개의 모든 색을 거쳐 다시 빨간색(높은 색조 값)까지의 색상을 표시한다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hsv_image</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">image</span><span class="p">.</span><span class="n">rgb_to_hsv</span><span class="p">(</span><span class="n">rgb_image</span><span class="p">)</span>

<span class="n">hue_channel</span> <span class="o">=</span> <span class="n">hsv_image</span><span class="p">[:,:,</span><span class="mi">0</span><span class="p">]</span>

<span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">hsv_image</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'hsv'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Hue Channel of HSV Image'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s">'Hue Value'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span>
</code></pre></div></div>

<p><img src="/img/20240708-3.png" alt="" /><br />
<strong>결과 분석</strong><br />
무작위 랜덤 이미지여서 색도 채널도 무작위인 것을 확인할 수 있다</p>

<h4 id="픽셀-값의-정규화와-표준화">픽셀 값의 정규화와 표준화</h4>

<p>모델의 성능과 수렴 속도를 위해 픽셀 값의 사이즈를 조정한다</p>

<p><strong>정규화</strong><br />
픽셀 값을 [0,1] 범위로 스케일링 하는 것을 말한다. 픽셀값이 0에서 255 사이인 8비트 이미지라면 255로 나누기만 하면 된다</p>

<p><strong>표준화</strong><br />
이미지의 픽셀 값을 평균 0, 표준 편차 1이 되도록 스케일링하는 과정을 말한다. 표준화는 학습 과정을 가속화할 수 있으므로 머신 러닝 알고리즘에서 중요한 전처리 단계이다. 표준화를 하면 [0,255] 범위 였던 픽셀 값이 [-1, 1] 범위에 속하게 되어서 훈련에 좀 더 일관된 데이터 세트를 제공할 수 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#reduce_mean 함수는 텐서의 평균 값을 계산한다. 모든 픽셀의 평균 값을 계산한다. 이는 표준화 과정에서 픽셀 값을 0을 중심으로 정렬하는데 사용
</span><span class="n">mean</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">reduce_mean</span><span class="p">(</span><span class="n">rgb_image</span><span class="p">)</span>
<span class="c1">#reduce_std 함수는 텐서의 표준 편차를 계산한다. 표준 편차는 픽셀 값의 확산 또는 분산을 측정한다
</span><span class="n">stddev</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">math</span><span class="p">.</span><span class="n">reduce_std</span><span class="p">(</span><span class="n">rgb_image</span><span class="p">)</span>
<span class="c1">#앞서 구한 평균과 표준 편차로 표준화를 진행
</span><span class="n">normalized_image</span> <span class="o">=</span> <span class="p">(</span><span class="n">rgb_image</span><span class="o">-</span><span class="n">mean</span><span class="p">)</span><span class="o">/</span><span class="n">stddev</span>
<span class="n">rgb_image</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">normalized_image</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>

<p>-&gt; (&lt;tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 34.185913, 171.11664 , 108.85696 ], dtype=float32)&gt;,<br />
&lt;tf.Tensor: shape=(3,), dtype=float32, numpy=array([-1.2692811, 0.5908463, -0.2549167], dtype=float32)&gt;)</p>]]></content><author><name>birdfoot</name></author><category term="이미지처리바이블" /><summary type="html"><![CDATA[1. 디지털 이미지의 구조]]></summary></entry><entry><title type="html">이미지 처리와 컴퓨터 비전</title><link href="https://birdf00t.github.io/%EC%9D%B4%EB%AF%B8%EC%A7%80%EC%B2%98%EB%A6%AC%EB%B0%94%EC%9D%B4%EB%B8%94/2024/07/03/image-processing-and-computer-vision.html" rel="alternate" type="text/html" title="이미지 처리와 컴퓨터 비전" /><published>2024-07-03T00:00:00+00:00</published><updated>2024-07-03T00:00:00+00:00</updated><id>https://birdf00t.github.io/%EC%9D%B4%EB%AF%B8%EC%A7%80%EC%B2%98%EB%A6%AC%EB%B0%94%EC%9D%B4%EB%B8%94/2024/07/03/image-processing-and-computer-vision</id><content type="html" xml:base="https://birdf00t.github.io/%EC%9D%B4%EB%AF%B8%EC%A7%80%EC%B2%98%EB%A6%AC%EB%B0%94%EC%9D%B4%EB%B8%94/2024/07/03/image-processing-and-computer-vision.html"><![CDATA[<p>이미지 처리와 컴퓨터 비전 둘 다 이미지를 다루지만 접근 방식과 목표는 다르다.</p>

<h3 id="1-image-processing이미지-처리">1. image processing(이미지 처리)</h3>

<p><strong>analog image processing(아날로그 이미지 처리)</strong><br />
사진 촬영 시 필름에 다양한 화학 처리를 하거나 카메라로 촬영한 이미지를 조작하거나 편집하는 것</p>

<p><strong>digital image processing(디지털 이미지 처리)</strong><br />
수학적 알고리즘과 계산 기술에 의존하며 이미지 향상, 이미지 복원, 특징 추출 등이 포함됨.</p>

<p><strong>디지털 이미지 처리 단계</strong></p>
<ol>
  <li><strong>image acquisition(이미지 획득)</strong> : 이미지 캡처 장치를 통해 획득한 이미지를 디지털 형식으로 최대한 품질을 보존하면서 변환하여 컴퓨터에서 처리할 수 있도록 함. 이 과정에서 다양한 이미지 센서와 변환 알고리즘 사용</li>
  <li><strong>image enhancement(이미지 개선)</strong> : noise reduction(노이즈 제거), contrast adjustment(명암 조절), color correction(색상 보정) 등으로 이미지의 품질을 향상시키는데 초점을 맞춤. 이 과정에서 다양한 필터링 기법, histogram equalized(히스토그램 평활화), sharpening(샤프닝) 등의 기술 사용</li>
  <li><strong>image analysis(이미지 분석)</strong> : feature extraction(특징 추출), pattern recognition(패턴 인식), object detection(객체 감지) 등으로 이미지의 구조와 패턴을 파악하고 중요한 특징을 식별하여 이미지를 분석함. 이 과정에서 edge detection(에지 검출), corner detection(코너 검출), texture analysis(텍스트 분석) 등의 기술 사용.</li>
  <li><strong>image interpretation and understanding(이미지 해석 및 이해)</strong> : image classification(이미지 분류), image retrieval(이미지 검색), image recognition(이미지 인식) 등으로 이미지에서 얻은 데이터를 분석하고 이용하여 이미지 분류, 특정 패턴 인식, 이미지 객체 식별 등 여러 방식으로 데이터를 사용함. 이 과정에서 pattern matching(패턴 매칭), machine learning(머신 러닝), deep learning(딥 러닝) 등의 기술 사용.</li>
</ol>

<h3 id="2-computer-vision컴퓨터-비전">2. computer vision(컴퓨터 비전)</h3>
<p>컴퓨터 비전은 기계가 시각적 데이터를 이해하고 분석하는 능력을 개발하는 과학 분야로 이미지 처리는 주로 디지털 이미지 향상, 복원, 변형에 중점을 두고 컴퓨터 비전은 이미지의 분석과 해석에 초점을 둠. 그래서 컴퓨터 비전은 더 높은 수준의 이해를 필요로 하며 객체 인식, 패턴 분석, 이미지 분류 등의 작업을 포함.</p>

<p>컴퓨터 비전은 디지털 이미지를 통해 우리가 세상을 인식하고 이해하는 방식을 모방하고 이미지로부터 의미있는 정보를 추출하는 것. 즉, 컴퓨터가 사람처럼 볼 수 있도록 하는 것</p>

<p><strong>낮은 수준 비전 작업</strong></p>
<ul>
  <li>노이즈 제거</li>
  <li>대비 향상</li>
  <li>채도 향상</li>
  <li>에지 검출</li>
</ul>

<p><strong>중간 수준 비전 작업</strong></p>
<ul>
  <li>이미지 영역 분할</li>
  <li>이미지 객체로 분할</li>
  <li>이미지 광학 흐름 추정</li>
</ul>

<p><strong>높은 수준 비전 작업</strong></p>
<ul>
  <li>객체 인식</li>
  <li>장면 재구성</li>
  <li>이미지 학습 및 추론</li>
</ul>

<p>원본 이미지 -&gt; 이미지 처리 -&gt; 컴퓨터 비전</p>]]></content><author><name>birdfoot</name></author><category term="이미지처리바이블" /><summary type="html"><![CDATA[이미지 처리와 컴퓨터 비전 둘 다 이미지를 다루지만 접근 방식과 목표는 다르다.]]></summary></entry><entry><title type="html">텍스트에서 키워드 추출하기</title><link href="https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/09/KEYWORD-extract.html" rel="alternate" type="text/html" title="텍스트에서 키워드 추출하기" /><published>2024-06-09T00:00:00+00:00</published><updated>2024-06-09T00:00:00+00:00</updated><id>https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/09/KEYWORD-extract</id><content type="html" xml:base="https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/09/KEYWORD-extract.html"><![CDATA[<h1 id="저번에-추출해온-텍스트에서-키워드를-추출했다">저번에 추출해온 텍스트에서 키워드를 추출했다</h1>

<p><strong>과정</strong></p>

<p>키워드 빈도수 탑 100을 보고 필요없는 키워드들을 불용어 처리 후 다시 탑 100을 봤다 점점 순위가
내려갈수록 필요없는 키워드들이 나오는 것을 보고 그 이후 키워드들을 다 자르고 53개의 키워드가 남았다.</p>

<p><strong>코드</strong></p>

<p>저장해놓은 데이터 불러오기</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># CSV 파일 읽기
</span><span class="n">olive</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'..//올리브영 데이터//oliveyoung_text.csv'</span><span class="p">)</span>
<span class="n">coupang</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'..//쿠팡 데이터//coupang_text.csv'</span><span class="p">)</span>

<span class="c1"># 띄어쓰기 변환 함수
</span><span class="k">def</span> <span class="nf">correct_spacing</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="n">okt</span> <span class="o">=</span> <span class="n">Okt</span><span class="p">()</span>
    <span class="n">tokens</span> <span class="o">=</span> <span class="n">okt</span><span class="p">.</span><span class="n">morphs</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
    <span class="n">corrected_text</span> <span class="o">=</span> <span class="s">' '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">corrected_text</span>

<span class="c1">#텍스트 형태소 분석하기 쉽게 처리
</span><span class="n">docs</span><span class="o">=</span><span class="n">olive</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">docs</span><span class="p">)):</span>
    <span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">]).</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\r</span><span class="s">"</span><span class="p">,</span><span class="s">""</span><span class="p">)</span>
    <span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">]).</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span><span class="s">" "</span><span class="p">)</span>
    <span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">correct_spacing</span><span class="p">(</span><span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">])</span>
    <span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">"[0-9]"</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">]))</span>

<span class="c1">#형태소 분석
</span><span class="n">title_token_list</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># 제목의 형태소를 담아낼 리스트
</span><span class="n">title_token_noun</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># 제목의 명사를 담아낼 리스트
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">docs</span><span class="p">))):</span>
    <span class="c1"># komoran.pos() 메서드를 사용하여 형태소 분석 실시
</span>    <span class="k">try</span><span class="p">:</span>
        <span class="n">pos</span> <span class="o">=</span> <span class="n">komoran</span><span class="p">.</span><span class="n">pos</span><span class="p">(</span><span class="sa">u</span><span class="s">'{}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">]))</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
    <span class="c1"># komoran.nouns() 메서드를 사용하여 길이가 2이상인 명사를 추출라고 리스트에 저장
</span>    <span class="k">try</span><span class="p">:</span>
        <span class="n">noun</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">term</span> <span class="k">for</span> <span class="n">term</span> <span class="ow">in</span> <span class="n">komoran</span><span class="p">.</span><span class="n">nouns</span><span class="p">(</span><span class="sa">u</span><span class="s">'{}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">docs</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">]))</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">term</span><span class="p">)</span> <span class="o">&gt;</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
    <span class="n">title_token_list</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">pos</span><span class="p">)</span> <span class="c1"># 형태소 분석결과를 리스트에 추가
</span>    <span class="n">title_token_noun</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">noun</span><span class="p">)</span> <span class="c1"># 추출한 명사를 리스트에 추가
</span>
<span class="c1">#불용어 사전 불러오기
</span><span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"stopwords-ko.txt"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">"UTF-8"</span><span class="p">)</span>
<span class="n">st</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">readlines</span><span class="p">()</span>
<span class="n">f</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">stw</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">st</span><span class="p">)):</span>
    <span class="n">stw</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">st</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">rstrip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">))</span> <span class="c1"># st리스트에서 '\n' 제거
</span>
<span class="c1"># 사용자가 불용어 추가
</span><span class="n">user_stopwords</span> <span class="o">=</span> <span class="p">[</span><span class="s">'크림'</span><span class="p">,</span> <span class="s">'아토'</span><span class="p">,</span> <span class="s">'베리'</span><span class="p">,</span><span class="s">'라마'</span><span class="p">,</span> <span class="s">'이드'</span><span class="p">,</span><span class="s">'피부'</span><span class="p">,</span> <span class="s">'크림'</span><span class="p">,</span>
                  <span class="s">'사용'</span><span class="p">,</span> <span class="s">'쇼핑'</span><span class="p">,</span> <span class="s">'이드'</span><span class="p">,</span><span class="s">'제품'</span><span class="p">,</span> <span class="s">'화장품'</span><span class="p">,</span><span class="s">'글리'</span><span class="p">,</span><span class="s">'도움'</span><span class="p">,</span>
                  <span class="s">'레이'</span><span class="p">,</span> <span class="s">'효과'</span><span class="p">,</span> <span class="s">'부위'</span><span class="p">,</span><span class="s">'라마'</span><span class="p">,</span> <span class="s">'폴리'</span><span class="p">,</span><span class="s">'경우'</span><span class="p">,</span><span class="s">'기능'</span><span class="p">,</span>
                  <span class="s">'뷰티'</span><span class="p">,</span> <span class="s">'로켓'</span><span class="p">,</span> <span class="s">'사항'</span><span class="p">,</span> <span class="s">'상품'</span><span class="p">,</span><span class="s">'씨드'</span><span class="p">,</span> <span class="s">'세린'</span><span class="p">,</span> <span class="s">'명품'</span><span class="p">,</span>
                  <span class="s">'라이'</span><span class="p">,</span> <span class="s">'관리'</span><span class="p">,</span> <span class="s">'주의'</span><span class="p">,</span> <span class="s">'용량'</span><span class="p">,</span> <span class="s">'원료'</span><span class="p">,</span> <span class="s">'광선'</span><span class="p">,</span> <span class="s">'유지'</span><span class="p">,</span>
                  <span class="s">'알코올'</span><span class="p">,</span> <span class="s">'공급'</span><span class="p">,</span> <span class="s">'상담'</span><span class="p">,</span> <span class="s">'완료'</span><span class="p">,</span> <span class="s">'기준'</span><span class="p">,</span><span class="s">'아크릴'</span><span class="p">,</span> <span class="s">'적용'</span><span class="p">,</span>
                  <span class="s">'스킨'</span><span class="p">,</span> <span class="s">'개월'</span><span class="p">,</span> <span class="s">'알란'</span><span class="p">,</span> <span class="s">'프릴'</span><span class="p">,</span> <span class="s">'소비자'</span><span class="p">,</span> <span class="s">'보관'</span><span class="p">,</span> <span class="s">'필요'</span><span class="p">,</span>
                  <span class="s">'로에베'</span><span class="p">,</span> <span class="s">'티코'</span><span class="p">,</span> <span class="s">'다이올'</span><span class="p">,</span> <span class="s">'외부'</span><span class="p">,</span><span class="s">'제수'</span><span class="p">,</span><span class="s">'고민'</span><span class="p">,</span><span class="s">'인체'</span><span class="p">,</span>
                  <span class="s">'사이드'</span><span class="p">,</span><span class="s">'특성'</span><span class="p">,</span> <span class="s">'얼굴'</span><span class="p">,</span> <span class="s">'발라'</span><span class="p">,</span> <span class="s">'이나'</span><span class="p">,</span> <span class="s">'에센스'</span><span class="p">,</span> <span class="s">'아마이드'</span><span class="p">,</span>
                  <span class="s">'폴리머'</span><span class="p">,</span> <span class="s">'오스'</span><span class="p">,</span> <span class="s">'당량'</span><span class="p">,</span> <span class="s">'해결'</span><span class="p">,</span> <span class="s">'에탄올'</span><span class="p">,</span><span class="s">'기한'</span><span class="p">,</span> <span class="s">'에이트'</span><span class="p">,</span>
                  <span class="s">'취급'</span><span class="p">,</span> <span class="s">'증상'</span><span class="p">,</span><span class="s">'상처'</span><span class="p">,</span><span class="s">'개인'</span><span class="p">,</span><span class="s">'결과'</span><span class="p">,</span> <span class="s">'올리브'</span><span class="p">,</span> <span class="s">'만족'</span><span class="p">,</span> <span class="s">'도포'</span><span class="p">,</span>
                  <span class="s">'부문'</span><span class="p">,</span><span class="s">'센터'</span><span class="p">,</span> <span class="s">'라스트'</span><span class="p">,</span> <span class="s">'판매'</span><span class="p">,</span> <span class="s">'성인'</span><span class="p">,</span> <span class="s">'브랜드'</span><span class="p">,</span> <span class="s">'연구원'</span><span class="p">,</span> <span class="s">'베리'</span><span class="p">,</span>
                  <span class="s">'기술'</span><span class="p">,</span> <span class="s">'원료'</span><span class="p">,</span> <span class="s">'마스크'</span><span class="p">,</span> <span class="s">'기획'</span><span class="p">,</span><span class="s">'기간'</span><span class="p">,</span> <span class="s">'리프'</span><span class="p">,</span> <span class="s">'전체'</span><span class="p">,</span> <span class="s">'확인'</span><span class="p">,</span> <span class="s">'워드'</span><span class="p">,</span>
                  <span class="s">'비교'</span><span class="p">,</span> <span class="s">'데카'</span><span class="p">,</span> <span class="s">'파트'</span><span class="p">,</span> <span class="s">'코스'</span><span class="p">,</span> <span class="s">'기관'</span><span class="p">,</span> <span class="s">'반응'</span><span class="p">,</span> <span class="s">'청규'</span><span class="p">,</span> <span class="s">'아토'</span><span class="p">,</span><span class="s">'무도'</span><span class="p">,</span><span class="s">'설문'</span><span class="p">,</span>
                  <span class="s">'토너'</span><span class="p">,</span><span class="s">'랭킹'</span><span class="p">,</span><span class="s">'온라인'</span><span class="p">,</span> <span class="s">'김정문'</span><span class="p">,</span> <span class="s">'수다'</span><span class="p">,</span><span class="s">'라보'</span><span class="p">,</span> <span class="s">'지오'</span><span class="p">,</span> <span class="s">'증가'</span><span class="p">,</span> <span class="s">'방법'</span><span class="p">,</span><span class="s">'페이'</span><span class="p">,</span>
                    <span class="s">'집중'</span><span class="p">,</span> <span class="s">'물질'</span><span class="p">,</span> <span class="s">'리얼'</span><span class="p">,</span><span class="s">'직후'</span><span class="p">,</span><span class="s">'세계'</span><span class="p">,</span><span class="s">'경과'</span><span class="p">,</span><span class="s">'증정'</span><span class="p">,</span><span class="s">'이미지'</span><span class="p">,</span> <span class="s">'페이스'</span><span class="p">,</span>
                  <span class="s">'롤라이'</span><span class="p">,</span><span class="s">'대비'</span><span class="p">,</span><span class="s">'키스'</span><span class="p">,</span> <span class="s">'카프'</span><span class="p">,</span> <span class="s">'증정'</span><span class="p">,</span> <span class="s">'티에'</span><span class="p">,</span> <span class="s">'일리'</span><span class="p">,</span> <span class="s">'반점'</span><span class="p">,</span> <span class="s">'녹차'</span><span class="p">,</span> <span class="s">'가지'</span><span class="p">,</span>
                  <span class="s">'고시'</span><span class="p">,</span> <span class="s">'방법'</span><span class="p">,</span> <span class="s">'자제'</span><span class="p">,</span> <span class="s">'어린이'</span><span class="p">,</span> <span class="s">'이지'</span><span class="p">,</span><span class="s">'고객'</span><span class="p">,</span> <span class="s">'밸런스'</span><span class="p">,</span> <span class="s">'베타'</span><span class="p">,</span> <span class="s">'트리'</span><span class="p">,</span> <span class="s">'공정'</span><span class="p">,</span>
                  <span class="s">'제조'</span><span class="p">,</span> <span class="s">'책임'</span><span class="p">,</span> <span class="s">'에도'</span><span class="p">,</span> <span class="s">'왁스'</span><span class="p">,</span> <span class="s">'토인'</span><span class="p">,</span> <span class="s">'레시틴'</span><span class="p">,</span> <span class="s">'거래'</span><span class="p">,</span> <span class="s">'위원회'</span><span class="p">,</span> <span class="s">'분쟁'</span><span class="p">,</span> <span class="s">'판매업'</span><span class="p">,</span>
                  <span class="s">'프로'</span><span class="p">,</span> <span class="s">'나무'</span><span class="p">,</span> <span class="s">'다이'</span><span class="p">,</span> <span class="s">'번호'</span><span class="p">,</span> <span class="s">'충전'</span><span class="p">,</span> <span class="s">'글로리'</span><span class="p">,</span> <span class="s">'여과'</span><span class="p">,</span> <span class="s">'페이'</span><span class="p">,</span> <span class="s">'보상'</span><span class="p">,</span> <span class="s">'형성'</span><span class="p">,</span> <span class="s">'리페'</span><span class="p">,</span> <span class="s">'에이치'</span><span class="p">,</span><span class="s">'비탄'</span><span class="p">]</span>

<span class="n">stw</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="n">user_stopwords</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">csv</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'불용어.csv'</span><span class="p">,</span><span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="nb">file</span> <span class="p">:</span>
    <span class="n">write</span> <span class="o">=</span> <span class="n">csv</span><span class="p">.</span><span class="n">writer</span><span class="p">(</span><span class="nb">file</span><span class="p">)</span>
    <span class="n">write</span><span class="p">.</span><span class="n">writerow</span><span class="p">(</span><span class="n">stw</span><span class="p">)</span>

<span class="c1">#불용어 제거
</span><span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">stw</span><span class="p">:</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">title_token_noun</span><span class="p">)):</span>
        <span class="c1"># 리스트에 불용어가 있을 경우 제거
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="k">while</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">title_token_noun</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span>
                <span class="n">title_token_noun</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">remove</span><span class="p">(</span><span class="n">word</span><span class="p">)</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>

<span class="c1"># 키워드 저장
</span><span class="n">raw_data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="n">raw_data</span><span class="p">[</span><span class="s">'keyword'</span><span class="p">]</span> <span class="o">=</span> <span class="n">title_token_noun</span>

<span class="n">raw_data</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"oliveyoung_keyword.csv"</span><span class="p">)</span>

<span class="c1">#키워드 빈도수 탑
</span><span class="n">noun</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">itertools</span><span class="p">.</span><span class="n">chain</span><span class="p">(</span><span class="o">*</span><span class="n">title_token_noun</span><span class="p">))</span> <span class="c1"># 리스트 접합
</span><span class="n">count</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">(</span><span class="n">noun</span><span class="p">)</span>
<span class="n">top</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">count</span><span class="p">.</span><span class="n">most_common</span><span class="p">(</span><span class="mi">53</span><span class="p">))</span>
<span class="n">keyword</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="n">top</span><span class="p">.</span><span class="n">keys</span><span class="p">())</span>

<span class="c1">#최종 키워드 저장
</span><span class="n">raw_data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="n">raw_data</span><span class="p">[</span><span class="s">'top53'</span><span class="p">]</span> <span class="o">=</span> <span class="n">top</span>
<span class="n">raw_data</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"oliveyoung_top.csv"</span><span class="p">)</span>

<span class="c1">#제품에 키워드가 최소 하나 이상 들어가있는지 확인하기
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">olive</span><span class="p">)):</span>
    <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">keyword</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">olive</span><span class="p">[</span><span class="s">'text'</span><span class="p">][</span><span class="n">i</span><span class="p">].</span><span class="n">find</span><span class="p">(</span><span class="n">j</span><span class="p">)</span><span class="o">!=-</span><span class="mi">1</span><span class="p">:</span>
            <span class="k">break</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">keyword</span><span class="o">==</span><span class="n">keyword</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]:</span>
                <span class="k">print</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
</code></pre></div></div>]]></content><author><name>birdfoot</name></author><category term="teamProject" /><category term="화장품마케팅분석" /><summary type="html"><![CDATA[저번에 추출해온 텍스트에서 키워드를 추출했다]]></summary></entry><entry><title type="html">올리브영 이미지에서 텍스트 추출하기</title><link href="https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/04/OLIVEYOUNG-img-text.html" rel="alternate" type="text/html" title="올리브영 이미지에서 텍스트 추출하기" /><published>2024-06-04T00:00:00+00:00</published><updated>2024-06-04T00:00:00+00:00</updated><id>https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/04/OLIVEYOUNG-img-text</id><content type="html" xml:base="https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/04/OLIVEYOUNG-img-text.html"><![CDATA[<h2 id="화장품마다-광고-키워드를-가져오기-위해-광고-사진에서-텍스트를-추출했다">화장품마다 광고 키워드를 가져오기 위해 광고 사진에서 텍스트를 추출했다</h2>

<p><strong>과정</strong></p>

<p>tesseract를 사용했다가 영어는 괜찮은데 한글이 너무 추출이 안 되고 깨졌다.<br />
우린 한국에서의 마케팅 분석이기 때문에 영어보다 한글이 중요한 만큼 tesseract를 포기하고
다른 방법을 찾아봤다.</p>

<p>다음으로는 구글의 cloud vision api를 사용해봤다.<br />
한글도 문제없이 잘 추출됐다. 하지만 안되는 이미지들이 엄청 많았다. 찾아보니까 다른 사람들도
겪고 있는 문제였지만 해결됐다는 말이 없었다.</p>

<p>그래서 다음으로는 이미지 추출 사이트에서 동적으로 돌리고 추출된 텍스트를 가져오는 코드로 바꿨다.
이 방법은 사실상 완성된 다른 사이트를 이용하는거여서<br />
대부분 문제없이 돌아갔지만 용량이 큰 몇몇 gif들은 프리미엄을 사라고 나왔다.
안 되는 gif들만 확인해보니까 중요한 텍스트가 따로 없어서 안 되는 것들은 버렸다.
이번에는 최종 방법에 대해서만 올리겠다.</p>

<p><strong>저장했던 데이터 불러오기</strong></p>

<p>저장해놓은 데이터 불러오기</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># CSV 파일 읽기
</span><span class="n">img_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'..//올리브영 제품 링크//oliveyoung_img_link.csv'</span><span class="p">)</span>


<span class="c1"># 이미지 링크 열 이름 수정 ('img_link'로 가정)
</span><span class="n">image_links1</span> <span class="o">=</span> <span class="n">img_df</span><span class="p">[</span><span class="s">'img_link'</span><span class="p">].</span><span class="n">tolist</span><span class="p">()</span>

<span class="c1"># URL 확인 및 정리 - 리스트 요소를 문자열로 변환
</span><span class="n">image_links2</span> <span class="o">=</span> <span class="p">[</span><span class="n">url</span><span class="p">.</span><span class="n">strip</span><span class="p">(</span><span class="s">"[]"</span><span class="p">)</span> <span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">image_links1</span> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="nb">str</span><span class="p">)]</span>
<span class="n">image_links2</span> <span class="o">=</span> <span class="p">[</span><span class="n">url</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\'</span><span class="s">"</span><span class="p">,</span><span class="s">""</span><span class="p">)</span> <span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">image_links2</span> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="nb">str</span><span class="p">)]</span>

<span class="n">links</span><span class="o">=</span><span class="p">[[]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">image_links2</span><span class="p">))]</span>

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">image_links2</span><span class="p">)):</span>
    <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">image_links2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">','</span><span class="p">):</span>
        <span class="n">j</span><span class="o">=</span><span class="n">j</span><span class="p">.</span><span class="n">strip</span><span class="p">()</span>
        <span class="n">links</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">j</span><span class="p">)</span>
</code></pre></div></div>

<p>사이트 추출해서 최종으로 저장하는 코드</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">text_get</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">):</span>
    <span class="c1"># if links[x][y][-3:]=='gif':
</span>    <span class="c1">#     return
</span>    <span class="n">driver</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">links</span><span class="p">[</span><span class="n">x</span><span class="p">][</span><span class="n">y</span><span class="p">])</span>
    <span class="n">links</span><span class="p">[</span><span class="n">x</span><span class="p">][</span><span class="n">y</span><span class="p">]</span><span class="o">=</span><span class="n">driver</span><span class="p">.</span><span class="n">current_url</span>
    <span class="n">driver</span><span class="p">.</span><span class="n">back</span><span class="p">()</span>
    <span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">CLASS_NAME</span><span class="p">,</span> <span class="s">'enterUrl '</span><span class="p">).</span><span class="n">click</span><span class="p">()</span>
    <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.scrollTo(0,500)"</span><span class="p">)</span>
    <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">CSS_SELECTOR</span><span class="p">,</span> <span class="s">"#uploadfile &gt; div &gt; div.col-xl-9 &gt; div.col-12.image_back.toolColor.text-center.m-0-auto.d-block.br_10.p-3 &gt; div &gt; div.col-12.urlArea.d_none.mt-3 &gt; div &gt; input"</span><span class="p">).</span><span class="n">send_keys</span><span class="p">(</span><span class="n">links</span><span class="p">[</span><span class="n">x</span><span class="p">][</span><span class="n">y</span><span class="p">])</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">CLASS_NAME</span><span class="p">,</span> <span class="s">'urlBtn '</span><span class="p">).</span><span class="n">click</span><span class="p">()</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.scrollTo(0,800)"</span><span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">clickable_element</span> <span class="o">=</span> <span class="n">WebDriverWait</span><span class="p">(</span><span class="n">driver</span><span class="p">,</span> <span class="mi">10</span><span class="p">).</span><span class="n">until</span><span class="p">(</span>
            <span class="n">EC</span><span class="p">.</span><span class="n">element_to_be_clickable</span><span class="p">((</span><span class="n">By</span><span class="p">.</span><span class="n">CLASS_NAME</span><span class="p">,</span> <span class="s">'convertFiles'</span><span class="p">))</span>
        <span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
        <span class="n">clickable_element</span><span class="p">.</span><span class="n">click</span><span class="p">()</span>
        <span class="c1"># driver.find_element(By.CLASS_NAME, 'convertFiles').click()
</span>        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.scrollTo(0,700)"</span><span class="p">)</span>
        <span class="c1">#가능해질 때까지 기다리기
</span>        <span class="n">clickable_element</span> <span class="o">=</span> <span class="n">WebDriverWait</span><span class="p">(</span><span class="n">driver</span><span class="p">,</span> <span class="mi">10</span><span class="p">).</span><span class="n">until</span><span class="p">(</span>
            <span class="n">EC</span><span class="p">.</span><span class="n">element_to_be_clickable</span><span class="p">((</span><span class="n">By</span><span class="p">.</span><span class="n">CLASS_NAME</span><span class="p">,</span> <span class="s">'copyData'</span><span class="p">))</span>
        <span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">25</span><span class="p">)</span>
        <span class="n">clickable_element</span><span class="p">.</span><span class="n">click</span><span class="p">()</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.open('{}')"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="s">'https://n.lrl.kr/'</span><span class="p">))</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
        <span class="c1"># driver.find_element(By.CLASS_NAME, 'copyData').click()
</span>        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">a</span><span class="o">=</span><span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">CLASS_NAME</span><span class="p">,</span> <span class="s">'note-editable'</span><span class="p">)</span>
        <span class="n">a</span><span class="p">.</span><span class="n">click</span><span class="p">()</span>
        <span class="n">a</span><span class="p">.</span><span class="n">send_keys</span><span class="p">(</span><span class="n">Keys</span><span class="p">.</span><span class="n">CONTROL</span><span class="p">,</span> <span class="s">'v'</span><span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">html</span><span class="o">=</span><span class="n">driver</span><span class="p">.</span><span class="n">page_source</span>
        <span class="n">soup</span><span class="o">=</span><span class="n">BS</span><span class="p">(</span><span class="n">html</span><span class="p">,</span><span class="s">'html.parser'</span><span class="p">)</span>
        <span class="n">result</span><span class="o">=</span><span class="n">soup</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="s">"div"</span><span class="p">,</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span><span class="s">"note-editable"</span><span class="p">})</span>
        <span class="k">if</span> <span class="n">result</span><span class="p">.</span><span class="n">text</span><span class="o">!=</span><span class="s">''</span><span class="p">:</span>
            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">[</span><span class="n">x</span><span class="p">])</span><span class="o">&gt;=</span><span class="n">y</span><span class="p">:</span>
                <span class="n">text</span><span class="p">[</span><span class="n">x</span><span class="p">][</span><span class="n">y</span><span class="p">]</span><span class="o">=</span><span class="n">result</span><span class="p">.</span><span class="n">text</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">text</span><span class="p">[</span><span class="n">x</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">text</span><span class="p">)</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">+</span><span class="s">"-"</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="p">))</span>
        <span class="k">print</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

        <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.scrollTo(0,1200)"</span><span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">CLASS_NAME</span><span class="p">,</span> <span class="s">'re_convert'</span><span class="p">).</span><span class="n">click</span><span class="p">()</span>

<span class="n">driver</span> <span class="o">=</span> <span class="n">webdriver</span><span class="p">.</span><span class="n">Chrome</span><span class="p">()</span>
<span class="n">driver</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'https://www.cardscanner.co/ko/image-to-text'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">re_index</span><span class="p">:</span>
    <span class="n">text_get</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">i</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"-"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]),</span><span class="nb">int</span><span class="p">(</span><span class="n">i</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"-"</span><span class="p">)[</span><span class="mi">1</span><span class="p">]))</span>

<span class="n">raw_data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="n">raw_data</span><span class="p">[</span><span class="s">'text'</span><span class="p">]</span> <span class="o">=</span> <span class="n">plus_text</span>
<span class="n">raw_data</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"oliveyoung_text.csv"</span><span class="p">)</span>
</code></pre></div></div>]]></content><author><name>birdfoot</name></author><category term="teamProject" /><category term="화장품마케팅분석" /><summary type="html"><![CDATA[화장품마다 광고 키워드를 가져오기 위해 광고 사진에서 텍스트를 추출했다]]></summary></entry><entry><title type="html">올리브영 제품 링크 가져오기</title><link href="https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/01/OLIVEYOUNG-link.html" rel="alternate" type="text/html" title="올리브영 제품 링크 가져오기" /><published>2024-06-01T00:00:00+00:00</published><updated>2024-06-01T00:00:00+00:00</updated><id>https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/01/OLIVEYOUNG-link</id><content type="html" xml:base="https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/01/OLIVEYOUNG-link.html"><![CDATA[<h2 id="화장품-키워드-마케팅-분석-전-올리브영에서-제품-링크-가져오기">화장품 키워드 마케팅 분석 전 올리브영에서 제품 링크 가져오기</h2>

<p><strong>소감</strong></p>

<p>이번에는 데이터를 가져오기 전에 관련된 링크를 다 가져왔다.
시간에 쫓기느라 코드를 예쁘게 짜기보단 실행되기만 하면 된다라는 생각으로 짜서
나도 한 번에 이해하기 힘들 정도로 코드가 더럽다.</p>

<p><strong>모든 링크 가져오기</strong></p>

<p>크림 카테고리 제품이 총 27페이지였다
url에서 페이지 수를 바꿔가면서 데이터를 가져왔다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">driver</span> <span class="o">=</span> <span class="n">webdriver</span><span class="p">.</span><span class="n">Chrome</span><span class="p">()</span>
<span class="n">driver</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'https://www.oliveyoung.co.kr/store/display/getMCategoryList.do?dispCatNo=100000100010015&amp;isLoginCnt=0&amp;aShowCnt=0&amp;bShowCnt=0&amp;cShowCnt=0&amp;gateCd=Drawer&amp;trackingCd=Cat100000100010015_MID&amp;trackingCd=Cat100000100010015_MID&amp;t_page=드로우_카테고리&amp;t_click=카테고리탭_중카테고리&amp;t_2nd_category_type=중_크림'</span><span class="p">)</span>

<span class="n">link</span><span class="o">=</span><span class="p">[]</span>

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">27</span><span class="p">):</span>
    <span class="n">url</span> <span class="o">=</span><span class="s">'https://www.oliveyoung.co.kr/store/display/getMCategoryList.do?dispCatNo=100000100010015&amp;fltDispCatNo=&amp;prdSort=01&amp;pageIdx='</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">+</span><span class="s">'&amp;rowsPerPage=24&amp;searchTypeSort=btn_thumb&amp;plusButtonFlag=N&amp;isLoginCnt=0&amp;aShowCnt=&amp;bShowCnt=&amp;cShowCnt=&amp;trackingCd=Cat100000100010015_Small&amp;amplitudePageGubun=&amp;t_page=&amp;t_click=&amp;midCategory=%ED%81%AC%EB%A6%BC&amp;smallCategory=%EC%A0%84%EC%B2%B4&amp;checkBrnds=&amp;lastChkBrnd='</span>
    <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># 첫 번째 탭으로 이동
</span>    <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.open('{}')"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">url</span><span class="p">))</span> <span class="c1"># URL 실행
</span>    <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="c1"># 두 번째 탭으로 이동
</span>    <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
    <span class="n">html</span> <span class="o">=</span> <span class="n">driver</span><span class="p">.</span><span class="n">page_source</span>  <span class="c1"># 현재 페이지의 HTML 코드를 가져옴
</span>    <span class="n">soup</span> <span class="o">=</span> <span class="n">BS</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s">'html.parser'</span><span class="p">)</span>
    <span class="nb">list</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"div"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span> <span class="s">"prd_info"</span><span class="p">})</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">)):</span>
        <span class="n">a</span><span class="o">=</span><span class="nb">list</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">find</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span><span class="s">"prd_thumb goodsList"</span><span class="p">}).</span><span class="n">attrs</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span>
        <span class="n">link</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
    <span class="n">driver</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
    <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.3</span><span class="p">)</span>
</code></pre></div></div>

<p><strong>리뷰가 2000개 이상인 제품만 링크 따로 저장하기</strong></p>

<p>아까 가져왔던 전체 제품 링크에 들어가서 리뷰 데이터가 2000개 이상이라면 링크를 저장했다 = 샘플 링크</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">selenium.common.exceptions</span> <span class="kn">import</span> <span class="n">TimeoutException</span>
<span class="n">review</span><span class="o">=</span><span class="p">[]</span>

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">raw_data</span><span class="p">)):</span>
    <span class="n">url</span> <span class="o">=</span> <span class="n">raw_data</span><span class="p">[</span><span class="s">'link'</span><span class="p">][</span><span class="n">i</span><span class="p">]</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># 첫 번째 탭으로 이동
</span>        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.open('{}')"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">url</span><span class="p">))</span> <span class="c1"># URL 실행
</span>        <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="c1"># 두 번째 탭으로 이동
</span>        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>
        <span class="n">html</span> <span class="o">=</span> <span class="n">driver</span><span class="p">.</span><span class="n">page_source</span>  <span class="c1"># 현재 페이지의 HTML 코드를 가져옴
</span>        <span class="n">soup</span> <span class="o">=</span> <span class="n">BS</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s">'html.parser'</span><span class="p">)</span>
        <span class="n">review_count</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span> <span class="s">"goods_reputation"</span><span class="p">})</span>
        <span class="n">num</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'[^0-9]'</span><span class="p">,</span> <span class="s">''</span><span class="p">,</span> <span class="n">review_count</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="s">"span"</span><span class="p">).</span><span class="n">text</span><span class="p">))</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">num</span><span class="o">&gt;=</span><span class="mi">2000</span><span class="p">):</span>
            <span class="n">link</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
            <span class="n">review</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">num</span><span class="p">)</span>
    <span class="k">except</span> <span class="n">TimeoutException</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="n">println</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.3</span><span class="p">)</span>
        <span class="k">if</span><span class="p">(</span><span class="n">i</span><span class="o">%</span><span class="mi">10</span><span class="o">==</span><span class="mi">0</span><span class="p">):</span>
            <span class="k">print</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">+</span><span class="s">"/"</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">raw_data</span><span class="p">)))</span>
</code></pre></div></div>

<p><strong>리뷰가 2000개 이상인 제품의 광고 이미지 링크 가져오기</strong></p>

<p>샘플 링크에 들어가서 제품 설명 이미지를 다 가져왔다
제품마다 이미지가 한 개 인 곳도 여러개인 곳도 있어서 2차원 사용</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">img_link</span><span class="o">=</span><span class="p">[[]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">link</span><span class="p">))]</span>
<span class="n">wo</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">link</span><span class="p">)):</span>
    <span class="n">url</span> <span class="o">=</span> <span class="n">link</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># 첫 번째 탭으로 이동
</span>        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.open('{}')"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">url</span><span class="p">))</span> <span class="c1"># URL 실행
</span>        <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="c1"># 두 번째 탭으로 이동
</span>        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>
        <span class="n">html</span> <span class="o">=</span> <span class="n">driver</span><span class="p">.</span><span class="n">page_source</span>  <span class="c1"># 현재 페이지의 HTML 코드를 가져옴
</span>        <span class="n">soup</span> <span class="o">=</span> <span class="n">BS</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s">'html.parser'</span><span class="p">)</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>

        <span class="n">group</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="s">"div"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span> <span class="s">"iPrdViewimg"</span><span class="p">})</span>
        <span class="k">if</span><span class="p">(</span><span class="n">group</span><span class="o">==</span><span class="bp">None</span><span class="p">):</span>
            <span class="n">group</span><span class="o">=</span><span class="n">soup</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"picture"</span><span class="p">)</span>
        <span class="n">img</span><span class="o">=</span><span class="p">[]</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">group</span><span class="p">:</span>
            <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="n">NavigableString</span><span class="p">):</span>
                <span class="k">continue</span>
            <span class="n">img</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">j</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"img"</span><span class="p">))</span>
        <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">img</span><span class="p">:</span>
            <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">j</span><span class="p">:</span>
                <span class="k">if</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s">"src"</span><span class="p">][</span><span class="mi">0</span><span class="p">:</span><span class="mi">8</span><span class="p">]</span><span class="o">!=</span><span class="s">"https://"</span><span class="p">):</span>
                    <span class="n">img_link</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s">"data-src"</span><span class="p">])</span>
                <span class="k">else</span><span class="p">:</span>
                    <span class="n">img_link</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">m</span><span class="p">[</span><span class="s">"src"</span><span class="p">])</span>
    <span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
        <span class="n">println</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.3</span><span class="p">)</span>
        <span class="k">if</span><span class="p">(</span><span class="n">i</span><span class="o">%</span><span class="mi">10</span><span class="o">==</span><span class="mi">0</span><span class="p">):</span>
            <span class="k">print</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">+</span><span class="s">"/"</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">link</span><span class="p">)))</span>
</code></pre></div></div>]]></content><author><name>birdfoot</name></author><category term="teamProject" /><category term="화장품마케팅분석" /><summary type="html"><![CDATA[화장품 키워드 마케팅 분석 전 올리브영에서 제품 링크 가져오기]]></summary></entry><entry><title type="html">올리브영 제품 리뷰 가져오기</title><link href="https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/01/OLIVEYOUNG-review-data.html" rel="alternate" type="text/html" title="올리브영 제품 리뷰 가져오기" /><published>2024-06-01T00:00:00+00:00</published><updated>2024-06-01T00:00:00+00:00</updated><id>https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/01/OLIVEYOUNG-review-data</id><content type="html" xml:base="https://birdf00t.github.io/teamproject/%ED%99%94%EC%9E%A5%ED%92%88%EB%A7%88%EC%BC%80%ED%8C%85%EB%B6%84%EC%84%9D/2024/06/01/OLIVEYOUNG-review-data.html"><![CDATA[<h2 id="올리브영-샘플-리뷰-가져오기">올리브영 샘플 리뷰 가져오기</h2>

<p><strong>소감</strong></p>

<p>리뷰가 url 차이도 없이 클릭으로만 나와서 동적 크롤ㄹ을 사용할 수 밖에 없었다.
올리브영 사이트 내에 제품이니까 다 양식이 같을 줄 알고 XPATH로 클릭을 했더니 몇몇 링크가
다른 양식이라서 다른 방법으로 바꿨다. 또 너무 오래 걸리다보니까 켜놓고 잠들었다가
화면이 꺼져서 날라가서 다시 처음부터 여러번 시작했다. 스크롤 내려가는 정도를
애매하게 해놨다가 넘어간 페이지도 여럿이었다. 마지막날엔 안 된 페이지만 다시 가져오려고 처리하다가
코드를 반대로 써서 필요한 11만개 데이터가 다 날라가고 필요없는 6개 데이터만 남기도 했다.
진짜 저때는 세상에서 사라지고 싶었다 저장하고 했어야 하는데 너무 안일했다
4일동안 제대로 못 자고 자다가 몇 시간 간격으로 일어나서 확인했는데 오히려 잠을 제대로 못 자느라 머리가 잘 안 돌아가서 4일이나 걸린 것 같기도 하다</p>

<p>리뷰가 다 있는줄 알았는데 올리브영에서 제품당 1000개만 보이게 해놔서
약 11만개의 리뷰를 가져왔다. 1000개만이라서 차라리 다행인 것 같기도ㅎ..</p>

<p><strong>샘플 리뷰 데이터 가져오기</strong></p>

<p>저장해놓은 데이터 불러오기</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#샘플 제품 링크
</span><span class="n">link</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'올리브영 제품 링크//oliveyoung_sample_link.csv'</span><span class="p">)</span>
<span class="n">link</span><span class="o">=</span><span class="n">link</span><span class="p">[</span><span class="s">'link'</span><span class="p">]</span>
<span class="c1">#그나마 저장해놨던 데이터
</span><span class="n">data</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'oliveyoung_reviews.csv'</span><span class="p">)</span>
</code></pre></div></div>

<p>데이터가 이상하게 저장되어 있어서 원래대로 분리 해주는 함수</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">str_out</span><span class="p">(</span><span class="n">st</span><span class="p">):</span>
    <span class="n">return_text</span><span class="o">=</span><span class="p">[]</span>
    <span class="n">t</span><span class="o">=</span><span class="s">""</span>
    <span class="n">input_ok</span><span class="o">=</span><span class="mi">0</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">st</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">input_ok</span><span class="o">==</span><span class="mi">1</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">i</span><span class="o">==</span><span class="s">"</span><span class="se">\'</span><span class="s">"</span><span class="p">:</span>
                <span class="n">input_ok</span><span class="o">=</span><span class="mi">0</span>
                <span class="n">return_text</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>
                <span class="n">t</span><span class="o">=</span><span class="s">""</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">t</span><span class="o">+=</span><span class="n">i</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">i</span><span class="o">==</span><span class="s">"</span><span class="se">\'</span><span class="s">"</span><span class="p">:</span>
                <span class="n">input_ok</span><span class="o">=</span><span class="mi">1</span>
    <span class="k">return</span> <span class="n">return_text</span>
</code></pre></div></div>

<p>데이터 세팅</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#별점
</span><span class="n">star</span><span class="o">=</span><span class="p">[]</span>

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'star'</span><span class="p">])):</span>
    <span class="n">star</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">a</span><span class="p">:</span> <span class="n">a</span><span class="p">.</span><span class="n">isdigit</span><span class="p">()</span><span class="o">==</span><span class="bp">True</span><span class="p">,</span><span class="n">data</span><span class="p">[</span><span class="s">'star'</span><span class="p">][</span><span class="n">i</span><span class="p">])))</span>

<span class="n">star</span><span class="p">[</span><span class="mi">26</span><span class="p">]</span><span class="o">=</span><span class="p">[]</span>
<span class="c1">#피부 타입
</span><span class="n">skin_type</span><span class="o">=</span><span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'skin_type'</span><span class="p">])):</span>
    <span class="n">a</span><span class="o">=</span><span class="n">str_out</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'skin_type'</span><span class="p">][</span><span class="n">i</span><span class="p">])</span>
    <span class="n">skin_type</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">skin_type</span><span class="p">[</span><span class="mi">26</span><span class="p">]</span><span class="o">=</span><span class="p">[]</span>

<span class="c1">#리뷰 내용
</span><span class="n">reviews</span><span class="o">=</span><span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'review'</span><span class="p">])):</span>
    <span class="n">a</span><span class="o">=</span><span class="n">str_out</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="s">'review'</span><span class="p">][</span><span class="n">i</span><span class="p">])</span>
    <span class="n">reviews</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">reviews</span><span class="p">[</span><span class="mi">26</span><span class="p">]</span><span class="o">=</span><span class="p">[]</span>
</code></pre></div></div>

<p>페이지 양식에 맞춰서 가져오는 함수</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">info_get</span><span class="p">(</span><span class="n">index</span><span class="p">):</span>
    <span class="n">html</span> <span class="o">=</span> <span class="n">driver</span><span class="p">.</span><span class="n">page_source</span>
    <span class="n">soup</span> <span class="o">=</span> <span class="n">BS</span><span class="p">(</span><span class="n">html</span><span class="p">,</span> <span class="s">'html.parser'</span><span class="p">)</span>

    <span class="n">info</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"div"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span><span class="s">"user clrfix"</span><span class="p">})</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">info</span><span class="p">)):</span>
        <span class="c1">#피부타입
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="n">sk</span><span class="o">=</span><span class="n">info</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">find</span><span class="p">(</span><span class="s">"p"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span><span class="s">'tag'</span><span class="p">})</span>
            <span class="nb">type</span><span class="o">=</span><span class="s">""</span>
            <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">sk</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"span"</span><span class="p">):</span>
                <span class="nb">type</span><span class="o">+=</span><span class="n">k</span><span class="p">.</span><span class="n">text</span><span class="o">+</span><span class="s">","</span>

            <span class="nb">type</span><span class="o">=</span><span class="nb">type</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
            <span class="n">skin_type</span><span class="p">[</span><span class="n">index</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="nb">type</span><span class="p">)</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="n">skin_type</span><span class="p">[</span><span class="n">index</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="s">""</span><span class="p">)</span>

    <span class="n">review</span> <span class="o">=</span> <span class="n">soup</span><span class="p">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">"div"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span><span class="s">"review_cont"</span><span class="p">})</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">review</span><span class="p">)):</span>
        <span class="c1">#리뷰내용
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="n">re</span><span class="o">=</span><span class="n">review</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">find</span><span class="p">(</span><span class="s">"div"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">"class"</span><span class="p">:</span> <span class="s">"txt_inner"</span><span class="p">}).</span><span class="n">text</span>
            <span class="n">reviews</span><span class="p">[</span><span class="n">index</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">re</span><span class="p">)</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="n">reviews</span><span class="p">[</span><span class="n">index</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="s">""</span><span class="p">)</span>
         <span class="c1">#별점
</span>        <span class="k">try</span><span class="p">:</span>
            <span class="n">st</span><span class="o">=</span><span class="n">review</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">find</span><span class="p">(</span><span class="s">"span"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span> <span class="s">"class"</span><span class="p">:</span><span class="s">"point"</span><span class="p">}).</span><span class="n">text</span>
            <span class="n">st</span><span class="o">=</span><span class="n">st</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"5점만점에 "</span><span class="p">,</span><span class="mi">1</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
            <span class="n">star</span><span class="p">[</span><span class="n">index</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">st</span><span class="p">)</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="n">star</span><span class="p">[</span><span class="n">index</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="s">""</span><span class="p">)</span>
</code></pre></div></div>

<p>모든 링크 리뷰 가져오기</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">driver</span> <span class="o">=</span> <span class="n">webdriver</span><span class="p">.</span><span class="n">Chrome</span><span class="p">()</span>
<span class="n">driver</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'https://www.oliveyoung.co.kr/store/display/getMCategoryList.do?dispCatNo=100000100010015&amp;isLoginCnt=0&amp;aShowCnt=0&amp;bShowCnt=0&amp;cShowCnt=0&amp;gateCd=Drawer&amp;trackingCd=Cat100000100010015_MID&amp;trackingCd=Cat100000100010015_MID&amp;t_page=드로우_카테고리&amp;t_click=카테고리탭_중카테고리&amp;t_2nd_category_type=중_크림'</span><span class="p">)</span>

<span class="k">for</span> <span class="n">index</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">26</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">star</span><span class="p">)):</span><span class="c1">#check_index:
</span>    <span class="n">page_num</span><span class="o">=</span><span class="mi">1</span>
    <span class="n">url</span> <span class="o">=</span> <span class="n">link</span><span class="p">[</span><span class="n">index</span><span class="p">]</span>
    <span class="k">try</span><span class="p">:</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c1"># 첫 번째 탭으로 이동
</span>        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.open('{}')"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">url</span><span class="p">))</span> <span class="c1"># URL 실행`
</span>        <span class="n">driver</span><span class="p">.</span><span class="n">switch_to</span><span class="p">.</span><span class="n">window</span><span class="p">(</span><span class="n">driver</span><span class="p">.</span><span class="n">window_handles</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="c1"># 두 번째 탭으로 이동
</span>        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.scrollTo(0, 1500)"</span><span class="p">)</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">ID</span><span class="p">,</span> <span class="s">'reviewInfo'</span><span class="p">).</span><span class="n">click</span><span class="p">()</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">ID</span><span class="p">,</span> <span class="s">'searchType_1'</span><span class="p">).</span><span class="n">click</span><span class="p">()</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">ID</span><span class="p">,</span> <span class="s">'searchType_3'</span><span class="p">).</span><span class="n">click</span><span class="p">()</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">XPATH</span><span class="p">,</span> <span class="s">'//*[@id="gdasSort"]/li[3]/a'</span><span class="p">).</span><span class="n">click</span><span class="p">()</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.scrollTo(0, 2500)"</span><span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
                <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.scrollTo(0, 4000)"</span><span class="p">)</span>
                <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">1.5</span><span class="p">)</span>
                <span class="n">err</span><span class="o">=</span><span class="mi">0</span>
                <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
                    <span class="k">if</span> <span class="n">page_num</span><span class="o">==</span><span class="mi">100</span><span class="p">:</span>
                        <span class="n">info_get</span><span class="p">(</span><span class="n">index</span><span class="p">)</span>
                        <span class="k">break</span>
                    <span class="k">if</span> <span class="n">page_num</span><span class="o">&gt;</span><span class="mi">10</span><span class="p">:</span>
                        <span class="n">i</span><span class="o">+=</span><span class="mi">1</span>
                    <span class="n">page_num</span><span class="o">+=</span><span class="mi">1</span>
                    <span class="k">try</span><span class="p">:</span>
                        <span class="n">info_get</span><span class="p">(</span><span class="n">index</span><span class="p">)</span>
                        <span class="n">driver</span><span class="p">.</span><span class="n">execute_script</span><span class="p">(</span><span class="s">"window.scrollTo(0, 4000)"</span><span class="p">)</span>
                        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
                        <span class="n">page</span><span class="o">=</span><span class="n">driver</span><span class="p">.</span><span class="n">find_element</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">CLASS_NAME</span><span class="p">,</span> <span class="s">'pageing'</span><span class="p">)</span>
                        <span class="n">page</span><span class="o">=</span><span class="n">page</span><span class="p">.</span><span class="n">find_elements</span><span class="p">(</span><span class="n">By</span><span class="p">.</span><span class="n">TAG_NAME</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span>
                        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
                        <span class="n">page</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">click</span><span class="p">()</span>
                        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
                    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                        <span class="k">print</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
                        <span class="n">err</span><span class="o">=</span><span class="mi">1</span>
                        <span class="k">break</span>
                <span class="k">if</span> <span class="n">page_num</span><span class="o">==</span><span class="mi">100</span><span class="p">:</span>
                    <span class="k">break</span>
                <span class="k">if</span> <span class="n">err</span><span class="o">==</span><span class="mi">1</span><span class="p">:</span>
                    <span class="k">break</span>
            <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                <span class="k">print</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
                <span class="k">break</span>
    <span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"페이지 이상: link "</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">index</span><span class="p">))</span>
    <span class="k">finally</span><span class="p">:</span>
        <span class="n">driver</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.3</span><span class="p">)</span>
        <span class="k">if</span><span class="p">(</span><span class="n">index</span><span class="o">%</span><span class="mi">10</span><span class="o">==</span><span class="mi">0</span><span class="p">):</span>
            <span class="k">print</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">index</span><span class="p">)</span><span class="o">+</span><span class="s">"/"</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">link</span><span class="p">)))</span>
</code></pre></div></div>

<p>데이터 제대로 못 가져온 링크 인덱스 따로 저장</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">check_index</span><span class="o">=</span><span class="p">[]</span>

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">star</span><span class="p">)):</span>
    <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">skin_type</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="o">==</span><span class="nb">len</span><span class="p">(</span><span class="n">star</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="o">==</span><span class="nb">len</span><span class="p">(</span><span class="n">reviews</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="o">==</span><span class="mi">1000</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="n">reviews</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">skin_type</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">=</span><span class="p">[]</span><span class="c1">#None
</span>        <span class="n">star</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">=</span><span class="p">[]</span><span class="c1">#None
</span>        <span class="n">reviews</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">=</span><span class="p">[]</span><span class="c1">#None
</span>        <span class="n">check_index</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
</code></pre></div></div>]]></content><author><name>birdfoot</name></author><category term="teamProject" /><category term="화장품마케팅분석" /><summary type="html"><![CDATA[올리브영 샘플 리뷰 가져오기]]></summary></entry><entry><title type="html">K nearest neighbors regression</title><link href="https://birdf00t.github.io/%ED%98%BC%EA%B3%B5%EB%A8%B8%EC%8B%A0/2024/05/24/K-nearest-neighbors-regression.html" rel="alternate" type="text/html" title="K nearest neighbors regression" /><published>2024-05-24T00:00:00+00:00</published><updated>2024-05-24T00:00:00+00:00</updated><id>https://birdf00t.github.io/%ED%98%BC%EA%B3%B5%EB%A8%B8%EC%8B%A0/2024/05/24/K-nearest-neighbors-regression</id><content type="html" xml:base="https://birdf00t.github.io/%ED%98%BC%EA%B3%B5%EB%A8%B8%EC%8B%A0/2024/05/24/K-nearest-neighbors-regression.html"><![CDATA[<h2 id="지도-학습-알고리즘">지도 학습 알고리즘</h2>

<p><strong>분류</strong> : 샘플을 몇 개의 클래스 중 하나로 분류하는 문제<br />
<strong>회귀</strong> : 정해진 클래스가 아닌 임의의 수치를 예측하는 문제</p>

<h2 id="k-최근접-이웃">k-최근접 이웃</h2>

<p><strong>분류</strong> : 가까운 k개 중에 제일 많은 클래스가 샘플의 클래스가 된다<br />
<strong>회귀</strong> : 가까운 k개의 평균이 샘플의 수치가 된다</p>

<h2 id="회귀의-탄생">‘회귀’의 탄생</h2>

<p><strong>WHO?</strong> 19세기 통계학자이자 사회학자인 프랜시스 골턴<br />
<strong>HOW?</strong> 그는 키가 큰 사람의 아이가 부모보다 더 크지 않다는 사실을 관찰하고 이를 ‘평균으로 회귀한다’라고 표현했다. 그 후 두 변수 사이의 상관관계를 분석하는 방법을 회귀라고 불렀다.</p>

<h2 id="k-최근접-이웃-회귀-모델">k-최근접 이웃 회귀 모델</h2>

<p><strong>데이터준비</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">perch_length</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mf">8.4</span><span class="p">,</span> <span class="mf">13.7</span><span class="p">,</span> <span class="mf">15.0</span><span class="p">,</span> <span class="mf">16.2</span><span class="p">,</span> <span class="mf">17.4</span><span class="p">,</span> <span class="mf">18.0</span><span class="p">,</span> <span class="mf">18.7</span><span class="p">,</span> <span class="mf">19.0</span><span class="p">,</span>
       <span class="mf">19.6</span><span class="p">,</span> <span class="mf">20.0</span><span class="p">,</span> <span class="mf">21.0</span><span class="p">,</span> <span class="mf">21.0</span><span class="p">,</span> <span class="mf">21.0</span><span class="p">,</span> <span class="mf">21.3</span><span class="p">,</span> <span class="mf">22.0</span><span class="p">,</span> <span class="mf">22.0</span><span class="p">,</span> <span class="mf">22.0</span><span class="p">,</span> <span class="mf">22.0</span><span class="p">,</span> <span class="mf">22.0</span><span class="p">,</span>
       <span class="mf">22.5</span><span class="p">,</span> <span class="mf">22.5</span><span class="p">,</span> <span class="mf">22.7</span><span class="p">,</span> <span class="mf">23.0</span><span class="p">,</span> <span class="mf">23.5</span><span class="p">,</span> <span class="mf">24.0</span><span class="p">,</span> <span class="mf">24.0</span><span class="p">,</span> <span class="mf">24.6</span><span class="p">,</span> <span class="mf">25.0</span><span class="p">,</span> <span class="mf">25.6</span><span class="p">,</span> <span class="mf">26.5</span><span class="p">,</span>
       <span class="mf">27.3</span><span class="p">,</span> <span class="mf">27.5</span><span class="p">,</span> <span class="mf">27.5</span><span class="p">,</span> <span class="mf">27.5</span><span class="p">,</span> <span class="mf">28.0</span><span class="p">,</span> <span class="mf">28.7</span><span class="p">,</span> <span class="mf">30.0</span><span class="p">,</span> <span class="mf">32.8</span><span class="p">,</span> <span class="mf">34.5</span><span class="p">,</span> <span class="mf">35.0</span><span class="p">,</span> <span class="mf">36.5</span><span class="p">,</span>
       <span class="mf">36.0</span><span class="p">,</span> <span class="mf">37.0</span><span class="p">,</span> <span class="mf">37.0</span><span class="p">,</span> <span class="mf">39.0</span><span class="p">,</span> <span class="mf">39.0</span><span class="p">,</span> <span class="mf">39.0</span><span class="p">,</span> <span class="mf">40.0</span><span class="p">,</span> <span class="mf">40.0</span><span class="p">,</span> <span class="mf">40.0</span><span class="p">,</span> <span class="mf">40.0</span><span class="p">,</span> <span class="mf">42.0</span><span class="p">,</span>
       <span class="mf">43.0</span><span class="p">,</span> <span class="mf">43.0</span><span class="p">,</span> <span class="mf">43.5</span><span class="p">,</span> <span class="mf">44.0</span><span class="p">])</span>
<span class="n">perch_weight</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mf">5.9</span><span class="p">,</span> <span class="mf">32.0</span><span class="p">,</span> <span class="mf">40.0</span><span class="p">,</span> <span class="mf">51.5</span><span class="p">,</span> <span class="mf">70.0</span><span class="p">,</span> <span class="mf">100.0</span><span class="p">,</span> <span class="mf">78.0</span><span class="p">,</span> <span class="mf">80.0</span><span class="p">,</span>
       <span class="mf">85.0</span><span class="p">,</span> <span class="mf">85.0</span><span class="p">,</span> <span class="mf">110.0</span><span class="p">,</span> <span class="mf">115.0</span><span class="p">,</span> <span class="mf">125.0</span><span class="p">,</span> <span class="mf">130.0</span><span class="p">,</span> <span class="mf">120.0</span><span class="p">,</span> <span class="mf">120.0</span><span class="p">,</span> <span class="mf">130.0</span><span class="p">,</span>
       <span class="mf">135.0</span><span class="p">,</span> <span class="mf">110.0</span><span class="p">,</span> <span class="mf">130.0</span><span class="p">,</span> <span class="mf">150.0</span><span class="p">,</span> <span class="mf">145.0</span><span class="p">,</span> <span class="mf">150.0</span><span class="p">,</span> <span class="mf">170.0</span><span class="p">,</span> <span class="mf">225.0</span><span class="p">,</span> <span class="mf">145.0</span><span class="p">,</span>
       <span class="mf">188.0</span><span class="p">,</span> <span class="mf">180.0</span><span class="p">,</span> <span class="mf">197.0</span><span class="p">,</span> <span class="mf">218.0</span><span class="p">,</span> <span class="mf">300.0</span><span class="p">,</span> <span class="mf">260.0</span><span class="p">,</span> <span class="mf">265.0</span><span class="p">,</span> <span class="mf">250.0</span><span class="p">,</span> <span class="mf">250.0</span><span class="p">,</span>
       <span class="mf">300.0</span><span class="p">,</span> <span class="mf">320.0</span><span class="p">,</span> <span class="mf">514.0</span><span class="p">,</span> <span class="mf">556.0</span><span class="p">,</span> <span class="mf">840.0</span><span class="p">,</span> <span class="mf">685.0</span><span class="p">,</span> <span class="mf">700.0</span><span class="p">,</span> <span class="mf">700.0</span><span class="p">,</span> <span class="mf">690.0</span><span class="p">,</span>
       <span class="mf">900.0</span><span class="p">,</span> <span class="mf">650.0</span><span class="p">,</span> <span class="mf">820.0</span><span class="p">,</span> <span class="mf">850.0</span><span class="p">,</span> <span class="mf">900.0</span><span class="p">,</span> <span class="mf">1015.0</span><span class="p">,</span> <span class="mf">820.0</span><span class="p">,</span> <span class="mf">1100.0</span><span class="p">,</span> <span class="mf">1000.0</span><span class="p">,</span>
       <span class="mf">1100.0</span><span class="p">,</span> <span class="mf">1000.0</span><span class="p">,</span> <span class="mf">1000.0</span><span class="p">])</span>
</code></pre></div></div>

<p><strong>데이터의 형태 파악을 위한 시각화 (x축: 특성 데이터인 길이, y축: 타깃 데이터인 무게)</strong></p>

<p>농어 길이와 무게가 비례하는 것을 확인할 수 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">perch_length</span><span class="p">,</span> <span class="n">perch_weight</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'length'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'weight'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p>결과 » <img src="https://velog.velcdn.com/images/koeunbi093/post/6c52f094-cae1-40c9-b657-e08d46442b92/image.png" alt="" /></p>

<hr />

<p><strong>데이터를 모델에 사용하기 전에 테스트와 훈련 세트 나누기</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="n">train_input</span><span class="p">,</span> <span class="n">test_input</span><span class="p">,</span> <span class="n">train_target</span><span class="p">,</span> <span class="n">test_target</span>
<span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">perch_length</span><span class="p">,</span> <span class="n">perch_weight</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
</code></pre></div></div>

<p>perch_length가 1차원 배열이기 때문에 이를 나눈 train_input과 test_input도 1차원이다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">train_input</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>

<p>결과 » (42,)</p>

<hr />

<p>사이킷런에서 훈련을 하기 위해 reshape 함수를 사용해서 2차원으로 바꾸기<br />
ex) [1,2,3] -&gt; [ [1],[2],[3] ] , 크기 (3, ) -&gt; 크기(3, 1)</p>

<p>reshape() 함수는 지정한 크기가 원소 개수와 다르면 에러가 발생한다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train_input</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div></div>

<p>결과 » <span style="color: red">cannot reshape array of size 42 into shape (2,3)</span></p>

<hr />

<p>크기에 -1을 지정하면 나머지 원소 개수로 모두 채워준다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># train_input.reshape(42,1)도 가능
</span><span class="n">train_input</span> <span class="o">=</span> <span class="n">train_input</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">test_input</span><span class="o">=</span><span class="n">test_input</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">train_input</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>

<p>결과 » (42, 1)</p>

<hr />

<p>객체 생성과 회귀 모델 훈련하기</p>

<p>k-최근접 이웃 분류에서 사용한 클래스 KNeighborsClassfier과 비슷한 KNeighborsRegressor을 k-최근접 이웃 회귀에서 사용한다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">KNeighborsRegressor</span>
<span class="n">knr</span> <span class="o">=</span> <span class="n">KNeighborsRegressor</span><span class="p">()</span>
<span class="n">knr</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_input</span><span class="p">,</span> <span class="n">train_target</span><span class="p">)</span>
</code></pre></div></div>

<p>테스트 세트의 테스트 점수 확인</p>

<p>분류: 정확도 (샘플을 정확하게 분류한 개수의 비율)<br />
회귀: 결정계수 (정확하게 맞히는 것은 불가능하기에 정확도가 아닌 결정계수)</p>

<p>결정계수 = 1- (타깃-예측)^2의 합/ (타깃- 평균)^2의 합</p>

<p>분자와 분모가 비슷해지면(예측이 타깃의 평균가 비슷해지면) 결정계수는 0에 가까워지고
타깃과 예측이 비슷해지면 결정계수는 1에 가까운 값이 된다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">knr</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">test_input</span><span class="p">,</span> <span class="n">test_target</span><span class="p">))</span>
</code></pre></div></div>

<p>결과 » 0.992809406101064</p>

<hr />

<p>타깃과 예측의 절댓값 오차 평균</p>

<p>이 외에도 타깃과 예측한 값 사이를 구해서 예측이 어느정도 벗어났는지 확인할 수 있다</p>

<p>예측값이 평균적으로 19g 정도 타깃값과 다르다는 것을 알 수 있다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">mean_absolute_error</span>
<span class="n">test_prediction</span> <span class="o">=</span> <span class="n">knr</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">test_input</span><span class="p">)</span>
<span class="n">mae</span> <span class="o">=</span> <span class="n">mean_absolute_error</span><span class="p">(</span><span class="n">test_target</span><span class="p">,</span> <span class="n">test_prediction</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">mae</span><span class="p">)</span>
</code></pre></div></div>

<p>결과 » 19.157142857142862</p>

<hr />

<p>훈련 세트의 테스트 점수 확인</p>

<p>훈련세트로 훈련을 했기에 훈련 점수가 더 높아야 함에도 불구하고 테스트 세트의 점수가 더 높다.<br />
이는 과소적합 됐다고 볼 수 있다</p>

<p>과대적합: 훈련세트 점수는 좋지만 테스트 세트의 점수가 굉장히 안 좋다<br />
과소적합: 테스트 세트의 점수가 더 좋거나 두 점수 모두 너무 낮다</p>

<p>과소 적합은 훈련, 테스트 세트의 크기가 매우 작거나 모델이 단순해서 훈련이 제대로 되지 않았을 때 일어난다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">knr</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">train_input</span><span class="p">,</span><span class="n">train_target</span><span class="p">))</span>
</code></pre></div></div>

<p>결과 »0.9698823289099254</p>

<hr />

<p>과소 적합을 해결하기 위해 모델을 조금 더 복잡하게 만들겠다</p>

<p>k-최근접 이웃 알고리즘 모델은 k의 개수를 줄이면 더 복잡하게 만들 수 있다<br />
이웃의 개수를 줄이면 훈련 세트에 잇는 국지적인 패턴에 민감해지고 이웃의 개수를 늘리면 데이터 전반의 일반적인 패턴을 따를 것이다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">knr</span><span class="p">.</span><span class="n">n_neighbors</span><span class="o">=</span><span class="mi">3</span>

<span class="n">knr</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_input</span><span class="p">,</span> <span class="n">train_target</span><span class="p">)</span>
</code></pre></div></div>

<p>테스트와 훈련 세트의 점수 확인</p>

<p>테스트 세트의 점수가 더 낮아졌으니 과소적합 문제가 해결됐다</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">knr</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">train_input</span><span class="p">,</span> <span class="n">train_target</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">knr</span><span class="p">.</span><span class="n">score</span><span class="p">(</span><span class="n">test_input</span><span class="p">,</span> <span class="n">test_target</span><span class="p">))</span>
</code></pre></div></div>

<p>결과 » 0.9804899950518966
0.9746459963987609</p>]]></content><author><name>birdfoot</name></author><category term="혼공머신" /><summary type="html"><![CDATA[지도 학습 알고리즘]]></summary></entry></feed>