데이터 분석을 위한 SQL 레시피 8장 공부

728x90

데이터를 무기로 삼기 위한 분석 기술

검색 기능 평가하기

검색 기능 개선 방법
- 검색 키워드의 흔들림을 흡수할 수 있게 '동의어 사전 추가'하기
  - 흔들름 : 검색 엔진의 키워드를 설명할 때 역시 단어가 정확하지 않고 조금 떨리는 현상
- 검색 키워드를 검색 엔진이 이해할 수 있게 '사용자 사전 추가'하기
- 검색 결과가 사용자가 원하는 순서로 나오게 '정렬 순서 조정'하기

NoMatch 비율 집계

NoMatch 비율
- 검색 총 수 중에서 검색 결과를 0으로 리턴하는 검색 결과 비율

select to_char(stamp, 'YYYY-MM-DD') AS dt
	, count(1) as search_count
	, sum(case when result_num = 0 then 1 else 0 end) as no_match_count
	, avg(case when result_num = 0 then 1 else 0 end) as no_match_rate
from access_log
where action = 'search'
group by dt;

NoMatch 키워드 집계
- 100.0*count(1)/count(1) over() as search_share
  - 전체 검색 횟수에 대한 각 키워드의 검색 비율을 백분율로 계산

-- 검색 키워드 전체 집계 결과
with search_keyword_stat as (
	select keyword
		, result_num
		, count(1) as search_count
		, 100.0*count(1)/count(1) over() as search_share
	from access_log
	where action = 'search'
	group by keyword, result_num
)

-- NoMatch 키워드 집계 결과
select keyword
	, search_count
	, search_share
	, 100.0*search_count / sum(search_count) over() no_match_share
from search_keyword_stat
where result_num = 0;

조건을 선택하는 카테고리 검색에서도 NoMatch 비율이 중요한 지표가 될 수 있다. 어떤 경우에도 검색 결과가 0이 나오지 않게 여러 가지 대책을 세워야 한다.

재검색

재검색 비율
- 사용자가 검색 결과의 출력과 관계 없이, 어떤 결과도 클릭하지 않고 새로 검색을 실행한 비율
검색 화면과 상세 화면의 접근 로그 다음 줄의 액션을 기록
- action과 next_action 모두가 search인 레코드는 재검색

with access_log_with_next_action as (
	select stamp
		, session 
		, action
		, lead(action) over(partition by session order by stamp) as next_action
	from access_log al 
)

select *
from access_log_with_next_action
order by session, stamp;

재검색 비율 집계

with access_log_with_next_action as (
	select stamp
		, session 
		, action
		, lead(action) over(partition by session order by stamp) as next_action
	from access_log al 
)

select substring(cast(stamp::date as text), 1, 10) as dt
	, count(1) as search_count
	, sum(case when next_action = 'search' then 1 else 0 end) as retry_count
	, avg(case when next_action = 'search' then 1 else 0 end) as retry_rate
from access_log_with_next_action
where action = 'search'
group by dt
order by dt;

재검색 키워드 집계
- 재검색 키워드를 집계하면 동의어 사전이 흔들림을 잡지 못하는 범위를 쉽게 확인 가능

with access_log_with_next_search as (
	select stamp
		, session
		, action
		, keyword
		, result_num
		, lead(action) over(partition by session order by stamp) as next_action
		, lead(keyword) over(partition by session order by stamp) as next_keyword
		, lead(result_num) over(partition by session order by stamp) as next_result_num
	from access_log al 
)

select keyword
	, result_num
	, count(1) as retry_count
	, next_keyword
	, next_result_num
from access_log_with_next_search
where action = 'search' and next_action = 'search'
group by keyword, result_num, next_keyword, next_result_num;

재검색 키워드를 분류해서 집계하기

NoMatch에서 조건 변경
- 해당 키워드는 동의어 사전과 사용자 사전에 추가할 키워드 후보들

with access_log_with_next_search as (
	select stamp
		, session
		, action
		, keyword
		, result_num
		, lead(action) over(partition by session order by stamp) as next_action
		, lead(keyword) over(partition by session order by stamp) as next_keyword
		, lead(result_num) over(partition by session order by stamp) as next_result_num
	from access_log al 
)

select keyword
	, result_num
	, count(1) as retry_count
	, next_keyword
	, next_result_num
from access_log_with_next_search
where action = 'search' and next_action = 'search'
	-- NoMatch 로그만 필터링
	and result_num = 0
group by keyword, result_num, next_keyword, next_result_num;

검색 결과 필터링
- 재검색한 검색 키워드가 원래의 검색 키워드를 포함하고 있다면, 검색을 조금 더 필터링하고 싶다는 의미
- 자주 사용되는 필터링 키워드가 있다면, 연관 검색어 등으로 출력해서 사용자가 요구하는 콘텐츠로 빠르게 유도

with access_log_with_next_search as (
	select stamp
		, session
		, action
		, keyword
		, result_num
		, lead(action) over(partition by session order by stamp) as next_action
		, lead(keyword) over(partition by session order by stamp) as next_keyword
		, lead(result_num) over(partition by session order by stamp) as next_result_num
	from access_log al 
)

select keyword
	, result_num
	, count(1) as retry_count
	, next_keyword
	, next_result_num
from access_log_with_next_search
where action = 'search' and next_action = 'search'
	-- 원래 키워드를 포함하는 경우만 추출
	and next_keyword like concat('%', keyword, '%')
group by keyword, result_num, next_keyword, next_result_num;

검색 키워드 변경
- 완전히 다른 키워드를 사용해 재검색 했다면, 원래 검색 키워드를 사용한 결과에 원하는 내용이 없다는 뜻
  - 즉, 동의어 사전이 제대로 기능하지 못한다는 것

with access_log_with_next_search as (
	select stamp
		, session
		, action
		, keyword
		, result_num
		, lead(action) over(partition by session order by stamp) as next_action
		, lead(keyword) over(partition by session order by stamp) as next_keyword
		, lead(result_num) over(partition by session order by stamp) as next_result_num
	from access_log al 
)

select keyword
	, result_num
	, count(1) as retry_count
	, next_keyword
	, next_result_num
from access_log_with_next_search
where action = 'search' and next_action = 'search'
	-- 원래 키워드를 포함하는 경우만 추출
	and next_keyword not like concat('%', keyword, '%')
group by keyword, result_num, next_keyword, next_result_num;

검색 이탈

검색 결과에 만족하지 못한 경우
검색 이탈 집계
- next_action is null

with access_log_with_next_action as (
	select stamp
		, session
		, action
		, lead(action) over(partition by session order by stamp) as next_action
	from access_log
)

select substring(cast(stamp::date as text), 1, 10) as dt
	, count(1) as search_count
	, sum(case when next_action is null then 1 else 0 end) as exit_count
	, avg(case when next_action is null then 1.0 else 0.0 end) as exit_rate
from access_log_with_next_action
where action = 'search'
group by dt
order by dt;

검색 이탈 키워드 집계

with access_log_with_next_search as (
	select stamp
		, session
		, action
		, keyword
		, result_num
		, lead(action) over(partition by session order by stamp) as next_action
		, lead(keyword) over(partition by session order by stamp) as next_keyword
		, lead(result_num) over(partition by session order by stamp) as next_result_num
	from access_log al 
)

select keyword
	, count(1) as search_count
	, sum(case when next_action is null then 1 else 0 end) as exit_count
	, avg(case when next_action is null then 1.0 else 0.0 end) as exit_rate
	, result_num
from access_log_with_next_search
where action = 'search'
group by keyword, result_num
-- 키워드 전체의 이탈률을 계산한 후, 이탈률이 0보다 큰 키워드만 추출
having sum(case when next_action is null then 1 else 0 end) > 0;

검색 결과의 포괄성을 지표화하기

재현율을 사용해 검색의 포괄성 평가
- 재현율 (Recall)
  - 어떤 키워드의 검색 결과에서 미리 준비한 정답 아이템이 얼마나 나왔는지를 비율로 나타낸 것
검색 결과와 정답 아이템을 결합

-- 검색 결과와 정답 아이템을 결합하는 쿼리
with search_result_with_correct_items as (
	select coalesce(r.keyword, c.keyword) as keyword
		, r.rank
		, coalesce(r.item, c.item) as item
		, case when c.item is not null then 1 else 0 end as correct
	from search_result as r
	full outer join correct_result as c
		on r.keyword = c.keyword
			and r.item = c.item
)
select *
from search_result_with_correct_items
order by keyword, rank;

검색 결과 상위 n개의 재현율 계산

, search_result_with_recall as (
	select *
		-- 검색 결과 상위에서 정답 데이터에 포함되는 아이템 수의 누계 구하기
		, sum(correct)
			-- rank가 null이라면 정렬 순서의 마지막에 위치하므로 편의상 굉장히 큰 값으로 변환해서 넣기
			over(partition by keyword order by coalesce(rank, 100000)
				rows between unbounded preceding and current row) as cum_correct
		, case 
			-- 검색 결과에 포함되지 않은 아이템은 편의상 적합률을 0으로 다루기
			when rank is null then 0.0
			else 100.0
				*sum(correct)
					over(partition by keyword order by coalesce(rank, 100000)
						rows between unbounded preceding and current row)
				/ sum(correct) over(partition by keyword)
		end as recall
	from search_result_with_correct_items
)

select *
from search_result_with_recall
order by keyword, rank;

검색 결과 상위 5개의 재현율을 키워드별로 추출하는 쿼리

, recall_over_rank_5 as (
	select keyword
		, rank 
		, recall
		-- 검색 결과 순위가 높은 순서로 번호 붙이기
		-- 검색 결과에 나오지 않는 아이템은 편의상 0으로 다루기
		, row_number() over(partition by keyword order by coalesce(rank, 0) desc) as desc_number
	from search_result_with_recall
	-- 검색 결과 상위 5개 이하 또는 검색 결과에 포함되지 않은 아이템만 출력하기
	where coalesce(rank, 0) <= 5
)

select keyword
	, recall as recall_at_5
from recall_over_rank_5
-- 검색 결과 상위 5개 중에서 가장 순위가 높은 레코드 추출
where desc_number = 1;

검색 엔진 전체의 평균 재현율을 계산하는 쿼리

select avg(recall) as average_recall_at_5
from recall_over_rank_5
-- 검색 결과 상위 5개 중에서 가장 순위가 높은 레코드 추출
where desc_number = 1;

재현율은 정답 아이템에 포함되는 아이템을 어느 정도 망라할 수 있는지를 나타내는 지표
- 망라하다 : 널리 받아들여 모두 포함하다

검색 결과의 타당성을 지표화하기

정확률 (Precision)
- 검색 결과에 포함되는 아이템 중 정답 아이템이 어느 정도 비율로 포함되는지
검색 결과 상위 n개의 정확률 계산하는 쿼리

-- 검색 결과와 정답 아이템을 결합하는 쿼리
with search_result_with_correct_items as (
	select coalesce(r.keyword, c.keyword) as keyword
		, r.rank
		, coalesce(r.item, c.item) as item
		, case when c.item is not null then 1 else 0 end as correct
	from search_result as r
	full outer join correct_result as c
		on r.keyword = c.keyword
			and r.item = c.item
)
, search_result_with_precision as (
	select *
		-- 검색 결과의 상위에서 정답 데이터에 포함되는 아이템 수의 누계 구하기
		, sum(correct)
			-- rank가 null이라면 정렬 순서의 마지막에 위치하므로
			-- 편의상 굉장히 큰 값으로 변환해서 넣기
			over(partition by keyword order by coalesce(rank, 100000)
				rows between unbounded preceding and current row) as cum_correct
		, case 
			-- 검색 결과에 포함되지 않은 아이템은 편의상 적합률을 0으로 다루기
			when rank is null then 0.0
			else 
				100.0
				* sum(correct)
					over(partition by keyword order by coalesce(rank, 100000)
						rows between unbounded preceding and current row)
				-- 재현률과 다르게 분모에 검색 결과 순위까지의 누계 아이템 수 지정
				/ count(1)
					over(partition by keyword order by coalesce(rank, 100000)
						rows between unbounded preceding and current row)
		end as precision 
	from search_result_with_correct_items
)

select *
from search_result_with_precision
order by keyword, rank;

검색 결과 상위 5개의 정확률을 키워드별로 추출하는 쿼리

, precision_over_rank_5 as (
	select keyword
		, rank 
		, precision
		-- 검색 결과 순위가 높은 순서로 번호 붙이기
		-- 검색 결과에 나오지 않는 아이템은 편의상 0으로 다루기
		, row_number() over(partition by keyword order by coalesce(rank, 0) desc) as desc_number
	from search_result_with_precision
	-- 검색 결과 상위 5개 이하 또는 검색 결과에 포함되지 않은 아이템만 출력하기
	where coalesce(rank, 0) <= 5
)

select keyword
	, precision as precision_at_5
from precision_over_rank_5
-- 검색 결과 상위 5개 중에서 가장 순위가 높은 레코드만 추출
where desc_number = 1;

select *
from search_result_with_precision
order by keyword, rank;

검색 엔진 전체의 평균 정확률을 계산하는 쿼리

select avg(precision) as average_precision_at_5
from precision_over_rank_5
-- 검색 결과 상위 5개 중에서 가장 순위가 높은 레코드만 추출
where desc_number = 1;

정확률은 검색 결과 상위에 출력되는 아이템의 타당성을 나타내는 지표
이 지표는 웹 검색 등 방대한 데이터 검색에서 적절한 검색 결과를 빠르게 찾고 싶은 경우에 굉장히 중요한 지표

검색 결과 순위와 관련된 지표 계산하기

MAP (Mean Average Precision)
- 검색 결과 상위 N개의 적합률 평균

어소시에이션 분석

데이터 마이닝 분야 중 하나인 '상관 규칙 추출'의 대표적인 방법
- ex) '상품 a를 구매한 사람의 60%는 상품 b도 구매했다'등의 상관규칙을 대량의 데이터에서 찾아내는 것을 의미
어소시에이션 분석에 사용되는 지표
- 지지도 (Support)
  - 상관 규칙이 어느 정도의 확률로 발생하는지 나타내는 값
- 확신도, 신뢰도 (Confidence)
  - 어떤 결과가 어느 정도의 확률로 발생하는지 의미하는 값
- 리프트 (Lift)
  - 어떤 조건을 만족하는 경우의 확률을 사전 조건 없이 해당 결과가 일어날 확률로 나눈 값
  - 보통 리프트 값이 1.0 이상이면 좋은 규칙이라고 판단
두 상품의 연관성을 어소시에이션 분석으로 찾기

-- 구매 로그 수와 상품별 구매 수를 세는 쿼리
with purchase_id_count as (
	-- 구매 상세 로그에서 유니크한 구매 로그 수 계산하기
	select count(distinct purchase_id) as purchase_count
	from purchase_detail_log
)
, purchase_detail_log_with_counts as (
	select d.purchase_id
		, p.purchase_count
		, d.product_id
		-- 상품별 구매 수 계산
		, count(1) over(partition by d.product_id) as product_count
	from purchase_detail_log as d
	-- 구매 로그 수를 모든 레코드 수와 결함
	cross join purchase_id_count as p
)

select *
from purchase_detail_log_with_counts
order by product_id, purchase_id;

-- 상품 조합별로 구매 수를 세는 쿼리
, product_pair_with_stat as (
	select l1.product_id as p1
		, l2.product_id as p2
		, l1.product_count as p1_count
		, l2.product_count as p2_count
		, count(1) as p1_p2_count
		, l1.purchase_count as purchase_count
	from purchase_detail_log_with_counts as l1
	join purchase_detail_log_with_counts as l2
		on l1.purchase_id = l2.purchase_id
	-- 같은 상품 조합 제외
	where l1.product_id <> l2.product_id
	group by l1.product_id, l2.product_id, l1.product_count, l2.product_count, l1.purchase_count
)

select *
from product_pair_with_stat
order by p1, p2;

어소시에이션 분석에 필요한 '구매 로그 총 수', '상품 a의 구매 수', '상품 b의 구매 수', '상품 a와 상품 b 동시 구매 수' 4개의 키 페어 구함
어소시에이션 분석의 지표 계산

select p1
	, p2
	, 100.0 * p1_p2_count / purchase_count as support
	, 100.0 * p1_p2_count / p1_count as confidence
	, (100.0 *p1_p2_count / p1_count)
		/ (100.0 * p2_count / purchase_count) as lift
from product_pair_with_stat
order by p1, p2;

특정 아이템에 흥미가 잇는 사람이 함께 찾아보는 아이템 검색

열람 수와 구매 수를 조합한 점수를 계산하는 쿼리

with rating as (
	select user_id
		, product
		-- 상품 열람 수
		, sum(case when action = 'view' then 1 else 0 end) as view_count
		-- 상품 구매 수
		, sum(case when action = 'purchase' then 1 else 0 end) as purchase_count
		-- 열람 수와 구매 수에 3:7의 비율의 가중치 주어 평균 구하기
		, 0.3*sum(case when action = 'view' then 1 else 0 end)
		+ 0.7*sum(case when action = 'purchase' then 1 else 0 end) as score
	from action_log
	group by user_id, product
)

select *
from rating
order by user_id, score desc;

아이템 사이의 유사도를 계산하고 순위를 생성하는 쿼리

with rating as (
	select user_id
		, product
		-- 상품 열람 수
		, sum(case when action = 'view' then 1 else 0 end) as view_count
		-- 상품 구매 수
		, sum(case when action = 'purchase' then 1 else 0 end) as purchase_count
		-- 열람 수와 구매 수에 3:7의 비율의 가중치 주어 평균 구하기
		, 0.3*sum(case when action = 'view' then 1 else 0 end)
		+ 0.7*sum(case when action = 'purchase' then 1 else 0 end) as score
	from action_log
	group by user_id, product
)

select r1.product as target
	, r2.product as related
	-- 모든 아이템을 열람/구매한 사용자 수
	, count(r1.user_id) as users
	-- 스코어들을 곱하고 합계를 구해 유사도 계산
	, sum(r1.score*r2.score) as score
	-- 상품 유사도 순위 구하기
	, row_number() over(partition by r1.product order by sum(r1.score*r2.score) desc) as rank 
from rating as r1
join rating as r2
	-- 공통 사용자가 존재하는 상품의 페어 만들기
	on r1.user_id = r2.user_id
-- 같은 아이템의 경우에는 페어 제외
where r1.product <> r2.product
group by r1.product, r2.product
order by target, rank;

아이템 벡터를 L2 정규화하는 쿼리

with rating as (
	select user_id
		, product
		-- 상품 열람 수
		, sum(case when action = 'view' then 1 else 0 end) as view_count
		-- 상품 구매 수
		, sum(case when action = 'purchase' then 1 else 0 end) as purchase_count
		-- 열람 수와 구매 수에 3:7의 비율의 가중치 주어 평균 구하기
		, 0.3*sum(case when action = 'view' then 1 else 0 end)
		+ 0.7*sum(case when action = 'purchase' then 1 else 0 end) as score
	from action_log
	group by user_id, product
)

-- 아이템 벡터 정규화하기
, product_base_normalized_ratings as (
	select user_id
		, product
		, score
		, sqrt(sum(score*score) over(partition by product)) as norm
		, score / SQRT(sum(score*score) over(partition by product )) as norm_score
	from rating
)

select *
from product_base_normalized_ratings;

정규화된 점수로 아이템의 유사도를 계산하는 쿼리
- 자기자신과의 유사도는 1.0
  - 내적의 값이 0이라면 전혀 유사성이 없는 아이템
- 정규화 전과 후의 순위가 달라질 수 있음

with rating as (
	select user_id
		, product
		-- 상품 열람 수
		, sum(case when action = 'view' then 1 else 0 end) as view_count
		-- 상품 구매 수
		, sum(case when action = 'purchase' then 1 else 0 end) as purchase_count
		-- 열람 수와 구매 수에 3:7의 비율의 가중치 주어 평균 구하기
		, 0.3*sum(case when action = 'view' then 1 else 0 end)
		+ 0.7*sum(case when action = 'purchase' then 1 else 0 end) as score
	from action_log
	group by user_id, product
)

-- 아이템 벡터 정규화하기
, product_base_normalized_ratings as (
	select user_id
		, product
		, score
		, sqrt(sum(score*score) over(partition by product)) as norm
		, score / SQRT(sum(score*score) over(partition by product )) as norm_score
	from rating
)


select r1.product as target
	, r2.product as related
	-- 모든 아이템을 열람/구매한 사용자 수
	, count(r1.user_id) as users
	-- 스코어들을 곱하고 합계를 구해 유사도 계산
	, sum(r1.score*r2.score) as score
	, sum(r1.norm_score * r2.norm_score) as norm_score
	-- 상품 유사도 순위 구하기
	, row_number() over(partition by r1.product order by sum(r1.norm_score*r2.norm_score) desc) as rank 
from product_base_normalized_ratings as r1
join product_base_normalized_ratings as r2
	-- 공통 사용자가 존재하는 상품의 페어 만들기
	on r1.user_id = r2.user_id
group by r1.product, r2.product
order by target, rank;

당신을 위한 추천 상품

사용자끼리의 유사도 계산하는 쿼리

with ratings as (
	select user_id
		, product
		-- 상품 열람 수
		, sum(case when action = 'view' then 1 else 0 end) as view_count
		-- 상품 구매 수
		, sum(case when action = 'purchase' then 1 else 0 end) as purchase_count
		-- 열람 수와 구매 수에 3:7의 비율의 가중치 주어 평균 구하기
		, 0.3*sum(case when action = 'view' then 1 else 0 end)
		+ 0.7*sum(case when action = 'purchase' then 1 else 0 end) as score
	from action_log
	group by user_id, product
)
-- 사용자 벡터 정규화하기
, user_base_normalized_ratings as (
	select user_id
		, product
		, score
		, sqrt(sum(score*score) over(partition by user_id)) as norm
		, score / SQRT(sum(score*score) over(partition by user_id)) as norm_score
	from ratings
)

-- 경향이 비슷한 사용자 찾기
, related_users as (
	select r1.user_id
		, r2.user_id as related_user
		, count(r1.product) as products
		, sum(r1.norm_score * r2.norm_score) as score
		, row_number() over(partition by r1.user_id order by sum(r1.norm_score * r2.norm_score) desc) as rank 
	from user_base_normalized_ratings as r1
	join user_base_normalized_ratings as r2
		on r1.product = r2.product
	where r1.user_id <> r2.user_id
	group by r1.user_id, r2.user_id
)

select *
from related_users
order by user_id, rank;

순위가 높은 유사 사용자를 기반으로 추천 아이템을 추출하는 쿼리


, related_user_base_products as (
	select u.user_id
		, r.product
		, sum(u.score*r.score) as score
		, row_number() over(partition by u.user_id order by sum(u.score*r.score) desc) as rank 
	from related_users as u
	join ratings as r
		on u.related_user = r.user_id
	where u.rank <= 1
	group by u.user_id, product
)

select *
from related_user_base_products
order by user_id;

이미 구매한 아이템을 필터링

select p.user_id
	, p.product
	, p.score
	, row_number() over(partition by p.user_id order by p.score desc) as rank 
from related_user_base_products as p
left join ratings as r
	on p.user_id = r.user_id and p.product = r.product
-- 대상 사용자가 구매하지 않은 상품만 추천
where coalesce(r.purchase_count, 0) = 0
order by p.user_id;

값의 범위가 다른 지표를 정규화해서 비교 가능한 상태로 만들기

Min-Max 정규화
- 각 지표를 0~1의 스케일로 정규화하는 방법

select user_id
	, product
	, view_count as v_count
	, purchase_count as p_count
	, 1.0 * (view_count - min(view_count) over()) 
	/ nullif((max(view_count) over() - min(view_count) over()), 0) as norm_v_count
	, 1.0 * (purchase_count - min(purchase_count) over()) 
	/ nullif((max(purchase_count) over() - min(purchase_count) over()), 0) as norm_p_count
from action_counts
order by user_id, product;

시그모이드 함수
- 곡선, 0에 가까울수록 변환 후의 값에 미치는 영향이 큼
- 0~1
- 함수를 사용해 비선형 변환을 하는 것이 직관적으로 더 이해하기 쉬움

select user_id
	, product
	, view_count as v_count
	, purchase_count as p_count
	-- 게인을 0.1로 사용한 시그모이드 함수
	, 2.0 / (1 + exp(-0.1*view_count)) - 1.0 as sigm_v_count
	-- 게인을 10으로 사용한 시그모이드 함수
	, 2.0 / (1 + exp(-10*purchase_count)) - 1.0 as sigm_p_count
from action_counts
order by user_id, product;

로그 변환

열럼 수와 구매 수처럼 '어떤 것을 세어 집계한 숫자'는 점수를 계산할 때, 로그를 취해서 값의 변화를 완만하게 표현 가능
이를 활용하면 사람이 더 직감적으로 쉽게 값의 변화를 인지할 수 있음

가중평균

분기별 상품 매출액과 매출 합계 집계

with item_sales_per_quaters as (
	select item
		-- 2016년 1분기 상품 매출 모두 더하기
		, sum(case when year_month in ('2016-01', '2016-02', '2016-03') then amount else 0 end) as sales_2016_q1
		-- 2016년 4분기의 상품 매출 모두 더하기
		, sum(case when year_month in ('2016-10', '2016-11', '2016-12') then amount else 0 end) as sales_2016_q4
	from monthly_sales 
	group by item
)

select item
	-- 2016년 1분기 상품 매출
	, sales_2016_q1
	-- 2016년 1분기의 상품 매출 합계
	, sum(sales_2016_q1) over() as sum_sales_2016_q1
	-- 2016년 4분기 상품 매출
	, sales_2016_q4
	-- 2016년 4분기의 상품 매출 합계
	, sum(sales_2016_q4) over() as sum_sales_2016_q4
from item_sales_per_quaters;

1분기 합계 매출액 < 4분기 합계 매출액
분기별 상품 매출액을 기반으로 점수 계산
- min_max 정규화

with item_sales_per_quaters as (
	select item
		-- 2016년 1분기 상품 매출 모두 더하기
		, sum(case when year_month in ('2016-01', '2016-02', '2016-03') then amount else 0 end) as sales_2016_q1
		-- 2016년 4분기의 상품 매출 모두 더하기
		, sum(case when year_month in ('2016-10', '2016-11', '2016-12') then amount else 0 end) as sales_2016_q4
	from monthly_sales 
	group by item
)

, item_scores_per_quaters as (
	select item
		, sales_2016_q1
		, 1.0
			* (sales_2016_q1 - min(sales_2016_q1) over())
			/ nullif(max(sales_2016_q1) over() - min(sales_2016_q1) over(), 0) as score_2016_q1
		, sales_2016_q4
		, 1.0
			* (sales_2016_q4 - min(sales_2016_q4) over())
			/ nullif(max(sales_2016_q4) over() - min(sales_2016_q4) over(), 0) as score_2016_q4
	from item_sales_per_quaters
)

select * from item_scores_per_quaters;

분기별 상품 점수 가중 평균으로 순위를 생성

select item
	, 0.7 * score_2016_q1 + 0.3 * score_2016_q4 as score
	, row_number() over(order by 0.7*score_2016_q1 + 0.3*score_2016_q4 desc) as rank 
from item_scores_per_quarters
order by rank;

728x90

'데이터 분석 > SQL 분석' 카테고리의 다른 글

데이터 분석을 위한 SQL 레시피 7장 공부 (0)	2024.06.14
데이터 분석을 위한 SQL 레시피 6장 공부 (0)	2024.06.13
RFM 분석 (2)	2024.06.13
데이터 분석을 위한 SQL 레시피 5장 공부 (0)	2024.06.12
Z차트를 작성하는 SQL 쿼리 (0)	2024.06.12

DA로 가는 길

데이터 분석을 위한 SQL 레시피 8장 공부

데이터를 무기로 삼기 위한 분석 기술

검색 기능 평가하기

NoMatch 비율 집계

재검색

재검색 키워드를 분류해서 집계하기

검색 이탈

검색 결과의 포괄성을 지표화하기

검색 결과의 타당성을 지표화하기

검색 결과 순위와 관련된 지표 계산하기

어소시에이션 분석

추천 시스템

특정 아이템에 흥미가 잇는 사람이 함께 찾아보는 아이템 검색

당신을 위한 추천 상품

값의 범위가 다른 지표를 정규화해서 비교 가능한 상태로 만들기

로그 변환

가중평균

'데이터 분석 > SQL 분석' 카테고리의 다른 글

댓글

티스토리툴바

데이터 분석을 위한 SQL 레시피 8장 공부

데이터를 무기로 삼기 위한 분석 기술

검색 기능 평가하기

NoMatch 비율 집계

재검색

재검색 키워드를 분류해서 집계하기

검색 이탈

검색 결과의 포괄성을 지표화하기

검색 결과의 타당성을 지표화하기

검색 결과 순위와 관련된 지표 계산하기

어소시에이션 분석

추천 시스템

특정 아이템에 흥미가 잇는 사람이 함께 찾아보는 아이템 검색

당신을 위한 추천 상품

값의 범위가 다른 지표를 정규화해서 비교 가능한 상태로 만들기

로그 변환

가중평균

'데이터 분석 > SQL 분석' 카테고리의 다른 글

관련글

댓글

티스토리툴바