데이터 분석을 위한 SQL 레시피 7장 공부

728x90

데이터 활용의 정밀도를 높이는 분석 기술

세션별로 페이지 열람 수 랭킹 비율 구하기

PERCENT_RANK의 값은 (rank - 1) / (전체수 - 1)

with session_count as (
	select session 
		, count(1) as count
	from action_log_with_noise
	group by session
)

select session
	, count
	, rank() over(order by count desc) as rank 
	, percent_rank() over(order by count desc) as percent_rank
from session_count;

데이터 타당성 확인하기

로그 데이터의 요건을 만족하는지 확인하는 쿼리

select action 
	-- session은 반드시 null이 아니어야 함
	, avg(case when session is not null then 1.0 else 0.0 end) as session
	-- user_id는 반드시 null이 아니어야 함
	, avg(case when user_id is not null then 1.0 else 0.0 end) as user_id
	-- category는 action=view의 경우 null, 이외의 경우는 null이 아니어야 함
	, avg(
		case action
			when 'view' then
				case when category is null then 1.0 else 0.0 end
			else
				case when category is not null then 1.0 else 0.0 end
		end
	) as category
	-- product는 action=view의 경우 null, 이외의 경우 null이 아니어야 함
	, avg(
		case action
			when 'view' then 
				case when products is null then 1.0 else 0.0 end
			else
				case when products is not null then 1.0 else 0.0 end
		end
	) as products
	-- amount는 action=purchase의 경우 null이 아니어야 하며 이외의 경우 null
	, avg(
		case action
			when 'purchase' then
				case when amount is not null then 1.0 else 0.0 end
			else
				case when amount is null then 1.0 else 0.0 end
		end
	) as amount
	-- stamp는 반드시 null이 아니어야 함
	, avg(case when stamp is not null then 1.0 else 0.0 end) as stamp
from invalid_action_log 
group by action;

데이터 중복 검출하기

데이터 무결성
- 데이터의 정확성, 일관성, 유효성이 유지되는 것을 의미
- 정확성이란 중복이나 누락이 없는 상태를 뜻하고, 일관성은 원인과 결과의 의미가 연속적으로 보장되어 변하지 않는 상태를 뜻함
키가 중복되는 레코드 확인
- STRING_AGG("합칠컬럼명", "구분자")
  - 컬럼 값을 지정한 구분자로 연결하여 하나의 문자열로 합칩

select id
    , count(*) as record_num
    -- 데이터를 배열로 집약하고, 쉼표로 구분된 문자열로 변환
    , string_agg(name, ',') as name_list
    , string_agg(stamp, ',') as stamp_list
from mst_categories
group by id
-- 중복된 id 확인
having count(*) > 1;

중복 데이터 배제하기

ROW_NUMBER를 사용해 중복을 배제

with dup_action_log_with_order_num as (
	select *
		-- 중복된 데이터에 순번 붙이기
		, row_number() over(partition by session, user_id, action, products order by stamp) as order_num
	from dup_action_log 
)

select session
	, user_id
	, action
	, products
	, stamp
from dup_action_log_with_order_num
where order_num = 1;

이전 액션으로부터의 경과 시간을 계산
- EXTRACT(날짜/시간 데이터 요소 from timestamp)
  - 특정날짜부분을 추출
  - epoch : 1970년 1월 1일 00:00:00 UTC 부터 현재까지의 초

with dup_action_log_with_lag_seconds as (
	select user_id
		, action
		, products
		, stamp
		-- 같은 사용자와 상품 조합에 대한 이전 액션으로부터의 경과 시간 계산
		, extract (epoch from stamp::timestamp - lag(stamp::timestamp)
			over(
				partition by user_id, action, products
				order by stamp
			)) as lag_seconds
	from dup_action_log 
)

select *
from dup_action_log_with_lag_seconds
order by stamp;

30분 이내의 같은 액션을 중복으로 보고 배제

select user_id
	, action
	, products
	, stamp
from dup_action_log_with_lag_seconds
where (lag_seconds is null or lag_seconds >= 30*60)
order by stamp;

변경된 마스터 데이터 모두 추출

IS DISTINCT FROM
- 한쪽에만 NULL이 있는 레코드를 확인할 때

select coalesce(new_mst.product_id, old_mst.product_id) as product_id
	, coalesce(new_mst.name, old_mst.name) as name
	, coalesce(new_mst.price, old_mst.price) as price
	, coalesce(new_mst.updated_at, old_mst.updated_at) as updated_at
	, case 
		when old_mst.updated_at is null then 'added'
		when new_mst.updated_at is null then 'deleted'
		when new_mst.updated_at <> old_mst.updated_at then 'updated'
	end as status
from mst_products_20170101 as new_mst
full outer join mst_products_20161201 as old_mst
	on new_mst.product_id = old_mst.product_id
where new_mst.updated_at is distinct from old_mst.updated_at;

두 순위의 유사도 계산하기

3개의 지표를 기반으로 순위를 작성

-- 경로별 방문 횟수, 방문자 수, 페이지 뷰 계산
with path_stat as (
	select path
		, count(distinct long_session) as access_users
		, count(distinct short_session) as access_count
		, count(*) as page_view
	from access_log
	group by path
)

-- 방문 횟수, 방문자 수, 페이지 뷰별로 순위 붙이기
, path_ranking as (
	select 'access_user' as type
		, path
		, rank() over(order by access_users desc) as rank
	from path_stat
	
	union all
	
	select 'access_count' as type
		, path
		, rank() over(order by access_count desc) as rank
	from path_stat
	
	union all
	
	select 'page_view' as type
		, path
		, rank() over(order by page_view desc) as rank
	from path_stat
)

select *
from path_ranking
order by type, rank;

경로별 순위들의 차이 계산

, pair_ranking as (
	select r1.path
		, r1.type as type1
		, r1.rank as rank1
		, r2.type as type2
		, r2.rank as rank2
		-- 순위 차이 계산
		, power(r1.rank - r2.rank, 2) as diff
	from path_ranking as r1
	join path_ranking as r2
		on r1.path = r2.path
)

select *
from pair_ranking
order by type1, type2, rank1;

스피어만 상관계수 계산
- 완전히 일치할 경우 1.0, 완전히 일치하지 않을 경우 -1.0

select type1
	, type2
	, 1 - (6.0*sum(diff) / (power(count(1), 3) - count(1))) as spearman
from pair_ranking
group by type1, type2
order by type1, spearman desc;

728x90

'데이터 분석 > SQL 분석' 카테고리의 다른 글

데이터 분석을 위한 SQL 레시피 8장 공부 (0)	2024.06.14
데이터 분석을 위한 SQL 레시피 6장 공부 (0)	2024.06.13
RFM 분석 (2)	2024.06.13
데이터 분석을 위한 SQL 레시피 5장 공부 (0)	2024.06.12
Z차트를 작성하는 SQL 쿼리 (0)	2024.06.12

DA로 가는 길

데이터 분석을 위한 SQL 레시피 7장 공부

데이터 활용의 정밀도를 높이는 분석 기술

세션별로 페이지 열람 수 랭킹 비율 구하기

데이터 타당성 확인하기

데이터 중복 검출하기

중복 데이터 배제하기

변경된 마스터 데이터 모두 추출

두 순위의 유사도 계산하기

'데이터 분석 > SQL 분석' 카테고리의 다른 글

댓글

티스토리툴바

데이터 분석을 위한 SQL 레시피 7장 공부

데이터 활용의 정밀도를 높이는 분석 기술

세션별로 페이지 열람 수 랭킹 비율 구하기

데이터 타당성 확인하기

데이터 중복 검출하기

중복 데이터 배제하기

변경된 마스터 데이터 모두 추출

두 순위의 유사도 계산하기

'데이터 분석 > SQL 분석' 카테고리의 다른 글

관련글

댓글

티스토리툴바