[추천] Collaborative Filtering - 코드 구현 (feat. python)

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

[Alex] 데이터 장인의 블로그

[추천] Collaborative Filtering - 코드 구현 (feat. python) 본문

ML&DL/추천 알고리즘

[추천] Collaborative Filtering - 코드 구현 (feat. python)

Alex, Yoon 2020. 9. 4. 21:37

1. 라이브러리를 호출하고 데이터를 불러옵니다.

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.metrics.pairwise import cosine_similarity
import operator
%matplotlib inline

anime = pd.read_csv('anime.csv')
rating = pd.read_csv('rating.csv')
anime.head()

	anime_id	name	genre	type	episodes	rating	members
0	32281	Kimi no Na wa.	Drama, Romance, School, Supernatural	Movie	1	9.37	200630
1	5114	Fullmetal Alchemist: Brotherhood	Action, Adventure, Drama, Fantasy, Magic, Mili...	TV	64	9.26	793665
2	28977	Gintama°	Action, Comedy, Historical, Parody, Samurai, S...	TV	51	9.25	114262
3	9253	Steins;Gate	Sci-Fi, Thriller	TV	24	9.17	673572
4	9969	Gintama'	Action, Comedy, Historical, Parody, Samurai, S...	TV	51	9.16	151266

2. 데이터 값이 없는(null or nan) 경우 -1로 대체하고 확인합니다.

# insert missing values by -1 
rating.rating.replace({-1: np.nan}, regex=True, inplace = True) 
rating.head()

	user_id	anime_id	rating
0	1	20	-1
1	1	24	-1
2	1	79	-1
3	1	226	-1
4	1	241	-1

약 18%의 행이 score의 값이 없다는 것을 확인할 수 있습니다.

# 약 18% 데이터 rating data = nan 
rating.groupby("rating").count().iloc[:,:1] / rating.count().user_id

	user_id
rating
-1	0.188962
1	0.002131
2	0.002963
3	0.005305
4	0.013347
5	0.036193
6	0.081622
7	0.176009
8	0.210657
9	0.160499
10	0.122312

3. TV 애니메이션 프로그램의 데이터만 가져와 학습시켜보겠습니다.

# For this analysis I'm only interest in finding recommendations for the TV category
anime_tv = anime[anime['type']=='TV']
anime_tv.head()

	anime_id	name	genre	type	episodes	rating	members
1	5114	Fullmetal Alchemist: Brotherhood	Action, Adventure, Drama, Fantasy, Magic, Mili...	TV	64	9.26	793665
2	28977	Gintama°	Action, Comedy, Historical, Parody, Samurai, S...	TV	51	9.25	114262
3	9253	Steins;Gate	Sci-Fi, Thriller	TV	24	9.17	673572
4	9969	Gintama'	Action, Comedy, Historical, Parody, Samurai, S...	TV	51	9.16	151266
5	32935	Haikyuu!!: Karasuno Koukou VS Shiratorizawa Ga...	Comedy, Drama, School, Shounen, Sports	TV	10	9.15	93351

# Join the two dataframes on the anime_id columns
merged = rating.merge(anime_tv, left_on = 'anime_id', right_on = 'anime_id', suffixes= ['_user', ''])
merged.rename(columns = {'rating_user':'user_rating'}, inplace = True)

merged.shape

(5283596, 9)

merged.head()

	user_id	anime_id	user_rating	name	genre	type	episodes	rating	members
0	1	20	-1	Naruto	Action, Comedy, Martial Arts, Shounen, Super P...	TV	220	7.81	683297
1	3	20	8	Naruto	Action, Comedy, Martial Arts, Shounen, Super P...	TV	220	7.81	683297
2	5	20	6	Naruto	Action, Comedy, Martial Arts, Shounen, Super P...	TV	220	7.81	683297
3	6	20	-1	Naruto	Action, Comedy, Martial Arts, Shounen, Super P...	TV	220	7.81	683297
4	10	20	-1	Naruto	Action, Comedy, Martial Arts, Shounen, Super P...	TV	220	7.81	683297

4. 간단히 작업하기 위해 30000명의 회원 데이터만 사용하겠습니다.(kaggle notebook을 사용하니 많은 양의 데이터가 돌아가지 않네요)

+) 원래 샘플링이나 rating 값에 따른 추출이 맞습니다.

# For computing reasons I'm limiting the dataframe length to 10,000 users
merged=merged[['user_id', 'name', 'user_rating']]
merged_sub= merged[merged.user_id <= 30000]
merged_sub.head()

	user_id	name	user_rating
0	1	Naruto	-1
1	3	Naruto	8
2	5	Naruto	6
3	6	Naruto	-1
4	10	Naruto	-1

5. collaborative filtering에서는 사용자-사용자 간의 유사도, 아이템-아이템 간의 유사도를 나타내어야 합니다. 유사도를 계산하기 위해 '피봇팅'을 진행하겠습니다.

User CF - row : 사용자 , column : 아이템, values : 평가점수

Item CF - row : 아이템 , column : 사용자, values : 평가점수

piv = merged_sub.pivot_table(index=['user_id'], columns=['name'], values='user_rating')

print(piv.shape)
piv.head()

(29802, 3031)

name	.hack//Roots	.hack//Sign	.hack//Tasogare no Udewa Densetsu	009-1	07-Ghost	11eyes	12-sai.: Chicchana Mune no Tokimeki	12-sai.: Chicchana Mune no Tokimeki 2nd Season	3 Choume no Tama: Uchi no Tama Shirimasenka?	30-sai no Hoken Taiiku	...	Zone of the Enders: Dolores, I	Zukkoke Knight: Don De La Mancha	ef: A Tale of Melodies.	ef: A Tale of Memories.	gdgd Fairies	gdgd Fairies 2	iDOLM@STER Xenoglossia	s.CRY.ed	xxxHOLiC	xxxHOLiC Kei
user_id
1	NaN	NaN	NaN	NaN	NaN	-1.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	NaN

5 rows × 3031 columns

6. 사용자 개개인의 평가 척도를 조정해주기 위해 표준화 작업을 진행합니다.

점수를 상대적으로 짜게 주는 사람에 대한 '조정' 작업을 거치는 것입니다.

# Normalize the values
piv_norm = piv.apply(lambda x: (x-np.mean(x))/(np.max(x)-np.min(x)), axis=1) # min-max scaling

piv_norm.head()

name	.hack//Roots	.hack//Sign	.hack//Tasogare no Udewa Densetsu	009-1	07-Ghost	11eyes	12-sai.: Chicchana Mune no Tokimeki	12-sai.: Chicchana Mune no Tokimeki 2nd Season	3 Choume no Tama: Uchi no Tama Shirimasenka?	30-sai no Hoken Taiiku	...	Zone of the Enders: Dolores, I	Zukkoke Knight: Don De La Mancha	ef: A Tale of Melodies.	ef: A Tale of Memories.	gdgd Fairies	gdgd Fairies 2	iDOLM@STER Xenoglossia	s.CRY.ed	xxxHOLiC	xxxHOLiC Kei
user_id
1	NaN	NaN	NaN	NaN	NaN	-0.034483	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	-0.196643	NaN

5 rows × 3031 columns

7. 표준화 작업을 진행한 피벗 테이블의 Null 값을 0으로 대체합니다.

# Drop all columns containing only zeros representing users who did not rate
piv_norm.fillna(0, inplace=True)
piv_norm = piv_norm.T
piv_norm = piv_norm.loc[:, (piv_norm != 0).any(axis=0)]

8. 유사도 계산 작업을 빠르게 하기 위해 scipy의 csr_matrix로 인덱스를 압축하여 저장합니다.

# Our data needs to be in a sparse matrix format to be read by the following functions
piv_sparse = sp.sparse.csr_matrix(piv_norm.values)

item_similarity = cosine_similarity(piv_sparse)
user_similarity = cosine_similarity(piv_sparse.T)

# Inserting the similarity matricies into dataframe objects
item_sim_df = pd.DataFrame(item_similarity, index = piv_norm.index, columns = piv_norm.index)
user_sim_df = pd.DataFrame(user_similarity, index = piv_norm.columns, columns = piv_norm.columns)

9. 유사도를 구했으니 각각의 인자를 받아 유사한 해당 값(사용자 or 아이템)과 유사도 점수를 불러오는 함수를 만들어보겠습니다.

item_sim_df.head()

name	.hack//Roots	.hack//Sign	.hack//Tasogare no Udewa Densetsu	009-1	07-Ghost	11eyes	12-sai.: Chicchana Mune no Tokimeki	12-sai.: Chicchana Mune no Tokimeki 2nd Season	3 Choume no Tama: Uchi no Tama Shirimasenka?	30-sai no Hoken Taiiku	...	Zone of the Enders: Dolores, I	Zukkoke Knight: Don De La Mancha	ef: A Tale of Melodies.	ef: A Tale of Memories.	gdgd Fairies	gdgd Fairies 2	iDOLM@STER Xenoglossia	s.CRY.ed	xxxHOLiC	xxxHOLiC Kei
name
.hack//Roots	1.000000	0.257472	0.291974	0.037378	0.049659	0.050604	0.000121	0.0	0.004850	0.016794	...	0.011440	0.033142	-0.027421	-0.024537	0.006951	0.009136	0.010323	0.001812	0.001332	-0.010557
.hack//Sign	0.257472	1.000000	0.236637	0.039900	0.034780	0.054074	-0.004044	0.0	0.000782	0.018086	...	0.030886	0.011848	-0.008644	-0.009961	0.001443	0.002468	0.008348	0.012492	0.010136	0.007127
.hack//Tasogare no Udewa Densetsu	0.291974	0.236637	1.000000	0.067013	0.019580	0.067691	-0.002985	0.0	0.003263	0.024902	...	0.021230	0.001971	-0.025303	-0.030734	0.003677	0.007734	0.016217	0.024350	0.002422	-0.010017
009-1	0.037378	0.039900	0.067013	1.000000	0.016166	0.017148	0.001369	0.0	0.000000	0.026445	...	0.003870	0.000000	-0.013522	-0.015784	0.005584	0.006803	-0.005863	0.003647	0.017379	0.007374
07-Ghost	0.049659	0.034780	0.019580	0.016166	1.000000	0.082719	-0.006585	0.0	0.002402	0.019046	...	-0.004252	0.019895	-0.032360	-0.025090	-0.013317	-0.015407	-0.021932	0.016019	-0.003591	0.001685

5 rows × 3031 columns

user_sim_df.head()

user_id	1	2	3	5	7	8	10	11	12	14	...	29989	29990	29991	29993	29994	29995	29997	29998	29999	30000
user_id
1	1.000000	-0.014327	-0.000415	-0.079289	-0.004787	0.061162	0.214333	0.034211	-0.285338	-0.110447	...	0.000691	-0.006462	-0.015495	0.050925	0.045025	0.014816	0.049610	-0.000866	0.000000	0.000000
2	-0.014327	1.000000	0.117331	0.003013	0.000000	0.000000	0.000000	0.000000	0.000000	-0.003626	...	0.000000	0.000000	0.000000	0.173025	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
3	-0.000415	0.117331	1.000000	0.054561	0.096398	0.021292	0.011997	0.001684	0.069823	0.007697	...	-0.032303	0.000000	0.019270	0.124736	-0.007535	0.030651	0.051558	0.021221	0.017483	0.000000
5	-0.079289	0.003013	0.054561	1.000000	0.066806	0.016339	-0.071223	0.003867	0.088133	0.089625	...	0.026154	-0.002340	0.042167	0.019129	0.046774	0.036779	-0.001458	-0.006795	0.054113	-0.026226
7	-0.004787	0.000000	0.096398	0.066806	1.000000	-0.013548	-0.045915	-0.054953	0.061591	0.122852	...	0.023275	-0.049705	-0.005210	0.025561	-0.012488	-0.012518	0.005264	0.035924	0.032345	0.028072

5 rows × 26622 columns

# This function will return the top 10 shows with the highest cosine similarity value
def top_animes(anime_name):
    count = 1
    print('Similar shows to {} include:\n'.format(anime_name))
    result = item_sim_df.loc[~item_sim_df.index.isin([anime_name]), anime_name].sort_values(ascending = False)[:10]
    for item, score in result.items():
        print('No. {}: {}({:.2f})'.format(count, item , score))
        count +=1

# This function will return the top 5 users with the highest similarity value 
def top_users(user):
    if user not in piv_norm.columns:
        return('No data available on user {}'.format(user))

    print('Most Similar Users:\n')
    result = user_sim_df.sort_values(by=user, ascending=False).loc[:,user][1:11]
    for user, sim in result.items():
        print('User #{0}, Similarity value: {1:.2f}'.format(user, sim))

top_animes('Cowboy Bebop')

Similar shows to Cowboy Bebop include:

No. 1: Samurai Champloo(0.24)
No. 2: Trigun(0.20)
No. 3: Tengen Toppa Gurren Lagann(0.19)
No. 4: Fullmetal Alchemist: Brotherhood(0.17)
No. 5: Baccano!(0.17)
No. 6: Mushishi(0.16)
No. 7: Ghost in the Shell: Stand Alone Complex(0.16)
No. 8: Neon Genesis Evangelion(0.16)
No. 9: Steins;Gate(0.16)
No. 10: Ghost in the Shell: Stand Alone Complex 2nd GIG(0.15)

top_users(3)

Most Similar Users:

User #4647, Similarity value: 0.55
User #2277, Similarity value: 0.54
User #29848, Similarity value: 0.52
User #3225, Similarity value: 0.47
User #23557, Similarity value: 0.45
User #13143, Similarity value: 0.44
User #10270, Similarity value: 0.42
User #27503, Similarity value: 0.42
User #934, Similarity value: 0.41
User #15384, Similarity value: 0.40

10. User-based CF는 비슷한 회원들과 유사도 점수를 나타내주지만 보통 추천 서비스에서는 아이템 list를 추천합니다.

지정 사용자의 유사도가 높은 10명의 비슷한 사용자를 기준으로 item을 추천해주는 함수를 만들어보겠습니다.

# This function constructs a list of lists containing the highest rated shows per similar user
# and returns the name of the show along with the frequency it appears in the list

def similar_user_recs(user):
    if user not in piv_norm.columns:
        return('No data available on user {}'.format(user))

    # 유사도가 높은 10명의 사용자를 가져옵니다. 
    sim_users = user_sim_df.sort_values(by=user, ascending=False).index[1:11] 
    best = []
    most_common = {}


    for i in sim_users:
        # 유사도가 높은 10명의 사용자들이 평가점수를 높게 주었던 item list를 가져옵니다. 
        # 단, 주의해야할 점은 추천하려고 하는 대상 user가 평가하지 않았던 아이템이어야 합니다. 
        result_sorted = piv_norm.loc[:, i][(piv_norm.loc[:,user] == 0)].sort_values(ascending = False)
        best.append(result_sorted.index[:5].tolist())
#     print(best)
    for i in range(len(best)):
        for j in best[i]:
            if j in most_common:
                most_common[j] += 1
            else:
                most_common[j] = 1
    sorted_list = sorted(most_common.items(), key=operator.itemgetter(1), reverse=True)
    return sorted_list[:5]

similar_user_recs(3)

[('Angel Beats!', 4),
 ('Steins;Gate', 3),
 ('Fullmetal Alchemist', 3),
 ('Toradora!', 2),
 ('Nisekoi', 2)]

저작자표시 (새창열림)

'ML&DL > 추천 알고리즘' 카테고리의 다른 글

[추천] Collaborative Filtering 협업필터링 (0)	2020.09.04

'ML&DL/추천 알고리즘' Related Articles

[추천] Collaborative Filtering 협업필터링 2020.09.04

Comments

[Alex] 데이터 장인의 블로그

[추천] Collaborative Filtering - 코드 구현 (feat. python) 본문

[추천] Collaborative Filtering - 코드 구현 (feat. python)

1. 라이브러리를 호출하고 데이터를 불러옵니다.

2. 데이터 값이 없는(null or nan) 경우 -1로 대체하고 확인합니다.

약 18%의 행이 score의 값이 없다는 것을 확인할 수 있습니다.

3. TV 애니메이션 프로그램의 데이터만 가져와 학습시켜보겠습니다.

4. 간단히 작업하기 위해 30000명의 회원 데이터만 사용하겠습니다.(kaggle notebook을 사용하니 많은 양의 데이터가 돌아가지 않네요)

+) 원래 샘플링이나 rating 값에 따른 추출이 맞습니다.

5. collaborative filtering에서는 사용자-사용자 간의 유사도, 아이템-아이템 간의 유사도를 나타내어야 합니다. 유사도를 계산하기 위해 '피봇팅'을 진행하겠습니다.

User CF - row : 사용자 , column : 아이템, values : 평가점수

Item CF - row : 아이템 , column : 사용자, values : 평가점수

6. 사용자 개개인의 평가 척도를 조정해주기 위해 표준화 작업을 진행합니다.

7. 표준화 작업을 진행한 피벗 테이블의 Null 값을 0으로 대체합니다.

8. 유사도 계산 작업을 빠르게 하기 위해 scipy의 csr_matrix로 인덱스를 압축하여 저장합니다.

9. 유사도를 구했으니 각각의 인자를 받아 유사한 해당 값(사용자 or 아이템)과 유사도 점수를 불러오는 함수를 만들어보겠습니다.

10. User-based CF는 비슷한 회원들과 유사도 점수를 나타내주지만 보통 추천 서비스에서는 아이템 list를 추천합니다.

'ML&DL > 추천 알고리즘' 카테고리의 다른 글

티스토리툴바