Machine Learning Analysis of Rice Quality and Palatability Evaluation Data to Predict Rice Taste and Interpret Variables

Yuchan Choi; Hyun-jin Park; Yu-geun Oh; Ji-eun Kwak

doi:10.7740/kjcs.2025.70.4.222

Preview

Original Research Article

The Korean Journal of Crop Science. 1 December 2025. 222-233
https://doi.org/10.7740/kjcs.2025.70.4.222

Machine Learning Analysis of Rice Quality and Palatability Evaluation Data to Predict Rice Taste and Interpret Variables

쌀 품질 및 기호도 평가 데이터의 머신러닝 분석을 통한 밥맛 예측 및 변수 해석

Yuchan Choi¹^*

Hyun-jin Park¹

Yu-geun Oh¹

Ji-eun Kwak²

최 유찬¹^*

박 현진¹

오 유근¹

곽 지은²

¹Junior Scientist, Quality Management and Evaluation Research Division, Department of Food Sciences, National Institute of Crop Science and Food, Rural Development Administration, Suwon 16613, Korea

²Junior Scientist, Planning and Coordination Division, National Institute of Crop Science and Food, Rural Development Administration, Wanju 55365, Korea

¹국립식량과학원 식품자원개발부 품질관리평가과 농업 연구사

²국립식량과학원 기획조정과 농업 연구사

^{*Corresponding Author}

License (open-access, http://creativecommons.org/licenses/by-nc/4.0/):

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

ABSTRACT

This study aimed to develop a machine learning-based evaluation framework for predicting rice (Oryza sativa L.) eating quality using physicochemical and sensory datasets accumulated over the past decade at the National Institute of Crop Science (NICS). A total of 297 rice samples were collected, including grain shape, milling, and physicochemical traits, along with sensory panel scores of overall palatability. After data preprocessing and the removal of outliers, 216 samples were used for model training. Principal component analysis (PCA), cluster analysis, and preference mapping were conducted to identify the structural relationships among rice quality traits. Gradient Boosting Machine (GBM) and Random Forest models were then trained to classify overall eating quality into two categories (A: superior, B: inferior). Model performance was evaluated using five-fold cross-validation and an independent test set. The Random Forest model achieved the highest accuracy (0.7143) and Kappa coefficient (0.4273), indicating statistically significant generalization performance (p = 0.009). SHAP (SHapley Additive exPlanations) analysis revealed that amylose, protein, and Milled/Brown Ratio were the most influential predictors of eating quality, where lower amylose and protein content and higher Milled/Brown Ratio contributed positively to superior eating quality (A). These results demonstrate the potential of machine learning as a complementary approach to sensory testing by enhancing objectivity and efficiency, and suggest core quality indices for rice breeding and processing—namely, amylose, protein, and Milled/Brown Ratio.

Keywords

amylose

machine learning

milled/brown ratio

preference mapping

protein

random forest

rice eating quality

SHAP

MAIN

서 론
연구 방법
데이터 수집 및 구성
데이터 전처리
변수선택
다변량 차원 축소 분석
모델 개발
연구 수행 도구
결과 및 고찰
데이터 탐색 및 전처리 결과
주성분 분석 결과
군집 분석 결과
Preference Map 분석 결과
변수선택 결과
머신러닝 모델의 지도학습과 성능 평가 결과
SHAP Value 분석 결과
선행연구와의 비교
적 요

서 론

농업은 인류 문명의 근간이며, 최근 디지털 기술의 발전으로 농업 데이터가 급증함에 따라 머신러닝(ML) 기법의 활용이 주목받고 있다(Zhou et al., 2017). 식품 품질 평가, 특히 식량작물의 감각검사는 평가자의 주관성, 비용, 반복의 한계 등 구조적 제약을 지닌다. 머신러닝은 이러한 다변량 비선형 관계를 학습하여 감각평가를 보완할 유망한 대안으로 부상하고 있다(Schreurs et al., 2024).

국립식량과학원은 지난 10여 년간 쌀 품질 및 밥맛 기호도 데이터를 체계적으로 축적해 왔으며, 이는 모델 학습 및

해석에 적합한 기반 데이터를 제공한다. 본 연구는 이 데이터를 활용하여 (1) 차원축소(PCA)와 군집분석을 통해 품질 특성 구조를 규명하고, (2) GBM (Gradient Boosting Machine)과 Random Forest 모델을 이용해 밥맛 기호도 총평을 예측하며, (3) SHAP (SHapley Additive exPlanations) 값을 통해 주요 품질 변수의 영향을 해석하는 것을 목표로 한다. 이를 통해 감각검사의 효율성과 객관성을 높이는 머신러닝 기반 품질 평가 체계의 가능성을 제시하고자 한다.

연구 방법

데이터 수집 및 구성

본 연구에서는 지난 10여 년간 축적된 쌀 품질 및 밥맛 기호도 평가 데이터를 활용하여 머신러닝 모델을 개발하였다. 주요 데이터는 국립식량과학원의 ｢벼 신품종 육성 및 이용촉진사업 보고서｣(2011~2019)와 ｢벼 신품종 개발 공동연구보고서｣(2020~2022)에서 수집된 미질 특성 자료와, 같은 기관 중부작물부에서 실시한 밥맛 기호도 평가 결과를 포함한다. 데이터 세트에는 벼의 품종 및 계통 정보와 함께 입형 특성(정조 길이, 너비, 두께, 장폭비 등), 이화학적 특성(아밀로스 함량, 단백질 함량, 윤기치 등), 도정 특성(제현율, 현백율, 백미 완전미율, 싸라기율, 피해립률 등), 기호도 평가 항목(총평)이 포함되었다(Table 1). 기호도 항목은 7점 등간 척도(-3=매우 나쁨, ..., +3=매우 좋음)로 평가하였으며, 패널 점수는 계통 단위로 평균하여 분석에 활용하였다. 총 297개 시료가 포함되었으며, 수집된 데이터는 전처리를 수행한 후 통합하여 머신러닝 모델 학습에 활용하였다.

Table 1.

Composition of the collected dataset.

Category	Measured Traits	Variables
Lines and Cultivars	Sampling information	297 samples collected from 2011 to 2022
Morphological Characteristics	-	(Rough rice) Length, Width, Thickness, Length/Width Ratio (Brown rice) Length, Width, Thickness, Length/Width Ratio
Physicochemical Characteristics	Grain composition and glossiness	ADV^*, Amylose(%), Protein (%), Toyo Value for Glossiness
Milling Characteristics	Milling recovery ratio(%)	Brown/Rough Ratio, Milled/Brown Ratio, Milled/Rough Ratio,
Milling Characteristics	Grain processing traits	Head Rice (%), Chalky Rice (%), Broken Rice (%), Damaged Rice (%) Head rice milling recovery ratio
Palatability Result	Sensory evaluation	Overall Score

^* ADV indicates Alkali Digestion Value.

데이터 전처리

수집된 데이터에 대하여 결측치 처리, 이상치 제거, 변수 스케일 조정 등의 전처리 과정을 수행하였다. 먼저 결측치는 탐색적 데이터 분석(Exploratory Data Analysis, EDA)을 통해 확인하였으며, 결측률(>21%)이 높은 시료는 분석에서 제외하였으며, 이상치는 마할라노비스 거리(Mahalanobis Distance)를 이용하여 다변량 분포를 기반으로 탐지하였다(Supplementary Table S3). 자유도(df)=9, 신뢰수준 97.5%의 χ² 분포 경계 밖에 위치한 시료를 이상치로 판단하여 제거하였다. 결측치와 이상치 제거 후 216개 시료 데이터를 모델의 지도 학습에 사용 하였다.

연속형 변수에 대해서는 평균 0, 표준편차 1로 변환하는 표준화(Standardization)를 적용하였으며, 이를 위해 계산된 평균 및 표준편차 값은 별도로 저장하여 추후 새로운 데이터에도 동일한 변환을 적용할 수 있도록 하였다(Supplementary Table S2).

변수선택

본 연구의 변수선택 과정은 훈련 데이터 세트만을 이용해 정보 누출을 방지했으며, 다음과 같은 세 단계로 구성되었다: ① 상관분석, 주성분분석(PCA), 군집분석 및 Preference Map 분석 등 다변량 차원 축소 분석 결과를 바탕으로 밥맛 예측에 기여할 가능성이 높은 주요 품질 특성을 입력변수 후보로 선정하였다. ② 이후 GBM과 Random Forest의 예비 모델링으로 변수 중요도를 산출하였다. ③ 모델별 변수중요도를 정규화한 뒤 평균하여 결합 중요도를 산출하고, 해석을 기반으로 최종 변수를 선택하였다(Greenwell et al., 2018). 이를 통해 과적합(Overfitting)을 방지하고, 모델의 해석 가능성과 예측 효율을 동시에 확보하고자 하였다(Pudjihartono et al., 2022).

다변량 차원 축소 분석

PCA 분석

주성분분석(PCA)에 앞서, 변수 간 상관 구조의 적합성을 검증하기 위해 KMO (Kaiser-Meyer-Olkin) 검정과 Bartlett의 구형성 검정을 실시하였다. KMO 검정 결과는 0.47로 다소 낮았으나, Bartlett 구형성 검정은 통계적으로 유의(χ²₍₁₉₀₎=4170.3, p < 0.001)하여 변수 간 유의미한 상관이 존재함을 확인하였다. KMO 값이 낮다는 것은 변수 간 공통 요인이 약함을 시사할 수 있으나, 본 연구의 목적은 요인 추출(Factor Analysis)이 아닌 데이터의 구조적 분산 패턴을 탐색하고 요약하는 탐색적 PCA이다. 따라서 유의한 Bartlett 검정 결과를 근거로, 고차원 품질 특성 변수를 저차원의 주성분으로 요약하고 시각화하기 위해 PCA를 수행하였다. 이를 위해 관측 변수를 선형 결합한 주성분을 추출하여 미질 특성의 구조를 구명하고자 하였으며, 모든 변수는 표준화하여 분석에 동일한 가중치로 기여하도록 하였다. 이후, 주요 주성분의 분산 기여율과 변수 부하량(loadings)을 산출하였다.

군집분석

쌀 시료 간 유사도 기반 그룹화를 위해 K-평균 군집화(K-means clustering)를 적용하였다. NbClust 패키지를 이용하여 다양한 평가 지표(실루엣 계수, CH 지수 등)를 종합적으로 검토한 결과, 26개의 지표 중 12개의 지표가 3개의 군집을 지지하여 최적 군집 수는 3개로 결정하였다(Supplementary Fig. S1). 또한 엘보(elbow) 기법을 통해 군집 내 응집도와 군집 간 분산의 변화를 시각적 확인한 결과, 3개 군집이 가장 적절한 구조로 나타났다(Supplementary Fig. S2)

Preference Map 분석

군집 분석 결과와 기호도 평가 데이터를 종합적으로 해석하기 위해 Preference Mapping 기법을 적용하였다. Preference Map은 소비자 기호도 데이터와 제품 특성 데이터를 통합하여, 제품들 간의 상대적인 기호도 위치를 2차원 공간에 시각화하는 감각평가 분석 기법이다(Faye et al., 2006). SensMap 패키지를 활용하여 PCA로 축약된 품질 특성 공간 상에 각 군집의 평균 기호도 점수(총평)를 투영하였다. 이를 통해 군집별 쌀 시료의 위치와 기호도 총평과의 관계를 시각적으로 표현하고, 품질 특성과 기호도 점수의 연관성을 분석하였다.

모델 개발

본 연구에서는 GBM, Random Forest의 두 가지 앙상블 모형을 적용하여 기호도 총평의 이진 분류 예측을 수행하였다. 입력변수는 앞서 언급한 변수 선택 결과에 따라 설정하였으며, 목표 변수(총평)는 기준 품종 ‘추청’ 총평과의 차이를 Δ로 하고, - A(우수): Δ ≥ 0.0 - B(열등): Δ < 0.0 의 이진 범주로 정의하였다. 이러한 이진 분류 접근은 실제 벼 육종 현장에서 육종가들이 대비 품종을 기준으로 신규 계통의 선발 여부를 결정하는 Go/No-Go 의사결정 과정을 직접적으로 반영한다. 따라서 본 모델은 회귀나 다중 분류 모델보다 현장 적용성 및 실용적 가치가 높다고 판단하였다. 데이터 분할과 재표본추출 및 모델 학습은 set.seed (123)으로 고정하여 재현성을 확보하였다. 변수선택은 오직 훈련 세트 내부 교차검증 단위에서 수행되었고, 테스트 세트에는 훈련 세트에서 추정된 파라미터 및 최종 선택 변수만을 적용하였다.

모델 학습 및 하이퍼파라미터 설정

전체 데이터는 훈련 세트(80%, n = 174)와 테스트 세트(20%, n = 42)로 분할하였으며, 각 세트에서 A/B 클래스 비율이 원 데이터와 동일하도록 층화 분할(stratified split)을 수행하였다(Supplementary Table S1). 훈련 세트에 대해서는 5-fold 교차검증을 적용하였으며, 모델 학습과 하이퍼파라미터 탐색은 caret::train() 함수를 이용하여 동시에 수행하였다. 하이퍼파라미터 최적화 과정에서는 분류 정확도(Accuracy)를 기준으로 가장 우수한 조합을 선택하였다.

그 결과, GBM 모델의 최적 하이퍼파라미터는 트리 수(n.trees)=200, 트리 깊이(interaction.depth)=2, 학습률(shrinkage)= 0.05, 노드 최소 관측치(n.minobsinnode)=10이 최적값(정확도=0.7345)으로 결정되었다. Random Forest 모델의 경우, mtry=5, 트리 수(ntree)=500에서 가장 높은 교차검증 정확도(0.7305)를 보여 최적 하이퍼파라미터로 채택하였다(Table 2).

Table 2.

Model accuracy and optimized hyperparameters after cross-validation.

Model	HyperParameter	Optimal value	selection criterion	Performance (Accuracy)
GBM	n.trees, interaction.depth, shrinkage, n.minobsinnode	200, 2, 0.05, 10	Accuracy	0.7345
Random Forest	mtry, ntree	5, 500	Accuracy	0.7305

모델 성능 평가 및 모형 해석

모델의 최종 예측 성능을 검증하기 위하여, 훈련 단계에서 최적화된 하이퍼파라미터를 적용한 두 모델을 독립된 테스트 세트(20%)로 평가하였다. 혼동행렬(confusion matrix)과 ROC (Receiver Operating Characteristic) 곡선 기반으로 수행하였으며, 이를 통해 클래스별 예측 결과로부터 주요 평가지표를 산출하였다. 산출된 평가지표는 정확도, 카파계수(Kappa coefficient), 민감도(Sensitivity), 특이도(Specificity), 정밀도(Precision)로 구성되었다. 이들 지표는 모델의 전반적인 예측력과 클래스 간 구분의 균형성을 종합적으로 평가하기 위함이다. 또한 ‘No Information Rate (NIR)’ 대비 정확도의 통계적 유의성 검정을 통해 모델의 분류 성능이 무작위 예측보다 유의하게 우수한지를 확인하였다. 모형 해석을 위해서는 SHAP (Shapley Additive Explanations)을 적용하여 변수 중요도와 예측값에 대한 영향 방향을 추정하였다.

연구 수행 도구

데이터 전처리부터 통계 분석, 머신러닝 모델링까지 전 과정은 Jamovi 2.2 및 R 4.3.2 (RStudio 2023.12.1+402) 소프트웨어 환경에서 수행되었다. 주요 R 패키지로는 머신러닝 모델 구축 및 평가를 위한 caret, gbm, randomForest 차원 축소와 군집화를 위한 psych, REdaS, FactoMineR, factoextra, cluster, NbClust, 감각 분석을 위한 SensMap, SHAP 분석을 위한 shapr, kernelshap, 그리고 데이터 전처리·시각화를 위한 dplyr, ggplot2 등을 활용하였다.

결과 및 고찰

데이터 탐색 및 전처리 결과

수집된 쌀 품질 및 기호도 데이터에 대해 탐색적 자료 분석(Exploratory Data Analysis, EDA)을 통해 일부 품질 특성 변수에서 분포 불균형이 확인되었다. 297개 시료의 기본 통계량과 함께 왜도(skewness) 및 첨도(kurtosis)를 산출한 결과, 도정 특성 중 싸라기율과 피해립률이 각각 왜도 4.08, 6.48, 첨도 21.50, 64.30으로 극단적인 비대칭 분포를 보였다(Table 3, Fig. 1). 이러한 심한 비대칭 분포는 머신러닝 지도학습 과정에서 특정 범주의 과적합이나 예측 성능 저하를 초래할 수 있으므로(Ryu, 2011), 이에 분상질립률·싸라기율·피해립률·완전미 도정수율 등 극단치를 보이는 변수는 모델 분석에서 제외하였다.

Table 3.

Exploratory data analysis (EDA) of the collected dataset.

	Rough rice										Brown rice
	Length	Width		Thickness			Length/ Width ratio				Length		Width			Thickness		Length/ Width ratio
N	261	261		261			261				261		261			261		261
Missing	36	36		36			36				36		36			36		36
Mean	7.37	3.31		2.25			2.23				5.07		2.80			2.01		1.81
SD	0.42	0.19		0.12			0.14				0.24		0.08			0.13		0.10
IQR	0.34	0.25		0.16			0.16				0.26		0.10			0.18		0.11
Min.	6.53	2.82		1.80			1.93				4.48		2.36			1.60		1.62
Max.	9.53	3.86		2.62			3.10				6.30		3.04			2.30		2.67
Skewness	1.46	-0.08		-0.29			1.87				1.64		-0.44			-0.62		2.96
Kurtosis	4.54	0.26		0.43			9.08				5.08		2.64			0.37		20.39

	ADV				Amylose (%)				Protein (%)					Toyo Value for Glossiness
N	261				261				261					261
Missing	36				36				36					36
Mean	6.41				18.77				6.13					75.86
SD	0.40				0.88				0.64					6.20
IQR	0.50				1.10				0.80					10
Min.	4.50				16				4.10					61
Max.	7				21				8.60					88.6
Skewness	-1.77				-0.37				0.44					-0.22
Kurtosis	5.38				0.30				0.88					-0.60

	Milling recovery ratio (%)							Head rice (%)		Chalky rice (%)		Broken rice (%)			Damaged rice (%)		Head rice milling recovery ratio (%)
	Brown/ rough		Milled/ brown			Milled/ rough
N	261		261			261		261		234		234			234		234
Missing	36		36			36		36		63		63			63		63
Mean	82.91		90.45			75		87.62		6.02		5.46			0.74		65.44
SD	1.09		1.07			1.42		10.45		6.15		7.84			0.75		8.21
IQR	1.40		1.20			1.80		12.20		5.85		4.65			0.60		9.20
Min.	79.80		86.90			69.70		35.60		0.40		0.20			0.10		26.50
Max.	85.80		92.70			78.40		98.80		35.10		62.60			9.00		74.80
Skewness	0.25		-0.75			-0.61		-1.84		1.97		4.08			6.48		-1.84
Kurtosis	0.17		1.20			0.96		4.40		4.48		21.50			64.30		4.33

	Overall Score
N	297
Missing	0
Mean	0.06
SD	0.25
IQR	0.29
Min.	-0.92
Max.	0.70
Skewness	-0.58
Kurtosis	1.08

https://cdn.apub.kr/journalsite/sites/kjcs/2025-070-04/N0840700405/images/kjcs_2025_702_222_F1.jpg

Fig. 1.

Density and histogram of the collected data; (A) broken grain ratio, (B) damaged grain ratio, (C) Palatability Overall Score. The corresponding skewness and kurtosis values are 4.08 and 21.50 for (A), 6.48 and 64.30 for (B), and -0.58 and 1.08 for (C).

한편, 예측 대상으로 설정한 기호도 점수는 비교적 정규분포에 가까운 균일한 분포를 보여, 목표 변수가 심각한 불균형을 나타내지는 않았다(Table 3, Fig. 1). 변수들 간의 상호 상관성 분석 결과, 입형 특성과 도정 특성 변수들 사이에 유의하게 높은 상관관계가 다수 확인되었다(Table 4, Fig. 2). 예를 들어 정조의 길이·너비·두께 간 또는 도정 관련 지표들 간 피어슨 상관계수가 높게 나타나, 데이터 내 다중공선성(multicollinearity) 문제가 존재할 가능성이 확인되었다. 독립변수들 간 높은 상관성은 모델의 신뢰성과 예측 정확도를 저하시킬 수 있으며, 변수 해석 시 각 요인의 독립적인 기여도를 판단하기 어렵게 만든다(Chan et al., 2022). 이러한 문제를 해결하기 위해, 변수선택 단계에서는 상관계수가 높은 정조와 현미의 입형 특성 중 정조 관련 변수를 제거하여 다중공선성을 완화하였다. VIF(Variance Inflation Factor) 분석 결과, 정조·현미 입형 관련 변수(정조길이·정조너비·현미길이 등) 및 일부 가공 특성(현백율·도정율·제현율)은 VIF가 100을 초과하는 매우 높은 수준으로 나타났다 (Supplementary Table S4). 이는 동일 계열 변수들이 강한 상관성을 가지며 구조적으로 중복되는 정보를 포함하고 있음을 의미하며 , 각 변수군별 대표성을 고려한 변수 선택이 타당함을 뒷받침한다. 이와 같이 EDA 및 상관분석을 통해 확보된 정보는 데이터의 품질과 신뢰성을 진단하고, 안정적인 머신러닝 모델 학습을 위한 기반 자료로 활용되었다.

Table 4.

Correlation coefficients among rice quality traits; (A) grain shape traits, (B) milling traits.

(A)	(R)Width	(R)Thickness	(R)L/W	(B)Length	(B)Width	(B)Thickness	(B)L/W
(R)Length	0.40^***	0.34^***	0.54^***	0.74^***	-0.01	0.35^***	0.63^***
(R)Width	—	0.77^***	-0.54^***	0.21^**	0.30^***	0.58^***	0.00
(R)Thickness		—	-0.41^***	0.11	0.27^***	0.80^***	-0.06
(R)L/W			—	0.51^***	-0.30^***	-0.23^***	0.60^***
(B)Length				—	0.01	0.07	0.83^***
(B)Width					—	0.32^***	-0.53^***
(B)Thickness						—	-0.11

(B)	Milled/ Brown	Milled/ Rough	Head Rice	Chalky Rice	Broken Rice	Damaged Rice	Head Rice Milling
Brown/Rough	0.14^*	0.79^***	0.02	0.03	-0.10	0.08	0.16^*
Milled/Brown	—	0.72^***	0.39^***	-0.32^***	-0.28^***	-0.11	0.49^***
Milled/Rough		—	0.26^***	-0.17^**	-0.24^***	-0.01	0.41^***
Head Rice			—	-0.68^***	-0.80^***	-0.08	0.99^***
Chalky Rice				—	0.11	0.09	-0.67^***
Broken Rice					—	-0.06	-0.79^***
Damaged Rice						—	-0.08

^* p < 0.05, ^** p < 0.01, ^*** p < 0.001, All coefficients are Pearson’s r values.

https://cdn.apub.kr/journalsite/sites/kjcs/2025-070-04/N0840700405/images/kjcs_2025_702_222_F2.jpg

Fig. 2.

Correlation matrix among rice quality traits.

주성분 분석 결과

품질 특성에 대한 PCA에서는 윤기치, 백미 완전미율, 현미길이, 현미두께가 주요 변동에 크게 기여하였다. 주성분 1(Dim1)은 Amylose, 윤기치와 백미 완전미율 등의 높은 부하량으로 보아 이화학 및 도정 품질 축으로, 주성분 2·3(Dim2·Dim3)은 정조 및 현미의 입형 특성을 주로 반영하는 축으로 해석된다(Table 5, Fig. 3). 이러한 구조는 유사 속성의 변수가 동일 축에 응집되는 경향을 보여 주며, 앞선 상관분석 결과와도 일치하였다.

Table 5.

Variable loadings of rice quality traits by principal components (Dim1-Dim3).

Variable	First principal component (Dim 1)	Second principal component (Dim 2)		Third principal component (Dim 3)
(R)Length^*	-0.333		0.224		0.781
(R)Width	-0.324		0.776		0.300
(R)Thickness	-0.383		0.795		0.207
(R)L/W	0.050		-0.648		0.372
(B)Length^*	-0.144		-0.053		0.819
(B)Width	0.067		0.630		-0.177
(B)Thickness	-0.315		0.777		0.092
(B)L/W	-0.160		-0.470		0.783
ADV	0.219		-0.126		-0.01
Amylose	0.554		-0.037		0.005
Protein	-0.238		-0.285		0.002
Toyo Value for Glossiness	0.548		0.368		0.021
Brown rice ratio	0.325		0.137		0.247
Head brown rice ratio	0.749		0.112		-0.119
Milling ratio	0.659		0.159		0.109
Head rice	0.846		0.107		0.232
Chalky rice	-0.662		-0.107		-0.101
Broken rice	-0.604		-0.084		-0.295
Damaged rice	-0.091		-0.053		0.296
Head Rice Milling Recovery Ratio	0.897		0.125		0.239

^* (R) indicates Rough rice; (B) indicates Brown rice.

^** Factor loadings with an absolute value ≥ 0.50 are shown in bold.

https://cdn.apub.kr/journalsite/sites/kjcs/2025-070-04/N0840700405/images/kjcs_2025_702_222_F3.jpg

Fig. 3.

PCA biplot of rice quality traits with sensory evaluation scores; (A) PC1 vs. PC2, (B) PC1 vs. PC3.

아울러 품질 특성에 대한 PCA에서는 주성분 1-3이 전체 분산의 약 52.6%를 설명하여 데이터의 주요 패턴을 요약하였으나, 잔여 주성분의 누적 기여도도 47.4%에 달한다(Table 6). 따라서 핵심 축 중심의 해석은 유효하되, 모델 구축 시 잔여 차원에 내재한 정보 손실을 최소화하도록 다양한 특성을 균형 있게 고려할 필요가 있다. 종합하면, Amylose와 일부 도정 및 입형 특성이 데이터 변동의 주요 요인임을 확인하였다. 이 결과는 후속 군집화 및 예측 모델링 단계에 유용한 근거를 제공한다.

Table 6.

Explained variance, and cumulative variance of principal components for rice quality traits.

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10
S.D.	2.159	1.808	1.609	1.394	1.133	1.059	1.000	0.912	0.903	0.785
Var.%	0.233	0.163	0.130	0.097	0.064	0.056	0.050	0.042	0.041	0.031
CEV	0.233	0.396	0.526	0.623	0.687	0.743	0.793	0.835	0.876	0.907

^* S.D., Standard deviation of each principal component; Var.%, proportion of variance explained by each component; CEV, cumulative explained variance.

군집 분석 결과

K-means 군집화 결과 NbClust 지표와 엘보 기법에 근거해 3개 군집이 최적임을 확인하였다(Supplementary Fig. S1, S2). PCA 평면에 투영했을 때 군집 간 품질 특성이 명확히 구분되었다(Fig. 4). 군집 1은 윤기치·현미너비·완전미 도정수율이 높아 전반적으로 양호한 미질 특성을 보였고, 군집 2는 정조길이가 길고 싸라기율·분상질률이 높아 품질 저하 요인이 많았다. 군집 3은 단백질 함량이 높은 경향을 보였다. 이로써 쌀 품종이 외관·이화학·도정 특성에 따라 뚜렷한 그룹을 형성함을 확인하였다.

https://cdn.apub.kr/journalsite/sites/kjcs/2025-070-04/N0840700405/images/kjcs_2025_702_222_F4.jpg

Fig. 4.

K-means clustering of collected rice lines visualized on the PCA plot of quality traits; Each point represents a rice sample, and arrows indicate the direction and contribution of each variable to the principal components. Dim1 (23.3%) and Dim2 (16.3%) account for the largest proportions of variance in the data. Samples were classified into three clusters (Groups 1-3) based on multivariate similarities derived from morphological, physicochemical, and milling characteristics.

Preference Map 분석 결과

Preference Map 분석 결과, 군집별 품질 특성이 감각적 선호도와 밀접히 연관됨을 확인하였다(Fig. 5). 군집 1은 윤기치·완전미율이 높고 총평이 우수해 외관 품질이 밥맛 인상에 긍정적으로 작용함을 보였다(Li et al., 2023). 반면 군집2와 3은 총평 점수가 상대적으로 낮았다. 특히 군집 2는 분상질률·피해립률이 높고 윤기치가 낮아 외관·식감 저하와 관련된 부정적 요인으로 작용한 것으로 판단된다(Cuili et al., 2022). 이러한 군집 구조와 preference map 분석 결과는 예측모델 구축 시 변수 선택의 근거를 제공한다(Bhatnagar et al., 2018).

https://cdn.apub.kr/journalsite/sites/kjcs/2025-070-04/N0840700405/images/kjcs_2025_702_222_F5.jpg

Fig. 5.

External preference map of overall eating scores; PCA plot of rice quality traits with the overall eating scores projected by a second-order regression surface.

변수선택 결과

1.3절에서 기술한 다변량 차원 축소 분석 결과, 이화학적 특성(Amylose, 단백질, 윤기치), 도정 특성(백미 완전미율, 현백율), 입형 특성(현미길이, 현미두께, 장폭비)이 주요 변동 요인으로 확인되었다. 이 변수들을 기반으로 GBM과 Random Forest를 예비 모델링하여 변수 중요도를 산출하고, 최소-최대 정규화(0-1)한 뒤 평균하여 결합한 결과(Table 7.), Amylose, 현백율, 단백질, 현미길이, 윤기치의 기여도가 높았다. 반면 현미너비, 제현율 등은 중요도가 낮았다. 따라서 Amylose, 현백율, 단백질, 현미길이, 윤기치, 백미완전미율, 현미두께, 도정율을 최종 입력 변수로 선정하였다(Table 8).

Table 7.

Variable importance results derived from random forest and GBM models.

Rank	Variable	Random Forest Importance	GBM Importance	Combined Importance
1	Amylose	1.000	1.000	1.000
2	Milled/Brown Ratio	0.467	0.708	0.587
3	Protein	0.459	0.397	0.428
4	Brown rice Length	0.334	0.467	0.400
5	Toyo Value	0.280	0.455	0.367
6	Head Rice	0.212	0.351	0.281
7	Brown rice Thickness	0.189	0.339	0.264
8	Milled/Rough Ratio	0.146	0.315	0.231
9	Brown rice Length/Width Ratio	0.136	0.244	0.190
10	Brown/Rough Ratio	0.171	0.205	0.188
11	ADV	0.186	0.184	0.185
12	Brown rice Width	0.020	0.192	0.106

^* The importance values were normalized to a min/max scale, and the combined importance represents the mean of the normalized Random Forest and GBM scores.

Table 8.

List of input variables used in machine learning models.

Category	Variable	Unit	Type
Physicochemical traits	Amylose	%	Continuous
	Protein	%	Continuous
	Toyo value for glossiness	-	Continuous
Milling traits	Milled/brown ratio	%	Continuous
	Head rice	%	Continuous
	Milled/rough ratio	%	Continuous
Grain shape traits	Brown rice length	mm	Continuous
Grain shape traits	Brown rice thickness	mm	Continuous

머신러닝 모델의 지도학습과 성능 평가 결과

훈련 데이터 세트에 대한 5-fold 교차검증을 통해 GBM과 Random Forest 모형의 최적 하이퍼파라미터를 결정하였으며, 결정된 모형을 독립된 테스트 세트에 적용하여 분류 정확도를 평가하였다(Tables 9, 10). Random Forest 모델은 전체 정확도(Accuracy) 0.7143으로 GBM (0.6429)보다 높았으며, Kappa 계수(0.4273) 또한 GBM (0.2792)보다 우수하였다. 이는 Random Forest 모델이 데이터 변동성에 더 강인하고 클래스 경계를 안정적으로 학습했음을 의미한다.

Table 9.

Confusion matrices of the GBM and Random Forest models on the test data set (42 samples).

GBM		Reference		Random Forest		Reference
GBM		A	B	Random Forest		A	B
Prediction	A	16	9	Prediction	A	16	6
Prediction	B	6	11	Prediction	B	6	14

Table 10.

Predictive performance of each model on the test data set.

Metric	GBM	Random Forest
Accuracy	0.6429	0.7143
95% CI	(0.4803, 0.7845)	(0.5542, 0.8428)
Kappa coefficient	0.2792	0.4273
Sensitivity	0.7273	0.7273
Specificity	0.55	0.7
Precision	0.64	0.7273
p-value (Accuracy > NIR)	0.0815	0.0094

두 모델 모두 A 클래스(우수)에 대한 민감도(Sensitivity)는 0.7273으로 동일하였으나, B 클래스(열등)의 특이도(Specificity)는 Random Forest (0.70)가 GBM (0.55)보다 높았다. 또한 Random Forest 모델의 정확도 향상 검정에서 p = 0.009로 통계적으로 유의하였으나, GBM은 p = 0.081로 유의하지 않았다.

GBM과 Random Forest 모델의 분류 성능을 ROC 곡선을 통해 비교하였다(Fig. 6, Supplementary Fig. S3). Random Forest의 ROC 곡선은 GBM보다 상단에 위치하고, AUC (Area Under the Curve) 값이 더 크게 나타나(GBM: 0.675, Random Forest: 0.722) 전반적인 분류 정확도와 특이도 수준에서 더 안정적인 성능을 확인할 수 있었다.

https://cdn.apub.kr/journalsite/sites/kjcs/2025-070-04/N0840700405/images/kjcs_2025_702_222_F6.jpg

Fig. 6.

Comparison of ROC curves for Random Forest and GBM models. The ROC curves compare the classification performance between Random Forest (magenta line) and Gradient Boosting Machine (blue line) models for predicting rice eating quality. Random Forest exhibited a higher area under the curve (0.722), indicating superior discriminative ability compared to the GBM model (0.675).

종합하면 Random Forest 모델은 GBM 대비 높은 정확도와 안정적인 클래스 구분력을 보였으며, 테스트 세트에서도 통계적으로 유의한 일반화 성능을 확보하였다. 따라서 Random Forest 모델 기반 밥맛 등급 예측은 같은 복합 특성 기반 분류 문제에 보다 신뢰성 있는 예측 도구로 활용될 수 있을 것으로 판단된다.

SHAP Value 분석 결과

본 연구에서는 A 클래스(우수)의 예측 확률을 기준으로 SHAP를 산출하여 각 예측변수가 밥맛 등급 분류(A/B)에 미치는 방향과 크기를 해석하였다. SHAP 값이 양수일수록 A로 분류될 가능성이 커지고, 음수일수록 B로 분류될 가능성이 작아진다(Lundberg et al., 2017; Bagheri, 2022). GBM과 Random Forest 모두에서 Amylose 함량, 현백율, 단백질 함량의 영향력이 컸다. 구체적으로 Amylose와 단백질 함량이 낮을수록 SHAP 값이 양의 방향으로 이동하여 A 분류 확률을 높였으며, 현백율은 증가할수록 SHAP 값이 양의 방향으로 이동하여 A 분류 확률을 높이는 것으로 나타나 낮은 Amylose·단백질과 높은 현백율이 우수한 밥맛 등급 예측에 기여하였다. 세부 분포는 SHAP summary plot에 제시하였다(Fig. 7). 또한, 주요 예측 변수인 Amylose와 단백질 간의 상호작용 효과를 탐색하기 위해 SHAP dependence plot을 분석하였다(Supplementary Fig. S4). 다만, 단백질 함량(색상)의 영향력이 일부 관찰되었으나, Amylose의 영향(x축)에 비해 지배적이지 않아 두 변수 간의 통계적으로 강한 상호작용은 확인되지 않았다.

https://cdn.apub.kr/journalsite/sites/kjcs/2025-070-04/N0840700405/images/kjcs_2025_702_222_F7.jpg

Fig. 7.

SHAP value plots for each model; (A) GBM, (B) Random Forest; The color indicates variable magnitude, and the position denotes the direction of influence on predicted eating quality (positive = favorable, negative = unfavorable).

선행연구와의 비교

본 연구의 SHAP 분석 결과는 선행 연구들과 일치한다. 여러 연구에서 Amylose 및 단백질 함량이 높을수록 밥맛이 저하되는 부의 상관이 일관되게 확인되었다(Gong et al., 2022; Liu et al., 2020; Youn & Kim, 2015; Cao et al., 2025). 본 연구에서 Amylose 및 단백질 증가가 A(우수) 확률 감소(음의 SHAP)라는 방향성은 위 선행 결과와 부합한다. 이러한 결과는 머신러닝을 활용한 품질 예측 시스템 개발에서 변수 선택의 타당성과 모델 해석 가능성을 높이는 한편, 밥맛에 영향을 주는 품질 특성에 대한 정보를 벼 품종 육종 연구에 제공함으로써 학문적·실용적 기여를 할 것으로 기대된다.

적 요

본 연구는 국립식량과학원에서 축적된 10여 년의 쌀 품질과 밥맛 기호도 데이터를 활용하여, 기호도 총평(A/B)의 예측 가능성과 해석 가능성을 갖춘 머신러닝 기반 평가체계를 제시하였다. 데이터 전처리 후 총 216개 시료를 대상으로 주성분분석(PCA), 군집분석 및 Preference Mapping을 통해 품질 특성의 구조를 파악하였으며, GBM (Gradient Boosting Machine)과 Random Forest 모델을 적용하여 밥맛 기호도 총평의 이진 분류(A/B)를 예측하였다. 모델의 일반화 성능은 교차검증과 독립 테스트세트로 평가하였고, 변수 중요도 해석에는 SHAP (Shapley Additive Explanations)을 활용하였다. 그 결과 Random Forest 모델이 GBM보다 높은 예측 정확도(Accuracy 0.7143)를 보였으며, Amylose, 단백질, 현백율이 밥맛 예측에 가장 큰 영향을 미쳤다.

이러한 결과는 (1) 감각검사의 비용·시간·주관성 한계를 보완할 수 있는 데이터 기반 보조도구의 가능성을 제시하고, (2) 육종·가공 단계에서 관리해야 할 핵심 지표(Amylose·단백질·현백율)를 정량적으로 제안한다는 점에서 학문적·실용적 의의가 있다. 다만 단일 기관 데이터와 이진 라벨 정의의 임계값 설정, 외부 코호트 검증의 부재는 한계로 남는다. 향후에는 다기관·다품종 확대, 외부/시계열 검증, 확장 지표(다등급 등), 그리고 현장 적용을 위한 경량화·캘리브레이션 절차의 표준화를 통해 모델의 보편성과 활용성을 한층 강화할 필요가 있다.

Supplementary Material

kjcs_2025_704_222_SF1.pdf

kjcs_2025_704_222_SF2.pdf

kjcs_2025_704_222_SF3.pdf

kjcs_2025_704_222_SF4.pdf

kjcs_2025_704_222_ST1.pdf

kjcs_2025_704_222_ST2.pdf

kjcs_2025_704_222_ST3.pdf

kjcs_2025_704_222_ST4.pdf

보충자료

본문의 인용된 보충자료는 한국작물학회지 홈페이지(https://www.cropbio.or.kr/)에서 확인할 수 있습니다.

∙Supplementary Table 1

∙Supplementary Table 2

∙Supplementary Table 3

∙Supplementary Table 4

∙Supplementary Figure 1

∙Supplementary Figure 2

∙Supplementary Figure 3

∙Supplementary Figure 4

Acknowledgements

본 논문은 농촌진흥청 연구사업(연구개발과제명: 머신러닝 기반의 쌀 관능평가 예측 시스템 개발을 위한 DB 구축 및 모델 평가, 과제번호: PJ017498052024)의 지원으로 이루어진 것임.

References

Bagheri, R. 2022. Introduction to shap values and their application in machine learning. Medium. 8th Aug.

Bhatnagar, S. R., Y. Yang, B. Khundrakpam, A. C. Evans, M. Blanchette, L. Bouchard, and C. M. Greenwood. 2018. An analytic approach for interpretable predictive models in high‐dimensional data in the presence of interactions with exposures. Genet. Epidemiol. 42(3) : 233-249.

10.1002/gepi.2211229423954PMC6175336

Breiman, L. 2001. Random forests. Mach. Learn. 45(1) : 5-32.

10.1023/A:1010933404324

Cao, N., W. Zhou, F. Zhao, G. Jiao, L. Xie, A. Lu, J. Wu, M. Zhu, Y. Liu, J. Yu, R. Zhao, X. Yang, S. Hu, Z. Sheng, X. Wei, Y. Lv, S. Tang, G. Shao, and P. Hu. 2025. OsGATA7 and SMOS1 cooperatively determine rice taste quality by repressing OsGluA2 expression and protein biosynthesis. Nat. Commun. 16(1) : 3513.

10.1038/s41467-025-58823-140223143PMC11994747

Chan, J. Y. L., S. M. H. Leow, K. T. Bea, W. K. Cheng, S. W. Phoong, Z. W. Hong, and Y. L. Chen. 2022. Mitigating the multicollinearity problem and its machine learning approach: a review. Mathematics 10(8) : 1283.

10.3390/math10081283

Cuili, W., G. Wen, H. Peisong, W. Xiangjin, T. Shaoqing, and J. Guiai. 2022. Differences of physicochemical properties between chalky and translucent parts of rice grains. Rice Sci. 29(6) : 577-588.

10.1016/j.rsci.2022.03.002

Faye, P., D. Brémaud, E. Teillet, P. Courcoux, A. Giboreau, and H. Nicod. 2006. An alternative to external preference mapping based on consumer perceptive mapping. Food Qual. Preference 17(7-8) : 604-614.

10.1016/j.foodqual.2006.05.006

Gong, X., L. Zhu, A. Wang, H. Xi, M. Nie, Z. Chen, Y. He, Y. Tian, F. Wang, and L. Tong. 2022. Understanding the palatability, flavor, starch functional properties and storability of indica-japonica hybrid rice. Molecules 27(13) : 4009.

10.3390/molecules2713400935807256PMC9268750

Greenwell, B. M., B. C. Boehmke, and A. J. McCarthy. 2018. A simple and effective model-based variable importance measure [Preprint]. arXiv. https://arxiv.org/abs/1805.04755

10.32614/CRAN.package.vip

Ishwaran, H. 2007. Variable Importance in Binary Regression Trees and Forests. Electron. J. Stat. 1 : 519-537.

10.1214/07-EJS039

Kim, C. S., N. Kim, and K. Y. Kwahk. 2019. Research trends analysis of machine learning and deep learning: Focused on the topic modeling. J. Korea Soc. Digit. Ind. Inf. Manag. 15(2) : 19-28.

Kumar, A. 2022. Machine Learning-Sensitivity vs Specificity Difference. Machine Learning, Data Analytics, date access, pp. 1-6.

Li, C., S. Yao, B. Song, L. Zhao, B. Hou, Y. Zhang, F. Zhang, and X. Qi. 2023. Evaluation of cooked Rice for eating quality and its components in Geng Rice. Foods 12(17) : 3267.

10.3390/foods1217326737685200PMC10486766

Liu, Q., Y. Tao, S. Cheng, L. Zhou, J. Tian, Z. Xing, G. Liu, H. Wei, and H. Zhang. 2020. Relating amylose and protein contents to eating quality in 105 varieties of Japonica rice. Cereal Chem. 97(6) : 1303-1312.

10.1002/cche.10358

Lundberg, S. M. and S. I. Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.

Mohammed, A. and R. Kora. 2023. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ-Comput. & Inf. Sci. 35(2) : 757-774.

10.1016/j.jksuci.2023.01.014

Pudjihartono, N., T. Fadason, A. W. Kempa-Liehr, and J. M. O’Sullivan. 2022. A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinf. 2 : 927312.

10.3389/fbinf.2022.92731236304293PMC9580915

Schreurs, M., S. Piampongsant, M. Roncoroni, L. Cool, B. Herrera-Malaver, C. Vanderaa, F. A. Theßeling, Ł. Kreft, A. Botzki, P. Malcorps, L. Daenen, T. Wenseleers, and K. J. Verstrepen. 2024. Predicting and improving complex beer flavor through machine learning. Nat. Commun. 15(1) : 2368.

10.1038/s41467-024-46346-038531860PMC10966102

Shapley, L. S. 1951. Notes on the n-person game—II: the value of an n-person game.

Youn, Y. and Y. S. Kim. 2015. Physicochemical properties of rice varieties for manufacturing frozen fried rice. Food Sci. Preserv. 22(6) : 823-830.

10.11002/kjfp.2015.22.6.823

Zhou, L., S. Pan, J. Wang, and A. V. Vasilakos. 2017. Machine learning on big data: Opportunities and challenges. Neurocomputing 237 : 350-361.

10.1016/j.neucom.2017.01.026

The Korean Journal of Crop Science ISSN:0252-9777(Print) 2287-8432(Online) 한국작물학회지

Preview

Machine Learning Analysis of Rice Quality and Palatability Evaluation Data to Predict Rice Taste and Interpret Variables

ABSTRACT

MAIN

Table 1.

Composition of the collected dataset.

Table 2.

Model accuracy and optimized hyperparameters after cross-validation.

Table 3.

Exploratory data analysis (EDA) of the collected dataset.

Fig. 1.

Density and histogram of the collected data; (A) broken grain ratio, (B) damaged grain ratio, (C) Palatability Overall Score. The corresponding skewness and kurtosis values are 4.08 and 21.50 for (A), 6.48 and 64.30 for (B), and -0.58 and 1.08 for (C).

Table 4.

Correlation coefficients among rice quality traits; (A) grain shape traits, (B) milling traits.

Fig. 2.

Correlation matrix among rice quality traits.

Table 5.

Variable loadings of rice quality traits by principal components (Dim1-Dim3).

Fig. 3.

PCA biplot of rice quality traits with sensory evaluation scores; (A) PC1 vs. PC2, (B) PC1 vs. PC3.

Table 6.

Explained variance, and cumulative variance of principal components for rice quality traits.

Fig. 4.

Fig. 5.

External preference map of overall eating scores; PCA plot of rice quality traits with the overall eating scores projected by a second-order regression surface.

Table 7.

Variable importance results derived from random forest and GBM models.

Table 8.

List of input variables used in machine learning models.

Table 9.

Confusion matrices of the GBM and Random Forest models on the test data set (42 samples).

Table 10.

Predictive performance of each model on the test data set.

Fig. 6.

Fig. 7.

SHAP value plots for each model; (A) GBM, (B) Random Forest; The color indicates variable magnitude, and the position denotes the direction of influence on predicted eating quality (positive = favorable, negative = unfavorable).

Supplementary Material

보충자료

Acknowledgements

References