Price Prediction of Chili Commodities in Bandung Regency Using Bayesian Network

Chili is one of the agricultural commodities consumed by Indonesian people. Market data in recent years show that chili prices tend to fluctuate as supply and demand changes. One of the impacts of chili price changes for farmers is the production cost is higher than the selling price. In addition to supply and demand changes, the weather is also indicated as a factor of price changes due to the weather being considered by farmers to grow chili. Price prediction is needed to determine the condition of chili prices in the future to help farmers in making decisions to plant at the right time. One method that can be used to make prediction is Data Mining classification method. In this paper, Bayesian network algorithm was used as Data Mining classification method to predict the price of chili commodity in Bandung Regency based on weather information and classified the price into economic class and not economic class. The result shows that the prediction model obtained by the Bayesian Network gives a system’s performance for precision and recall that is 0.92 and 1 respectively with average accuracy of 83.5% in classifying the price.


I. INTRODUCTION
NDONESIA is a country whose population is still dependent on agriculture. One of the most consumed agricultural commodities by Indonesian people is chilli. Chili is consumed as food in Indonesia and it also widely consumed in Bangladesh (Hossen, 2016). Chili contains many vitamins that are good for the body such as vitamin A, vitamin B, vitamin C and iron (Parle & Kaura, 2013). The high demand for chili makes chili prices fluctuate (Firdaus, & Gunawan, 2013) and becomes one of the reasons for the high inflation rate in Indonesia (Mariyono, Joko, & Sumarno, 2016). Price fluctuations certainly have a negative impact for both farmers and the community. With the farmers selling price lower than the production cost makes farmers loss (Ditakristy, Saepudin, & Nhita 2016), which causes the price of chilli is not economic for farmers. Economic price for farmers is if the production cost incurred is less than the selling price. Otherwise, if the production cost is higher than the selling price then the price is not economic. In addition to changes in supply and demand, weather is also indicated as the cause of the chili's price changes. Weather plays an important role in the process of chili planting. Chili plants require enough water content to have good quality results. With the weather changing conditions of course it is very influential on chili plants.Therefore, price prediction becomes very important and necessary. With the prediction of the price, farmers could know the condition of chili prices in the future so it can helps farmers in determining the right planting time.

OPEN ACCESS
Data mining classification methods can be used to make predictions. There have been many studies that used data mining methods to do predictions as in the study Ramesh and Vardhan (2013), which predicts the number of crops production using SVM, KNN and ANN algorithms. The algorithms were also used in making price predictions in research (Kaur, Gulati, & Kundra, 2014). In this paper the algorithm used is Bayesian Network where it has been widely applied to make predictions. Bayesian network is a probabilistic graphical model that shows the relationship between variables that affect each other of an event in the form of Directed Acyclic Graph. It can also be used to handle incomplete data. In the previous study Eisuke, Harada, and Mizuno (2012), Bayesian network was used to predict NIKKEI and Toyota stock prices with an average maximum error of 30% for NIKKEI and 20% for Toyota. In addition, the Bayesian Network is also used to predict the weather where the accuracy obtained from the overall test scenario gives a good accuracy that is more than 80% (Sharma & Goyal, 2015). In the field of agriculture, Bayesian Network has been widely used as in research Chawla et al., (2016), Gandhi, Armstrong and Petkar (2016), Rasmussen, Madsen and Lund (2014), Newlands andTomley-Smith (2011), Pérez-Ariza, Nicholson, andFlores (2014), who conducted research on the prediction of diseases in plants, crops prediction and also used as a risk management tool.
In this paper, Bayesian Network is used to predict the price of chilli commodity in Bandung Regency based on weather information. The data used in this research are the historical data of chili prices and weather data of Bandung Regency. This study provides output in the form of prices that have been classified into economic and not economic classes.

II. BAYESIAN NETWORK
Bayesian Network is a probabilistic graphical model (PGM) based on the probability calculation of each random variable (Heckerman, 1997). Random variables in the Bayesian Network are represented in the form of Directed Acyclic Graph (DAG) that consists of nodes and edges. Each random variable is a node in DAG while the relationship between variables is denoted by an edge.  (Sharma & Goyal, 2015) In Figure 1 node A, node B and node C show the random variables and the edges show the relation between them. In Figure 1, there is an edge from node A to node B. This means that node A affects node B or in other words node A is the parent of node B. As for node C is the child of node B. Each node has a conditional probability table (CPT) containing the probability of a node given its parent. Suppose in the DAG containing of nodes and its parent denoted by . To calculate the probability of all nodes and its parents or called joint probability can be calculated using the equation below (Sharma & Goyal, 2015).

III. RESEARCH METHOD
The aim of this paper is to predict the price of chilli commodity in Bandung Regency with the output of chili prices that have been classified into economic class and not economic class. The design system that used in this paper is shown in Table I.

A. Dataset
The data used in this paper are historical monthly data of chilli commodity prices year 2014-2016 obtained from Department of Trade Bandung Regency as shown in Figure 2 and weather data of Bandung Regency obtained from Meteorology Climatology and Geophysics Agency (BMKG). There are 7 attributes that used for input data, the weather attributes consists of Solar Radiation (S), Wind Speed (W), Temperature (T), Relative Humidity (H), Evaporation (E), Rainfall (R) and Price (P) attribute. The sample data is shown in this following Table II

B. Preprocessing
At the preprocessing stage, the data will be discretized and partitioned into training and testing data. Discretization is performed on all attributes because the type of the attributes is continuous as shown in Table  II while Bayesian Network can work well on discrete attribute. The discretization done in this paper consists of three ways that will be described below.

1) Discretization
Equal width is one of the discretization techniques that can be used to discretized continuous data. Equal width discretization divides the data into categories with the same range denoted by (Liu, Hussain, Tan, & Dash, 2002). The maximum and minimum value of the attributes are needed. The equation (2) below used for equal width discretization.
Solar Radiation, Temperature, Relative Humidity and Evaporation were discretized into 3 categories (Nhita, 2013). As for Wind Speed attribute was discretized into 2 categories due to uneven distribution of data if it was discretized into 3 categories. As for Rainfall attribute was discretized by grouping the data into 4 categories based on BMKG rainfall category. The category is shown in Table III below (Nhita, 2013).

2) Price Classification
In this paper chili prices is classified into economic class and not economic class. The class is determined by comparing the farmers price with the future value of production cost denoted by . The production cost is the cost incurred during the chili planting process such as costs for seed, fertilizer, pesticide, land and machine. The farmers price is 50% of the market. This information obtained from direct interview with Association of Vegetable Farmers in Bandung Regency. As for determining the economic price, the monthly inflation rate (i) and the lowest price of chili commodity ( ) are required. In this paper the lowest price used is 10,000 according to the interview with the farmers. The future value is calculated using this following equation (Ditakristy, Saepudin, & Nhita, 2016;Capiński & Zastawniak, 2003).
The inflation rate from year 2014-2016 reached 8.36%, 3.35% and 3.02% respectively based on the information obtained from Indonesian Central Bureau of Statistics with the average of the inflation rate is 4.91%. It uses a continuous interest rate so that the average inflation is 4% in a year which is 0.0033 for monthly inflation rate. After calculate the future value as in (3), chili prices then will be classified by comparing the farmers price and the future value to obtained economic price. The classification is shown in the Table IV below.  Figure 3 below shows the comparison between farmers price and economic price that used in this paper which shows that all of the data are mostly classified into economic class.  Figure 3 above it can be seen that the future value is almost linear, but in fact this value may increase or decrease depending on the cost of production incurred. For example if there is an increase in land prices, then the production costs incurred will be higher.

3) Data Partition
After the data has been discretized then partition is done to divide the data for training and testing. The Table  V below is the partition of training and testing data and its class distribution.

C. Structure Learning Bayesian Network
The aim of learning structure is to construct the structure of Bayesian Network. The Bayesian Network structure can be build manually based on domain knowledge or from data. To build the Bayesian Network structure from data, a special algorithm is required. In this paper the Bayesian Network structure was built using  (1992). In the K2 algorithm all nodes were ordered and the maximum number of parent ( ) for each node must be initialized. In this paper we selected = 1,2,3 to avoid complex structures because the good structure of Bayesian Network are less complex but can clearly represent relationships between attributes. Suppose that in a database consisting of cases, node has a number of values denoted by and the number of instantiation of parent denoted by , the number of cases in attribute with ℎ value and the parent with ℎ value defined by . The K2 algorithm will search for the parent that maximizes the value of the node with the following equation (Ruiz, 2009).
The K2 algorithm is described in this following Table VI (Ruiz, 2009).

D. Parameter Learning Bayesian Network
Learning parameters Bayesian Network is the process of calculating the conditional probability table based on the structure of the graph that has been built. In this paper the calculation of conditional probability table is calculated using Maximum A Posterior (MAP) calculation to avoid zero probability. Suppose the probability of x node given its parent denoted by θxyz, Nxyz is the number of cases with the value of x node is z and the value of its parent is y, with = ∑ and Nxy = ∑ where is a bias probability as in (6) where rx is the number of value in node x and qx is the number of instantiation of parent of node x, the MAP equation is as follow (Heckerman, 1998).

E. Inference Bayesian Network
Inference on the Bayesian Network is a calculation to determine the class probability of an event based on existing structure and CPT. Inference on the Bayesian Network uses the basis of Bayes's theorem describing the probability of two events such as in (7) which describes the probability of event X with the condition that event Y occurs first (Sarshar, Granmo, Radianti, & Gonzalez, 2013).
In this case, the probability of economic class and not economic class is calculated based on existing graph structure and CPT. The class with a greater probability value become final decisions.

F. Performance Evaluation
Performance evaluation is needed to analyze and evaluate the performance of the system in classifying prices. The confusion matrix consists of rows and columns that contain the amount of data that has been classified. With confusion matrix, the amount of data predicted correctly or wrongly is known. Here is an example of confusion matrix (Han, Pei, & Kamber, 2011). In Table VII, PP is the number of positive class that were correctly predicted as a positive class by system, NP is the number of negative class that were predicted as a positive class by system, PN is the number of Positive class that were predicted as a negative class by system and NN is the number of negative class that were correctly predicted as a negative class by system. From the confusion matrix above, we can calculate pecision, recall, F1 score and accuracy to know the system's performance.

1) Precision
Precision is a percentage that shows the exactness of system in predicting the data into positive class. Precision value can be calculated using this equation (7) where PP and NP obtained from the confusion matrix as shown in Table VII (Han, Pei, & Kamber, 2011).

2) Recall
Recall shows the sensitivity of the system in predicting the positive class. Recall can be calculated using this equation (8) where PP and PN obtained from the confusion matrix as shown in Table VII (Han, Pei, & Kamber, 2011).
3) F1-Score F1-Score is a value that shows the system's performance in doing prediction. The greater value of F1-Score generated, the better the system's performance. F1-Score can be calculated using this following equation (10) (Han, Pei, & Kamber, 2011), where precision is obtained from equation (5) and recall is obtained from equation (6).

4) Accuracy
Accuracy is a percentage of the system that was predict the data correctly. The equation below (10) is used to calculate accuracy (Han, Pei, & Kamber, 2011).

IV. RESULTS AND DISCUSSION
The research was done by using 2 scenarios that is dataset scenario and learning scenario of Bayesian Network's structure. Dataset scenarios uses data with all attributes and uses rainfall and price attributes. For learning scenario of Bayesian Network's structure we selected k = 1, 2, 3 as a maximum number of parent for Bayesian Network's structure.

A. Dataset Scenario
This study used 2 scenarios of dataset as detailed in Table VIII. The aim of this scenario is to analyze the effect of using certain attributes to predict chili prices. The dataset scenario is as follow.

B. Results of Performance Evaluation
The resulting structure and dataset scenario were evaluated to know the system performance. The Table IX below shows the performance evaluation of the scenarios. Scenario 2 can not be done by using k = 2, 3 because the number of attribute used are 2 attributes that is Rainfall and Price. So that the maximum number of parent must be 1. For the average accuracy of training and testing data are 82.5% and 83.5% respectively.

1) The influence of k analysis
The dataset scenario were used to create a Bayesian Network structure. From the results obtained by choosing k = 1,2 and 3 produce different structures. Here is the explanation of each structure generated. The inference using the structure in Figure 4 is as follow (Witten, Frank, Hall, & Pal, 2016).
From Figure 4 it can be seen that the price node is directly related to other nodes which in Figure 4 shows that the price node becomes the parent of all nodes. This shows that there is a relationship between price variable and weather variables. Where price variables can affect other variables. As for the other node which is weather variables there is no direct relationship. As for the structure k = 2 produce the structure as follows. The inference using the structure in Figure 5 is as follow (Witten, Frank, Hall, & Pal, 2016).  Figure 5 gives the same result as Figure 4 which price node becomes the parent of all nodes which implies a direct edge between the price node and other nodes. The difference is in the weather variables. In Figure 5 the weather variables are connected. This means that there is a relation between them. For example, evaporation variable has a direct relation to rainfall variable where there is an edge from evaporation node to rainfall node. So the value of evaporation variable will affect the value of rainfall variable. As for the structure with k = 3 produces the structure as follows. The inference using the structure in Figure 6 is as follow (Witten, Frank, Hall, & Pal, 2016).
The resulting structure with k = 3 is not much different from the resulting structure k = 2. The difference is in the temperature and humidity nodes which the two variables are both influenced by the wind speed and solar radiation variables. As for the structure of dataset scenario 2 is as follow. The inference using the structure in Figure 7 is as follow (Witten, Frank, Hall, & Pal, 2016).
The result of the structure shows that the price node affects the rainfall node similar to the structure in the dataset scenario 1.
From the results of the structure obtained using K2 Algorithm, the overall structure produced is good. The relationship of each attribute is well illustrated. From the classification performance using the structure obtained, from Table IX it can be seen that in training data, the more the number of k then the performance increases. However, if it seen from the testing data, for k = 3 has lower performance than k = 2. The performance value of data testing for k = 2 for precision is 0.92, recall is 1 F1-score is 0.96 with 92% accuracy. From the results of training and testing data, the best graph structure is k = 2 with a dataset scenario 1 that shows greater accuracy than k = 1 in training data and greater accuracy than k = 1 and k = 3.This difference caused by the uneven distribution of data on each attribute. Therefore the number k has an influence on the classification result. With the increasing number of k, the more parent a node can have. The more parent of a node affects the more combination of conditional probability. Overall the best structure for Bayesian Network is k = 2 which described the relation of all variables.

2) Dataset Analysis
Based on the results in Table VIII, each dataset scenario gives a good performance. Scenario 1 is the best dataset scenario for training which gives 1 and 0.94 for precision and recall respectively using k = 3. While the scenario 1 with k = 2 is the best for testing with the precision and recall values are 0.92 and 1 respectively. This means that each scenario can be used to predict the chili prices. Here is the confusion matrix of scenario 1 with k = 2. From Table X it can be seen that the system mostly predict the price into economic class. This means that the price of chili in the next month is still classified into an economic class for farmers. So farmers can plant chili next month. The prediction class where the economic class is dominant can be caused by due to the uneven distribution of data used in training data and testing data. Where in the training data or testing the majority of the class is an economic class. In addition, the amount of data used was not much. With uneven data distribution and lack of amount of data used make the system difficult to find prediction patterns.

V. Conclusion
Weather variables can be used to predicting chili prices because there is a relation between weather variables and chili prices. As shown in the learned structure of Bayesian Network there is an edge from node price to other nodes which is weather variables. The best structure of Bayesian Network has 2 maximum number of parent which described the relation between every node. Price predictions using the Bayesian Network algorithm provides a good performance with precision, recall and F1 score for testing data are 0.92, 1, 0.96 respectively. As for the training data it gives 0.85, 0.94, 0.89 respectively with the average accuracy of testing is 83.5%. The number of parent of node affects the performance of the built system. The more number of parent on a node the more probability combinations will be calculated and this could happen for prediction errors. The uneven distribution of data also affects the performance of the system. In this paper, the data used is not much and the economic class is dominant. This distribution makes the system less understand the prediction pattern and predicts most of the data into economic class.