Artificial intelligence in agriculture - the case of AgroTerra and rdl by red_mad_robot
AgroTerra has improved its Multivariate Analysis (MVA) model with rdl by red_mad_robot. The company improved the accuracy of the MVA by 20%. AgroTerra is engaged in crop farming and seed production; the company annually produces about 1 million tons of products, as well as soybean, wheat seeds, sunflower and corn hybrid seeds. The multivariate analysis model allows the company to identify the factors that produced the greatest impact on yield and quality in the past season and adjust the production technology and management system accordingly.
AgroTerra has been using the MVA for several years to determine the top factors affecting margins. For example, planted 100 fields and harvested a crop. One field yielded five tons per hectare, while the other yielded three. The question arises, "What's wrong?" This is the question that the MVA answers.
However, the model itself was difficult to develop, refine, and enrich with new data for several reasons. First, it was based on closed code, and secondly, the amount of data in agriculture is limited. Then AgroTerra turned to rdl by red_mad_robot, an industrial artificial intelligence, machine learning, computer vision, and predictive analytics firm.
Context
Different industries have different amounts of data available. Companies whose products are used by millions of users collect a lot of data - that's big data. Agribusiness has far fewer of data resources. It is limited by geography, and the production cycle here lasts an entire year. To test new hypotheses here, synthetic data comes to the rescue. Their main difference from conventional ones is that they are created by algorithms, not real events. Synthetic data is actively used to develop machine learning models.
"We are used to AI tools being used in various fields, but hardly many people think that machine learning models can be successfully applied to agriculture as well. At AgroTerra, we pay great attention to data collection and have already accumulated enough data to generate synthetic data to take the accuracy of our analytical models to a new level. To investigate the quality of this data and find the most appropriate tools for their analysis, we turned to the expertise of rdl," comments Nikolai Kashchuk, advanced analytics manager.
Solution in synthetic data
In 2022, AgroTerra decided to develop its models on open source, which is when the source code is open for analysis and editing. Work on improving the model began with Exploratory Data Analysis, or EDA. It consisted of several stages. The first step was to research the data and enrichment options, and the second step was to build different models. Once specific model validity metrics were selected, rdl proceeded to build the models.
In total, AgroTerra offered three machine learning models: Random Forest, Boosting, and Stacking. The random forest method is based on the number of opinions in favor of one or another yield factor. For example, most models believe that rainfall had the greatest impact on yield - the final model takes this information into account and makes predictions based on it. Another method, boosting, assumes that the model is trained step by step on the conclusions of the previous model. And the subsequent model corrects the mistakes of the previous one. For example, if precipitation is not the most important factor for yield, then it is air temperature, etc. Stacking is based on the fact that each model predicts a different outcome, and the final model uses the results of their predictions and renders a verdict.
AgroTerra chose the stacking method, but there was still the issue of improving the accuracy of the model. Then, rdl reduced the number of factors on which the model was trained and proposed additional data pre-processing. As a result, the accuracy of the model increased by 20%.
“If the data is insufficient, the models become extremely unstable during training and give significantly different results with different initialization parameters. That is why it is important in such problems to choose models that are less affected by random data unrelated to the problem being solved,” commented Ivan Timofeev, data scientist rdl by red_mad_robot
In the future, AgroTerra plans to use the model as a standard tool for seasonal analysis. This will identify the root causes of yield deviations, make management decisions and translate them into technological and managerial changes that contribute to the growth of business margins.