Markus Cozowicz ,
Principal Data Scientist at Microsoft
What is the appeal of reinforcement learning and how is it different or alike other machine learning (ML) areas? Let us use the standard cartpole scenario: the task is to balance an inverted pendulum by moving its anchor point.
The trained agent will output actions left and right based on the currently observed state of the pendulum. When using RL frameworks like RLLib, an agent is a deep learning model outputting categorical predictions to move left and right based on input features pole angle and velocity. To achieve the continuous control flow we are repeatedly invoking the model. In contrast to supervised learning, the agent is not trained using manually labeled data, but rather using a simulation environment in which RL agents learn to act. Similar, to recent breakthroughs in natural language processing (NLP), RL is a self-supervised approach automatically creating labelled data as part of the training process. NLP models like ELMo, BERT and GPT-3 are learning from large amounts of texts by using randomly masked words as labels and the surrounding context as input features. An additional challenge RL faces is the vast state space that needs to be explored. RL algorithms therefore need to pay special attention to balance exploration of the environment vs exploiting already gained knowledge. All these approaches have become tractable since deep learning and even more traditional tools such as gradient boosted trees are incredible powerful in modelling complex relationships, which innovation was fueled by massive amounts of labelled data collected in corpora like ImageNet. In summary, RL intelligently generates labeled data using simulators to train regular supervised ML models which take exploration/exploitation into account.
Reinforcement Learning for Inventory Planning
For the past few years, consumer expectations have dramatically changed; they expect Free (or lower delivery cost) and Fast (2-day or faster) shipping. The National Retail Federation (NRF) which published a 2019 study showed 75% of U.S. consumers expect free shipping. The same report also published a study where 39% of consumers expect free 2-Day or faster shipping. Many will shop elsewhere if they do not get it.
Unfortunately, providing fast shipping is not easy mostly because it is not cheap. To meet consumer demands, Retailers and CPG companies are striving to optimize and balance their fulfillment process. One key operational aspect that most companies take for granted is, investing in Inventory planning, precisely Inventory optimization and placement. Time and again we have heard that the traditional decision-making methods used by several industries around Inventory optimization and placement, have failed with modern e-commerce techniques. Companies are now looking for AI/ML powered technology to perform decisions around “How much to stock? When to stock? Where to stock by leveraging data across their network?”. Gartner is predicting by 2023 intelligent algorithms, and AI techniques will be an embedded or augmented component across 25% of all supply chain technology solutions.
Use-case: Inventory planners and managers need to decide - How much to stock and Where to stock?
Based on our experience with customers and research collaborations, we will share how RL can help address inventory optimization related challenges. To start with, we need:
Simulation: continuous or discrete time?
OpenAI Gym and RLLib, popular RL frameworks, operate on discrete time by imposing a step-interface to move through the simulation, which can be a natural match for daily or weekly inventory planning decisions. SimPy enables continuous time scenarios such as trucks moving goods between warehouses. A truck is modelled as a process which emits events when pickup and delivery occurs. The SimPy engine takes care of invoking the specific simulation logic when the time has come. Deciding between the two approaches is a trade-off between simulation accuracy and execution performance: are the results accurate enough if we aggregate courier movements daily? Can we run the simulation fast enough to explore all the relevant states?
Action-space: what to decide on?
Considering a standard two-echelon supply chain with distribution centers (DCs) as part of the First mile, warehouses/sortation facilities in the middle mile and last mile, delivering to the destination (customer) Inventory planners must decide how to distribute goods. Based upon the below simple value stream, there are 2 DCs x 3 warehouses x 100 products = 600 values to predict.
Supply chain value-stream
Modelling this problem using a single agent predicting all quantities simultaneously requires us to explore a 600-dimensional space. Multi-agent approaches allow us to break the decision space into smaller pieces -- training one model predicting a single product quantity while supplying additional features specifying the DC, warehouse, and product. There’s an excellent article on multi-agent support in RLLib we recommend reading as well as exploring Microsoft Research MARO (Multi-Agent Resource Optimization) platform.
Observation space: what information is visible to the model?
As the agent explores our simulated environment, RL algorithms collect observations composed of features and rewards to learn. Obvious features such as current inventory levels, in-transit quantities and upcoming supply can be augmented by traditional time series demand forecasts. Going further, the planners can include actions computed by traditional heuristics as additional features. The agent can now decide to follow the heuristics’ suggestion or deviate appropriately. If the planner wants to choose a multi-agent approach, it’s important to include information on the DC, warehouse, and product that agent should act for.
Reward: what signal does RL get from simulation?
As initially mentioned RL algorithms train supervised models requiring labeled data. Reward returned by the simulation is used as labels, but with the important nuance that exploration/exploitation is taken into account. Generally, reward is expressed as a single number. Incurred inventory cost is one example. Real-world supply chain scenarios more often require simultaneous optimization of multiple criteria: shipment cost, restock cost, fulfillment cost, etc... Provided all elements can be assigned a monetary value, one can simply sum the various inputs. Unfortunately, it is not always that obvious; fulfillment delays being one example. Inventory planners and managers can assign appropriate values based on their business needs and furthermore provide insights in return: how much does operation cost increase if we aim to reduce fulfillment delays by 10%?
How to handle constraints?
The inventory planning scenario imposes a set of constraints: DCs have limited stock, warehouses and transport links have limited capacity to name a few. A naive implementation simply terminates the training process if an invalid state is reached. Given the large state space, one quickly realizes that such behavior is computational infeasible. Alternatively, one can guide the learning by providing large negative reward, but in our experience training hardly recovered from such states. We have been most successful by rescaling actions to a feasible region. Rather than deciding directly on the quantity, the agent learns to predict product/warehouse preferences. For more details on the subject, we recommend reading A survey of Multi-Objective Sequential Decision-Making.
How to incorporate historic data?
It is important to strike the balance between exposing the agent to a rich set of realistic scenarios. OR-Gym, an environment for OR and RL research, doesn’t rely directly on historic data, but rather requires the user to express behavior in-terms of statistical distribution parameters from which data is sampled from: average number of items sold per day is one example. While in theory this allows to train a generic agent covering a wide range of demand and supply patterns, the overall complexity of a real-world supply chain quickly makes it computationally infeasible. Given access to historic data, we can use past inventory levels to seed the simulation with a variety of starting points. Demand and supply data can be continuously replayed. The number of days we replay the data, exposes an intuitive parameter controlling how far we allow an agent to diverge from reality in terms of inventory levels. Augmenting demand and supply with randomness allows to open the space further. And finally, to make a full circle by carefully combining both approaches: use historic inventory levels and supply, while sampling demand. When modelling supply chains, complexity rapidly increases, and data scientists need to make a trade-off between accurately reflecting real-world and computational efficiency. One might make simplification without impacting the training too much. As always, carefully evaluating results will be the best advice.
RL enables end-to-end modelling of a multitude of supply chain scenarios while incorporating traditional methods such as time series forecasting.