In this paper, a stochastic dynamic programming model is developed for maintenance planning on a deteriorating multistate production system. The quality of the bath/lot of items produced in each stage is employed as a condition monitoring for condition-based maintenance. The machine has m-1 operational states plus a non-operational state referred as the failure state. At the start of each stage, four actions are available for the management: (1) renew the system; (2) implement maintenance; (3) continue the production, and (4) inspect the system. It is assumed that the impact of the maintenance is imperfect which means after the maintenance, the system is restored to any non-worse states with known probabilities. As the system states change Markovianlly at the end of each stage, and the quality of the items produced depends on the system state, the system can be modeled based on a Markov decision process (MDP). As the MDP is the core of reinforcement learning, for the large-scale problem, it is discussed that the proposed stochastic dynamic programming can be employed to develop reinforcement learning algorithms. To this end, Q-learning algorithm is proposed.