FPGA On-Device ARM Cortex Deep Reinforcement Learning with DQN | Ganapathi Pulipaka – AI Scientist

Play Video

Recent Insights

Deep-Q-Learning algorithm for experience replay relies upon the large-scale buffers with backpropagation method with episodic optimization. Presentation of a real-world project with on-device ARM Cortex processor integrating FPGA devices Xilinx PYNQ-Z1 FPGA Board platform with OpenAI gym Cartpole environment. The Deep-Q-Learning algorithm ran 126x times faster than conventional Deep-Q-Learning algorithm. The project also leveraged extreme learning machine method with online sequential ELM instead of backpropagation technique.

Index Terms – Machine learning; Deep Reinforcement Learning; Data Science: AI; Deep Learning

Note – The sections including: 1) Motivation, 2) Hypothesis, 3) Methods and Results, and 4) Conclusion should be no more than 1 page. Bio and references are not included in the 1 page.


  • Deep-Q-Learning algorithm for experience replay relies upon the large-scale buffers with backpropagation method with episodic optimization, which needs huge memory allocation.
  • Atari 2600 game was trained with Deep-Q-Learning by DeepMind with stochastic gradient descent to update the weights in the arcade learning environment.
  • How Deep-Q-Learning can efficiently train on FPGA devices without the backpropagation method with exponential acceleration.


Use this section to state your hypothesis and discuss the challenges to solving the problem.

  • With the advent of deep neural networks, the Q-learning has evolved to the next level with Deep-Q-Learning. The Q-learning can only solve a limited set of state, action, and value pairs represented in Q-table with Q value as the output. In Deep-Q-Learning the state is the input processed by Deep-Q-Networks producing Q-value action pairs as the outputs.
  • In Q-Learning, the agent in the environment attempts to discover the optimal policy from the historic interactions of the environment. The history of the agent can be determined with the following equation determining each state and action and the reward gained and tracks the history of experiences Why hasn’t it been solved before? (Or, what’s wrong with previously proposed solutions? How does mine differ?)


  • Leveraging Bellman equations and Deep-Q-Learning agent is trained much faster on cartpole environment without backpropagation technique.


  • On-device Deep-Q-Learning algorithm project has shown more efficiency without the backpropagation technique.

BIO Ganapathi Pulipaka is AI Research scientist at DeepSingularity for AI infrastructure, supercomputing, high-performance computing for HPC, AI strategy neural network architecture, breaking new ground in the world of machine learning on conversational AI, NLP, Robotics, IoT, IIoT, reinforcement learning algorithms. He is ranked as #5 data science influencer by Onalytica.