當前位置: 首頁>>代碼示例>>Python>>正文


Python common.discount方法代碼示例

本文整理匯總了Python中baselines.common.discount方法的典型用法代碼示例。如果您正苦於以下問題:Python common.discount方法的具體用法?Python common.discount怎麽用?Python common.discount使用的例子?那麽, 這裏精選的方法代碼示例或許可以為您提供幫助。您也可以進一步了解該方法所在baselines.common的用法示例。


在下文中一共展示了common.discount方法的1個代碼示例,這些例子默認根據受歡迎程度排序。您可以為喜歡或者感覺有用的代碼點讚,您的評價將有助於係統推薦出更棒的Python代碼示例。

示例1: run

# 需要導入模塊: from baselines import common [as 別名]
# 或者: from baselines.common import discount [as 別名]
def run(self, update_counters=True):
        ob = self.env.reset()
        prev_ob = np.float32(np.zeros(ob.shape))
        if self.obfilter: ob = self.obfilter(ob)
        terminated = False
    
        obs = []
        acs = []
        ac_dists = []
        logps = []
        rewards = []

        for _ in range(self.max_pathlength):
            if self.animate:
                self.env.render()
            state = np.concatenate([ob, prev_ob], -1)
            obs.append(state)
            ac, ac_dist, logp = self.policy.act(state)
            acs.append(ac)
            ac_dists.append(ac_dist)
            logps.append(logp)
            prev_ob = np.copy(ob)
            scaled_ac = self.env.action_space.low + (ac + 1.) * 0.5 * (self.env.action_space.high - self.env.action_space.low)
            scaled_ac = np.clip(scaled_ac, self.env.action_space.low, self.env.action_space.high)
            ob, rew, done, _ = self.env.step(scaled_ac)
            if self.obfilter: ob = self.obfilter(ob)
            rewards.append(rew)
            if done:
                terminated = True
                break
        self.rewards.append(sum(rewards))
        self.rewards = self.rewards[-100:]
        if update_counters:
            self._num_rollouts += 1
            self._num_steps += len(rewards)
              
        path = {"observation" : np.array(obs), "terminated" : terminated,
                "reward" : np.array(rewards), "action" : np.array(acs),
                "action_dist": np.array(ac_dists), "logp" : np.array(logps)}
        
        rew_t = path["reward"]
        value = self.policy.predict(path["observation"], path)
        vtarg = common.discount(np.append(rew_t, 0.0 if path["terminated"] else value[-1]), self.gamma)[:-1]
        vpred_t = np.append(value, 0.0 if path["terminated"] else value[-1])
        delta_t = rew_t + self.gamma*vpred_t[1:] - vpred_t[:-1]
        adv_GAE = common.discount(delta_t, self.gamma * self.lam)
        
        if np.mean(self.rewards) >= self.score and not self.finished:
            self.episodes_till_done = self._num_rollouts
            self.frames_till_done = self._num_steps
            self.finished = True      
        
        return path, vtarg, value, adv_GAE 
開發者ID:wgrathwohl,項目名稱:BackpropThroughTheVoidRL,代碼行數:55,代碼來源:a2c_cont.py


注:本文中的baselines.common.discount方法示例由純淨天空整理自Github/MSDocs等開源代碼及文檔管理平台,相關代碼片段篩選自各路編程大神貢獻的開源項目,源碼版權歸原作者所有,傳播和使用請參考對應項目的License;未經允許,請勿轉載。