PropertyValue
nif:beginIndex
  • 0 (xsd:integer)
nif:broaderContext
nif:endIndex
  • 190 (xsd:integer)
nif:isString
  • Q-learning is based on estimating the expected total discounted future rewards (the quality) of each state-action pair under a policy π: Qπ(st, at) = E[rt+1 + γrt+2 + γ2rt+2 + … + γT-trT|π].
rdf:type