sentence7 | SoMeSci

Property	Value
nif:beginIndex	0 (xsd:integer)
nif:broaderContext	sms:PMC5381785
nif:endIndex	190 (xsd:integer)
nif:isString	Q-learning is based on estimating the expected total discounted future rewards (the quality) of each state-action pair under a policy π: Qπ(st, at) = E[rt+1 + γrt+2 + γ2rt+2 + … + γT-trT\|π].
rdf:type	nif:Context nif:OffsetBasedString nif:Sentence