




版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
ASurveyontheOptimizationofLargeLanguageModel-basedAgents
arXiv:2503.12434v1[cs.AI]16Mar2025
SHANGHENGDU
,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniver-sity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
JIABAOZHAO?,SchoolofComputerScienceandTechnology,DonghuaUniversity,China
JINXINSHI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
ZHENTAOXIE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
XINJIANG,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
YANHONGBAI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
LIANGHE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China
WiththerapiddevelopmentofLargeLanguageModels(LLMs),LLM-basedagentshavebeenwidelyadoptedinvariousfields,becomingessentialforautonomousdecision-makingandinteractivetasks.However,currentworktypicallyreliesonpromptdesignorfine-tuningstrategiesappliedtovanillaLLMs,whichoftenleadstolimitedeffectivenessorsuboptimalperformanceincomplexagent-relatedenvironments.AlthoughLLMoptimizationtechniquescanimprovemodelperformanceacrossmanygeneraltasks,theylackspecializedoptimizationtowardscriticalagentfunctionalitiessuchaslong-termplanning,dynamicenvironmentalinteraction,andcomplexdecision-making.AlthoughnumerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks,asystematicreviewsummarizingandcomparingthesemethodsfromaholisticperspectiveisstilllacking.Inthissurvey,weprovideacomprehensivereviewofLLM-basedagentoptimizationapproaches,categorizingthemintoparameter-drivenandparameter-freemethods.Wefirstfocusonparameter-drivenoptimization,coveringfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridstrategies,analyzingkeyaspectssuchastrajectorydataconstruction,fine-tuningtechniques,rewardfunctiondesign,andoptimizationalgorithms.Additionally,webrieflydiscussparameter-freestrategiesthatoptimizeagentbehaviorthroughpromptengineeringandexternalknowledgeretrieval.Finally,wesummarizethedatasetsandbenchmarksusedforevaluationandtuning,reviewkeyapplicationsofLLM-basedagents,anddiscussmajorchallengesandpromisingfuturedirections.Ourrepositoryforrelatedreferencesisavailableat
/YoungDubbyDu/LLM-Agent-Optimization
.
1Introduction
Thedevelopmentofautonomousagentshasbeenalong-termpursuitinArtificialIntelligence(AI).AIagentshaveevolvedfromearlyrule-basedandexpertsystem-basedarchitecturestoreinforce-mentlearning(RL)-drivenagents,whicharenowwidelyappliedinmanyfields[
35
].TraditionalRL-basedagentsoptimizepoliciesthroughinteractionwithenvironments,usingstructuredrewardfunctionstoachievegoalsandimproveperformanceovertime.However,theseapproachesoften
+Correspondingauthor.
Authors’ContactInformation:
ShanghengDu
,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,dsh@.cn;JiabaoZhao,SchoolofComputerScienceandTechnology,DonghuaUniversity,Shanghai,China,jbzhao@;JinxinShi,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,jinxinshi@;ZhentaoXie,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,ecnudavidtao@;XinJiang,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,51275901099@;YanhongBai,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,Lucky_Baiyh@;LiangHe,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,lhe@.
2S.Duetal.
requireextensivetraining,relyonwell-definedstate-actionspaces,andstrugglewithgeneralizationacrossdiversetasks.
Inrecentyears,LargeLanguageModels(LLMs)suchasGPT-4[
120
],PaLM2[
5
],andDeepseek-r1[
52
]haveachievedremarkablesuccess,demonstratingexceptionalcapabilitiesinlanguageunderstanding,reasoning,planningandcomplexdecision-making.Buildingonthesestrengths,LLMscanserveasagents,providingapromisingpathwaytoimproveautonomousdecision-makingandachieveAGI[
169
].UnlikeconventionalRL-basedagents,whichoptimizeexplicitreward-drivenpolicies,LLM-basedagentsoperatethroughtext-basedinstructionsandprompttemplatesandin-contextlearning(ICL),allowinggreaterflexibilityandgeneralization.TheseagentsleveragethecomprehensionandreasoningcapabilitiesofLLMstointeractwithenvironmentsthroughnaturallanguage,executecomplexmulti-steptasks,anddynamicallyadapttoevolvingscenarios.ExistingLLMagentsutilizevariousmethodssuchastaskdecomposition[
64
],self-reflection[
133
],memoryaugmentation[
210
],andmulti-agentcollaboration[
86
]toachievehighperformanceacrossarangeofdomains,includingsoftwaredevelopment[
67
],mathematicalreasoning[
1
],embodiedintelligence
[212
],webnavigation
[28
],andmore.
However,despitetheirstrengths,LLMsarenotinherentlydesignedforautonomousdecision-makingandlong-termtasks.Theirtrainingobjectivesfocusonnext-tokenpredictionratherthanreasoning,planning,orinteractivelearningrequiredforagent-basedtasks,sotheylackexplicittrainingonagent-centrictasks.Asaresult,deployingLLMsasagentsincomplexenvironmentspresentsseveralkeychallenges:1)LLM-basedagentsstrugglewithlong-horizonplanningandmulti-stepreasoning,astheirgenerativecontentmayleadtotaskinconsistenciesorerroraccumulationoverextendedinteractions.2)LimitedmemorycapacityinLLMshindersagentsfromutilizingpastexperiencesforreflection,leadingtosuboptimaldecision-makingandtaskperformance.3)TheadaptabilityofLLM-basedagentstonovelenvironmentsisconstrained,astheyprimarilyrelyonpre-trainedknowledgeorfixedcontexts,limitingtheirabilitytohandledynamicscenarios.Theselimitationsareparticularlyevidentinopen-sourceLLMs,whichlagbehindproprietarymodelslikeGPT-4inagent-specificcapabilities.Additionally,thehighcostandlackoftransparencyofclosed-sourceLLMshighlighttheneedforoptimizingopenLLMstoenhanceagentcapabilities.
Existingtechniques,suchassupervisedfine-tuning(SFT)[
122
]andreinforcementlearningwithhumanfeedback(RLHF)[
121
],havemadesignificantstridesinimprovingLLMperformanceininstructionfollowingtasks,buttheyfailtofullyaddressthechallengesofdecision-making,long-termplanning,andadaptabilityforLLM-basedagents.OptimizingLLM-basedagentsrequiresabroaderunderstandingofdynamicenvironmentsandagentbehaviors,whichneedstodesignspecializedtechniquesthatgobeyondtraditionalLLMfine-tuningandpromptengineeringmethods.Toaddressthesechallenges,numerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks.Thesemethodsensurethatagentscangeneralizeacrossdiverseenvironments,refinestrategiesbasedonfeedback,andefficientlyutilizeexternalresourcessuchastools,memory,andretrievalmechanisms.
Inthispaper,weprovideacomprehensivesurveyonLLM-basedagentoptimization,system-aticallycategorizingmethodsintoparameter-drivenandparameter-freeoptimizationstrategies.Ourworkfocusesonthetechnicalmethodologiesemployedtooptimizeagentcapabilitieslikeagenttuning,RL,andotherstoimproveagentperformance.Specifically,Parameter-drivenOptimizationrefinesLLMparameterstoenhanceagentperformance.Thiscategoryincludesconventionalfine-tuningapproaches,coveringkeystagessuchasagenttrajectorydataconstructionandfine-tuningstrategies.Inaddition,weexploreRL-basedoptimization,whichisdividedintotwodistinctoptimizationdirections:rewardfunction-basedmethodsleveragingtraditionalRLtechniqueslikeActor-Critic[
147
]andProximalPolicyOptimization(PPO)[
136
],andpreferencealignment-basedmethodsutilizingDirectPreferenceOptimization(DPO)[
132
]tosynchronize
ASurveyontheOptimizationofLargeLanguageModel-basedAgents3
agentpolicieswithhumanpreferenceortask-specificobjectives.Finally,wediscusshybridfine-tuningoptimizationstrategies,arisingarea,whichcombineSFTwithRLtoiterativelyrefineagentbehavior.Incontrast,wealsobrieflyoutlineParameter-freeOptimizationmethodsthatfocusonimprovingagentbehaviorwithoutmodifyingmodelparameters.Thesemethodsleveragepromptengineering,in-contextlearningandretrieval-augmentedgeneration(RAG),incorporatingvarioustypesofinformationintopromptstoguideagents’actions.Theyarecategorizedintofeedback-basedoptimization,experience-basedoptimization,tool-basedoptimization,retrieval-augmentedoptimization,andmulti-agentcollaborativeoptimization.
Fig.1.AnOverviewofthePaperOrganization.
Comparisontorelatedsurveys.DespitethegrowingresearchinterestinLLM-basedagents,existingsurveysprimarilyfocusongeneralLLMoptimizationorspecificagentabilitiessuchasplanning,memory,androle-playing,withouttreatingLLM-basedagentoptimizationasadistinctresearcharea.SurveysonLLMoptimizationmainlycoverfine-tuning[
115
,
122
]andself-evolutionapproaches[
150
],butlackdiscussionsonspecializedoptimizationrequiredforagentcapabilities.Ontheotherhand,existingagent-relatedsurveysgenerallycategorizeworksbasedonarchitecturalcomponentssuchasplanning[
64
],memory[
210
],ormulti-agentcoordination[
86
],ratherthansystematicallysummarizingthetechniquesdedicatedtooptimizeLLM-basedagentbehaviorsandperformance.Ascomparison,thisworkisthefirstsurveytowardsLLM-basedagentoptimization
4S.Duetal.
techniques,facilitatingaclearerunderstandingandcomparisonofexistingmethodsandprovidingdirectionsforfutureresearch.
Scopeandrationales.(1)WesurveyonlyLLM-basedagentoptimizationalgorithmstoimproveagenttaskperformance,suchasproblem-solvinganddecision-making,coveringparameter-drivenandparameter-freeapproaches.WeexcludeworkscenteredongeneralLLMefficiency,role-playing,ordialogue;(2)OurselectionincludespapersfromAIandNLPconferencesandjournals,aswellasrecenthigh-impactpreprintsfromarXivtoensurecoverageofthelatestadvancements.(3)Wefocusonstudiespublishedsince2022toreflectrecentadvancementsinLLM-basedagentoptimization.
Organizationofthissurvey.Theschematicrepresentationofthismanuscript’slayoutcanbefoundinFigure
1
.Section
2
providesthebackgroundknowledgeandrelatedconcepts.InSection
3
,wesystematicallyreviewparameter-drivenoptimizationapproachesthatmodifyLLMparameterstoenhanceagentcapabilities,categorizingthemintothreemainstrategies:fine-tuning-basedoptimization(§
3.1
),RL-basedoptimization(§
3.2
),andhybridoptimization(§
3.3
).Section
4
summarizesandclassifiesexistingworkonparameter-freeoptimizationstrategies.Then,Section
5
presentsdatasetsandbenchmarks,whileSection
6
reviewspracticalapplicationsacrossvariousdomains.Finally,Section
7
highlightschallengesandfuturedirections.
2Background
2.1ReinforcementLearning-basedAgentOptimization
RLhaslongbeenafundamentalapproachinagentoptimization,allowingagentstolearnfrominter-actionswithenvironments.CurrentRLmethodsmainlyoptimizeagentbehaviorsusingvalue-basedandpolicy-basedapproaches[
35
,
106
,
117
].Value-basedmethods,suchasQ-learning[
25
,
163
],optimizeanagent’saction-valuefunctiontomaximizelong-termrewards.Thesemethodsareeffectiveindiscreteactionspacesbutstrugglewithhigh-dimensionalstatesoractionspaces.Policy-basedmethods,includingPolicyGradient[
48
,
124
],directlyoptimizetheagent’spolicybyadjustingparametersbasedonrewardgradients.Toimprovestabilityandsampleefficiency,PPO[
136
]introducedaconstraintonpolicyupdates,mitigatingperformancedegradationduringtraining.Actor-Criticmethods[
147
]combinevalueestimationwithpolicylearning,improvingconvergenceefficiencyanddecisionrobustness.Beyondsingle-agentsettings,Multi-AgentRein-forcementLearning(MARL)extendsRLtechniquestoscenariosinvolvingmultipleinteractingagents,enablingbothcooperativeandcompetitivedynamics
[12
,
204
].
Inrecentyears,RLhasalsobeenincreasinglyappliedtoaligningAIagentswithhumanin-tentions,particularlyinpreference-basedoptimization.RLHF[
121
]hasemergedasaprominentapproach,refiningagentpoliciesbasedonhuman-providedsignalstoimprovealignmentwithdesiredbehaviors.DPO[
132
]optimizespoliciesdirectlyfrompreferencedatawithoutrewardmod-eling,improvingalignmentandcontrollability.Overall,RL-basedoptimizationhasevolvedfromearlyvalue-basedandpolicy-basedlearningtomoreadvancedtechniquesthatintegratestructuredfeedbackandmulti-agentcoordination,providingafoundationforimprovingdecision-makinginLLM-basedagents.
2.2LLMFine-Tuning
LLMfine-tuningisacriticalmethodforadaptingpre-trainedmodelstospecifictasksthroughopti-mizingparameters,makingthemmoresuitedtothedesiredapplication.ThemostpopularapproachisSFT,whereLLMsaretrainedonlabeleddatatoimprovetask-specificperformance.InstructionTuningisacommonlyusedmethodinSFT,whereLLMsarefurthertrainedoninstruction-outputpairstoenhancetheirabilitytofollowhumancommands[
98
,
205
].Anothermajordevelopmentisparameter-efficientfine-tuning(PEFT),includingmethodslikeP-Tuning[
103
],LoRA[
59
],and
ASurveyontheOptimizationofLargeLanguageModel-basedAgents5
QLoRA[
30
].Thesetechniquesadjustasmallsubsetofparameters,significantlyreducingthecom-putationalcostoffine-tuningwhilepreservingLLMperformance,makingthemhighlyefficientforreal-worldapplications.Additionally,RLHFhasbeenusedtofine-tuneLLMsbyintegratinghumanfeedback,improvingtheirdecision-makingandoutputalignmentwithuserpreferences[
121
].TheseoptimizationtechniquesenableLLMstoadaptmoreefficientlytoawiderangeoftasks,enhancingtheireffectivenessinreal-worldscenarios.
2.3LLM-basedRAG
RAGcombinesLLMwithexternalinformationretrievalsystemstoenhancetherelevanceandaccuracyofgeneratedoutputs.Byretrievingrelevantdocumentsfromexternalsources,RAGallowsLLMstoaddresstheknowledgeconstraintsinherentinmodels.TheevolutionofRAGmethodshasbeenmarkedbysignificantadvancementsinretrievalandgenerationintegration[
44
].Early,NaiveRAGmethodsfocusondirectlyretrievingrelevantdocumentstoaugmentthegenerativeprocess,improvingthequalityofresponsesintasksrequiringfactualknowledge.ToaddressthechallengesofNaiveRAG,AdvancedRAGisintroduced,refiningtheretrievalprocessbyincorporatingmoreef-fectiveranking,filtering,anddocumentselectionstrategies.Subsequently,ModularRAGintroducesamodularframeworkthatoptimizestheretrievalandgenerativecomponentsindependently.Thismodularapproachenablestask-specificoptimizations,allowingformoreflexibilityandscalabilityinapplicationsacrossdifferentdomains[
8
,
193
].TheseadvancementsinRAGhighlightitspotentialtoenhanceLLMsbyenablingdynamicaccesstoexternalknowledge,makingthemmoreadaptableandcapableofaddressingcomplextasksinreal-worldscenarios.
3Parameter-drivenOptimizationofLLM-basedAgents
ComparisonwithLLMparameteroptimization.Parameter-drivenLLMoptimizationfocuseson"howtocreateabettermodel",aimingtoenhancegenerallanguageunderstanding,instructionfollowing,andbroadtaskperformance.Incontrast,LLM-basedagentparameteroptimizationaddresses"howtousethemodeltosolvecomplexagenttasks",emphasizingdecision-making,multi-stepreasoning,andtaskexecutionindynamicenvironments.AlthoughgeneralLLMoptimizationimprovesfluencyandfactualaccuracyacrossdiverseapplications,LLM-agentoptimizationistask-specific,requiringmodelstoadaptstrategies,interactwithenvironments,andrefinebehaviorsforautonomousproblem-solving.Parameter-drivenoptimizationofLLM-basedagentsprimarilyreliesonexperttrajectorydataorself-generatedtrajectorydataobtainedthroughenvironmentexploration,thenemploysvariousoptimizationtechniquestoiterativelyrefinepoliciesandenhanceperformance.
Inthissection,wediscusshowparameter-drivenoptimizationmethodsimprovetheperformanceofLLM-basedagents.Specifically,wecategorizethesemethodsintothreemaintechnicalapproachesaccordingtodifferentstrategiesforparametertuning:conventionalfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridoptimization.
3.1ConventionalFine-Tuning-basedOptimization
Conventionalfine-tuning-basedagentoptimizationinvolvestuningpre-trainedLLMs’parametersthroughvariousfine-tuningtechniques,suchasinstructiontuningandparameter-efficientfine-tuning.Trajectoryforfine-tuningtypicallyareconstructedintheformofSFTandisusedtoadjusttheagent’sparameterstobetteralignwithtask-specificrequirements.Theoptimizationprocesstypicallyconsistsoftwomajorsteps:1)constructinghigh-qualitytrajectorydatatailoredtoagenttasks;2)fine-tuningLLM-basedagentsusingthesetrajectorydata,andthecompleteprocessispresentedinFigure
2
.Previousstudies[
40
,
83
,
122
]haveshownthatthequalityoftrainingdatasignificantlyimpactsmodelperformance,highlightingtheimportance
6S.Duetal.
Fig.2.WorkflowofFine-Tuning-basedOptimizationforLLM-basedAgents.
ofgenerating,filtering,andeffectivelyutilizinghigh-qualitytrajectories.Thismakestrajectoryconstructionacriticalstepinthefine-tuningpipeline,directlyinfluencingtheLLM-basedagent’soverallperformance.InTable
1
,weprovideacomprehensiveoverviewoffine-tuning-basedagentoptimizationmethods,highlightingthedataprocessingtechniquesandfine-tuningstrategiesusedineachwork.Itisimportanttonotethatthissectionexcludesfine-tuningmethodsthatinvolvereinforcementlearningorpreferencealignmenttechniques(e.g.,DPO,PPO),whichwillbeaddressedin§
3.2
.Instead,inthissection,weonlyfocusonthepartoftraditionalLLMfine-tuningtechniquesappliedinexistingworks,aimingtoensureeachstageoftheconventionalfine-tuning-basedagentoptimizationworkflowisclearlyintroduced.
3.1.1TrajectoryDataConstructionforAgentFine-Tuning.Theconstructionofhigh-qualitytrajectoryisacrucialstepbeforethefine-tuningofLLM-basedagents,whichaimstoempowerLLMswithagentability.Thisprocessinvolvesthegenerationoftrajectorydata,followedbyevaluationandfiltering,andthepotentialutilizationoflow-qualitysamples,toconstructrefineddatathatmeettherequirementsforeffectivefine-tuning.
DataAcquisitionandGeneration.High-qualitytrajectorydataconstructionbeginswiththeacquisitionandgenerationofinitialdata,whichrequiresnotonlyadiversesetoftrajectories,butalsosufficientalignmentwiththetargettaskstoensureeffectivelearning.Methodsforacquiringandgeneratingsuchdatacangenerallybeclassifiedintofourbroadcategories:expert-annotateddata,strongLLM-generatedtrajectories,self-explorationenvironment-interactiontrajectories,andmulti-agentcollaboration-basedconstruction.Here,weintroducetheutilizationandconstructionprocessesofeachcategoryandreviewtherelevantstudies.
(1)Expert-annotateddata.Expert-annotatedtrajectoriesrefertohigh-qualitydatasetsman-uallycraftedbyhumanexperts,oftenconsideredthegoldstandardforfine-tuning.Thesedataensuretaskreliabilityandalignment,asexpertscanmeticulouslydesignandannotatetrajectoriestailoredtospecificcases.
Manyworks[
14
,
39
,
144
,
158
,
177
]utilizeReAct-styleexperttrajectoriesasinitialdatasets,withdataincludingthoughts,observationsandactions[
189
],whichenableagentstomimicexpertdecision-makingprocessesmoreeffectively.Forinstance,IPR[
177
]leveragessuchtrajectoriestohelpagentsacquirefoundationalskills.Similarly,ETO[
144
]andAGILE[
39
]applyChainofThought
ASurveyontheOptimizationofLargeLanguageModel-basedAgents7
Table1.ComparisonofConventionalFine-Tuning-basedOptimizationforLLM-basedAgents:DataCon-
structionandFine-Tuning.Note:MA-Multi-AgentFramework;LQ-Low-QualityDataUtilization.
Method
TrajectoryAgentDataConstruction
Fine-Tuning
Generation
Filtering
MA
LQ
Fine-tuneApproachBaseModel
AgentTuning
[199]
StrongLLM
HumanorRule
√
√
InstructionTuning
Llama-2-7B/13B/70B
SMART
[197]
Multi-agent
Environment
/
√
LoRA
Llama-2-7B
Agent-FLAN
[22]
Expert
Model
√
√
InstructionTuning
Llama-2-7B
Self-Talk
[153]
Multi-agent
HumanorRule
/
√
LoRA
MosaicAI-7B-Chat
ENVISIONS
[178]
Self-exploration
Environment
/
√
SFT
Llama2-7B/13B-Chat
AgentGym
[170]
StrongLLM&Expert
Environment
/
√
BC
Llama-2-7B-Chat
FireAct
[14]
StrongLLM
Environment
/
/
LoRA
GPT3.5,Llama-2-7B/13B,CodeLlama-7B/13B/34B-Instruct
NAT
[158]
StrongLLM
Environment
/
√
SFT
Llama-2-7B/13B-Chat
AgentLumos
[192]
StrongLLM
HumanorRule
/
/
LoRA
Llama-2-7B/13B
STE
[154]
Self-exploration
Model
/
√
SFT
Llama-2-7B/13B-Chat,Mistral-7B-Instruct
OPTIMA
[19]
Multi-agent
HumanorRule
√
/
SFT
Llama-3-8B
Zhouetal.
[216]
StrongLLM
HumanorRule
√
/
LoRA
OpenChatv3.2,Llama-2-7B,AgentLM-7B
AgentOhana
[202]
Expert
Model
/
/
QLoRA
xLAM-v0.1
COEVOL
[85]
Expert
Model
√
/
SFT
Llama-2-7B,Mistral-7B
AGENTBANK
[143]
StrongLLM
Environment
/
√
InstructionTuning
Llama-2-Chat
ADASWITCH
[146]
Self-exploration
Model
√
√
SFT
DeepSeek-Coder-1.3B,StarCoder2-3B
IPR
[177]
Expert&Self-exploration
Environment
/
√
InstructionTuning
Llama-2-7B
Re-ReST
[33]
Self-exploration
Environment
/
√
LoRA
Llama-2-7B/13B,Llama-3-8B,CodeLlama-13B,VPGen
ATM
[219]
Multi-agent
/
√
/
MITO
Llama-2-7B
Aksitovetal.
[3]
Self-exploration
Model-based
/
/
SFT
PaLM-2-base-series
SWIFTSAGE
[94]
Self-exploration
Environment
√
/
SFT
T5-Large
AGILE
[39]
Expert
/
/
/
BC
Vicuna-13B,Meerkat-7B
NLRL
[40]
Self-exploration
/
/
/
SFT
Llama-3.1-8B-Instruct
ETO
[144]
Expert
/
/
√
BC
Llama-2-7B-Chat
Retrospex
[171]
Expert
/
/
√
BC
Flan-T5-Large,Llama-3-8B-Instruct
ToRA
[49]
StrongLLM
HumanorRule
/
√
BC
Llama-2-series,CodeLlama-series
Sayself
[179]
StrongLLM
HumanorRule
/
/
SFT
Mistral-7B,Llama-3-8B
(CoT)methods[
164
]toexperttrajectoriesforimitationlearning,reinforcingtask-specificbehaviors.Toensurealignmentwithpre-trainedLLMdomains,Agent-FLAN[
22
]transformsReAct-styleexperttrajectoriesintomulti-turndialogue,segmentingthedialogueintodifferenttask-specificturn,suchasinstruction-followingandreasoning.StepAgent[
29
]introducesatwo-phaselearningprocess,whereagentsfirstobservediscrepanciesbetweentheirpoliciesandexperttrajectories,theniterativelyrefinetheiractions.Additionally,AgentOhana[
202
]standardizesheterogeneousagentexperttrajectoriesintoaunifiedformattoimprovedataconsistency.Despitetheirreliabilityandalignmentwithspecifictasks,thesedatasetsareresource-intensiveandlackscalability,makingthemcommonlysupplementedwithotherdataacquisitionmethodstoenhancedatasetdiversity.
(2)StrongLLM-generatedtrajectories.StrongLLM-generatedtrajectoriesleveragepowerfulLLMslikeChatGPTandGPT-4toautonomouslygeneratetask-specificdata.ThesetrajectoriesareusuallyproducedbyreasoningframeworkssuchasReActandCoT,allowingthemodeltointeractwiththeenvironmentandsimulateprocessesofreasoning,decision-makingandacting.
AgentTuning[
199
]andFireAct[
14
]employReActandCoTtoguideagentbehaviorwhileincorporatingReflexion[
139
]refinements,improvingthediversityofgenerateddata.Someworksintegratetoolsandstructuredannotationstoenhancetrajectoryinformativeness.NAT[
158
]generatesmultipletrajectoriesunderdifferenttemperaturesettings,usingReActpromptsandintegratingtoolssuchascalculatorsandAPIsduringinteractions.AgentLumos[
192
]utilizesGPT-4andGPT-4Vtoannotatedatasetswithinplanningandgroundingmodules,producingLUMOS-IandLUMOS-Ostyledata.Othermethodsexploremulti-rolesimulationtoenrichtrajectorycomplexity.Zhouetal.[
216
]employGPT-4tosimulateproblemgenerators,actionplanners,andenvironmentagents,enablingiterativeinteraction-drivendatageneration.AGENTBANK[
143
]alsoleveragesGPT-4forenvironmentinteractiondataandGPT-3.5forCoTrationales,andfinallytransformsthedataintochatbot-styleformatsforimprovedusability.
(3)Self-explorationenvironment-interactiontrajectories.Giventhehighcostsofexpert
annotationandp
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 社團(tuán)活動(dòng)課題研究的指導(dǎo)計(jì)劃
- 2025年壓力容器檢驗(yàn)員資格考試全真模擬試題及答案
- 2025年小學(xué)教師資格考試《綜合素質(zhì)》職業(yè)道德教學(xué)設(shè)計(jì)專項(xiàng)試題及答案
- 2025年醫(yī)保知識(shí)考試題庫(kù)及答案(醫(yī)保談判藥品價(jià)格談判)試卷
- 窗簾安裝工程合同范本
- 物流公司分紅合同范本
- 電鍍外協(xié)加工合同范本
- 貸款合同解除協(xié)議書范本
- 高速公路綠化合同范本
- 地推合作協(xié)議書合同
- 《第10課 我喜歡的機(jī)器人》參考課件1
- 2024年7月浙江省高中學(xué)業(yè)水平考試數(shù)學(xué)試卷真題(含答案詳解)
- 紡紗學(xué)(東華大學(xué))智慧樹知到期末考試答案章節(jié)答案2024年?yáng)|華大學(xué)
- 2024高考前測(cè)試-文綜試題卷
- 2024年美國(guó)戶外露營(yíng)裝備市場(chǎng)現(xiàn)狀及上下游分析報(bào)告
- 《環(huán)境衛(wèi)生學(xué)》考試復(fù)習(xí)題庫(kù)(含答案)
- 防止老公出軌的協(xié)議書
- 《大學(xué)生創(chuàng)業(yè)》課件完整版
- (高清版)JTGT 3331-2024 采空區(qū)公路設(shè)計(jì)與施工技術(shù)規(guī)范
- 見證取樣制度及取樣要求、數(shù)量及方法
- 2024廣西公需課高質(zhì)量共建“一帶一路”譜寫人類命運(yùn)共同體新篇章答案
評(píng)論
0/150
提交評(píng)論