基于大語(yǔ)言模型的智能體優(yōu)化研究綜述

上傳人：策*** IP屬地：山西上傳時(shí)間：2025-05-28 格式：DOCX 頁(yè)數(shù)：64 大?。?00.96KB 積分：19.9 舉報(bào) 版權(quán)申訴

已閱讀5頁(yè)，還剩59頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

ASurveyontheOptimizationofLargeLanguageModel-basedAgents

arXiv:2503.12434v1[cs.AI]16Mar2025

SHANGHENGDU

,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniver-sity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

JIABAOZHAO?,SchoolofComputerScienceandTechnology,DonghuaUniversity,China

JINXINSHI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

ZHENTAOXIE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

XINJIANG,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

YANHONGBAI,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

LIANGHE,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,China

WiththerapiddevelopmentofLargeLanguageModels(LLMs),LLM-basedagentshavebeenwidelyadoptedinvariousfields,becomingessentialforautonomousdecision-makingandinteractivetasks.However,currentworktypicallyreliesonpromptdesignorfine-tuningstrategiesappliedtovanillaLLMs,whichoftenleadstolimitedeffectivenessorsuboptimalperformanceincomplexagent-relatedenvironments.AlthoughLLMoptimizationtechniquescanimprovemodelperformanceacrossmanygeneraltasks,theylackspecializedoptimizationtowardscriticalagentfunctionalitiessuchaslong-termplanning,dynamicenvironmentalinteraction,andcomplexdecision-making.AlthoughnumerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks,asystematicreviewsummarizingandcomparingthesemethodsfromaholisticperspectiveisstilllacking.Inthissurvey,weprovideacomprehensivereviewofLLM-basedagentoptimizationapproaches,categorizingthemintoparameter-drivenandparameter-freemethods.Wefirstfocusonparameter-drivenoptimization,coveringfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridstrategies,analyzingkeyaspectssuchastrajectorydataconstruction,fine-tuningtechniques,rewardfunctiondesign,andoptimizationalgorithms.Additionally,webrieflydiscussparameter-freestrategiesthatoptimizeagentbehaviorthroughpromptengineeringandexternalknowledgeretrieval.Finally,wesummarizethedatasetsandbenchmarksusedforevaluationandtuning,reviewkeyapplicationsofLLM-basedagents,anddiscussmajorchallengesandpromisingfuturedirections.Ourrepositoryforrelatedreferencesisavailableat

/YoungDubbyDu/LLM-Agent-Optimization

1Introduction

Thedevelopmentofautonomousagentshasbeenalong-termpursuitinArtificialIntelligence(AI).AIagentshaveevolvedfromearlyrule-basedandexpertsystem-basedarchitecturestoreinforce-mentlearning(RL)-drivenagents,whicharenowwidelyappliedinmanyfields[

].TraditionalRL-basedagentsoptimizepoliciesthroughinteractionwithenvironments,usingstructuredrewardfunctionstoachievegoalsandimproveperformanceovertime.However,theseapproachesoften

+Correspondingauthor.

Authors’ContactInformation:

ShanghengDu

,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,dsh@.cn;JiabaoZhao,SchoolofComputerScienceandTechnology,DonghuaUniversity,Shanghai,China,jbzhao@;JinxinShi,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,jinxinshi@;ZhentaoXie,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,ecnudavidtao@;XinJiang,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,51275901099@;YanhongBai,ShanghaiInstituteofArtificialIntelligenceforEducation,EastChinaNormalUniversity;SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,Lucky_Baiyh@;LiangHe,SchoolofComputerScienceandTechnology,EastChinaNormalUniversity,Shanghai,China,lhe@.

2S.Duetal.

requireextensivetraining,relyonwell-definedstate-actionspaces,andstrugglewithgeneralizationacrossdiversetasks.

Inrecentyears,LargeLanguageModels(LLMs)suchasGPT-4[

120

],PaLM2[

],andDeepseek-r1[

]haveachievedremarkablesuccess,demonstratingexceptionalcapabilitiesinlanguageunderstanding,reasoning,planningandcomplexdecision-making.Buildingonthesestrengths,LLMscanserveasagents,providingapromisingpathwaytoimproveautonomousdecision-makingandachieveAGI[

169

].UnlikeconventionalRL-basedagents,whichoptimizeexplicitreward-drivenpolicies,LLM-basedagentsoperatethroughtext-basedinstructionsandprompttemplatesandin-contextlearning(ICL),allowinggreaterflexibilityandgeneralization.TheseagentsleveragethecomprehensionandreasoningcapabilitiesofLLMstointeractwithenvironmentsthroughnaturallanguage,executecomplexmulti-steptasks,anddynamicallyadapttoevolvingscenarios.ExistingLLMagentsutilizevariousmethodssuchastaskdecomposition[

],self-reflection[

133

],memoryaugmentation[

210

],andmulti-agentcollaboration[

]toachievehighperformanceacrossarangeofdomains,includingsoftwaredevelopment[

],mathematicalreasoning[

],embodiedintelligence

[212

],webnavigation

[28

],andmore.

However,despitetheirstrengths,LLMsarenotinherentlydesignedforautonomousdecision-makingandlong-termtasks.Theirtrainingobjectivesfocusonnext-tokenpredictionratherthanreasoning,planning,orinteractivelearningrequiredforagent-basedtasks,sotheylackexplicittrainingonagent-centrictasks.Asaresult,deployingLLMsasagentsincomplexenvironmentspresentsseveralkeychallenges:1)LLM-basedagentsstrugglewithlong-horizonplanningandmulti-stepreasoning,astheirgenerativecontentmayleadtotaskinconsistenciesorerroraccumulationoverextendedinteractions.2)LimitedmemorycapacityinLLMshindersagentsfromutilizingpastexperiencesforreflection,leadingtosuboptimaldecision-makingandtaskperformance.3)TheadaptabilityofLLM-basedagentstonovelenvironmentsisconstrained,astheyprimarilyrelyonpre-trainedknowledgeorfixedcontexts,limitingtheirabilitytohandledynamicscenarios.Theselimitationsareparticularlyevidentinopen-sourceLLMs,whichlagbehindproprietarymodelslikeGPT-4inagent-specificcapabilities.Additionally,thehighcostandlackoftransparencyofclosed-sourceLLMshighlighttheneedforoptimizingopenLLMstoenhanceagentcapabilities.

Existingtechniques,suchassupervisedfine-tuning(SFT)[

122

]andreinforcementlearningwithhumanfeedback(RLHF)[

121

],havemadesignificantstridesinimprovingLLMperformanceininstructionfollowingtasks,buttheyfailtofullyaddressthechallengesofdecision-making,long-termplanning,andadaptabilityforLLM-basedagents.OptimizingLLM-basedagentsrequiresabroaderunderstandingofdynamicenvironmentsandagentbehaviors,whichneedstodesignspecializedtechniquesthatgobeyondtraditionalLLMfine-tuningandpromptengineeringmethods.Toaddressthesechallenges,numerousrecentstudieshaveexploredvariousstrategiestooptimizeLLM-basedagentsforcomplexagenttasks.Thesemethodsensurethatagentscangeneralizeacrossdiverseenvironments,refinestrategiesbasedonfeedback,andefficientlyutilizeexternalresourcessuchastools,memory,andretrievalmechanisms.

Inthispaper,weprovideacomprehensivesurveyonLLM-basedagentoptimization,system-aticallycategorizingmethodsintoparameter-drivenandparameter-freeoptimizationstrategies.Ourworkfocusesonthetechnicalmethodologiesemployedtooptimizeagentcapabilitieslikeagenttuning,RL,andotherstoimproveagentperformance.Specifically,Parameter-drivenOptimizationrefinesLLMparameterstoenhanceagentperformance.Thiscategoryincludesconventionalfine-tuningapproaches,coveringkeystagessuchasagenttrajectorydataconstructionandfine-tuningstrategies.Inaddition,weexploreRL-basedoptimization,whichisdividedintotwodistinctoptimizationdirections:rewardfunction-basedmethodsleveragingtraditionalRLtechniqueslikeActor-Critic[

147

]andProximalPolicyOptimization(PPO)[

136

],andpreferencealignment-basedmethodsutilizingDirectPreferenceOptimization(DPO)[

132

]tosynchronize

ASurveyontheOptimizationofLargeLanguageModel-basedAgents3

agentpolicieswithhumanpreferenceortask-specificobjectives.Finally,wediscusshybridfine-tuningoptimizationstrategies,arisingarea,whichcombineSFTwithRLtoiterativelyrefineagentbehavior.Incontrast,wealsobrieflyoutlineParameter-freeOptimizationmethodsthatfocusonimprovingagentbehaviorwithoutmodifyingmodelparameters.Thesemethodsleveragepromptengineering,in-contextlearningandretrieval-augmentedgeneration(RAG),incorporatingvarioustypesofinformationintopromptstoguideagents’actions.Theyarecategorizedintofeedback-basedoptimization,experience-basedoptimization,tool-basedoptimization,retrieval-augmentedoptimization,andmulti-agentcollaborativeoptimization.

Fig.1.AnOverviewofthePaperOrganization.

Comparisontorelatedsurveys.DespitethegrowingresearchinterestinLLM-basedagents,existingsurveysprimarilyfocusongeneralLLMoptimizationorspecificagentabilitiessuchasplanning,memory,androle-playing,withouttreatingLLM-basedagentoptimizationasadistinctresearcharea.SurveysonLLMoptimizationmainlycoverfine-tuning[

115

122

]andself-evolutionapproaches[

150

],butlackdiscussionsonspecializedoptimizationrequiredforagentcapabilities.Ontheotherhand,existingagent-relatedsurveysgenerallycategorizeworksbasedonarchitecturalcomponentssuchasplanning[

],memory[

210

],ormulti-agentcoordination[

],ratherthansystematicallysummarizingthetechniquesdedicatedtooptimizeLLM-basedagentbehaviorsandperformance.Ascomparison,thisworkisthefirstsurveytowardsLLM-basedagentoptimization

4S.Duetal.

techniques,facilitatingaclearerunderstandingandcomparisonofexistingmethodsandprovidingdirectionsforfutureresearch.

Scopeandrationales.(1)WesurveyonlyLLM-basedagentoptimizationalgorithmstoimproveagenttaskperformance,suchasproblem-solvinganddecision-making,coveringparameter-drivenandparameter-freeapproaches.WeexcludeworkscenteredongeneralLLMefficiency,role-playing,ordialogue;(2)OurselectionincludespapersfromAIandNLPconferencesandjournals,aswellasrecenthigh-impactpreprintsfromarXivtoensurecoverageofthelatestadvancements.(3)Wefocusonstudiespublishedsince2022toreflectrecentadvancementsinLLM-basedagentoptimization.

Organizationofthissurvey.Theschematicrepresentationofthismanuscript’slayoutcanbefoundinFigure

.Section

providesthebackgroundknowledgeandrelatedconcepts.InSection

,wesystematicallyreviewparameter-drivenoptimizationapproachesthatmodifyLLMparameterstoenhanceagentcapabilities,categorizingthemintothreemainstrategies:fine-tuning-basedoptimization(§

3.1

),RL-basedoptimization(§

3.2

),andhybridoptimization(§

3.3

).Section

summarizesandclassifiesexistingworkonparameter-freeoptimizationstrategies.Then,Section

presentsdatasetsandbenchmarks,whileSection

reviewspracticalapplicationsacrossvariousdomains.Finally,Section

highlightschallengesandfuturedirections.

2Background

2.1ReinforcementLearning-basedAgentOptimization

RLhaslongbeenafundamentalapproachinagentoptimization,allowingagentstolearnfrominter-actionswithenvironments.CurrentRLmethodsmainlyoptimizeagentbehaviorsusingvalue-basedandpolicy-basedapproaches[

106

117

].Value-basedmethods,suchasQ-learning[

163

],optimizeanagent’saction-valuefunctiontomaximizelong-termrewards.Thesemethodsareeffectiveindiscreteactionspacesbutstrugglewithhigh-dimensionalstatesoractionspaces.Policy-basedmethods,includingPolicyGradient[

124

],directlyoptimizetheagent’spolicybyadjustingparametersbasedonrewardgradients.Toimprovestabilityandsampleefficiency,PPO[

136

]introducedaconstraintonpolicyupdates,mitigatingperformancedegradationduringtraining.Actor-Criticmethods[

147

]combinevalueestimationwithpolicylearning,improvingconvergenceefficiencyanddecisionrobustness.Beyondsingle-agentsettings,Multi-AgentRein-forcementLearning(MARL)extendsRLtechniquestoscenariosinvolvingmultipleinteractingagents,enablingbothcooperativeandcompetitivedynamics

[12

204

Inrecentyears,RLhasalsobeenincreasinglyappliedtoaligningAIagentswithhumanin-tentions,particularlyinpreference-basedoptimization.RLHF[

121

]hasemergedasaprominentapproach,refiningagentpoliciesbasedonhuman-providedsignalstoimprovealignmentwithdesiredbehaviors.DPO[

132

]optimizespoliciesdirectlyfrompreferencedatawithoutrewardmod-eling,improvingalignmentandcontrollability.Overall,RL-basedoptimizationhasevolvedfromearlyvalue-basedandpolicy-basedlearningtomoreadvancedtechniquesthatintegratestructuredfeedbackandmulti-agentcoordination,providingafoundationforimprovingdecision-makinginLLM-basedagents.

2.2LLMFine-Tuning

LLMfine-tuningisacriticalmethodforadaptingpre-trainedmodelstospecifictasksthroughopti-mizingparameters,makingthemmoresuitedtothedesiredapplication.ThemostpopularapproachisSFT,whereLLMsaretrainedonlabeleddatatoimprovetask-specificperformance.InstructionTuningisacommonlyusedmethodinSFT,whereLLMsarefurthertrainedoninstruction-outputpairstoenhancetheirabilitytofollowhumancommands[

205

].Anothermajordevelopmentisparameter-efficientfine-tuning(PEFT),includingmethodslikeP-Tuning[

103

],LoRA[

],and

ASurveyontheOptimizationofLargeLanguageModel-basedAgents5

QLoRA[

].Thesetechniquesadjustasmallsubsetofparameters,significantlyreducingthecom-putationalcostoffine-tuningwhilepreservingLLMperformance,makingthemhighlyefficientforreal-worldapplications.Additionally,RLHFhasbeenusedtofine-tuneLLMsbyintegratinghumanfeedback,improvingtheirdecision-makingandoutputalignmentwithuserpreferences[

121

].TheseoptimizationtechniquesenableLLMstoadaptmoreefficientlytoawiderangeoftasks,enhancingtheireffectivenessinreal-worldscenarios.

2.3LLM-basedRAG

RAGcombinesLLMwithexternalinformationretrievalsystemstoenhancetherelevanceandaccuracyofgeneratedoutputs.Byretrievingrelevantdocumentsfromexternalsources,RAGallowsLLMstoaddresstheknowledgeconstraintsinherentinmodels.TheevolutionofRAGmethodshasbeenmarkedbysignificantadvancementsinretrievalandgenerationintegration[

].Early,NaiveRAGmethodsfocusondirectlyretrievingrelevantdocumentstoaugmentthegenerativeprocess,improvingthequalityofresponsesintasksrequiringfactualknowledge.ToaddressthechallengesofNaiveRAG,AdvancedRAGisintroduced,refiningtheretrievalprocessbyincorporatingmoreef-fectiveranking,filtering,anddocumentselectionstrategies.Subsequently,ModularRAGintroducesamodularframeworkthatoptimizestheretrievalandgenerativecomponentsindependently.Thismodularapproachenablestask-specificoptimizations,allowingformoreflexibilityandscalabilityinapplicationsacrossdifferentdomains[

193

].TheseadvancementsinRAGhighlightitspotentialtoenhanceLLMsbyenablingdynamicaccesstoexternalknowledge,makingthemmoreadaptableandcapableofaddressingcomplextasksinreal-worldscenarios.

3Parameter-drivenOptimizationofLLM-basedAgents

ComparisonwithLLMparameteroptimization.Parameter-drivenLLMoptimizationfocuseson"howtocreateabettermodel",aimingtoenhancegenerallanguageunderstanding,instructionfollowing,andbroadtaskperformance.Incontrast,LLM-basedagentparameteroptimizationaddresses"howtousethemodeltosolvecomplexagenttasks",emphasizingdecision-making,multi-stepreasoning,andtaskexecutionindynamicenvironments.AlthoughgeneralLLMoptimizationimprovesfluencyandfactualaccuracyacrossdiverseapplications,LLM-agentoptimizationistask-specific,requiringmodelstoadaptstrategies,interactwithenvironments,andrefinebehaviorsforautonomousproblem-solving.Parameter-drivenoptimizationofLLM-basedagentsprimarilyreliesonexperttrajectorydataorself-generatedtrajectorydataobtainedthroughenvironmentexploration,thenemploysvariousoptimizationtechniquestoiterativelyrefinepoliciesandenhanceperformance.

Inthissection,wediscusshowparameter-drivenoptimizationmethodsimprovetheperformanceofLLM-basedagents.Specifically,wecategorizethesemethodsintothreemaintechnicalapproachesaccordingtodifferentstrategiesforparametertuning:conventionalfine-tuning-basedoptimization,reinforcementlearning-basedoptimization,andhybridoptimization.

3.1ConventionalFine-Tuning-basedOptimization

Conventionalfine-tuning-basedagentoptimizationinvolvestuningpre-trainedLLMs’parametersthroughvariousfine-tuningtechniques,suchasinstructiontuningandparameter-efficientfine-tuning.Trajectoryforfine-tuningtypicallyareconstructedintheformofSFTandisusedtoadjusttheagent’sparameterstobetteralignwithtask-specificrequirements.Theoptimizationprocesstypicallyconsistsoftwomajorsteps:1)constructinghigh-qualitytrajectorydatatailoredtoagenttasks;2)fine-tuningLLM-basedagentsusingthesetrajectorydata,andthecompleteprocessispresentedinFigure

.Previousstudies[

122

]haveshownthatthequalityoftrainingdatasignificantlyimpactsmodelperformance,highlightingtheimportance

6S.Duetal.

Fig.2.WorkflowofFine-Tuning-basedOptimizationforLLM-basedAgents.

ofgenerating,filtering,andeffectivelyutilizinghigh-qualitytrajectories.Thismakestrajectoryconstructionacriticalstepinthefine-tuningpipeline,directlyinfluencingtheLLM-basedagent’soverallperformance.InTable

,weprovideacomprehensiveoverviewoffine-tuning-basedagentoptimizationmethods,highlightingthedataprocessingtechniquesandfine-tuningstrategiesusedineachwork.Itisimportanttonotethatthissectionexcludesfine-tuningmethodsthatinvolvereinforcementlearningorpreferencealignmenttechniques(e.g.,DPO,PPO),whichwillbeaddressedin§

3.2

.Instead,inthissection,weonlyfocusonthepartoftraditionalLLMfine-tuningtechniquesappliedinexistingworks,aimingtoensureeachstageoftheconventionalfine-tuning-basedagentoptimizationworkflowisclearlyintroduced.

3.1.1TrajectoryDataConstructionforAgentFine-Tuning.Theconstructionofhigh-qualitytrajectoryisacrucialstepbeforethefine-tuningofLLM-basedagents,whichaimstoempowerLLMswithagentability.Thisprocessinvolvesthegenerationoftrajectorydata,followedbyevaluationandfiltering,andthepotentialutilizationoflow-qualitysamples,toconstructrefineddatathatmeettherequirementsforeffectivefine-tuning.

DataAcquisitionandGeneration.High-qualitytrajectorydataconstructionbeginswiththeacquisitionandgenerationofinitialdata,whichrequiresnotonlyadiversesetoftrajectories,butalsosufficientalignmentwiththetargettaskstoensureeffectivelearning.Methodsforacquiringandgeneratingsuchdatacangenerallybeclassifiedintofourbroadcategories:expert-annotateddata,strongLLM-generatedtrajectories,self-explorationenvironment-interactiontrajectories,andmulti-agentcollaboration-basedconstruction.Here,weintroducetheutilizationandconstructionprocessesofeachcategoryandreviewtherelevantstudies.

(1)Expert-annotateddata.Expert-annotatedtrajectoriesrefertohigh-qualitydatasetsman-uallycraftedbyhumanexperts,oftenconsideredthegoldstandardforfine-tuning.Thesedataensuretaskreliabilityandalignment,asexpertscanmeticulouslydesignandannotatetrajectoriestailoredtospecificcases.

Manyworks[

144

158

177

]utilizeReAct-styleexperttrajectoriesasinitialdatasets,withdataincludingthoughts,observationsandactions[

189

],whichenableagentstomimicexpertdecision-makingprocessesmoreeffectively.Forinstance,IPR[

177

]leveragessuchtrajectoriestohelpagentsacquirefoundationalskills.Similarly,ETO[

144

]andAGILE[

]applyChainofThought

ASurveyontheOptimizationofLargeLanguageModel-basedAgents7

Table1.ComparisonofConventionalFine-Tuning-basedOptimizationforLLM-basedAgents:DataCon-

structionandFine-Tuning.Note:MA-Multi-AgentFramework;LQ-Low-QualityDataUtilization.

Method

TrajectoryAgentDataConstruction

Fine-Tuning

Generation

Filtering

Fine-tuneApproachBaseModel

AgentTuning

[199]

StrongLLM

HumanorRule

√

InstructionTuning

Llama-2-7B/13B/70B

SMART

[197]

Multi-agent

Environment

√

LoRA

Llama-2-7B

Agent-FLAN

[22]

Expert

Model

√

InstructionTuning

Llama-2-7B

Self-Talk

[153]

Multi-agent

HumanorRule

√

LoRA

MosaicAI-7B-Chat

ENVISIONS

[178]

Self-exploration

Environment

√

SFT

Llama2-7B/13B-Chat

AgentGym

[170]

StrongLLM&Expert

Environment

√

Llama-2-7B-Chat

FireAct

[14]

StrongLLM

Environment

LoRA

GPT3.5,Llama-2-7B/13B,CodeLlama-7B/13B/34B-Instruct

NAT

[158]

StrongLLM

Environment

√

SFT

Llama-2-7B/13B-Chat

AgentLumos

[192]

StrongLLM

HumanorRule

LoRA

Llama-2-7B/13B

STE

[154]

Self-exploration

Model

√

SFT

Llama-2-7B/13B-Chat,Mistral-7B-Instruct

OPTIMA

[19]

Multi-agent

HumanorRule

√

SFT

Llama-3-8B

Zhouetal.

[216]

StrongLLM

HumanorRule

√

LoRA

OpenChatv3.2,Llama-2-7B,AgentLM-7B

AgentOhana

[202]

Expert

Model

QLoRA

xLAM-v0.1

COEVOL

[85]

Expert

Model

√

SFT

Llama-2-7B,Mistral-7B

AGENTBANK

[143]

StrongLLM

Environment

√

InstructionTuning

Llama-2-Chat

ADASWITCH

[146]

Self-exploration

Model

√

SFT

DeepSeek-Coder-1.3B,StarCoder2-3B

IPR

[177]

Expert&Self-exploration

Environment

√

InstructionTuning

Llama-2-7B

Re-ReST

[33]

Self-exploration

Environment

√

LoRA

Llama-2-7B/13B,Llama-3-8B,CodeLlama-13B,VPGen

ATM

[219]

Multi-agent

√

MITO

Llama-2-7B

Aksitovetal.

[3]

Self-exploration

Model-based

SFT

PaLM-2-base-series

SWIFTSAGE

[94]

Self-exploration

Environment

√

SFT

T5-Large

AGILE

[39]

Expert

Vicuna-13B,Meerkat-7B

NLRL

[40]

Self-exploration

SFT

Llama-3.1-8B-Instruct

ETO

[144]

Expert

√

Llama-2-7B-Chat

Retrospex

[171]

Expert

√

Flan-T5-Large,Llama-3-8B-Instruct

ToRA

[49]

StrongLLM

HumanorRule

√

Llama-2-series,CodeLlama-series

Sayself

[179]

StrongLLM

HumanorRule

SFT

Mistral-7B,Llama-3-8B

(CoT)methods[

164

]toexperttrajectoriesforimitationlearning,reinforcingtask-specificbehaviors.Toensurealignmentwithpre-trainedLLMdomains,Agent-FLAN[

]transformsReAct-styleexperttrajectoriesintomulti-turndialogue,segmentingthedialogueintodifferenttask-specificturn,suchasinstruction-followingandreasoning.StepAgent[

]introducesatwo-phaselearningprocess,whereagentsfirstobservediscrepanciesbetweentheirpoliciesandexperttrajectories,theniterativelyrefinetheiractions.Additionally,AgentOhana[

202

]standardizesheterogeneousagentexperttrajectoriesintoaunifiedformattoimprovedataconsistency.Despitetheirreliabilityandalignmentwithspecifictasks,thesedatasetsareresource-intensiveandlackscalability,makingthemcommonlysupplementedwithotherdataacquisitionmethodstoenhancedatasetdiversity.

(2)StrongLLM-generatedtrajectories.StrongLLM-generatedtrajectoriesleveragepowerfulLLMslikeChatGPTandGPT-4toautonomouslygeneratetask-specificdata.ThesetrajectoriesareusuallyproducedbyreasoningframeworkssuchasReActandCoT,allowingthemodeltointeractwiththeenvironmentandsimulateprocessesofreasoning,decision-makingandacting.

AgentTuning[

199

]andFireAct[

]employReActandCoTtoguideagentbehaviorwhileincorporatingReflexion[

139

]refinements,improvingthediversityofgenerateddata.Someworksintegratetoolsandstructuredannotationstoenhancetrajectoryinformativeness.NAT[

158

]generatesmultipletrajectoriesunderdifferenttemperaturesettings,usingReActpromptsandintegratingtoolssuchascalculatorsandAPIsduringinteractions.AgentLumos[

192

]utilizesGPT-4andGPT-4Vtoannotatedatasetswithinplanningandgroundingmodules,producingLUMOS-IandLUMOS-Ostyledata.Othermethodsexploremulti-rolesimulationtoenrichtrajectorycomplexity.Zhouetal.[

216

]employGPT-4tosimulateproblemgenerators,actionplanners,andenvironmentagents,enablingiterativeinteraction-drivendatageneration.AGENTBANK[

143

]alsoleveragesGPT-4forenvironmentinteractiondataandGPT-3.5forCoTrationales,andfinallytransformsthedataintochatbot-styleformatsforimprovedusability.

(3)Self-explorationenvironment-interactiontrajectories.Giventhehighcostsofexpert

annotationandp

人人文庫(kù)> 全部分類> 應(yīng)用文書 > 研究報(bào)告

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

基于大語(yǔ)言模型的智能體優(yōu)化研究綜述

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

基于大語(yǔ)言模型的智能體優(yōu)化研究綜述

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔