LLM에 코드베이스 문맥을 전달하기 위한 esbuild 번들러 도입 시도 (raw)

*원제: 그냥 LLM이 코드베이스 전체를 이해해주면 안 될까? 번들링을 통한 RAG 시도* # 목차 1. [개요](#개요) 1. [배경지식](#배경지식) - [“코드베이스”란?](#코드베이스) - [LLM이 방대한 코드베이스를 이해하려면](#LLM_코드베이스) - [RAG의 필요성](#RAG) 1. [가설](#가설) 1. [가설 검증](#가설검증) 1. [검증 방법](#검증방법) - [번들링이란?](#번들링이란) 1. [테스트 설계](#테스트설계) 1. [번들링 전 (원본 코드베이스)](#번들링전) 1. [번들링 후 (LLM에게 전달할 내용)](#번들링후) 1. [테스트 1: ChatGPT를 통한 API 문서 생성](#테스트1) - [테스트 1 결과: ChatGPT를 통한 API 문서 생성](#테스트1결과) - [대조군과의 비교](#테스트1비교) 1. [테스트 2: ChatGPT를 통한 매뉴얼 생성](#테스트2) - [테스트 2 결과: ChatGPT를 통한 매뉴얼 생성](#테스트2결과) 1. [한계점](#한계점) - [ChatGPT와의 활용에 관한 한계점](#ChatGPT한계점) - [번들링에 관한 한계점](#번들링한계점) 1. [발견](#발견) - [전통적인 도구를 써서 RAG를 성취한다는 점](#전통적) - [다른 언어로 작성된 코드베이스에 대한 일반화 가능성](#일반화) - [Semantic search와의 시너지](#다른방법과의비교) 1. [결론](#결론) 1. [업데이트 (2025-04-23)](#업데이트1) <div id="개요"> # 개요 </div> 이 글에서는 많은 파일을 내포하는 특정 TypeScript 코드베이스를 번들링(bundling)을 통해 1개의 JavaScript 파일로 압축한 후 LLM에게 전달하였을 때, LLM이 코드베이스와 관련된 질문에 얼마나 정확하게 답변할 수 있는지 확인할 것이다. 소프트웨어 시스템 분석[^software_analysis_static] [^software_analysis_dynamic]을 위해서는 코드베이스에 작성된 파일들을 ‘방문’하는 것이 필수적이다. 예를 들어, 비즈니스 로직에 관한 질문에 LLM이 대답하기 위해서는, LLM은 그 비즈니스 로직과 관련된 모든 파일을 읽을 수 있어야 한다[^rag]. 한편, 코드베이스는 일종의 비정형 데이터의 모음으로 볼 수 있는데, LLM이 비정형 데이터를 효율적으로 사용할 수 있게 하는 일반화된 방법은 여럿 존재한다 [^semantic_search] [^chatgpt_fileupload]. 이 글에서는 코드베이스를 RAG하기 위한 한 가지 방법으로써 코드베이스 전체를 압축한 후 LLM에게 제공하는 것을 시도할 것이다. [^software_analysis_static]: 정적 분석에 속하는 예: 호출 그래프 작성을 통한 제어 흐름 파악 & 우선순위 확인, (웹 서버의 경우) “컨트롤러” 코드를 시작으로 분석 진행, 코드베이스에 이미 도입된 아키텍쳐 패턴이 무엇인지 파악한 후, 연역적 추론을 통한 분석 방법 수립. [Wikipedia: Call graph](https://en.wikipedia.org/wiki/Call_graph) [^software_analysis_dynamic]: 동적 분석에 속하는 예: 통합 테스트 과정에서 호출되는 코드와 그렇지 않은 코드를 분류하여 각 코드간 관계를 파악. [Wikipedia: Runtime verification](https://en.wikipedia.org/wiki/Runtime_verification) [^rag]: LLM 모델에 지식을 빠르게 추가하기 위한 방법으로 Retrieval-augmented generation(RAG)가 일반적으로 사용된다. [Wikipedia: Retrieval-augmented generation](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) [^semantic_search]: 의미론적 검색 기법. [Wikipedia: Semantic_search](https://en.wikipedia.org/wiki/Semantic_search) [^chatgpt_fileupload]: [OpenAI: "How does the new file uploads capability work? "](https://help.openai.com/en/articles/8555545-file-uploads-faq) - ChatGPT의 파일 업로드(File Upload)는 전처리, 변환, 추출을 거쳐 정보를 정제한다. 예를 들어 Word 문서를 읽을 때에는 알고리즘적으로 미리 해석하고 문서의 제목/부제목들을 추출하여 메타데이터로 활용한다. <div id="배경지식"></div> # 배경지식 <div id="코드베이스"></div> ## “코드베이스”란? ![리눅스 커널의 코드베이스](https://blog.atj.sh/post-attachments/v/01962f18-fa38-7c12-878a-57f1a4551539) *사진: 리눅스 코드베이스* 한 소프트웨어를 이루는 소스 코드 파일의 모임을 코드베이스라고 한다. 소프트웨어의 기능이 많을수록, 코드베이스는 자연스럽게 방대해진다. <div id="LLM_코드베이스"></div> ## LLM이 방대한 코드베이스를 이해하려면 코드베이스와 관련된 작업을 LLM이 처리해줄 수 있을까? - “로그인 페이지에 버튼을 추가해줘” - “소스코드에서 '휴면 계정'과 관련된 로직을 제거해줘” - “오늘 작업한 코드들에 대해 단위 테스트 코드를 작성해줘” - “이 폴더에 들어있는 소스코드들을 한 번 검토해줘” 당연하게도 코드베이스를 LLM에게 전달하면 될 것이다. 하지만 코드베이스가 방대할수록 고려할 점이 늘어난다. - LLM이 한 번에 받아들일 수 있는 정보의 양은 LLM마다 다르다. 그리고 더 많은 정보를 전달할수록 비용이 더 많이 부과된다. - 설사 LLM이 받아들일 수 있는 양이 무제한이더라도, 우리는 그 방대한 코드베이스 전체를 네트워크를 통해 전부 전달해야 하는 번거로움을 부담해야 할 것이다. 이것은 비단 코드베이스만의 문제는 아니다. LLM을 통해 방대한 정보를 처리시키기 위해서는 반드시 해결해야 하는 문제이다. <div id="RAG"></div> ## RAG의 필요성 ![RAG의 절차](https://blog.atj.sh/post-attachments/v/01962f19-6d74-7441-ae63-50e354c17e4b) *사진: RAG의 절차* 방대한 지식을 LLM에게 알려주는 방법 중 하나로 **RAG**(Retrieval-Augmented Generation)가 있다. RAG를 구현하는 수단 중 하나로는 semantic search가 있다. 쉽게 말하자면 유사도 검색을 통한 지식 선택이다. 응답을 생성하기 전에 지식 베이스에서 ‘질문과 관련되었을 확률이 높은 정보들만 쿼리’하여 LLM에게 전달한다. 질문마다 필요한 지식만을 추려서 답변에 반영하는 것이다. 하지만, 데이터베이스화를 아무리 잘 했더라도, 원하는 지식이 제대로 조회되지 않는다면 결국 답변의 품질이 떨어질 것이다. 즉, RAG 시스템을 구축하여 LLM의 응답 품질을 높일 수 있지만, 지식을 적절하게 제공해야 하는 것은 변치 않는다. RAG 방법론은 지금도 새롭게 연구되고 있다. <div id="가설"></div> # 가설 **코드베이스를 (LLM이 허용하는 선에서) 통째로 전달할 수 있다면, 코드베이스 전체에 대한 유의미한 RAG를 비교적 쉽게 성취할 수 있을 것이다.** 이러한 RAG의 장점은 다음과 같을 것이다. - RAG를 구현하기 위해 ‘검색 시스템’을 반드시 구현하지 않아도 된다. - 코드베이스 전체에 대한 문맥을 확보할 수 있어, 질문자가 고려하지 못한 부분에 대해서도 LLM이 파악해줄 수 있다. 해결해야 하는 부분은 아래와 같을 것이다. - 코드베이스를 ‘압축'해야 한다. - 무손실 압축은 사실상 불가능하기에, 정보 손실을 감수하고 압축을 시도해야 한다. - 압축 방법의 일반화가 가능해야 한다. <div id="가설검증"></div> # 가설 검증 <div id="검증방법"></div> ## 검증 방법 코드베이스 전체를 LLM에 RAG하기 위한 압축 방법으로써 번들링을 사용하고자 한다. 많은 파일을 내포하는 TypeScript 코드베이스를 번들링을 통해 1개의 JavaScript 파일로 압축한 후 LLM에게 전달할 것이다. 그리고, LLM이 코드베이스와 관련된 질문에 얼마나 정확하게 답변할 수 있는지 확인할 것이다. <div id="번들링이란"></div> ### 번들링이란? ![번들링 툴을 요약한 그림](https://blog.atj.sh/post-attachments/v/01962f1a-3d8d-76f3-903e-d652830647ea) TypeScript/JavaScript 언어 진영에서의 “번들링”(bundle, bundling)은 코드베이스 안에 작성되어 있는 여러 소스 파일들을 단 몇 개의 소스 파일로 엮는 행위를 뜻한다. 번들링의 특징: - 주석, Dead Code, 테스트 코드 등 비즈니스 로직과 관계없는 것들은 제외된 코드가 도출된다. 그럼에도 동작의 동일성이 보장된다. - 방대한 의존성 패키지들(예: node_modules)을 전부 모아 한 파일에 포함시킨다. - Syntactic sugar를 적극 활용하여 코드를 짧게 줄인다. <div id="테스트설계"></div> ## 테스트 설계 번들링을 위해 사용한 설정[^esbuild_config]은 아래와 같다. [^esbuild_config]: 이전에 작성한 적이 있는 esbuild 빌드 설정을 조금 수정하여 그대로 사용하였다. 원본은 GitHub에서 확인 가능하다. [GitHub: build-option.ts](https://github.com/atjsh/mini-dice-v1/blob/main/builders/server/src/lib/build-option.ts) ``` const projectRoot = path.resolve(process.cwd(), "..", ".."); const commonBuildOptions = { entryPoints: [ { in: `${projectRoot}/apps/server/src/main.ts`, out: "index" } ], tsconfig: `${projectRoot}/apps/server/tsconfig.json`, format: "esm", platform: "node", target: "node19", outExtension: { ".js": ".mjs" }, legalComments: "none", banner: { /** https://github.com/evanw/esbuild/issues/1921#issuecomment-1152991694 */ js: "import{createRequire}from'module';const require=createRequire(import.meta.url);" + "import{fileURLToPath}from'node:url';import{dirname as __pathDirname}from'node:path';const __filename=fileURLToPath(import.meta.url);const __dirname=__pathDirname(__filename);" }, charset: "utf8", plugins: [ esbuildDecorators({ tsconfig: `${projectRoot}/apps/server/tsconfig.json` }), esbuildProgressPulgin() ] } satisfies esbuild.BuildOptions; await esbuild.build({ ...commonBuildOptions, outdir: `${projectRoot}/apps/server/dist/bundle-minify-nokeepnames-package-external`, bundle: true, minify: true, keepNames: false external: [ "@fastify/aws-lambda", "@fastify/cookie", "@fastify/helmet", "@nestjs/axios", "@nestjs/common", "@nestjs/config", "@nestjs/core", "@nestjs/jwt", "@nestjs/passport", "@nestjs/platform-fastify", "@nestjs/typeorm", "aws-lambda", "axios", "class-transformer", "class-validator", "fastify", "joi", "jsonwebtoken", "lodash", "passport", "passport-jwt", "pg", "reflect-metadata", "rxjs", "typeorm", "uuid" ], } satisfies esbuild.BuildOptions); ``` 외부 패키지들은 번들되지 않도록 하고, 소스코드 내부 코드만 번들되도록 설정하였다. 테스트 환경: - 테스트 대상 코드베이스: TypeScript로 작성된 Node.js 런타임용 웹 서버[^source] - 번들링 수단: [esbuild](https://esbuild.github.io/getting-started/#bundling-for-node)로 minified bundle 1회 진행 - LLM: OpenAI o3-mini 모델 [^price] [^source]: [GitHub: atjsh/mini-dice-v1](https://github.com/atjsh/mini-dice-v1), [Mini Dice 소개](https://blog.atj.sh/post/8) [^price]: o3-mini는 OpenAI에서 제공하는 가장 저렴한 LLM 중 하나이다. [OpenAI Pricing](https://platform.openai.com/docs/pricing) <div id="번들링전"></div> ## 번들링 전 (원본 코드베이스) - 코드 줄: 약 15,000여줄의 TypeScript - 파일 개수: 약 300여개 - 소스 코드 크기: 약 2MB - 외부 패키지들을 포함한 소스 코드 크기: 약 1GB [^node_modules] [^node_modules]: devDependencies에 속하는 패키지 포함. <div id="번들링후"></div> ## 번들링 후 (LLM에게 전달할 내용) ![코드베이스 파일 전체를 JavaScript 파일 1개로 압축한 결과물](https://blog.atj.sh/post-attachments/v/01962f1a-8154-7bc1-879b-43be067e7e19) *사진: 코드베이스 파일 전체를 JavaScript 파일 1개로 압축한 결과물* :img[]{src="https://blog.atj.sh/post-attachments/v/01962f1c-6f8c-71b1-8d27-941610085bf6" .half}:img[]{src="https://blog.atj.sh/post-attachments/v/01962f1c-81f0-7a13-b083-8c9b3f68bc52" .half} - *왼쪽 사진: 토큰 개수 계산 결과* - *오른쪽 사진: 파일 정보* 번들링 후 코드베이스는 다음과 같이 압축되었다. - 문자수: 239,143개 - 토큰 개수: 약 76,706개 - 파일 크기: 248KB - 파일 개수: 1개 <div id="테스트1"></div> ## 테스트 1: ChatGPT를 통한 API 문서 생성 **코드베이스를 압축한 파일 1개만 있어도, ChatGPT가 Swagger API 문서를 생성할 수 있을까?** ![](https://blog.atj.sh/post-attachments/v/01962f1f-caae-72c1-a473-4e1ebffba937) *스웨거(Swagger)는 개발자가 REST 웹 서비스를 설계, 빌드, 문서화, 소비하는 일을 도와주는 오픈 소스 소프트웨어 프레임워크이다.* **테스트 절차** 1. 번들 생성 2. 압축된 코드베이스를 아래 프롬프트 안에 텍스트로 첨부 3. ChatGPT에 전달 **프롬프트** ``` Rules: - The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in IETF RFC 2119 (https://datatracker.ietf.org/doc/html/rfc2119). Context: - [index.mjs] is a bundled JavaScript code of a TypeScript codebase. - [index.mjs] was generated by esbuild to reduce size and remove unnecessary contents. - [index.mjs] keeps every business logic that original codebases implements. Request: - Create a OpenAPI Specification 3.0 content based on [filename]. And, response the specification's content only. Requirements: - The result MUST be in JSON format. - The result MUST include every HTTP API and it's HTTP Method as path information. - The result SHOULD include security, schema, response information for each API. - The result MAY include tag information for each API. `package.json`'s content: (코드베이스에서 쓰인 의존성 목록) `index.mjs`'s content: (번들된 코드 내용) Response: ``` <div id="테스트1결과"></div> ## 테스트 1 결과: ChatGPT를 통한 API 문서 생성 **대부분의 시도에서 21개 엔드포인트 중 19개 이상의 엔드포인트를 정확하게 문서화하는 데 성공했다.** 엔드포인트가 포함된 API 문서는 YAML 형식 오류 없이 생성되었으며, URL, Method 정보가 정확하게 기재되었다. :img{src="https://blog.atj.sh/post-attachments/v/01962f75-a787-7ad3-a71f-e5561556d0b5"} *사진: ChatGPT가 생성한 Swagger API 문서를 [https://editor.swagger.io](https://editor.swagger.io/)에서 확인한 모습* <div id="테스트1비교"></div> ### 대조군과의 비교 ChatGPT를 통해 생성한 API 문서가 얼마나 사실에 가까운지 확인하기 위해, 웹 프레임워크에서 자체적으로 생성한 Swagger 내용과 비교하였다. #### 대조군 <details> <summary> 웹 프레임워크에서 자체적으로 생성한 Swagger 파일 확인하기 </summary> [웹 프레임워크에서 자체적으로 생성한 Swagger 파일 다운로드 (yaml)](https://blog.atj.sh/post-attachments/v/01962fb2-ea7d-7aa0-bbb9-bc6a004fca61) ```yaml openapi: 3.0.0 paths: /: get: operationId: AppController_root parameters: [] responses: '200': description: '' tags: - App /ads.txt: get: operationId: AppController_adsTxt parameters: [] responses: '200': description: '' tags: - App /auth/access-token: get: operationId: AccessTokenController_getAccessToken parameters: [] responses: '200': description: '' tags: - AccessToken /auth/logout: post: operationId: LocalJwtController_logout parameters: [] responses: '201': description: '' tags: - LocalJwt get: operationId: LocalJwtController_logoutGet parameters: [] responses: '200': description: '' tags: - LocalJwt /auth/google-oauth/{web}: get: operationId: GoogleOAuthController_authUserWithGoogleOauthCode parameters: - name: code required: true in: query schema: type: string - name: web required: true in: path schema: type: string responses: '200': description: '' tags: - GoogleOAuth /temp-signup: post: operationId: TempSignupController_temporarySignUpUser parameters: [] requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/TemporarySignUpDto' responses: '201': description: '' tags: - TempSignup /frontend-error: post: operationId: FrontendErrorController_insert parameters: [] responses: '201': description: '' tags: - FrontendError /land-comments: post: operationId: UserLandCommentController_registerComment parameters: [] requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/Function' responses: '201': description: '' tags: - UserLandComment /land-events: get: operationId: LandEventController_getLandEvents parameters: [] responses: '200': description: '' tags: - LandEvent /profile/me: get: operationId: PublicProfileController_getProfile parameters: [] responses: '200': description: '' tags: - PublicProfile patch: operationId: PublicProfileController_updateUserById parameters: [] responses: '200': description: '' tags: - PublicProfile /profile/others: get: operationId: PublicProfileController_getOthersProfiles parameters: [] responses: '200': description: '' tags: - PublicProfile /profile/complete-signup: patch: operationId: PublicProfileController_completeSignup parameters: [] responses: '200': description: '' tags: - PublicProfile /profile: delete: operationId: PublicProfileController_terminateUser parameters: [] responses: '200': description: '' tags: - PublicProfile /user-interaction-web: post: operationId: UserInteractionWebController_makeUserInteractionAndGetWebMessageResponse parameters: [] requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/UserInteractionDto' responses: '201': description: '' tags: - UserInteractionWeb /dice-toss: post: operationId: DiceTossController_tossDiceAndGetWebMessageResponse parameters: [] responses: '201': description: '' tags: - DiceToss /recent-skill-logs: get: operationId: RecentSkillLogsController_getRecentSkillLogsWeb parameters: - name: limit required: true in: query schema: type: number responses: '200': description: '' tags: - RecentSkillLogs /scenarios/d1/map: get: operationId: D1Controller_getAliases parameters: [] responses: '200': description: '' tags: - D1 /health/detail: get: operationId: HealthController_readiness parameters: [] responses: '200': description: '' tags: - Health /health: get: operationId: HealthController_simple parameters: [] responses: '200': description: '' tags: - Health info: title: Cats example description: The cats API description version: '1.0' contact: {} tags: - name: cats description: '' servers: [] components: schemas: TemporarySignUpDto: type: object properties: {} Function: type: object properties: {} UserInteractionDto: type: object properties: {} ``` </details> #### o3-mini Generation <details> <summary> ChatGPT가 생성한 Swagger 파일 확인하기 </summary> [ChatGPT가 생성한 Swagger 파일 다운로드 (yaml)](https://blog.atj.sh/post-attachments/v/01962fbd-fd8d-75c2-a115-f39c9cd0c1ad) ```yaml openapi: "3.0.0" info: title: "Server API" version: "1.0.0" servers: - url: "https://example.com" description: "Primary server (update with actual URL)" components: securitySchemes: bearerAuth: type: http scheme: bearer bearerFormat: JWT cookieAuth: type: apiKey in: cookie name: refreshToken schemas: ContactResponse: type: string example: "contact: lifegame2021team@gmail.com" AdsTxtResponse: type: string example: "ADS TXT content from configuration" TokenResponse: type: object properties: accessToken: type: string example: "eyJhbGciOiJIUzI1NiIsInR..." GenericResponse: type: object additionalProperties: true ErrorResponse: type: object properties: message: type: string paths: /: get: tags: - Root summary: "Get root contact info" responses: "200": description: "Contact information" content: application/json: schema: $ref: "#/components/schemas/ContactResponse" /ads.txt: get: tags: - Root summary: "Get ads.txt content" responses: "200": description: "ADS TXT content" content: text/plain: schema: $ref: "#/components/schemas/AdsTxtResponse" /auth/access-token: get: tags: - Auth summary: "Obtain a new access token using a valid refresh token" security: - cookieAuth: [] responses: "200": description: "Access token generated successfully" content: application/json: schema: $ref: "#/components/schemas/TokenResponse" "401": description: "Unauthorized – invalid or missing credentials" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /auth/logout: post: tags: - Auth summary: "Logout user (POST)" security: - cookieAuth: [] responses: "200": description: "User successfully logged out" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" get: tags: - Auth summary: "Logout user (GET)" security: - cookieAuth: [] responses: "200": description: "User successfully logged out" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" /auth/google-oauth/{web}: get: tags: - Auth summary: "Authenticate user with Google OAuth" parameters: - in: path name: web required: true schema: type: string description: "A dynamic segment representing the web redirect path" - in: query name: code required: true schema: type: string description: "The OAuth authorization code returned by Google" responses: "200": description: "Google OAuth callback response" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" "400": description: "Bad Request" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /temp-signup: post: tags: - Temp Signup summary: "Temporarily sign up a new user with hCaptcha verification" requestBody: required: true content: application/json: schema: type: object properties: hCaptchaSuccessToken: type: string username: type: string countryCode3: type: string required: - hCaptchaSuccessToken - username - countryCode3 responses: "200": description: "Temporary signup successful" content: application/json: schema: type: boolean example: true "400": description: "Bad Request – invalid hCaptcha token or missing fields" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /land-events: get: tags: - Land Events summary: "Retrieve a list of recent land events" parameters: - in: header name: timezone schema: type: string required: false description: "Timezone identifier (e.g., 'Asia/Seoul')" security: - bearerAuth: [] responses: "200": description: "A list of land event summaries" content: application/json: schema: type: array items: $ref: "#/components/schemas/GenericResponse" "401": description: "Unauthorized" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /dice-toss: post: tags: - Dice Toss summary: "Toss dice and receive a gaming response message" requestBody: required: true content: application/json: schema: type: object properties: userId: type: string required: - userId security: - bearerAuth: [] responses: "200": description: "Dice toss result with game data" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" "400": description: "Bad Request – invalid input" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /profile/me: get: tags: - Profile summary: "Retrieve the authenticated user's profile information" security: - bearerAuth: [] responses: "200": description: "User profile data" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" "401": description: "Unauthorized" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" patch: tags: - Profile summary: "Update the authenticated user's profile" security: - bearerAuth: [] requestBody: required: true content: application/json: schema: type: object properties: username: type: string countryCode3: type: string required: - username - countryCode3 responses: "200": description: "Profile updated successfully" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" "400": description: "Bad Request – validation errors" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /profile/others: get: tags: - Profile summary: "Retrieve profiles of other users" security: - bearerAuth: [] parameters: - in: query name: limit required: false schema: type: integer description: "Maximum number of profiles to return" - in: query name: page required: false schema: type: integer description: "Page number for pagination" - in: query name: updatedAfter required: false schema: type: string format: date-time description: "Filter profiles updated after this datetime" responses: "200": description: "List of other user profiles" content: application/json: schema: type: array items: $ref: "#/components/schemas/GenericResponse" "401": description: "Unauthorized" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /profile/complete-signup: post: tags: - Profile summary: "Complete the authenticated user's signup process" security: - bearerAuth: [] requestBody: required: true content: application/json: schema: type: object properties: countryCode3: type: string username: type: string required: - countryCode3 - username responses: "200": description: "Signup completed successfully" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" "400": description: "Bad Request" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /profile: delete: tags: - Profile summary: "Terminate (delete) the authenticated user's account" security: - bearerAuth: [] responses: "200": description: "User account terminated successfully" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" "401": description: "Unauthorized" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /recent-skill-logs: get: tags: - Skill Logs summary: "Retrieve recent skill logs for the authenticated user" security: - bearerAuth: [] parameters: - in: query name: limit required: false schema: type: integer description: "The maximum number of logs to retrieve" responses: "200": description: "A list of recent skill logs" content: application/json: schema: type: array items: $ref: "#/components/schemas/GenericResponse" "401": description: "Unauthorized" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /land-comments: post: tags: - Land Comments summary: "Register a comment on a land (game event)" security: - bearerAuth: [] requestBody: required: true content: application/json: schema: type: object properties: userId: type: string comment: type: string required: - userId - comment responses: "200": description: "Comment registered successfully" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" "400": description: "Bad Request" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /scenarios/d1/map: get: tags: - Scenarios summary: "Retrieve skill group aliases for scenario D1" security: - bearerAuth: [] responses: "200": description: "A mapping of skill groups and their aliases" content: application/json: schema: type: object additionalProperties: true /user-interaction-web: post: tags: - User Interaction summary: "Perform a user interaction call to trigger a skill" security: - bearerAuth: [] requestBody: required: true content: application/json: schema: type: object properties: callingSkillRoute: type: string callingSkillParam: type: object required: - callingSkillRoute responses: "200": description: "Response from the user interaction call" content: application/json: schema: $ref: "#/components/schemas/GenericResponse" "400": description: "Bad Request" content: application/json: schema: $ref: "#/components/schemas/ErrorResponse" /health/detail: get: tags: - Health summary: "Detailed health check" responses: "200": description: "Detailed health status" content: application/json: schema: type: string example: "OK" /health: get: tags: - Health summary: "Simple health check" responses: "200": description: "Simple health status" content: application/json: schema: type: string example: "m" security: - bearerAuth: [] ``` </details> <div id="테스트2"></div> ## 테스트 2: ChatGPT를 통한 매뉴얼 생성 **코드베이스를 압축한 파일 1개만 있어도, ChatGPT가 비즈니스 로직을 이해하고 ‘매뉴얼’을 생성할 수 있을까?** **테스트 절차** 1. 편의를 위해 테스트 1에서 오간 대화를 그대로 유지 2. 프롬프트 전달 **프롬프트** ``` Generate manual for 'POST /temp-signup' endpoint. [for non-coder] 1. ELI5 for teammates who cannot code 2. runtime behavior 3. 'how dose it works' explanation - should include technical details 4. business logics 5. possible edge cases [for coder] 1. ELI5 for teammates who cannot code 2. runtime behavior 3. 'how dose it works' explanation - should include technical details 3.1. technical detail for framework and libraries 3.2. sequence diagram if possible 4. possible design pattern it implements 5. possible improvements 6. possible edge cases tone: no emoji, professional tone, literacy style each in markdown format one in english, one in korean. ``` <div id="테스트2결과"></div> ## 테스트 2 결과: ChatGPT를 통한 매뉴얼 생성 **비개발자/개발자용 매뉴얼을 동시에 성공적으로 생성할 수 있었다.** :img{src="https://blog.atj.sh/post-attachments/v/01962f7e-6c1c-7212-8ec9-47b14d773f16"} *사진: ChatGPT가 생성해낸 시퀀스 다이어그램을 통한 시스템 설명* <details> <summary> "비개발자를 위한 매뉴얼" 생성물 확인하기 </summary> POST /temp-signup 엔드포인트 매뉴얼 (Korean) *비개발자를 위한 설명* **1. ELI5 설명 (코드를 모르는 팀원을 위한 설명)** 이 엔드포인트는 임시 사용자 계정을 생성합니다. hCaptcha라는 사람임을 확인하는 절차(비밀 코드를 보여주는 것과 같은 방식)를 거쳐서, 이름, 이메일, 국가 코드 등의 정보를 제출하면 임시로 계정이 만들어지고, 그 계정으로 서비스를 이용할 수 있게 됩니다. **2. 런타임 동작** 사용자가 회원가입 정보를 전송하면 서버는 먼저 hCaptcha 토큰을 검사하여 사용자가 로봇이 아니라는 것을 확인합니다. 그 후, 임시 사용자 계정을 생성하고, 사용자 브라우저에 보안 쿠키(리프레시 토큰)를 설정합니다. 마지막으로, 회원가입이 성공했음을 true/false 값으로 반환합니다. **3. 동작 방식 (높은 수준의 설명)** 입력 검증: hCaptcha 토큰, 사용자 이름, 이메일, 국가 코드 등 필수 필드가 포함되어 있는지 확인합니다. 인간 여부 확인: hCaptcha 토큰을 외부 hCaptcha 서비스에 전송하여 사람이 맞는지 확인합니다. 계정 생성: 검증이 성공하면, 임시 사용자 정보를 데이터베이스(또는 내부 저장소)에 저장합니다. 쿠키 설정: 생성된 리프레시 토큰을 HTTP 전용 쿠키로 설정하여, 이후 사용자가 계속 인식될 수 있도록 합니다. 응답: 회원가입 성공 여부를 응답합니다. **4. 비즈니스 로직** 사용자 검증: hCaptcha를 사용하여 스팸과 자동 가입을 방지합니다. 임시 계정 생성: 사용자가 즉시 서비스를 이용할 수 있도록 임시 계정을 만들되, 이후에 정식 가입 절차가 추가될 수 있습니다. 보안 관리: 리프레시 토큰을 보안 쿠키로 발급하여 세션 보안을 강화합니다. **5. 가능한 엣지 케이스** 유효하지 않은 hCaptcha 토큰: hCaptcha 토큰이 없거나 잘못된 경우 회원가입이 거부됩니다. 불완전한 데이터: 사용자 이름, 이메일, 국가 코드 중 하나라도 빠진 경우 실패할 수 있습니다. 서버 오류: 일시적인 데이터베이스 또는 네트워크 문제로 인해 예기치 못한 오류가 발생할 수 있습니다. 중복 가입: 이메일이나 사용자 이름이 이미 사용 중인 경우 추가 검증이 필요할 수 있습니다. </details> <details> <summary> "개발자를 위한 매뉴얼" 생성물 확인하기 </summary> POST /temp-signup 엔드포인트 매뉴얼 (Korean) *개발자를 위한 설명* **1. ELI5 설명 (코드를 모르는 팀원을 위한 설명)** 회원가입 과정을 티켓 부스에 비유한다면, 사용자는 hCaptcha라는 비밀 코드를 제시하여 로봇이 아님을 증명합니다. 그러면 부스에서는 임시 회원 카드를 발급하고, 나중에 클럽에 다시 방문할 수 있도록 브레이슬릿(쿠키)을 제공합니다. **2. 런타임 동작** POST 요청이 /temp-signup 경로로 오면, 서버는 JSON 형식의 페이로드를 처리합니다. 입력값이 검증되고, 외부 hCaptcha 서비스에서 hCaptcha 토큰을 확인한 후, 임시 사용자 레코드를 생성합니다. 이후, 보안 HTTP 전용 쿠키에 리프레시 토큰을 설정하고, 최종적으로 가입 성공 여부를 Boolean 값으로 응답합니다. **3. 동작 방식 – 기술적 세부사항** **3.1. 프레임워크 및 라이브러리 관련 기술 세부사항** NestJS 프레임워크: 이 엔드포인트는 NestJS 컨트롤러로 구현되어 있으며, @Controller(), @Post() 등 데코레이터와 의존성 주입(@Injectable())을 사용합니다. hCaptcha 연동: @nestjs/axios의 HttpService를 사용하여 "https://hcaptcha.com/siteverify" URL에 POST 요청을 보내 hCaptcha 토큰의 유효성을 검사합니다. 환경변수로 관리되는 hCaptcha 비밀 키(HCAPTCHA_SECRET_KEY)를 사용합니다. 데이터베이스 연동: TypeORM(@nestjs/typeorm)을 통해 임시 사용자 레코드를 데이터베이스에 저장합니다. 쿠키 관리: @fastify/cookie 패키지를 사용하여 보안 쿠키를 관리합니다. 쿠키는 HTTP 전용, 보안, 서명된 쿠키로 설정됩니다. 데이터 검증: NestJS의 Validation Pipe 및 선택적 라이브러리(class-validator 혹은 Joi)를 사용해 입력 데이터의 유효성을 보장합니다. **3.2. Sequence Diagram (텍스트 기반)** ```mermaid sequenceDiagram participant Client participant Controller as POST /temp-signup participant hCaptchaService participant UserService participant RefreshTokenService Client->>Controller: JSON payload {hCaptchaSuccessToken, username, email, countryCode3} Controller->>hCaptchaService: Validate hCaptcha token hCaptchaService-->>Controller: Return true/false alt hCaptcha valid Controller->>UserService: Create temporary user record UserService-->>Controller: Return new user object Controller->>RefreshTokenService: Generate refresh token RefreshTokenService-->>Controller: Return token Controller->>Client: Set HTTP-only cookie and return true else Controller->>Client: Return error response (signup failed) end ``` **4. Possible Design Patterns** 서비스 지향 아키텍처: 인증, 사용자 관리, 토큰 관리 등 각각의 책임을 개별 서비스로 분리합니다. 의존성 주입: NestJS의 주요 특징을 활용해 각 서비스가 다른 서비스에 의존할 수 있도록 구성합니다. 파사드 패턴: 컨트롤러는 여러 내부 서비스(hCaptcha, User, Token 서비스)를 단순화하여 클라이언트에 제공하는 역할을 합니다. **5. Possible Improvements** 고도화된 검증: class-validator 또는 Joi와 같은 라이브러리를 사용해 더 상세한 요청 스키마 검증을 수행. 에러 핸들링 개선: 클라이언트에 좀 더 구체적인 오류 메시지를 제공할 수 있도록 에러 핸들링 구조를 개선. 속도 제한: 엔드포인트 남용을 방지하기 위해 속도 제한(Rate Limiting) 기능 도입. 로깅 및 모니터링: 인터셉터를 통한 상세 로깅을 추가하여 운영 및 감사에 대비. 중복 체크: 기존 사용자와의 중복 가입 여부를 미리 확인하여 불필요한 데이터 생성 방지. **6. Possible Edge Cases** 유효하지 않거나 만료된 hCaptcha 토큰: 외부 hCaptcha 서비스에서 유효성 검증에 실패한 경우. 필수 입력 누락: 사용자 이름, 이메일 또는 국가 코드의 누락. 데이터베이스 장애: 데이터베이스 연결 장애 시 임시 계정 생성 실패. 쿠키 구성 오류: 쿠키 설정 옵션(HTTP 전용, 보안 등)이 제대로 설정되지 않은 경우로 인한 보안 취약성. 예외 처리 미흡: 외부 hCaptcha API 응답에서 비표준 응답이 올 때 처리할 로직의 부재. </details> 이 테스트에서 나눈 대화의 전문을 아카이브시켜 두었다. [ChatGPT 대화 기록](https://chatgpt.com/share/67fbc4cb-5988-8001-b968-b80682e2e8a1)에서 확인해볼 수 있다. <div id="한계점"></div> # 한계점 <div id="ChatGPT한계점"></div> ## ChatGPT와의 활용에 관한 한계점 :img{src="https://blog.atj.sh/post-attachments/v/01962f25-1c49-7b91-8720-e57b5d738d00"} *사진: GPT-4.5 모델이 응답을 거부하는 모습.* 번들된 JavaScript 파일을 ChatGPT 대화에 첨부할 때, “ChatGPT 파일 업로드”를 통해 업로드하는 것보다, 프롬프트 안에 파일 내용을 직접 포함시키는 것이 답변의 정확도를 거의 완벽한 수준으로 높인다는 점을 발견했다. 금번 테스트에서는 o3-mini 모델을 사용하였다. 한편, 다른 모델을 통한 답변 생성을 시도했을 때에는 ‘메시지 길이 제한’에 걸려 응답을 받아낼 수 없었다. <div id="번들링한계점"></div> ## 번들링에 관한 한계점 :img{src="https://blog.atj.sh/post-attachments/v/01962f25-2ae0-7c12-84bb-b7fddae990de"} *사진: 소스 코드 파일 위치를 묻는 질문에 응답을 거부하는 모습.* 번들링을 하기 위해서는, 번들러 도구 사용법을 공부해야 할 것이다. 코드베이스에 대한 번들링을 시도하기 위해서는 번들링에 실패할 만한 문법적 오류가 소스코드에 존재해서는 안 된다. 번들링 중 손실되는 정보는 LLM에 전달되지 않는다. 별도의 RAG가 필요하다. <div id="발견"></div> # 발견 <div id="전통적"></div> ## 전통적인 도구를 써서 RAG를 성취한다는 점 코드베이스가 context window에 포함될 수 있게끔 ‘손실 압축’하여도 동작과 관련된 지식을 전달할 수 있음을 확인하였다. 또한, 이미 존재하는 번들링 도구를 그대로 사용하였음에 의미가 있다. <div id="일반화"></div> ## 다른 언어로 작성된 코드베이스에 대한 일반화 가능성 TypeScript/JavaScript 코드 진영에서는 오랫동안 번들링에 대한 연구가 이루어진 상황이라, 번들링 시스템을 구성하는 난도는 비교적 낮다. 하지만 다른 언어는 상황이 다르다. 단적인 예로써 컴파일 언어(C++ 등)로 작성되는 코드베이스에 대해서는 이러한 번들링이 가능한지 확인이 필요할 것이다. 다만, LLM이 .exe 파일을 이해하고, 파이썬 포팅하는 데 성공했다며 주장하는 사례가 존재한다[^exe]. 따라서, 코드베이스에 대한 적절한 압축이 이루어진다면 LLM이 이해하는 데에는 문제가 없다고 볼 수 있을 것이다. [^exe]: [GeekNews: 27년 된 EXE 파일을 Claude 3.7에 업로드한 후 일어난 놀라운 일](https://news.hada.io/topic?id=19493) <div id="다른방법과의비교"></div> ## Semantic search와의 시너지 **semantic search와 번들링된 코드베이스 정보를 동시에 RAG에 활용한다면, 서로 상호보완적으로 작용하여 상승효과를 볼 수 있을 것이라고 생각한다.** 현재 상용 코딩 AI 에이전트들은 여러 파일로 구성된 코드베이스의 전체적인 맥락을 온전히 이해하지 못하는 경우가 다반사이다. 예를 들어, “운영자 로그인 로그가 ‘ISMS 로그 테이블’에 기록되도록 기능을 추가하여 주십시오”라는 요청이 주어지면, AI 에이전트는 ‘운영자 로그인’ 기능과 ‘ISMS 로그 테이블’과 관련된 모든 소스 코드를 탐색하는 과정에서 semantic search(의미 기반 검색)를 수행한다. Semantic search 기법은 입력된 키워드와 가장 유사한 원본 데이터를 식별하는 데 기여하나, 검색 대상에 포함되지 않은 파일은 에이전트가 인지하지 못할 위험이 있다. 한편, 번들링을 통해 압축된 코드베이스를 LLM에게 사전에 제공하고, 여기에 semantic search를 결합한다면, 코드베이스의 전반적인 맥락을 보다 충실히 이해함과 동시에 데이터베이스화된 원본 소스 코드에 효과적으로 접근할 수 있어 유리한 상황을 마련할 수 있다. <div id="결론"></div> # 결론 LLM에 대해 코드베이스 전체를 RAG하기 위해, 코드베이스 전체를 한 파일로 압축하여 LLM에게 전달하는 테스트를 설계했다. 현존하는 번들링 툴을 그대로 사용하여 코드베이스를 압축하였을 때, LLM은 코드베이스 전체에 대한 API 문서화를 진행하거나, 특정 API에 대한 자세한 매뉴얼을 작성할 수 있었다. 코드베이스 압축을 진행하며 발생하는 정보 손실에 대해서는 원본 파일에 대한 추가적인 Semantic search를 통해 보완할 수 있을 것으로 전망한다. *[GeekNews에서 댓글을 남기실 수 있습니다.](https://news.hada.io/topic?id=20309)* <div id="업데이트1"></div> # 업데이트 (2025-04-23) 코드베이스를 압축 없이 tarball로 묶어 LLM에게 주었을 때 context window가 충분히 큰 경우 코드베이스를 잘 인식할 것이다. 더 쉽고 일반적인 방법이라고 생각하여 공유한다.

LLM에 코드베이스 문맥을 전달하기 위한 esbuild 번들러 도입 시도 (Raw Markdown)