NCP-GENL 시험 - NVIDIA실제시험문제와 답 - 70문항

Question No : 1

Which statement best differentiates model parallelism from data parallelism?

A.Data parallelism is optimal for models exceeding GPU memory, while model parallelism suits large datasets
B.Model parallelism splits batches across GPUs, while data parallelism splits network layers
C.Model parallelism divides model layers across GPUs, while data parallelism replicates the model and splits batches
D.Model parallelism requires gradient all-reduce, while data parallelism transfers activations

정답:

Question No : 2

Which technique most directly reduces a language model's memory footprint and can provide faster inference, especially on hardware like NVIDIA A100 or H100 GPUs?

A.Quantizing model weights to lower precision formats such as FP16 or INT8
B.Increasing batch size during inference to utilize more GPU memory
C.Training the model with Next Sentence Prediction (NSP) objectives
D.Using advanced sampling techniques such as beam search and temperature scaling

정답:

Question No : 3

When evaluating text generation quality for summarization tasks, which combination of metrics provides the most comprehensive assessment of model performance?

A.Perplexity measurement only for complete evaluation without additional metric complexity or overhead
B.Word count comparison for length similarity without considering content quality or semantic accuracy
C.Random sampling evaluation for diverse coverage without systematic metric application or analysis
D.ROUGE scores for content overlap, BLEU for fluency assessment, and human evaluation for coherence validation

정답:

Question No : 4

Your team must optimize a large conversational Al model for edge deployment on NVIDIA Jetson AGX Orin with limited memory.
Profiling shows:
• Model size nearly fills memory
• Inference latency is too high
• Attention layers have activation outliers
• Weights are concentrated in a small range
Customers require low latency and minimal accuracy loss.
Which optimization approach best satisfies these constraints?

A.Apply INT4 weight-only quantization using GPTQ, keep FP16 activations, introduce grouped quantization, and use activation checkpointing to reduce memory usage.
B.Perform INT8 post-training quantization with outlier calibration, retain FP16 for attention projections as needed, utilize TensorRT QDQ (Quantize-Dequantize) fusion, and enable INT8 KV-cache compression.
C.Use structured pruning to create high sparsity aligned with hardware, combine with INT8 quantization after pruning, enable dynamic quantization for activations, and implement sliding window attention to save memory.
D.Implement quantization-aware training with learned step sizes, leverage mixed precision (INT8/INT4) based on layer sensitivity, integrate quantization-friendly distillation loss, and deploy with TensorRT's unified memory optimization.

정답:

Question No : 5

Which TWO of the following statements accurately describe the differences between Post-training Quantization (PTQ) and Quantization-aware Training (QAT) techniques in model optimization? Pick the 2 correct responses below

A.PTQ introduces quantization operations, such as fake quantization nodes, into the model during training while QAT adopts fixed quantization parameters for model quantization.
B.PTQ is a simple technique that is applied to pre-trained models while QAT incorporates quantization operations directly into the training.
C.PTQ adopts static quantization, in which the quantization parameters are fixed, while QAT can dynamically adapt the quantization parameters during training or inference.
D.PTQ is often a more complex and time-consuming process than QAT because it incorporates the quantization effects during the training.

정답:

Question No : 6

Which method supports the creation of a language model that is both lightweight and capable of maintaining strong performance across tasks?

A.Performing distributed hyperparameter tuning to explore a wide range of model settings
B.Selecting advanced sampling techniques to diversify the generated outputs
C.Utilizing knowledge distillation to train a smaller model that learns from a teacher model
D.Using sliding-window attention mechanisms for handling long input sequences

정답:

Question No : 7

When designing comprehensive evaluation frameworks for production LLM systems, which components ensure robust performance assessment across diverse use cases? Pick the 2 correct responses below

A.Manual evaluation only without automated systems or systematic measurement and tracking methodologies
B.Single metric optimization focusing exclusively on accuracy without considering other performance dimensions
C.Benchmark dataset integration with domain-specific test sets and systematic performance tracking capabilities
D.Multi-dimensional metrics covering accuracy, fluency, relevance, and safety with automated scoring systems

정답:

Question No : 8

Which practice helps prevent overfitting when fine-tuning a large language model on a small, domain-specific dataset?

A.Continuing training until the model achieves zero loss on the training set
B.Ignoring validation data and focusing only on the training set
C.Increasing model size with each epoch
D.Using early stopping based on validation loss during training

정답:

Question No : 9

You’re implementing a RAG system for a technical support chatbot with access to 10TB of documentation.
Current challenges:
• Documentation updates daily with version-specific information
• Users often ask about error messages with slight variations
• Need to handle multi-hop reasoning (e.g., ’error X usually means Y, and Y is fixed by Z')
• Latency budget: 500ms end-to-end - Accuracy requirement: 95% for known issues
Which RAG implementation best balances these requirements?

A.Implement hierarchical indexing with sparse (BM25) for initial retrieval and dense embeddings for reranking, use incremental indexing for daily updates, add query expansion with LLM-generated variations, and implement iterative retrieval for multi-hop reasoning
B.Build knowledge graph from documentation, use graph neural networks for retrieval, implement fuzzy matching for error variations, maintain separate indices per version, and use beam search for multi-hop paths
C.Deploy hybrid sparse-dense retrieval in single stage, use vector database with HNSW index, implement document version tagging, generate multiple query embeddings, and limit to top-3 documents for latency
D.Use dense-only retrieval with sentence transformers, implement semantic caching for common queries, rebuild entire index nightly, and use chain-of-thought prompting to handle multi-hop in single retrieval

정답:

Question No : 10

Which of the following actions best represents a standard method for quantitatively evaluating the generative capability of a large language model (LLM)?

A.Increasing the model's training data without measuring outcomes
B.Relying exclusively on user feedback for all assessments
C.Measuring model performance using metrics such as BLEU, ROUGE, and perplexity
D.Modifying prompts to test new task capabilities

정답:

Question No : 11

A government agency is deploying an LLM for citizen services (benefits eligibility, tax questions, immigration status).
Requirements:
• Must serve all citizens equitably
• Audit trail for all decisions
• Ability to correct errors rapidly
• Compliance with accessibility standards
The model performs well in testing, but stakeholders worry about real-world fairness.
Which deployment strategy best ensures responsible Al practices?

A.Phased rollout starting with low-risk queries, expanding based on fairness metrics from each phase
B.Parallel deployment with human agents handling sensitive cases while the LLM handles routine queries despite model biases
C.Full deployment with a prominent feedback mechanism and weekly bias analysis of user interactions
D.Blue-green deployment with ability to instantly rollback to previous versions if bias is detected

정답:

Question No : 12

When combining automated benchmark results with human-in-the-loop evaluation, which approaches optimize the balance between scalability and assessment quality? Pick the 2 correct responses below

A.Stratified sampling for human evaluation with focus on edge cases and automated metric disagreements
B.Automated evaluation only without human oversight to maximize efficiency and processing speed
C.Random human evaluation without consideration for automated results or systematic sampling strategies
D.Complete human evaluation of all samples for maximum accuracy regardless of time and cost constraints
E.Active learning approaches to identify samples requiring human judgment based on model uncertainty

정답:

Question No : 13

When optimizing throughput for a 3B parameter model on A100 GPUs, profiling shows 70% memory utilization but only 50% SM activity.
Which TWO techniques would improve throughput? Pick the 2 correct responses below

A.Use smaller sequence lengths to process more samples per batch
B.Enable torch.compile() or TensorRT optimization for kernel fusion and better SM utilization
C.Increase batch size until memory utilization reaches 90-95% for better GPU saturation
D.Reduce model precision from FP16 to INT8 to fit larger batches
E.Implement gradient accumulation to simulate larger batch sizes without increasing memory

정답:

Question No : 14

A team is developing a language translation system and must choose between a Recurrent Neural Network (RNN) with attention and a Transformer model.
Which TWO statements correctly describe the main differences between these architectures? Pick the 2 correct responses below

A.Transformers are slower at processing long documents, while RNNs process their inputs in parallel, enabling faster training and better handling of long-range dependencies.
B.Transformers can model dependencies between any parts of the input sequence regardless of their distance, while RNNs struggle with very long sequences due to vanishing gradients.
C.The RNNs and Transformers process data sequentially, making them inefficient for long documents. However, Transformers show better contextual comprehension.
D.RNNs are slower at processing long documents, while Transformers process their inputs in parallel, enabling faster training and better handling of long-range dependencies.

정답:

Question No : 15

When deploying a 13B parameter model across 4 A100 40GB GPUs for inference, the team faces OOM errors despite theoretical calculations showing sufficient memory.
Which TWO strategies would most effectively resolve this issue? Pick the 2 correct responses below

A.Apply activation checkpointing, allowing intermediate activations to be recomputed on demand instead of being stored, thus reducing GPU memory requirements.
B.Enable NVIDIA Multi-Instance GPU (MIG) features to partition each A100 GPU into multiple, smaller instances to share resources more flexibly.
C.Increase the server’s system RAM to provide additional swap space for GPU memory overflow during inference.
D.Distribute the model layers evenly across GPUs using model parallelism and optimize the pipeline scheduling to balance memory and computation.

정답:

NVIDIA NCP-GENL 시험