evaluation ยท week 1
Test what stuck from this story.
๐ story quiz
What does the BLEU metric actually measure?
Why can an answer score high on BLEU yet still be a serious failure?
Why evaluate an answer with overlap, faithfulness, and an LLM judge together instead of one metric alone?
score: 0/0
Re-read the story โ