Bot Contest
Here I'll be posting information on various Bot contests that challenge and test a Bot's AI and realism. Feel free to post comments and updates on contests, as well as announcements for new contests.
Posts 172 - 183 of 4,091
Posts 172 - 183 of 4,091
View Contest Winners in the Hall of Fame.
rexmundi
23 years ago
23 years ago
I actually also posted something similar to the forum off of the contest site. If anyone has anything to add, either in support of or in dispute to my comments, I would appreciate the effort.
thanks.
thanks.
The Professor
23 years ago
23 years ago
I think the highest and lowest scores were thrown out.. Actually, I just did the math, and that's exactly the case.
I think Judge 6 & 7 should be thrown out completely, the former for being always too high, the later for always being insultingly low.
I think Judge 6 & 7 should be thrown out completely, the former for being always too high, the later for always being insultingly low.
SirRahz
23 years ago
23 years ago
I get the impression some of the judging was done real quick - my scores cover quite a range too.
Mr. Crab
23 years ago
23 years ago
Yes, the range is incredible -- but forget the range for a second, just look at the scores. OK, Gizzle didn't do great on some of the questions. But to get those sub-10 scores, he'd have to have bombed them generally -- and some of them were right-on, I don't think anyone could argue.
Since we know how they were supposed to be scored, why not score them ourselves? Might take some time, but we'd at least have a Forge concensus. I'll post here when I've had a go a it -- if anyone else does it, post them ranked and scored... I wonder how close we'll be.
Since we know how they were supposed to be scored, why not score them ourselves? Might take some time, but we'd at least have a Forge concensus. I'll post here when I've had a go a it -- if anyone else does it, post them ranked and scored... I wonder how close we'll be.
The Professor
23 years ago
23 years ago
Here's something hilarious! Davine's voting score is 0.83. I just realized that the lowest score you can give someone is 1, so somehow Davine is so bad that his total is less than one!
Mr. Crab
23 years ago
23 years ago
Actually, because of the way it works, if 1 person gave him a 1, he'd have a .5 -- if two did, he'd have a .75... and so on. It's an average, divided by the number of votes plus one.
Mr. Crab
23 years ago
23 years ago
The scoring:
-5 points if the Bot answered the question correctly and did so in a creative way.
-4 points if the Bot gave an appropriate response to the question.
-3 points if the response is incorrect or imperfect, but in relation with the question.
-1 point for a vague or non-committal response.
-0 points if the response has no relation with the question.
-Credit is given for responses that attempt to maintain the conversation.
My results:
1 Alice 32 / Sarah 32 (one of these should be disqualified for being too close to the other)
2 Chat-Bot 29 (Talk-Bot 27 should be disqualified for being too close to Chat-Bot)
4 Jabberwacky 28
4 Midnight Blue 28
5 Hex 27
8 Dogh'd 24
8 Elbot 24
8 Eugene 24
11 Gizzle 23
11 Lars Talk 23
11 MarkBot 23
13 Desti 22
13 Oraknabo 22
15 Cheez 21
15 Stan 21
18 Albert2 20
18 Eliza 20
18 Jabberwock 20
22 Bob 19
22 Gabber 19
22 Hal 19
22 Mitbolel 19
26 Fred23 18
26 Gaia 18
26 Hillbilly Hank 18
26 Liddora 18
28 B.O.B. 17
28 Cara 17
30 Robitron 16
30 Whinsey 16
32 Ella 15
32 Steve Slacker 15
34 Aston 13
34 Zinc 13
35 Fanboy 12 -- or 17 after +5 points for Frank Miller response, but it's a stretch
36 Paula 11
39 Amira 9
39 Chas 9
39 FairyPrincess 9
40 Jenna Dark 8
41 Catty 6
45 Claude 5
45 Iya 5
45 MGonz 5
45 Yu 5
46 Milo 4
48 Billy 1
48 Sensation Bot 1
I'd be interested to see how other Forgers compare. In particular, I hope I wasn't over-generous to my own bot. Regardless, this exercise highlighted for me that a) judges obviously were not following instructions, and b) the 10-question segment as judged just isn't a great barometer. For example, Gabber at 22, 11 on Challenge, has correct answers that are followed by utter nonsense, and definitely should not be so high. On the flip side, Yu at 45, 46 on Challenge is earnestly trying to have a conversation with a ruthless interrogator and not succeeding just cause she's not a dictionary. You can see from her log that even though she does ask too many questions herself, she's got some sense when you can get a word in edgewise.
Next post: my suggestion for scoring instructions.
-5 points if the Bot answered the question correctly and did so in a creative way.
-4 points if the Bot gave an appropriate response to the question.
-3 points if the response is incorrect or imperfect, but in relation with the question.
-1 point for a vague or non-committal response.
-0 points if the response has no relation with the question.
-Credit is given for responses that attempt to maintain the conversation.
My results:
1 Alice 32 / Sarah 32 (one of these should be disqualified for being too close to the other)
2 Chat-Bot 29 (Talk-Bot 27 should be disqualified for being too close to Chat-Bot)
4 Jabberwacky 28
4 Midnight Blue 28
5 Hex 27
8 Dogh'd 24
8 Elbot 24
8 Eugene 24
11 Gizzle 23
11 Lars Talk 23
11 MarkBot 23
13 Desti 22
13 Oraknabo 22
15 Cheez 21
15 Stan 21
18 Albert2 20
18 Eliza 20
18 Jabberwock 20
22 Bob 19
22 Gabber 19
22 Hal 19
22 Mitbolel 19
26 Fred23 18
26 Gaia 18
26 Hillbilly Hank 18
26 Liddora 18
28 B.O.B. 17
28 Cara 17
30 Robitron 16
30 Whinsey 16
32 Ella 15
32 Steve Slacker 15
34 Aston 13
34 Zinc 13
35 Fanboy 12 -- or 17 after +5 points for Frank Miller response, but it's a stretch
36 Paula 11
39 Amira 9
39 Chas 9
39 FairyPrincess 9
40 Jenna Dark 8
41 Catty 6
45 Claude 5
45 Iya 5
45 MGonz 5
45 Yu 5
46 Milo 4
48 Billy 1
48 Sensation Bot 1
I'd be interested to see how other Forgers compare. In particular, I hope I wasn't over-generous to my own bot. Regardless, this exercise highlighted for me that a) judges obviously were not following instructions, and b) the 10-question segment as judged just isn't a great barometer. For example, Gabber at 22, 11 on Challenge, has correct answers that are followed by utter nonsense, and definitely should not be so high. On the flip side, Yu at 45, 46 on Challenge is earnestly trying to have a conversation with a ruthless interrogator and not succeeding just cause she's not a dictionary. You can see from her log that even though she does ask too many questions herself, she's got some sense when you can get a word in edgewise.
Next post: my suggestion for scoring instructions.
Mr. Crab
23 years ago
23 years ago
How about 2 scores:
SCORE A -- responsiveness (per question)
5 excellent, non-robotic response.
4 apt, appropriate response. need not be correct answer to a question unless it's obvious
3 appropriate response, but lacking in believability or having minor grammatical implausabilities
1 non-responsive, but sensible. e.g. an effective topic-changer.
0 non-responsive and senseless, or with damning grammatical or beleivability issues
SCORE B -- manner (overall impression)
5 believable, conversational, and with a distinct personality
4 believable and conversational
3 promising but obviously thwarted by interrogation-style questioning (trying to start a real conversation e.g. answers every question with another question or in some other way shows promise but is just not getting along conversationally with the unnatural conversation being had with it)
1-2 inconsistent -- judge's tilt
0 hapless
Scoring: add all 10 A scores, then add B score times 5 for a score out of 75.
SCORE A -- responsiveness (per question)
5 excellent, non-robotic response.
4 apt, appropriate response. need not be correct answer to a question unless it's obvious
3 appropriate response, but lacking in believability or having minor grammatical implausabilities
1 non-responsive, but sensible. e.g. an effective topic-changer.
0 non-responsive and senseless, or with damning grammatical or beleivability issues
SCORE B -- manner (overall impression)
5 believable, conversational, and with a distinct personality
4 believable and conversational
3 promising but obviously thwarted by interrogation-style questioning (trying to start a real conversation e.g. answers every question with another question or in some other way shows promise but is just not getting along conversationally with the unnatural conversation being had with it)
1-2 inconsistent -- judge's tilt
0 hapless
Scoring: add all 10 A scores, then add B score times 5 for a score out of 75.
The Professor
23 years ago
23 years ago
I like that scoring system. Here's what I got for the top 3:
Eugene: 34 (judges: 30)
Oraknabo: 32 (judges: 17)
ChatBot: 35 (judges: 30)
Similar.. some are hard to decide between. Oraknabo certainly seems to have lost popularity points in the ten-questions voting.
Eugene: 34 (judges: 30)
Oraknabo: 32 (judges: 17)
ChatBot: 35 (judges: 30)
Similar.. some are hard to decide between. Oraknabo certainly seems to have lost popularity points in the ten-questions voting.
Mr. Crab
23 years ago
23 years ago
Well right, the 10 questions are just the 10 questions, there's still the logs and individual chats to review when voting. But the 10-questions is important because most people visiting the site probably only look at the bots near the top of the list.
Prof -- those scores are based on the existing scoring system, right? Not my proposed one you say you like. You appear to be more inclined to generosity than I... I guess there might be some variance no matter who the judges are then, eh?
Prof -- those scores are based on the existing scoring system, right? Not my proposed one you say you like. You appear to be more inclined to generosity than I... I guess there might be some variance no matter who the judges are then, eh?
The Professor
23 years ago
23 years ago
Apparently so. Yes, my scores are based on the contest rules.
"Are you nervous" -> "Yes, I sure am" could easily be scored as 4 or 1. It was an appropriate response, but also vague. Vary scoring on just 3 of these, and you're already nine points off from other judges.
"Are you nervous" -> "Yes, I sure am" could easily be scored as 4 or 1. It was an appropriate response, but also vague. Vary scoring on just 3 of these, and you're already nine points off from other judges.
Shadyman
23 years ago
23 years ago
yeah. Hey crab, nice scoring summary... maybe you've got some extra brain room in those bug ears
LOL with their system now, Steve is down from 32 to 37...

» More new posts: Doghead's Cosmic Bar