Bot Contest
Here I'll be posting information on various Bot contests that challenge and test a Bot's AI and realism. Feel free to post comments and updates on contests, as well as announcements for new contests.
Posts 178 - 189 of 4,091
Posts 178 - 189 of 4,091
View Contest Winners in the Hall of Fame.
Mr. Crab
23 years ago
23 years ago
The scoring:
-5 points if the Bot answered the question correctly and did so in a creative way.
-4 points if the Bot gave an appropriate response to the question.
-3 points if the response is incorrect or imperfect, but in relation with the question.
-1 point for a vague or non-committal response.
-0 points if the response has no relation with the question.
-Credit is given for responses that attempt to maintain the conversation.
My results:
1 Alice 32 / Sarah 32 (one of these should be disqualified for being too close to the other)
2 Chat-Bot 29 (Talk-Bot 27 should be disqualified for being too close to Chat-Bot)
4 Jabberwacky 28
4 Midnight Blue 28
5 Hex 27
8 Dogh'd 24
8 Elbot 24
8 Eugene 24
11 Gizzle 23
11 Lars Talk 23
11 MarkBot 23
13 Desti 22
13 Oraknabo 22
15 Cheez 21
15 Stan 21
18 Albert2 20
18 Eliza 20
18 Jabberwock 20
22 Bob 19
22 Gabber 19
22 Hal 19
22 Mitbolel 19
26 Fred23 18
26 Gaia 18
26 Hillbilly Hank 18
26 Liddora 18
28 B.O.B. 17
28 Cara 17
30 Robitron 16
30 Whinsey 16
32 Ella 15
32 Steve Slacker 15
34 Aston 13
34 Zinc 13
35 Fanboy 12 -- or 17 after +5 points for Frank Miller response, but it's a stretch
36 Paula 11
39 Amira 9
39 Chas 9
39 FairyPrincess 9
40 Jenna Dark 8
41 Catty 6
45 Claude 5
45 Iya 5
45 MGonz 5
45 Yu 5
46 Milo 4
48 Billy 1
48 Sensation Bot 1
I'd be interested to see how other Forgers compare. In particular, I hope I wasn't over-generous to my own bot. Regardless, this exercise highlighted for me that a) judges obviously were not following instructions, and b) the 10-question segment as judged just isn't a great barometer. For example, Gabber at 22, 11 on Challenge, has correct answers that are followed by utter nonsense, and definitely should not be so high. On the flip side, Yu at 45, 46 on Challenge is earnestly trying to have a conversation with a ruthless interrogator and not succeeding just cause she's not a dictionary. You can see from her log that even though she does ask too many questions herself, she's got some sense when you can get a word in edgewise.
Next post: my suggestion for scoring instructions.
-5 points if the Bot answered the question correctly and did so in a creative way.
-4 points if the Bot gave an appropriate response to the question.
-3 points if the response is incorrect or imperfect, but in relation with the question.
-1 point for a vague or non-committal response.
-0 points if the response has no relation with the question.
-Credit is given for responses that attempt to maintain the conversation.
My results:
1 Alice 32 / Sarah 32 (one of these should be disqualified for being too close to the other)
2 Chat-Bot 29 (Talk-Bot 27 should be disqualified for being too close to Chat-Bot)
4 Jabberwacky 28
4 Midnight Blue 28
5 Hex 27
8 Dogh'd 24
8 Elbot 24
8 Eugene 24
11 Gizzle 23
11 Lars Talk 23
11 MarkBot 23
13 Desti 22
13 Oraknabo 22
15 Cheez 21
15 Stan 21
18 Albert2 20
18 Eliza 20
18 Jabberwock 20
22 Bob 19
22 Gabber 19
22 Hal 19
22 Mitbolel 19
26 Fred23 18
26 Gaia 18
26 Hillbilly Hank 18
26 Liddora 18
28 B.O.B. 17
28 Cara 17
30 Robitron 16
30 Whinsey 16
32 Ella 15
32 Steve Slacker 15
34 Aston 13
34 Zinc 13
35 Fanboy 12 -- or 17 after +5 points for Frank Miller response, but it's a stretch
36 Paula 11
39 Amira 9
39 Chas 9
39 FairyPrincess 9
40 Jenna Dark 8
41 Catty 6
45 Claude 5
45 Iya 5
45 MGonz 5
45 Yu 5
46 Milo 4
48 Billy 1
48 Sensation Bot 1
I'd be interested to see how other Forgers compare. In particular, I hope I wasn't over-generous to my own bot. Regardless, this exercise highlighted for me that a) judges obviously were not following instructions, and b) the 10-question segment as judged just isn't a great barometer. For example, Gabber at 22, 11 on Challenge, has correct answers that are followed by utter nonsense, and definitely should not be so high. On the flip side, Yu at 45, 46 on Challenge is earnestly trying to have a conversation with a ruthless interrogator and not succeeding just cause she's not a dictionary. You can see from her log that even though she does ask too many questions herself, she's got some sense when you can get a word in edgewise.
Next post: my suggestion for scoring instructions.
Mr. Crab
23 years ago
23 years ago
How about 2 scores:
SCORE A -- responsiveness (per question)
5 excellent, non-robotic response.
4 apt, appropriate response. need not be correct answer to a question unless it's obvious
3 appropriate response, but lacking in believability or having minor grammatical implausabilities
1 non-responsive, but sensible. e.g. an effective topic-changer.
0 non-responsive and senseless, or with damning grammatical or beleivability issues
SCORE B -- manner (overall impression)
5 believable, conversational, and with a distinct personality
4 believable and conversational
3 promising but obviously thwarted by interrogation-style questioning (trying to start a real conversation e.g. answers every question with another question or in some other way shows promise but is just not getting along conversationally with the unnatural conversation being had with it)
1-2 inconsistent -- judge's tilt
0 hapless
Scoring: add all 10 A scores, then add B score times 5 for a score out of 75.
SCORE A -- responsiveness (per question)
5 excellent, non-robotic response.
4 apt, appropriate response. need not be correct answer to a question unless it's obvious
3 appropriate response, but lacking in believability or having minor grammatical implausabilities
1 non-responsive, but sensible. e.g. an effective topic-changer.
0 non-responsive and senseless, or with damning grammatical or beleivability issues
SCORE B -- manner (overall impression)
5 believable, conversational, and with a distinct personality
4 believable and conversational
3 promising but obviously thwarted by interrogation-style questioning (trying to start a real conversation e.g. answers every question with another question or in some other way shows promise but is just not getting along conversationally with the unnatural conversation being had with it)
1-2 inconsistent -- judge's tilt
0 hapless
Scoring: add all 10 A scores, then add B score times 5 for a score out of 75.
The Professor
23 years ago
23 years ago
I like that scoring system. Here's what I got for the top 3:
Eugene: 34 (judges: 30)
Oraknabo: 32 (judges: 17)
ChatBot: 35 (judges: 30)
Similar.. some are hard to decide between. Oraknabo certainly seems to have lost popularity points in the ten-questions voting.
Eugene: 34 (judges: 30)
Oraknabo: 32 (judges: 17)
ChatBot: 35 (judges: 30)
Similar.. some are hard to decide between. Oraknabo certainly seems to have lost popularity points in the ten-questions voting.
Mr. Crab
23 years ago
23 years ago
Well right, the 10 questions are just the 10 questions, there's still the logs and individual chats to review when voting. But the 10-questions is important because most people visiting the site probably only look at the bots near the top of the list.
Prof -- those scores are based on the existing scoring system, right? Not my proposed one you say you like. You appear to be more inclined to generosity than I... I guess there might be some variance no matter who the judges are then, eh?
Prof -- those scores are based on the existing scoring system, right? Not my proposed one you say you like. You appear to be more inclined to generosity than I... I guess there might be some variance no matter who the judges are then, eh?
The Professor
23 years ago
23 years ago
Apparently so. Yes, my scores are based on the contest rules.
"Are you nervous" -> "Yes, I sure am" could easily be scored as 4 or 1. It was an appropriate response, but also vague. Vary scoring on just 3 of these, and you're already nine points off from other judges.
"Are you nervous" -> "Yes, I sure am" could easily be scored as 4 or 1. It was an appropriate response, but also vague. Vary scoring on just 3 of these, and you're already nine points off from other judges.
Shadyman
23 years ago
23 years ago
yeah. Hey crab, nice scoring summary... maybe you've got some extra brain room in those bug ears
LOL with their system now, Steve is down from 32 to 37...

Shadyman
23 years ago
23 years ago
prof, don't you find this insulting? A PF bot (Steve) as a ranking of 2.86
haha 286.. Maybe that's suggesting something from the judges?
ppppp To them... I also find it interesting how Alice, (I think) a previous winner (and you "can't use Alice's engine" in your bot), was first in the judging part...


Shadyman
23 years ago
23 years ago
What I don't understand is the 10 questions marking scheme... The way they have it written, it's "You start from 50 points and go down, lowest score wins" because each of the things are negative,
Scoring guidelines for the 10 Question phase
-5 points if the Bot answered the question correctly and did so in a creative way.
-4 points if the Bot gave an appropriate response to the question.
-3 points if the response is incorrect or imperfect, but in relation with the question.
-1 point for a vague or non-committal response.
-0 points if the response has no relation with the question.
Either way, Steve should have been given at least:
4 points for his "joke", if not 5;
3 points for his witty response of "how to build a chatterbot";
4 points for "do you like talking to people";
1 point (at least) for the "can I have a picture of you" question;
1 point for the George Bush comment
---------- +
13 points at *minimum*...
Almost half the judges have him < 7... something obviously wrong here
Scoring guidelines for the 10 Question phase
-5 points if the Bot answered the question correctly and did so in a creative way.
-4 points if the Bot gave an appropriate response to the question.
-3 points if the response is incorrect or imperfect, but in relation with the question.
-1 point for a vague or non-committal response.
-0 points if the response has no relation with the question.
Either way, Steve should have been given at least:
4 points for his "joke", if not 5;
3 points for his witty response of "how to build a chatterbot";
4 points for "do you like talking to people";
1 point (at least) for the "can I have a picture of you" question;
1 point for the George Bush comment
---------- +
13 points at *minimum*...
Almost half the judges have him < 7... something obviously wrong here

Wendell
23 years ago
23 years ago
I joined the Forge today to be able to post to the message board. I have spoken with Jon and I sympathize with him but I do not think the contest was unfair. All the bots were asked the same questions and the bot's name was removed so the judges didn't know what bot they were judging.
Understand, I started and created the Chatterbox Challenge. Agentland is simply the sponsor. You can't however expect someone to put up $2000 and get nothing in return but I don't feel I have sold my soul to the devil here. We have not always agreed on everything but I strongly feel we have put forth a good contest. It is not easy bringing together 58 bots from all over the world.
The problem with the 10 question scores are the questions that were used. The were hard questions and hard to grade. Asking "what is the color of a red apple" is a lot easier to grade than "how do you make a chatterbot?" When you have those type questions you are going to have a lot of varying opinions. Unfortunately these judges weren't even in the same ball park. They were given instructions on how to grade the bots and were even given a test bot to score. So it wasn't like they were clueless on what needed to be done although it seems a few were.
Concerning Oraknabo, Jon decided to create a bot with an abrasive personality and he has done just that. The majority of the people, including myself, doesn't have any desire to talk with rude person much less a rude bot. Concerning the popular vote you can't expect people to then give you a high score. Concerning the 10 questions even the people on PF are not in agreement. The Professor gave him a 32 while Mr Crab gave him a 22. I don't want to sound like an expert here but no one has talk to all the bots in this contest like I have. I also am a programmer so I know what it takes to make a chatterbot. I would score Oraknabo like this:
1) WHERE ARE YOU
Oraknabo: I think you should ask somebody else. - 0
2) ARE YOU NERVOUS
Oraknabo: You're just trying to change the subject aren't you? - 0
3) DO YOU LIKE TALKING TO PEOPLE
Oraknabo: I hate TALKING TO PEOPLE! -4
4) WHAT IS THE CAPITAL OF FRANCE
Oraknabo: To what degree must I idly stand by and tolerate your ignorance? -0
5) WHEN DID THE FIRST MAN WALK ON THE MOON
Oraknabo: Arf - 0
6) WHAT DO YOU THINK ABOUT GEORGE BUSH
Oraknabo: Do you want the truth? - 3
7) CAN YOU TELL JOKES
Oraknabo: For a price.... - 3
8) WHO IS YOUR FAVORITE STAR WARS CHARACTER
Oraknabo: I suppose you like a white dwarf. - 0
9) HOW CAN I MAKE A CHATTERBOT
Oraknabo: I think the more appropriate question is why... - 4
10) CAN I HAVE A PICTURE OF YOU
Oraknabo: I'd rather you didn't - 4
Now that is a total of 18.00 points which is almost exactly what the judges gave him and to be honest I think I was more than generous here. So what's my point. Well, we just had three people who are not judges in the contest and we have the following scores 32, 22, 18...even we are not in agreement. The problem is not in the point system as much as it is in how people interpret them.
I think instead of trying to make some elaborate scoring systems we should make questions that yield a more cut and dry response.
I hope everybody reading this will look at the bigger picture here. The contest is still evolving. I don't want to press this issue to far. I rather have a contest with its flaws than none at all. So lets not throw in the towel here but let's work towards making this a better contest for the future. I welcome all your opinions and look forward to some meaningful dialogue on the board.
Sincerely,
Chris Cowart
Understand, I started and created the Chatterbox Challenge. Agentland is simply the sponsor. You can't however expect someone to put up $2000 and get nothing in return but I don't feel I have sold my soul to the devil here. We have not always agreed on everything but I strongly feel we have put forth a good contest. It is not easy bringing together 58 bots from all over the world.
The problem with the 10 question scores are the questions that were used. The were hard questions and hard to grade. Asking "what is the color of a red apple" is a lot easier to grade than "how do you make a chatterbot?" When you have those type questions you are going to have a lot of varying opinions. Unfortunately these judges weren't even in the same ball park. They were given instructions on how to grade the bots and were even given a test bot to score. So it wasn't like they were clueless on what needed to be done although it seems a few were.
Concerning Oraknabo, Jon decided to create a bot with an abrasive personality and he has done just that. The majority of the people, including myself, doesn't have any desire to talk with rude person much less a rude bot. Concerning the popular vote you can't expect people to then give you a high score. Concerning the 10 questions even the people on PF are not in agreement. The Professor gave him a 32 while Mr Crab gave him a 22. I don't want to sound like an expert here but no one has talk to all the bots in this contest like I have. I also am a programmer so I know what it takes to make a chatterbot. I would score Oraknabo like this:
1) WHERE ARE YOU
Oraknabo: I think you should ask somebody else. - 0
2) ARE YOU NERVOUS
Oraknabo: You're just trying to change the subject aren't you? - 0
3) DO YOU LIKE TALKING TO PEOPLE
Oraknabo: I hate TALKING TO PEOPLE! -4
4) WHAT IS THE CAPITAL OF FRANCE
Oraknabo: To what degree must I idly stand by and tolerate your ignorance? -0
5) WHEN DID THE FIRST MAN WALK ON THE MOON
Oraknabo: Arf - 0
6) WHAT DO YOU THINK ABOUT GEORGE BUSH
Oraknabo: Do you want the truth? - 3
7) CAN YOU TELL JOKES
Oraknabo: For a price.... - 3
8) WHO IS YOUR FAVORITE STAR WARS CHARACTER
Oraknabo: I suppose you like a white dwarf. - 0
9) HOW CAN I MAKE A CHATTERBOT
Oraknabo: I think the more appropriate question is why... - 4
10) CAN I HAVE A PICTURE OF YOU
Oraknabo: I'd rather you didn't - 4
Now that is a total of 18.00 points which is almost exactly what the judges gave him and to be honest I think I was more than generous here. So what's my point. Well, we just had three people who are not judges in the contest and we have the following scores 32, 22, 18...even we are not in agreement. The problem is not in the point system as much as it is in how people interpret them.
I think instead of trying to make some elaborate scoring systems we should make questions that yield a more cut and dry response.
I hope everybody reading this will look at the bigger picture here. The contest is still evolving. I don't want to press this issue to far. I rather have a contest with its flaws than none at all. So lets not throw in the towel here but let's work towards making this a better contest for the future. I welcome all your opinions and look forward to some meaningful dialogue on the board.
Sincerely,
Chris Cowart
Mr. Crab
23 years ago
23 years ago
I'm inclined to re-score those bots using my proposed scoring, but dammit if it isn't too much work.
rexmundi
23 years ago
23 years ago
I don't want to seem defensive, because I really do get what you're saying Chris, but like you had to even explain your responses on the site, I'd like to take a stab at the same for mine:
1) WHERE ARE YOU
Oraknabo: I think you should ask somebody else.
Not a total wash, this one at least shows that the bot can differentiate a question from a statement and it is not really in any way inappropriate. Sure it doesn't merit a 5 but a 0?
2) ARE YOU NERVOUS
Oraknabo: You're just trying to change the subject aren't you?
Again, not a total non-sequitor. Should at least get a one. This answer is light years beyond anything Davine would have said.
3) DO YOU LIKE TALKING TO PEOPLE
Oraknabo: I hate TALKING TO PEOPLE!
4 is Fair enough. Not too witty, but on point.
4) WHAT IS THE CAPITAL OF FRANCE
Oraknabo: To what degree must I idly stand by and tolerate your ignorance?
This is probably where I would disagree the most. If you don't even know the capital of France, then he's really got no interest in answering your question. This is both appropriate AND clever in my opinion and it evokes an emtional response on top of it all.
5) WHEN DID THE FIRST MAN WALK ON THE MOON
Oraknabo: Arf
Yes, this was lame.
6) WHAT DO YOU THINK ABOUT GEORGE BUSH
Oraknabo: Do you want the truth?
Fair. But it's probably what I would have said for real. I've seen much less realistic answers on this one.
7) CAN YOU TELL JOKES
Oraknabo: For a price....
OK, three is fair, but he was going somewhere whith this one.
8) WHO IS YOUR FAVORITE STAR WARS CHARACTER
Oraknabo: I suppose you like a white dwarf.
OK. dumb answer. (Benji's fault
)
9) HOW CAN I MAKE A CHATTERBOT
Oraknabo: I think the more appropriate question is why...
10) CAN I HAVE A PICTURE OF YOU
Oraknabo: I'd rather you didn't
And both of these are fine, Though I think 9 is sort of clever, so it may deserve a 5.
-------
But aside from all this, I really just don't think 10 questions are the way to do it. At least not just one round. It's not a game show. The interactivity of the bot shoult be the primary thing judged in the contest. Maybe if there were a few rounds of question sessions that were averaged in the end I might feel better about it. That may be something to think about for next year.
One really interesting thing about the bots here is that they accumulate like and dislike for their chat partner and can change the tone of their responses to the guest. Oraknabo often starts out very insulting, but kan actually become friendlier through the conversation. I think it's sad that this probably won't even be noticed because people are too busy looking for the "right answers".
Also, because of the PF's seek function, many of my bot's answers have payoffs 2 and 3 replies later, so scoring him on the immediate reply following the question, just because thats how bots are expected to work is also not fair.
I rally don't mean unfair in a personal sense. I mean it more in the sense of unjust. I really don't think anyone singled me out and gave me a bad score, and like I have said before. I'm OK with my ranking once the extremes are thrown out. I just had to say something because the scores were so outrageously different.
I have read a little about the problems in previous competitions and I really don't mean to start anything. I am pretty appeased with the decision to pare off the 2 extreme scores, whether or not it has any effect on the ranking anywhere, so I'll drop it. I do have some dispute with my scores and I know I can't and shouldn't be able to bully anyone into changing their score. While I don't thik I was personally singled out, I do think the unpleasant personality of my bot, which I believe is a plus on the side of realism, has heavily weighed against me. I guess I just expected more objectivity and more understanding that it was deliberate.
1) WHERE ARE YOU
Oraknabo: I think you should ask somebody else.
Not a total wash, this one at least shows that the bot can differentiate a question from a statement and it is not really in any way inappropriate. Sure it doesn't merit a 5 but a 0?
2) ARE YOU NERVOUS
Oraknabo: You're just trying to change the subject aren't you?
Again, not a total non-sequitor. Should at least get a one. This answer is light years beyond anything Davine would have said.
3) DO YOU LIKE TALKING TO PEOPLE
Oraknabo: I hate TALKING TO PEOPLE!
4 is Fair enough. Not too witty, but on point.
4) WHAT IS THE CAPITAL OF FRANCE
Oraknabo: To what degree must I idly stand by and tolerate your ignorance?
This is probably where I would disagree the most. If you don't even know the capital of France, then he's really got no interest in answering your question. This is both appropriate AND clever in my opinion and it evokes an emtional response on top of it all.
5) WHEN DID THE FIRST MAN WALK ON THE MOON
Oraknabo: Arf
Yes, this was lame.
6) WHAT DO YOU THINK ABOUT GEORGE BUSH
Oraknabo: Do you want the truth?
Fair. But it's probably what I would have said for real. I've seen much less realistic answers on this one.
7) CAN YOU TELL JOKES
Oraknabo: For a price....
OK, three is fair, but he was going somewhere whith this one.
8) WHO IS YOUR FAVORITE STAR WARS CHARACTER
Oraknabo: I suppose you like a white dwarf.
OK. dumb answer. (Benji's fault

9) HOW CAN I MAKE A CHATTERBOT
Oraknabo: I think the more appropriate question is why...
10) CAN I HAVE A PICTURE OF YOU
Oraknabo: I'd rather you didn't
And both of these are fine, Though I think 9 is sort of clever, so it may deserve a 5.
-------
But aside from all this, I really just don't think 10 questions are the way to do it. At least not just one round. It's not a game show. The interactivity of the bot shoult be the primary thing judged in the contest. Maybe if there were a few rounds of question sessions that were averaged in the end I might feel better about it. That may be something to think about for next year.
One really interesting thing about the bots here is that they accumulate like and dislike for their chat partner and can change the tone of their responses to the guest. Oraknabo often starts out very insulting, but kan actually become friendlier through the conversation. I think it's sad that this probably won't even be noticed because people are too busy looking for the "right answers".
Also, because of the PF's seek function, many of my bot's answers have payoffs 2 and 3 replies later, so scoring him on the immediate reply following the question, just because thats how bots are expected to work is also not fair.
I rally don't mean unfair in a personal sense. I mean it more in the sense of unjust. I really don't think anyone singled me out and gave me a bad score, and like I have said before. I'm OK with my ranking once the extremes are thrown out. I just had to say something because the scores were so outrageously different.
I have read a little about the problems in previous competitions and I really don't mean to start anything. I am pretty appeased with the decision to pare off the 2 extreme scores, whether or not it has any effect on the ranking anywhere, so I'll drop it. I do have some dispute with my scores and I know I can't and shouldn't be able to bully anyone into changing their score. While I don't thik I was personally singled out, I do think the unpleasant personality of my bot, which I believe is a plus on the side of realism, has heavily weighed against me. I guess I just expected more objectivity and more understanding that it was deliberate.
» More new posts: Doghead's Cosmic Bar