I didn’t realize that Java’s regex class, Matcher, uses m.group(0) to denote the entire pattern. I spent some time debugging it. Hence this note.
As is stated in the documentation, “Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().”
Here is a sample code to pick out twitter user names out of a string using Java. A set is returned, therefore if a user name is mentioned more than once, it’ll only be stored once in the set. All user names are returned back in lowercase. This function has gone through pretty through testing and works pretty well.
In addition, this is also a pretty good sample of negative lookbehind regex usage: we are not looking for pattern where @ is proceeded by any valid Twitter user name character.
Update: Angle brackets in Java code caused my code formatter to add some junk inside the code. Be aware! I need to look for a good code formatter for WordPress…
[code language=”java”]
package com.haidongji.java;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tweet {
/**
* Get usernames mentioned in a string.
*
* @param s
* string, a tweet
* @return the set of usernames mentioned in the text of the tweet.
* A username-mention is “@” followed by a Twitter username
* A Twitter username is composed of:
* English letters, digits, dash, and underscore
* The username-mention cannot be immediately preceded or followed
* by any character valid in a Twitter username. Therefore:
* user@example.com does NOT contain a mention of the username example.
* Twitter usernames are case-insensitive
*/
public static Set getMentionedUsersFromString(String s) {
Set set = new HashSet();
String pattern = “(?< ![a-zA-Z0-9_-])@([a-zA-Z0-9_-]+)"; // see spec above
Matcher m = Pattern.compile(pattern).matcher(s);
while (m.find()) {
set.add(m.group(1).toLowerCase());
}
return set;
}
}
[/code]