The amount of information out there is staggering
Tom Potok works at the Oak Ridge National Laboratory in Tenessee. He has been in the field of text and data mining for twenty years and worked on a wide variety of things. Some of the biggest challenges are the amounts of information out there, and trying to figure out how the mind works with text.
“My name is Tom Potok, I’ve been working in the field for about probably twenty years. So I work at the Oak Ridge National Laboratory, which is in Oak Ridge, Tenessee. I’ve done a wide variety of things, from intelligent agents to swarm intelligence to very intense data mining, to hot performance computing, and now we’re even looking at neuromorphic and quantum ways to look at text analysis. So it’s been quite a wide variety of things.”
Probably some of our most interesting work is, well there are two. One was on a system called Parana. And Parana did very large scale text clustering. So that has been used by a number of people, it’s been actually licensed we got a start-up company that’s using it. And it clusters data, it takes very large sets of text. It uses a novel term weighting scheme to be able to pull information together in a better and faster way than is typically done.
Some other work we’ve done is a system called death star. The name started off as sort of a joke. It was a recommender system, it was saying ‘can I take a couple of my documents and use those to go out and scour the internet.’ One of the big problems you have is, that you know where to find documents and where to find relevant stuff, but how do you find things that you’re not aware of? And so that’s what we’re trying to address. So we go out, we’ve got about 9000 RSS feeds that we take, we take a collection of our papers, go scan the internet, bring papers back, and then present them on a Twitter feed. So it’s very quick, very simple to find new information and find it fairly quickly.
There used to be a time when you could go to a library and you could go and look through journals and pretty much find the collection of information you want. You know, now there are repositories like the archive repository, it’s just staggering. There’s so much information that’s put out there daily and put out new, trying to keep up and trying to find new and relevant documents is very very hard. It’s a very challenging problem. And so these type systems help people go ‘Can I find something interesting and relevant to my work and to my research?’
I think the biggest challenge has been really in just understanding text, trying to figure out how the mind uses and develops and works with text. It’s very counterintuitive.
But I think, for the most part it’s a good field and it’s a fun field and a nice field to work in.”