===============

===============

Good morning, Robert!

One fact of life is... about the food chain in academia, and in the life of the mind more generally.

It reminds me of the time when Cherkassky had one of his first big plenaries, at

an IJCNN, and there was also a big discussion or question-and-answer session later. Someone asked me, roughly:"Paul, what do YOU make of all this Cherkassky stuff? Is it really a new right way to do things? Or is it just a heap of formal nonsense?" My answer, roughly: "After watching how science works, I really appreciate how it is like a kind of vast ecology in itself. We can't all do the same thing. There are lots of things which need to be done, as one part of the system. So this work has its place in the food chain, but no, it's not the whole thing and not the big picture." Later, some Chinese folks at the meeting said they had a really good laugh at this, because in their culture "we know what it means when you say someone has his place in the food chain." But I didn't mean it THAT way...

Last night, Amir gave us an important window into the front lines of practical forecasting, a serious and important area. It's a highly empirical area. I couldn't help remembering that folks who sell consulting services to the five big markets he mentioned must always be agile enough to adapt to their clients preconceptions -- TO SOME DEGREE. They need to show they can do all the stuff their client might think is important -- but they also need to establish clear value added, by showing they can do better in some way.

The sheer variety of methods probably seems a bit bewildering to students, as it comes, just one after another. In fact, this kind of bewildering variety is pretty typical of a lot of areas of science these days. Facing such a bewildering variety, most humans go off in one of two directions:

(1) Embrace the bewilderment (and the market). Get deep into it,

play with the toys -- and make money. Think of all the many things you can do in playing with data, and hack through it.

(2) Retreat from the confusion, and keep yourself sane and coherent by embracing a theoretical framework, and sticking with it... either by becoming a pure theorist or by using that theory consistently in playing with data.

(In the world of physics, I tend to think of (1) as the common approach in practical electronics and photonics, while (2) creates monasteries to worship imaginary superstrings who have yet to manifest their divinity in experiments on earth.)

For those interested in human minds -- there was a nice little paper by Levine and Leven in Science many years ago, on "the new Coke", which described the psychological variable, "tolerance of cognitive dissonance," which drives this choice. Amir, being highly tolerant of cognitive dissonance, has immersed himself very deeply in the empirical data of forecasting itself. He himself is learning from data, and is not living what he called a "Monte Carlo" life.

By contrast, in my talk, I mentioned folks like maximum likelihood theorists, Bayesians, and a bit of Vapnik/Cherkassky people, who are more (2) than (1).

In graduate school, I started out believing in Bayesian theory, which in practical terms led me to maximum likelihood thinking. And yes, I was born with a low tolerance of cognitive dissonance, based in part on the German/Hungarian side of my family. Things have to make sense to me, and I don't like it when they don't.

In the maximum likelihood theory, "all life is about choosing a stochastic model based on experience." (Hey, Carnap and Jeffreys really were serious philosophers.)

Amir often echoed maximum likelihood thinking in PART of his presentation. In maximum likelihood theory, the choice of MODEL is based on data, on the likelihood of the data conditional on the model. The METHOD is basically always the same -- compare models, or values of parameters for a given model, by comparing the probability of seeing the data you did based on that model. Always the same method, but always a variety of models and parameters/weights to consider. The neural network approach is very special here, because an MLP roughly represents ALL POSSIBLE nonlinear input-output models; thus we can use the same METHOD (maximizing likelihood by minimizing error) AND the same structure (generalized MLP), effectively automating the process of model selection -- as is required when we try to build something like a brain.

But then, there were times when Amir talked about exponential smoothing,

about deseasonalization and about the (Hill? Holt?) model .. when a maximum likelihood theorist, like what I used to be, would be cringing and barely suppressing a scream. "What are all these ad hoc things? What do they mean? Why do they work? More important -- when could they work? Aren't these like those folk chartists or technicians on Wall Street who..."

BUT: BEYOND (1) and (2), it is absolutely crucial to science that some of us take a third path.

Like Amir, we must learn from data, from experience. But if we can't fully UNDERSTAND that experience, we need to work to understand it, by developing a BETTER theoretical framework capable of coping with the empirical data.

In a way... We need to practice syncretism! WE need to avoid excessive belief in a global model or theory, to the point where we don't pay enough attention to the new data points as they flood into our experience. But we also need to avoid being driven by past data without UNDERSTANDING; we need to keep adapting our global model or theory to cope with the data we didn't fully understand. This is really just the basic SCIENTIFIC METHOD, enunciated by Francis Bacon, based on the basic principles of learning we have already discussed from people like the Reverend Occam (and Emmanuel Kant).

The paper by Levine and Leven talks about more than just tolerance of cognitive dissonance. It also talks about something called "novelty seeking," which according to the research they review is also relatively heritable. (Bernie Baars, editor of the Journal of Consciousness, has also discussed this with me.) Statistically, most people are either high on tolerance of cognitive dissonance AND novelty seeking (pushing them into choice (1)), or low on both (pushing them to (2)). We really need both kinds of people; we need diversity in our ecology of the mind, for many reasons. But, since I am half Irish, I seem to have fallen into the less common mix of high novelty seeking AND low tolerance of cognitive dissonance, which creates a kind of conflict as one always tries to put things into a theoretical framework -- and then jubilantly encourages the inner trickster who delights in undermining the whole thing. That mix causes a lot of extra work and pain... but it also pushes us to try to get deep into the unexplained data, and extract the theoretical principles. That, too, is a crucial part of what we need for real progress in science. In certain moods, I think of extreme (1) as "the sponge personality," and (2) as "the Nazi personality" -- both quite numerous in our world, even in K-12.

This point is so important that I feel I must give another example. Many years ago, I attended the inaugural workshop at NSF for a big funding initiative called "CyberPhysical Systems" led by Helen Gill, who retired just a few months ago.

Everybody in the room invited from academia came form control theory. (Yes, I would have wanted more diversity.) The vast majority of leading people had talked a lot about the new research challenges Helen wanted to support, but really didn't have useful tangible results (if you actually want something to work). The main counterexample was Professor Shankar Shastry of Berkeley, who had DOZENS of amazing things working, bringing together a relatively open variety of approaches. When people asked him how he was so uniquely successful, he said: "It's basically simple. I get deep into these application challenges, and then I try to EXTRACT the basic principles that one can learn from them."

So now: back to prediction.

In early graduate school, I was a maximum likelihood theorist, but was happy to get deep into the long-term data on population and nationalism, given to me by the famous political scientist, Prof. Karl Deutsch, who agreed to be my thesis advisor. German as he was, it was really my Irish side that made me want to help the guy, who had "eaten" about ten graduate students before me who had trooubles making sense of the data. (By the way , the technical/mathematical side of this is of course given in detail in my PhD thesis, reprinted in full in my book "The Roots of Backpropagation.) Applying maximum likelihood analysis, I concluded that a

"vector ARMA" model could solve his problem, and make the first decent forecasts of this data. I used "backpropagation" (the chain rule for ordered derivatives) to develop the first fast computer program to estimate vector ARMA processes,

and applied it to this data -- and it worked.

But then came the trickster/scientist. As a matter of honesty, I felt I also needed to test out a kind of "devil's advocate" method (jnspired a bit by reading the work of Jay Forrester of MIT), which was basically just the "exponential smoothing" described by Amir last night. My version (in the book) was very easy and clean.

Maximum likelihood thinking said it could not possibly outperform vector ARMA here. But it did, by a factor of two! Fred Mosteller, a famous statistician on my thesis committee, got me to test all of this on simulated data, as well as real data, to establish the principle .. and I did.

Oops. So much for maximum likelihood. Maximum likelihood would lead to errors TWICE as large as exponential smoothing. What then?

NOT turn into a sponge! How can we extract the basic principle here?

That is when I came up with "the pure robust method," which can be used BOTH in ordinary statistical modeling (econometric modeling, especially) and in the training of time-lagged recurrent networks. The Ford package for training time-lagged recurrent networks (TLRN) has consistently beat everything else in forecasting competitions, but all it does is minimize square error in the usual method of maximum likelihood theory (made possible by backpropagation). When I compared that method, the minimization of square error, versus the pure robust method, for MLPs in forecasting data from the chemical industry, ERRORS WERE REDUCED BY A FACTOR OF THREE. **IF** that pattern holds up with TLRN, then

a new computer package which uses TLRN AND the pure robust method could get

errors one-third the size of those from the Ford package. If a university could start by

replicating the Ford capability, and THEN adding pure robustness, it could really do a whole lot better than anyone else here.

For some data. There is lots of data which behaves like the chemical industry data we studied. (To see that study of chemical data, just look at chapter 10 of HIC, posted on my web site.) But there is lots of data which doesn't. When I tried this out on an econometric model taken from DOD, in 1977, I found out that maximum likelihood and pure robust were BOTH extremes; the best was something in the middle, similar in flavor to the "Hill" or "Holt" weighted linear model that Amir talked about last night. But this "something in the middle" was NOT a mixture of models, and was NOT just a linear method. It was a truly general method for training A TLRN or AN econometric model, just as general as maximum likelihood or pure robust. I called it "the compromise method." It cut errors in half for predicting GNP, in the econometric model I mentioned just now, compared with the best of maximum likelihood and pure robust both. It included a way of ADAPTING the degree of robustness, to accommodate to DIFFERENT variables. Chapter 10 of HIC is still our best knowledge on that subject. One could do better, with some deep thinking, but no one ever has, even since HIC. Perhaps if Amir had been working with me, he might be ready to do more... but deep thinking is not easy stuff. I must apologize that I have allocated my own deep thinking to other things here lately.

Many of the empirical issues which Amir mentioned with detrending and deseasonalization were basically addressing a kind of scaling problem. The maximum likelihood theorists would say that a more proper way to address some of those concerns is by recognizing that random disturbances are not fixed Gaussians; thus simple devices like assuming percentage error, and doing maximum likelihood based on a better error model, are better. In that case, one needs to compute a Jacobian to compare models with different error models, but it's all straightforward in principle. One can adapt similar approaches to the compromise method and so on.

There is more to be said about Amir's presentation, which covered many important issues, but this email is long and complicated enough for now. You will also see all of these issues and more referred to, if briefly, in the Erdos pdf of slides and words

also posted on my website (www.werbos.com/Erdos.pdf.).

I hope this helps.

Best regards,

Paul

## No comments:

## Post a Comment